BorisovAI — Tools for the community. By the community.

I was debugging why our Trend Analysis pipeline kept crashing at the worst possible moments. The team reported that during peak enrichment cycles, Ollama would simply close connections without warning. “Remote end closed connection”—the error message that haunts every infrastructure engineer. I spent two hours staring at logs before the pattern clicked: we were hammering a single Ollama instance with concurrent requests to two different models, hermes3:8b and gemma4:e2b, and the poor thing was drowning in VRAM pressure.

Here’s where it got interesting. The fix wasn’t complicated, but it required touching the entire request pipeline. I split Ollama across two ports—11435 for gemma4:e2b (our workhorse), 11436 for hermes3:8b (the memory hog). But splitting the ports meant nothing if requests could still collide. So I added a global _ollama_mutex to serialize all requests, preventing the concurrent calls that were triggering Ollama’s connection resets. Two doors on the same building, but only one person could enter at a time.

The mutex was half the battle. The other half was buried in configuration. Someone had set keep_alive="-1", which is invalid Go duration syntax—Ollama literally rejected every request coming through. I changed it to "999h", which keeps models pinned in VRAM without expiring. A tiny string, massive difference.

But there was more. Our translation pipeline was chunking content into 50-character pieces and sending them to the LLM with 16K+ character prompts—pure context overflow. I dropped chunk_size from 50 to 5. Smaller prompts, fewer timeouts, cleaner responses. Separately, the SQLite busy_timeout was 15 seconds; under load, transactions were getting murdered. Bumped it to 60 seconds, and lock contention dropped noticeably.

The enrichment cycle itself was blocking during watchdog checks. I restructured it so enrichment runs before cluster detection and skips (doesn’t wait) when extraction is active. The crawler and watchdog now use a _crawl_active flag to serialize access. Small changes, but they eliminated deadlocks.

One last detail: citation enrichment was capping at 50 items, WAL checkpoints run every 5 minutes, and event times get normalized to UTC. Nothing flashy, but each fix removed a source of silent failure.

The result? Pipeline stability jumped from “crashes every hour” to “runs for days.” Concurrent requests no longer kill Ollama. Models stay in VRAM. Transactions complete without stalling. It’s the kind of fix where half the work is plumbing, half is understanding why the original design cracked under load.

Hey, here’s one for the engineers: I wish this Ollama instance was asynchronous… so it would finally give me a callback instead of closing the connection 😄

Taming the Chaos: How Dual-Port Routing Saved Our Ollama Pipeline

Metadata