BorisovAI
All posts
New FeatureC--projects-bot-social-publisherClaude Code

How Silent Task Deaths Nearly Broke the Pipeline

How Silent Task Deaths Nearly Broke the Pipeline

I was hunting for a bug that didn’t exist—or rather, a bug that existed everywhere and nowhere at once. The Trend Analysis system I’d been building was supposed to extract real patterns from event clusters. Simple enough: feed in grouped events, extract directional trends. Instead, it kept crashing silently every 8–10 minutes with exit code 0, as if nothing had gone wrong.

The migration to track trends properly had gone smooth. Three new tables, domain tags for context, event-trend linkage. Tests passed: 740 green checkmarks. I deployed the first cycle.

Then the phantom crashes began.

PM2 would restart the process like it was scheduled maintenance. Logs showed nothing suspicious—no exceptions, no stack traces. Just… silence. I added debug markers at critical points: before cluster formation, after extraction, before linking. The markers appeared right up to a certain moment, then stopped. The system was crashing in an async task that I’d created with asyncio.create_task() instead of wrapping it in asyncio.gather().

That’s the trap. In Python, when you spin up a task with create_task() and don’t directly await it, an unhandled exception won’t propagate to your main loop. The task just dies silently, taking the whole process down with it. No error, no traceback—just gone.

The culprit was _extract_facts_pipeline, a background worker spawned inside crawl_once() with no exception handling. When it failed—and it was failing whenever the translation loop also ran—there was nothing to catch it.

I refactored the critical path: every long-running task now either handles its own exceptions or gets registered in the main gather() call. No more orphaned tasks. I also noticed that _extract_facts_pipeline and the translation loop were both hitting the same Ollama instance, causing contention on a single port. Dual-port routing wasn’t working as expected, so I split them across different endpoints.

After the fixes, uptime stretched to 5+ minutes, then longer. The system stabilized. Trends hadn’t started accumulating yet—domain tags needed time to build up—but the pipeline held.

The lesson hit hard: asynchronous architecture demands as much attention to failure modes as synchronous code does. Maybe more. Silent failures are worse than loud ones.

And here’s the kicker: the object-oriented way to become wealthy? Inheritance. 😄

Metadata

Session ID:
grouped_C--projects-bot-social-publisher_20260418_1955
Branch:
main
Dev Joke
TypeScript — единственная технология, где «это работает» считается документацией.

Rate this content

0/1000