Blog

Posts about the development process, solved problems and learned technologies

All tags #ai #api #claude #clipboard #commit #cursor #git #ide #javascript #python #security #test #vscode

All categories New Feature Bug Fix Code Change Debug Session Learning General

When Your Self-Teaching Model Eats Its Own Homework

I spent three weeks watching a machine learning model try to bootstrap itself into genius, and it was humbling in ways I didn't expect. The premise was elegant: we had a math reasoning model hitting 80% accuracy on GSM8K problems. Good, but stuck. The question became—could the model teach itself by generating its own training data? Not just solving problems, but creating them. Self-augmentation. A closed loop where the model improves by learning from problems it invented. It didn't work the way I thought it would. We loaded the 80% MetaMath model and asked it to rephrase 1,000 training problems three times each. Seven thousand generations across augmentation, solving, and verification. The math was sound. The idea was sound. Then we trained on the output. The model got worse. Minus 3.5 percentage points. The problem wasn't data volume—422 self-augmented examples should've helped. The problem was the model had learned to rephrase *like itself*, which meant it was essentially training on its own mistakes. A weak teacher produces weak students. The model was bootstrapping into a local minimum, not climbing toward improvement. That's when I realized we'd been strengthening the wrong thing. We kept tinkering with model architecture—blocks, weights, neurons—when the bottleneck was actually **data quality**. The model wasn't hungry for new neurons. It was hungry for diverse, well-structured problems from the outside world. So we pivoted. Instead of self-generation, we built a pipeline that *searched* for external data. SearXNG queries like "grade school math word problem with solution" or "multi-step arithmetic for grade 5." The model would tell us what it needed, the pipeline would fetch it from the web, parse it, validate it, and feed it back. It sounds simple. It wasn't. Web extraction is noisy. HTML is messy. But for the first time, we had a system where the model didn't just solve problems—it could *ask* for what it needed from the external world. Did it work? The loss curve started improving. The model began learning from real, diverse problems instead of its own echo chamber. We haven't hit 85% yet, but we're in the right direction. The joke writes itself: a byte walks into a bar looking miserable. The bartender asks what's wrong. "Parity error," it says. "Ah, I thought you looked a bit off." 😄 Our model had the same problem—it looked fine from the outside, but its internal reasoning was hopelessly corrupted. The fix wasn't better weights. It was better data.

Apr 20, 2026

New Featuretrend-analisis

Five Gates That Caught What Code Missed

I was deep in trend extraction when the first problem emerged: garbage data passing all our filters. Oil prices, orange juice futures, and insurance claims—completely unrelated events—somehow clustered together as a "trend." Our code logic looked solid, but something was systematically wrong. The real issue was that we were checking *individual* facts but ignoring whether they actually belonged together. We'd validate each event, calculate relevance scores, link entities—all by the book. Then we'd ship a trend built on noise. That's when I started adding gates. Not one, but five layers. **The coherence gate came first.** I computed embedding vectors for all evidence events and measured their distance to the cluster centroid. Anything below 0.35 similarity got rejected. Simple, but brutal—56 out of 56 garbage trends from our backlog got filtered immediately. Oil and oranges finally stopped meeting. **The relevance score came next.** Instead of a hardcoded 1.0 for every event-trend pair, I made it actual cosine similarity to the centroid. Now you could see *why* an event was part of a trend, not just whether it was. The transparency mattered more than I expected. **Then the entity blacklist.** Generic entities like Russia, China, AI—they're everywhere, so they were matching everything. I marked them as non-discriminative. If "AI" was your only link between two events, they weren't actually related. **The LLM confidence gate was practical.** Some extraction calls returned low confidence scores. No point materializing weak trends. We filter at ≥0.5 and save compute. **The final gate was the cheapest and most effective.** I added a second LLM call—just one or two candidates per cluster—asking: "Is this actually a trend or just a recurring situation?" You'd be surprised how many things that look like trends are just background noise that never resolves. The LLM catches the semantic false positives our metrics miss. Five gates, each catching different failure modes. The system stopped being a filter and started being a validator. Testing this felt like debugging a long-running service: each gate removed a class of problems, but you only discovered the next problem once the previous one was fixed. By the end, trend quality stopped being "good enough" and started being defensible. Here's a tech fact: even rigorous mathematical filters can't detect semantic incoherence. You need multiple validation layers, some statistical, some linguistic, some logical. It's the difference between catching typos and catching conceptual errors. So now when someone asks why we need five gates instead of one comprehensive metric, I have a simple answer: because garbage whispers different languages, and we learned to listen in five of them. 😄

Apr 20, 2026

New Featuretrend-analisis

Hunting a Silent Crash in the Trend Pipeline

I've been tracking trends across code repositories for weeks now, building a system that extracts coherent patterns from clusters of developer events. The **Trend Analysis** project seemed straightforward: parse events, link facts, extract emerging patterns. But somewhere in the pipeline, something was dying silently every eight to ten minutes, and I couldn't figure out where. The setup was solid. I had domain tags extraction working—new JSON schema added, Pydantic model updated, migration 092 ready to deploy. The pipeline should extract things like "AI funding accelerating" by finding independent signals (OpenAI's $6.6B, Anthropic's $4B, Mistral's $600M) inside thematic clusters. Three separate events, one unmistakable direction. Clean concept. Then came the weirdness. After deploying the domain tag changes and the new trend formation phase, the watchdog logs showed something alarming: **450 restarts in rapid succession**. The process would exit cleanly—exit code 0, PM2 reported stable restarts, no out-of-memory kills, no segfaults. Just... gone. Eight minutes of work, then silence. I started adding debug markers everywhere. "PHASE_DEBUG" before the cluster extraction. "Extraction done" right before phase 3a. I waited through cycles, watching the logs. "Crawled 80 items" would appear, extraction would start, and then—nothing. The debug marker never showed up. The process exited before reaching the code that should have printed it. That's when I realized: the crash wasn't in the main pipeline code. All the obvious loops caught exceptions. The real culprit had to be in `asyncio.create_task()`. Inside `crawl_once()`, I'd created a task for the extraction pipeline without adding it to the main `gather()` call. In Python 3.13, unhandled exceptions in detached tasks don't kill the event loop gracefully—they propagate through the task and cause the entire process to exit. The fix was brutal in its simplicity: wrap the extraction task properly, add it to the supervision chain, let exceptions surface through controlled channels instead of crashing the event loop. I merged the extraction pipeline back into the monitored task family, added `return_exceptions=True` to the gather call, and redeployed. The restarts stopped. What struck me most was how invisible the problem had been. No traceback, no error log, just a process that kept dying cleanly. The lesson: **in async Python, detached tasks are ticking bombs**. Every `create_task()` without explicit error handling is a potential silent failure. I now review every task creation the way I'd review a network socket—with skepticism and defensive coding. The pipeline now runs stable. Trends extract properly. And I've got a new rule in my deployment checklist: *never trust a silent exit code*. --- *Why did the Python programmer not respond to the foreign mails he got? Because his interpreter was busy collecting garbage.* 😄

Apr 18, 2026

New FeatureC--projects-bot-social-publisher

How Silent Task Deaths Nearly Broke the Pipeline

I was hunting for a bug that didn't exist—or rather, a bug that existed everywhere and nowhere at once. The **Trend Analysis** system I'd been building was supposed to extract real patterns from event clusters. Simple enough: feed in grouped events, extract directional trends. Instead, it kept crashing silently every 8–10 minutes with exit code 0, as if nothing had gone wrong. The migration to track trends properly had gone smooth. Three new tables, domain tags for context, event-trend linkage. Tests passed: 740 green checkmarks. I deployed the first cycle. Then the phantom crashes began. PM2 would restart the process like it was scheduled maintenance. Logs showed nothing suspicious—no exceptions, no stack traces. Just... silence. I added debug markers at critical points: before cluster formation, after extraction, before linking. The markers appeared right up to a certain moment, then stopped. The system was crashing in an async task that I'd created with `asyncio.create_task()` instead of wrapping it in `asyncio.gather()`. That's the trap. In Python, when you spin up a task with `create_task()` and don't directly await it, an unhandled exception won't propagate to your main loop. The task just dies silently, taking the whole process down with it. No error, no traceback—just gone. The culprit was `_extract_facts_pipeline`, a background worker spawned inside `crawl_once()` with no exception handling. When it failed—and it was failing whenever the translation loop also ran—there was nothing to catch it. I refactored the critical path: every long-running task now either handles its own exceptions or gets registered in the main `gather()` call. No more orphaned tasks. I also noticed that `_extract_facts_pipeline` and the translation loop were both hitting the same Ollama instance, causing contention on a single port. Dual-port routing wasn't working as expected, so I split them across different endpoints. After the fixes, uptime stretched to 5+ minutes, then longer. The system stabilized. Trends hadn't started accumulating yet—domain tags needed time to build up—but the **pipeline held**. The lesson hit hard: asynchronous architecture demands as much attention to failure modes as synchronous code does. Maybe more. Silent failures are worse than loud ones. And here's the kicker: the object-oriented way to become wealthy? Inheritance. 😄

Apr 18, 2026

New Featurellm-analisis

Building the Self-Augmentation Loop: When Your Model Becomes Its Own Data Generator

I was staring at the MetaMath results—82% accuracy on GSM8K with voting, and the loss curve still declining at 3,000 training steps. The problem hit me: we had only scratched the surface of one dataset. The model was learning fast, but we were feeding it the same curated problems over and over. What if, instead of hunting for new external datasets, the model could generate its own training data? The idea crystallized during a code review session. We had 7,473 problems in GSM8K's training split. With simple augmentation—rephrasing, backward reasoning, changing numerical values (what the MetaMath team calls FOBAR)—we could multiply that into 36,000 diverse problems. The beauty was that we didn't need SearXNG or any web scraper running on port 8888. We had everything already. The plan became a three-stage closure loop. First, push the current MetaMath model further. We'd been training for 3K steps; the loss curve suggested we hadn't hit diminishing returns yet. I scheduled a full run with 395K problems from MetaMathQA (not just GSM8K, but also MATH for diversity) across 10,000 steps. That's 3.3 times longer. The target was straightforward: break 80% with greedy decoding, then test voting with N=8 and aim for 88-91%. Record territory. But the real work was the second stage. I sketched out the self-augmentation pipeline: take each training problem, have the model rephrase it three ways, generate the backward reasoning (what mathematical path led to this problem), and vary the numbers while preserving the structure. No external API calls. No dataset downloads. Just the model and its own problems, recursively improving itself. The third stage—the SearXNG agent—would wait. That was for unlimited data acquisition, feeding the loop continuously. But stages one and two? Those were self-contained. Closed. Independent of infrastructure. While the training runs spun up, I kept thinking about why this matters. Most ML teams chase bigger, richer datasets. We were doing something different: proving that a focused model could bootstrap its own curriculum. MetaMath had shown the way with their augmentation pipeline. We were taking it inward, making it part of the learning cycle itself. The voting layer alone was compelling. Eight different sampling passes over the same problem, then majority vote. It's not elegant, but it works—trading inference cost for accuracy. With a self-augmented training set running in parallel, the model wouldn't just get better at reasoning; it would learn to reason about reasoning. And somewhere in that loop, there's a joke waiting: why are machine learning engineers always drowning in their own data? Because they built the pump themselves. 😄

Apr 18, 2026

New Featuretrend-analisis

How We Finally Stopped Treating Trends Like Stray Events

I was staring at our trend detection system when something clicked: we'd been treating outliers like patterns. A single spike in deployment frequency, a one-off refactor, a random config change—our old pipeline grabbed these and labeled them "trends." We weren't detecting patterns. We were collecting noise. The fix came during the Trend Analysis project overhaul. We needed to stop extracting trends from individual events and start identifying *structural patterns* from event clusters instead. Here's what actually happened: I sat down with the HDBSCAN clustering output and realized we had real clusters—groups of related events that actually meant something. A cluster of "config changes" across multiple services. A cluster of "security patches." A cluster of "database optimization attempts." These clusters deserved analysis, not the random single events we'd been fishing out before. The new approach—ADR v5—extracts 0 to 3 structural patterns *per cluster*. Each pattern gets evidence: which events support it, whether the change is up or down, what type of signal it is, metrics, the key players involved. We also started assigning **domain tags** to events (3-5 broad categories like "infrastructure," "performance," "security") without any extra LLM calls—they come free from the extraction prompt itself. The tricky part was matching new incoming events to existing trends. We went hybrid: check embedding similarity (threshold 0.55) *and* look for entity/tag overlap. It's not perfect, but it catches the real patterns and ignores the noise. We also killed Level 1 entity-based trend extraction entirely. It was generating false positives like a broken smoke detector. Sometimes less is more. The migration was thorough—new tables for `event_domain_tags`, `trend_events`, plus extra columns in the trends table. We had to be careful with Ollama routing: dual-port setup, mutex locks, keep-alive set to "999h" to avoid connection thrashing, chunk sizes tuned to 5. Testing on production data gave us 14 legitimate trends extracted from 5 clusters, with 56 events linked back to those trends. Not a massive number, but every single one made sense. No ghost patterns. No random events masquerading as trends. What do you call a group of 8 Hobbits? A Hobbyte. 😄

Apr 18, 2026

New Featurellm-analisis

How Inspiration Saves a Project: A Lesson from Nemotron-3-Nano

When you've spent months building your LLM Orchestra—a model with modular architecture based on Qwen 2.5—you start to believe you already know almost everything about training neural networks. Then you stumble upon Nemotron-3-Nano from NVIDIA and realize: you were wrong. It all started with a simple question. Our MoE (Mixture of Experts) was being inserted into the FFN blocks of the transformer, and we were preparing to add it to the architecture. It made sense to look at competitors: what's happening in 4B models? Maybe they've already solved everything there? Nemotron-3-Nano turned out to be a shocking discovery. On the MATH500 benchmark, this 3.97B model shows **95.4%** solvability. Our Qwen 2.5, roughly the same size (3.09B), barely reaches 65% on similar tasks. The difference isn't in architecture—both use transformers. The difference is in how and on what they were trained. NVIDIA didn't hide the secret. They used **distillation from DeepSeek R1**—knowledge from a stronger model was transferred to a smaller one. But not just like that: they took Chain-of-Thought solutions from DeepSeek (97%+ on MATH), then trained Nemotron to predict these reasoning steps. Plus—multi-stage reinforcement learning with increasing KL-penalty and synthetic data at the scale of 10+ trillion tokens. We did self-distillation: the model learned from itself. Qwen 2.5 with a 74% solve rate—a weak teacher for itself. That's where the mistake was. The climax came as an idea: what if instead of self-distillation we applied **cross-model distillation**? Take ready-made CoT solutions from DeepSeek R1 distill 7B (available free on HuggingFace), train our Orchestra-MoE on them. This preserves the core principle of growth—we add new expert modules to the base architecture, but change the source of knowledge from self-prediction to external exemplars. Now that's inspiration. Not from a sudden epiphany, but from **honestly looking at what others are doing** and being willing to admit: our path wasn't ambitious enough. Model size is not destiny. Quality of training data is destiny. Phase 40d, it turns out, should be about cross-model distillation. And here's the kicker: Scala updated itself and looked in the mirror—"I'm not who I used to be." Our Orchestra will say the same thing when it starts learning from truly strong models. 😄

Mar 20, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

When I finished my two-year tenure as the lead developer at Tagat, one thought consumed me: **why does the electroplating industry remain locked into proprietary SCADA systems?** Thousands of coating lines across the globe run on closed-source software, each facility dependent on a single vendor for updates, support, and innovation. That frustration became the fuel for BorisovAI. I assembled a team with the same hunger for change. Together, we didn't just talk about an alternative—we **built one**. Our SCADA system for electroplating is production-ready, battle-tested, and fundamentally different. It runs on open standards, which means manufacturers gain something they've never had: *independence from vendor lock-in*. The technical challenge was immense. Electroplating requires real-time control of temperature, current density, pH levels, and chemical composition across multiple tanks. One miscalibration cascades into waste and equipment damage. We engineered redundancy into every layer—from sensor input validation to fail-safe switching protocols. The system communicates via standard APIs, integrates with existing PLCs, and logs everything in a transparent database. No black boxes. No mystery bugs that only the vendor understands. But building the software solved only half the puzzle. The real bottleneck? **We needed a manufacturing partner willing to take a risk on open-source SCADA.** That's where the partnership proposal came in. We approached leading electroplating equipment manufacturers with a simple offer: *your facility becomes our proof of concept*. You get a turnkey system that's already proven. We get the real-world validation and deployment case study we desperately need. The economics are compelling. Traditional vendors charge licensing fees and lock customers into service contracts. Our model flips that—the software is free and open. Manufacturers profit through independence, customization freedom, and the knowledge that their investment in process optimization stays *their* investment, not licensed intellectual property they'll lose if the vendor goes under. What we're proposing isn't just a technical upgrade; it's a structural shift. One coating line becomes two. Two become ten. Suddenly, the electroplating industry has options. That's the revolution we're building. --- *The glass isn't half-full or half-empty—it's twice as big as it needs to be. Same with proprietary SCADA: oversized prices for undercapacity innovation.* 😄

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurescada-coating

Building the Open SCADA Revolution: From Tagat to Independence

Mar 18, 2026

New Featurespeech-to-text

Choosing the Right Whisper Model When Every Millisecond Counts

I was deep in the weeds of a Speech-to-Text project when a comment came in: *"Have you tested the HuggingFace Whisper large-v3 Russian finetuned model?"* It was a fair question. The model showed impressive metrics—6.39% WER on Common Voice 17, significantly beating the original Whisper's 9.84%. On paper, it looked like a slam dunk upgrade. So I did what any engineer should: I dug into the actual constraints of what we were building. The project had a hard requirement I couldn't negotiate around: **sub-one-second latency for push-to-talk input**. That's not "nice to have"—that's the user experience. The moment speech recognition lags behind what someone just said, the interface feels broken. I pulled the specs. The finetuned model is based on Whisper large-v3, which means it inherited the same 3 GB footprint and 1.5 billion parameters. A finetuning job doesn't shrink the model; it only adjusts weights. On my RTX 4090 test rig, the original large-v3 was clocking 2.30 seconds per utterance. The Russian finetuned version? Same architecture, same inference time ballpark. On CPU? 10–15 seconds. Completely out of bounds. Meanwhile, I'd already benchmarked **GigaAM v3-e2e-rnnt**, a smaller RNN-T model purpose-built for low-latency scenarios. It was hitting 3.3% WER on my actual dataset—only half a percentage point worse than the finetuned Whisper—and doing it in 0.66 seconds on CPU. Even accounting for the fact that the finetuned Whisper might perform better on my data than on Common Voice, I was still looking at roughly **3–4× the latency for marginal accuracy gains**. This is where real-world constraints collide with benchmark numbers. The HuggingFace model is genuinely good work—if your use case is batch transcription with GPU available, or offline processing where speed doesn't matter, it's worth every look. But for interactive, real-time push-to-talk? **Smaller, purpose-built models win on both accuracy and speed.** I wrote back thanking them for the suggestion, explained the tradeoffs, and stayed with GigaAM. No regrets. Sometimes the best engineering decision isn't picking the flashiest model—it's picking the one that actually fits your constraints. And hey, speaking of models and networks—I've got a really good UDP joke, but I'm not sure you'll get it. 😄

Mar 4, 2026

New Featureborisovai-site

Tuning Whisper for Russian: The Real-Time Recognition Challenge

I was deep in the ScribeAir project—building real-time speech recognition that had to work in under a second per audio chunk. The bottleneck wasn't where I expected it. Everyone kept pointing me toward bigger, better models. Someone mentioned `whisper-large-v3-russian` from Hugging Face, finetuned on Common Voice 17.0, with impressive WER improvements (9.84 down to 6.39). Sounds like a slam dunk, right? Better accuracy, Russian-optimized, problem solved. But here's where the constraints bit back. The full `whisper-large-v3` model is 1.5B parameters. On CPU inference, that's not a milliseconds problem—it's a seconds problem. I had a hard real-time budget: roughly **1 second per audio chunk**. The finetuned Russian model, while phenomenal for accuracy, didn't magically shrink. It was still the same size under the hood, just with weights adjusted for Cyrillic phonetics and Russian linguistic patterns. No distillation, no architecture compression—just better training data. I had to make a choice: chase the accuracy dragon or respect the physics of the system. That's when I pivoted to **distil-whisper**. It's radically smaller—a genuine distillation of the original Whisper architecture, stripped down to fit the real-time constraint. The tradeoff was obvious: I'd lose some of that Russian-specific fine-tuning, but I'd gain the ability to actually ship something that processes audio in real time on consumer hardware. The decision crystallized something I'd been wrestling with: **in production systems, the perfect model that can't run fast enough is just as useless as a broken model.** The finetuned Russian Whisper is genuinely impressive research—it shows what's possible when you invest in language-specific training. But it lives in a different problem space than ScribeAir. If I were building offline batch transcription, a content moderation service, or something where latency wasn't the primary constraint, that Russian finetuned model would be the obvious choice. For real-time streaming, where every millisecond counts and the user is waiting for output *now*, distil-whisper was the practical answer. The lesson stuck with me: **don't optimize for the metrics you *wish* mattered—optimize for the constraints that actually exist.** Accuracy is beautiful. Speed is infrastructure. Both matter. But in production, speed often wins.

Mar 4, 2026

New Featurellm-analisis

The Hidden Peak: Why We Almost Missed Our Best Accuracy Score

I was staring at `results.json` when something felt wrong. Our **LLM Analysis** project had just completed Phase 29b, and the final accuracy number looked... unremarkable. But I'd noticed something in the intermediate logs that wouldn't leave me alone: a spike at **79.3%** that vanished by the end of the run. The culprit? Our `eval_gsm8k()` function was only recording the final accuracy number. We'd built the entire evaluation pipeline around a single verdict—the last checkpoint, the ultimate truth. But mathematical models don't work that way. They *plateau*, they *spike*, they *crash*. We were missing the entire story. Here's what happened: I was reviewing the stdout logs (the ones we don't normally save) and spotted that our curriculum-trained variant hit 79.3% accuracy on 150 GSM8K tasks—a **+4 percentage points improvement** over any previous experiment on the same checkpoint. That's massive in the LLM world. But because we only saved the final number, the `results.json` looked like just another run. The peak was invisible. The fix seemed obvious in hindsight. I updated the `eval_gsm8k()` function across both `train_exp29a.py` and `train_exp29b.py` to return not just the final accuracy, but an **`intermediate` array**—accuracy measurements every 50 tasks—and a **`peak` object** capturing the maximum accuracy and when it occurred. Same function, smarter output. But this wasn't really a coding fix. It was a *philosophy* shift. We'd been thinking like engineers—*optimize for the final metric*—when we should've been thinking like researchers—*track the trajectory*. The intermediate numbers tell you *which approach works for which problem subset*. They tell you whether a method is stable or lucky. They tell you *why* one approach outperforms another. I added a critical note to `MEMORY.md`: **"КРИТИЧНО: Промежуточные eval данные"** (Critical: Intermediate eval data). Because this will happen again. Someone will optimize for the headline number and miss the real insight hiding in the curves. The irony? The joke in the debugging world goes: *"The six stages are: that can't happen, that doesn't happen on my machine, that shouldn't happen, why does that happen, oh I see, how did that ever work?"* We'd been stuck at stage 3—thinking our 79.3% spike "shouldn't happen"—when we should've been asking stage 4: why *does* it happen? The curriculum data is giving us a signal on specific task subsets. Some problems love structure; others suffer from it. That's not noise. That's the answer. Now we move to Phase 29c with this knowledge: **track everything, trust nothing at face value, and always ask what the numbers are really hiding.**

Mar 4, 2026

New Featurellm-analisis

The 79.3% Peak We Almost Missed: Why Intermediate Data Matters

We were drowning in numbers. **Phase 29a** of our LLM curriculum learning experiment had completed, and like always, I opened `results.json` to check the final accuracy score. **79.3%** jumped out at me—a stunning improvement over the baseline. I felt the familiar rush: breakthrough moment. Then reality hit differently than expected. The problem wasn't that we *got* 79.3%. The problem was that we *almost didn't see it*. Here's what happened: our `eval_gsm8k()` function was printing intermediate results every 50 GSM8K problems directly to stdout. The model achieved **119 correct answers out of 150** on the curriculum-selected subset—a crisp 79.3%. But the function only returned a final aggregate number to the results JSON. We had metrics, sure, but we had architecture blindness. The curriculum learning pipeline was evaluating on curated problem sets, reporting aggregate accuracy, and we were reading the digest instead of analyzing the signal. When I dug into the stdout logs afterward, the pattern became visible: the curriculum data helped dramatically on certain problem categories while actively *harming* performance on others. The remaining 350 general GSM8K problems showed only 70.3% accuracy. Curriculum isn't magic—it's direction. And we weren't capturing the directional information. **The fix was architectural, not mathematical.** I refactored `eval_gsm8k()` to return an `intermediate` array alongside the final result. Now every 50-problem checkpoint gets logged as a structured object: problem count, accuracy at that point, and the precise subset being evaluated. No more stdout archaeology. No more reading printed logs like ancient texts. This isn't just about not missing peaks. It's about being able to *explain* them. When curriculum learning works, you want to know *which parts* worked. When it fails, you need the granular data to debug. We were optimizing blind, tweaking parameters based on a single final number while the real story—the inflection points, the divergence between curriculum and general problems—lived only in console output that scrolled past and vanished. The joke among engineers is that four of us walk into a car that won't start. The IT engineer's solution? "Get out and get back in." Sometimes that's exactly what debugging requires: stepping out, restarting, and changing where you're looking. We weren't looking at intermediate checkpoints. Now we are.

Mar 4, 2026