From 20M Records/Day to AI Agents: What Backend Engineers Get Wrong About LLMs

I spent my first three years at Parcel Perform building data pipelines: file ingestion stacks, Kafka producers and consumers, Apache Flink streaming jobs, notification systems pushing 20M+ records a day. Good, solid engineering work. Then in late 2025 I moved to leading a GenAI product and discovered that almost everything I'd internalized about how to build reliable systems needed to be re-examined.

This isn't a post about how LLMs are magic. It's about the specific mental model mismatches that trip up experienced backend engineers when they first build with LLMs — and how to recalibrate.

Mental Model Mismatch #1: Latency

In backend systems, 100ms is slow. We optimize query plans, add Redis caching layers, tune Kafka consumer throughput. The goal is low-latency, high-throughput, predictable response times.

LLM calls are 1-10 seconds. That's not a bug — that's the baseline. The first mistake I made was trying to apply pipeline-style latency optimization to an LLM-heavy workflow. I spent a week investigating whether we could batch prompt calls to reduce overhead. The real bottleneck wasn't batching overhead — it was that 3-second inference was simply the floor.

The recalibration: design UX and SLAs around LLM latency from the start. Use streaming responses where possible. Move synchronous LLM calls to async background jobs for anything that isn't user-facing real-time. Accept that a 5-second sales email draft is fine if it's better than what a human produces in 30 minutes.

Mental Model Mismatch #2: Retry Logic

In Kafka consumers, we have dead-letter queues, exponential backoff, and exactly-once semantics. Retry is well-understood: if a message fails, retry it with the same input and eventually it succeeds or goes to DLQ.

LLM retries are different. An LLM call that fails due to rate limiting? Retry is fine. An LLM call that produces bad output (hallucinated data, wrong format, off-topic content)? Retrying with the same prompt will likely produce the same bad output. The retry strategy has to include prompt variation, temperature adjustment, or a fallback model.

We implemented a two-stage retry: first retry identical (catches transient errors), second retry with a refined prompt that explicitly states what was wrong with the previous output. This cut our "permanently failed" email drafts by 60%.

Mental Model Mismatch #3: Throughput vs Quality

High-throughput pipeline thinking: push more records per second. Optimize for volume. Use parallelism to scale.

LLM thinking: quality per inference matters more than throughput. A batch of 10,000 low-quality AI emails will damage your brand. A batch of 1,000 high-quality ones moves pipeline. I had to completely reframe the success metric from "records processed" to "quality score × volume".

Practically: we added an LLM-as-judge scoring step after each email draft. Drafts below a quality threshold get regenerated or flagged for human review rather than sent. This slowed throughput by 20% and improved campaign reply rate by 3x.

Prompt Versioning: The Lesson I Learned Too Late

In backend systems, code is versioned, deployed, and rollback-able. Prompt changes felt lightweight — just edit a string. We treated prompts like config values, not like code.

Three months in, a prompt change for the subject line template caused a 40% drop in email open rates. We had no way to quickly identify *which* prompt change caused it because they weren't versioned. We rebuilt the whole prompt versioning system in an emergency sprint.

The right approach from day one: store prompts in a database with version IDs, link each agent run to the prompt version it used, and treat prompt changes as deployments — with review and staged rollout. We now use a prompt_version_id field in every LangSmith trace and ELK log entry.

# Prompt versioning pattern we settled on
class PromptVersion(BaseModel):
    version_id: str          # e.g. "subject-v2.3"
    template: str
    model: str
    temperature: float
    created_at: datetime
    active: bool

# Every agent run records which version was used
run_metadata = {
    "prompt_version_id": prompt.version_id,
    "campaign_id": campaign_id,
    "model": prompt.model,
}

ELK Cost Monitoring: Track LLM Spend Like DB Query Cost

We track database query performance obsessively — slow query logs, explain plans, index coverage. LLM inference costs money per token, but most teams treat it as an opaque line item until the AWS bill arrives.

We instrument every LLM call with input tokens, output tokens, model ID, and campaign ID, and ship these to ELK. This gives us: cost per campaign, cost per email draft, cost by model version, and — most usefully — cost anomaly detection. When a prompt change causes input tokens to balloon (usually a bug in context construction), we catch it within hours, not at month-end billing.

The Right Mental Model for LLM Systems

After a year working at both ends — 20M records/day pipelines and production AI agents — here's how I now think about LLM systems: they're probabilistic services with expensive, slow calls, where output quality varies and failures are often semantic, not just technical.

That means: optimize for quality over throughput, instrument everything (tokens, latency, quality scores, cost), version your prompts like code, design retry strategies that account for semantic failure, and accept that some percentage of outputs will need human review — that's not a bug in your system, it's just the nature of the technology.

Conclusion

The skills that made me a good backend engineer — systems thinking, fault tolerance design, observability, clean interfaces — all transfer to AI agent work. But the intuitions built from years of microsecond-latency, high-throughput, deterministic systems will mislead you if you apply them without adaptation. The engineers who do this well aren't the ones who forget their backend experience — they're the ones who know when to apply it and when to recalibrate.