The Hard Parts of AI Nobody Warns You About
Context Engineering is Just the Beginning to building real AI features
Building and shipping real-world AI features in products people use
In my last post, I argued for calling it "context engineering" instead of "prompt engineering"—but that only scratched the surface. There's an art and science to providing the right context, and it turns out context engineering gets you maybe 20% of the way to a real AI feature.
After three years of building AI features at CoderPad, I've learned that most people think about AI development completely backwards. They obsess over prompts and evals while ignoring the real work: treating AI features like data engineering problems.
The actual complexity isn't in crafting better prompts—it's in:
Breaking problems down into smaller components
Managing context windows properly (both to avoid hallucinations and maintain speed)
Balancing the three-way tension between accuracy, latency, and cost.
The infrastructure nightmares nobody mentions
Building on someone else's API—especially bleeding-edge AI APIs—brings a whole category of problems that mature industries solved decades ago:
Cost spirals: Token counting becomes a core product concern. Every feature request becomes a cost-benefit analysis. "Should we summarize the full transcript or just the highlights?" isn't just a UX question—it's a budget question.
Rate limiting roulette: Your feature works great until it doesn't. Suddenly you're implementing retry logic, backoff strategies, and queue management for what should be a simple API call.
Provider reliability: In mature industries, APIs don't just go down. But with frontier AI models under massive load? We've built fallback strategies, multiple provider integrations, and graceful degradation modes that we never needed for traditional services.
Model version chaos: Your carefully tuned prompts break when the provider updates their model. You're constantly monitoring for output drift and rebuilding context strategies.
This isn't theoretical—every AI feature team deals with this stuff daily. But somehow it never makes it into the "how to build with AI" tutorials.
The real architecture nobody talks about
And that's before you even get to the complex stuff. Andrej Karpathy, who coined the term "vibe coding," goes deeper into what building full AI applications actually requires:
On top of context engineering itself, an LLM app has to:
break up problems just right into control flows
pack the context windows just right
dispatch calls to LLMs of the right kind and capability
handle generation-verification UIUX flows
a lot more - guardrails, security, evals, parallelism, prefetching, ...
So context engineering is just one small piece of an emerging thick layer of non-trivial software that coordinates individual LLM calls (and a lot more) into full LLM apps. The term "ChatGPT wrapper" is tired and really, really wrong.
That last line hits hard. Calling production AI features "ChatGPT wrappers" is like calling a Tesla a "battery wrapper." Technically true, but it misses the entire engineering feat.
What production AI actually looks like
Let me show you what this means in practice. Here's how we built AI Assist three years ago—one of CoderPad's AI features that helps candidates during coding interviews. We've had real users and feedback ever since, unlike many companies just getting started with AI.
The simple version everyone imagines:
User asks: "How does this app work?" (where ‘this’ references the application running in our environment)
Send question to Responses API (GPT-4o, or back then it was GPT-3 or 3.5)
Display response
Done!
The actual implementation:
Step 1: Context intelligence Before we even think about answering the question, we run the conversation through a specialized model to figure out what files from the codebase are relevant. We don’t have the luxury of time in a live interview to have the interviewer or candidate try to define which files should or shouldn’t be included in the context. We need to manage all of that quickly, on the fly.
Step 2: Data preprocessing We can't just dump the entire codebase into context. We have file tree limits, token budgets, and relevance filtering. The preprocessing step batches related files, extracts the most relevant sections, and structures everything so the main model can actually use it.
Step 3: The actual response Only now do we call the main LLM with a carefully constructed context that includes:
System instructions
Chat history (cleaned and formatted)
Relevant code files (preprocessed and structured)
The current question or user prompt
Step 4: Post-processing and UI The response gets checked for code blocks, formatted for our interface, and sometimes run through additional validation steps.
This isn't a "wrapper"—it's a custom AI application with multiple layers of calls, data engineering, and sophisticated orchestration.
For a second example you can look at our Interview summaries
Users expect a clean, organized summary of a 45 minute coding interview or a 2-hour async project. Here's what actually happens:
Data preprocessing: Identify chapter breaks in the transcript, translate our collaborative editor's change history into LLM-readable code diffs, batch related changes together
Parallel processing: Run different specialized prompts for different sections—intro summary, coding challenges, candidate performance analysis
Context management: Each step gets exactly the right data—intro timestamps get conversation snippets, coding sections get batched code changes
Assembly and validation: Combine outputs, check for consistency, handle edge cases where the LLM might have missed something
Once again, every step involves careful prompt design, but the prompts are maybe 5% of the total work. The other 95% is data engineering, orchestration, error handling, and user experience design.
The stack nobody sees
Building AI features means building an entire software stack:
Context Layer (20%): The context engineering we talked about—getting the right information to the right models. Remember, in our experience, actual prompt writing is only about 5% of the total work.
Data Engineering & Orchestration (40%): Preprocessing, structuring, cleaning data, plus breaking complex tasks into steps and managing multiple LLM calls
Product & UX Layer (30%): Making AI responses feel natural, handling loading states, managing user expectations
Infrastructure Layer (10%): Rate limiting, cost management, error handling, monitoring, security
What this means for builders
Before you even think about building the stack, you need to avoid the biggest trap: building AI features because you can, not because you should.
I've watched too many teams fall into what I call "AI theater"—shipping features that look impressive in demos but solve problems nobody actually has.
The hard truth: users don't want more AI capabilities, they want fewer steps to get their work done. Boring AI that eliminates friction beats flashy AI that impresses investors.
Here's the filter we use now: Does this remove steps from the user's workflow, or does it add complexity they have to learn? If you can't point to specific friction you're eliminating, you're probably building AI theater.
If you're building AI features, stop thinking about prompts first. Start with:
What problem are you solving? (Not "how can I use AI?" but "what specific user pain point are you addressing?")
What's your data pipeline? How will you get the right information to your LLM in the right format?
What's your UX for uncertainty? LLMs are probabilistic—how will you handle when they're wrong or unsure?
How will you iterate? Your first version will be wrong. How will you measure, learn, and improve?
What's your context strategy? Only after you've figured out the above should you think about prompts and context engineering.
How to actually start (without drowning)
Okay, so you've got a real problem and you're convinced AI can help. Here's the progression that actually works:
Phase 1: Manual prototype (Week 1-2) Build the dumbest possible version. Literally copy-paste user input into ChatGPT, manually format the output, and show it to users. This validates the use case before you build anything complex.
Phase 2: Simple automation (Week 3-6) Replace the manual steps with basic API calls. No fancy orchestration yet—just input → OpenAI → output. Focus on handling edge cases and getting the context basics right.
Phase 3: Add intelligence (Month 2-3) Now you can start building the data pipeline. Add preprocessing, better context management, maybe a second LLM call for validation. This is where you'll spend most of your time.
Phase 4: Production hardening (Month 3-6) Add retry logic, cost monitoring, rate limiting, error handling. Build the infrastructure layer that makes it actually reliable.
Tools that can help:
LangChain/LangGraph for orchestrating multi-step LLM workflows
Pinecone, Weaviate, or Chroma for vector storage and retrieval
Weights & Biases or LangSmith for monitoring and evaluation
Modal or Replicate for model hosting and scaling
Reality check on timelines:
Simple feature (basic chat integration): 6-12 weeks
Complex feature (multi-step workflows): 3-6 months
Enterprise-grade system: 6-12 months
The key is resisting the urge to build everything at once. Start simple, get user feedback, then add complexity only when you need it.
The real competitive advantage
Here's what I've learned: the companies winning with AI aren't winning because they have better prompts. They're winning because they've built better systems around their LLMs.
They've figured out how to preprocess messy real-world data, how to break complex tasks into manageable steps, how to validate and improve outputs, and how to create user experiences that feel magical instead of frustrating.
Context engineering is crucial, but it's just the entry fee. The real game is building the full stack that makes AI features actually work in production.
Bottom line: If you're still thinking about "prompt engineering" or even just "context engineering," you're thinking too small. Production AI is about building sophisticated software systems that happen to include LLMs—not about getting better at talking to chatbots.
The sooner we start talking about it that way, the sooner we'll build AI features that users actually love.