Independent AI engineer
AI engineering for production systems.
I work with founders and engineering teams shipping LLM-powered products. The focus is reliability — evaluations, retrieval, agent workflows, and the application-layer engineering that makes a demo become a system you can put in front of paying users.
What I work on
Production AI, four angles.
Evaluations
Custom eval harnesses, error analysis, LLM-as-judge calibrated against human labels. The discipline that turns demos into systems you can stake a roadmap on.
Retrieval (RAG)
Hybrid search, reranking, query routing, synthetic eval data. Retrieval that holds up on inputs you have not seen yet.
Agent workflows
State machines and graphs for long-running tasks. Context engineering for agents that do not drift on day three.
The boring leverage
Prompt caching, model routing, structured outputs, observability. The invisible engineering that makes production cost and reliability go in the right direction.
Writing
Recent entries
Build the eval harness first
An evaluation harness is not exotic infrastructure for mature LLM applications. It is the first thing to build, before retrieval, before agent loops, before any optimisation. The minimum viable version fits in an afternoon.
RAG beyond embedding: the techniques that move quality numbers
Vector search alone is not enough for production RAG. The techniques that consistently improve retrieval quality — hybrid search, rerankers, query routing, synthetic eval data — and when each one is worth the engineering cost.
Reliability is the pitch
The dominant pain in production LLM applications is reliability — systems that produce trustworthy answers consistently. Cost optimisation is a downstream consequence of doing reliability well, not a competing pitch.
Get in touch
Have a system that needs to actually work?
Selective engagements with founders and engineering teams building LLM-powered products. Reach out if you'd like to talk about what you're shipping.
work.shivamsharma@zohomail.in →