In February, I attended the India AI Impact Summit in Delhi. Three days of policy talks, vendor pitches, and a lot of conversation about whether Indian AI companies could compete with frontier labs.
I came back energised. I also came back with a question I wanted to answer for myself: if I were building a document intelligence product for Indian languages right now, would I pick Sarvam Vision or OpenAI Vision?
The answer required actually building something. So I gave myself a three-day timebox and built DocuBharat — a small document intelligence tool covering 22 Indian languages. Upload a scanned land record, get back structured data. Hindi, Tamil, Bengali, Marathi, the works.
The hardest call wasn’t the frontend or the API design. It was picking the OCR layer.
Two paths, one decision:
| OpenAI Vision | Sarvam Vision | |
|---|---|---|
| Training distribution | Global, English-dominant | Indian languages, Indian documents |
| Indian language support | Handled | Built for |
| Pricing during my build | ~$0.01 per page | Free (February 2026 promo) |
| Pricing after | Per-token / per-image | ₹1.50 per page |
| Maturity | Production-stable for 2 years | Released the week before |
| Architecture | General-purpose VLM | 3B VLM + dedicated layout + reading-order modules |
| Workflow shape | Sync request/response | Async job-based |
| My prior experience | Used in other projects | First time |
I picked Sarvam.
This is the honest version of why — including the integration code, the benchmarks I don’t fully trust, and the one thing I’d do differently if I rebuilt it.
The vendor I bet on was getting publicly mocked nine months earlier
Worth knowing before you take Sarvam at face value:
In May 2025, Sarvam released Sarvam-M, their flagship 24B Indic LLM. It got fewer than a thousand Hugging Face downloads in the first few days. Menlo Ventures’ Deedy Das called the launch “embarrassing” on X. He compared it unfavourably to an open-source model trained by two Korean college students that hit ~200,000 downloads in a month. His broader cut: “No one is asking for a slightly better 24B Indic model.”
In February 2026, Sarvam quietly released Sarvam Vision — a 3B vision-language model focused entirely on Indian document parsing. Two days later, the same Deedy Das walked his criticism back in public. Called Sarvam’s Indic OCR work “really valuable.” Called the pricing “very reasonable.”
That’s the company I bet my build on. A vendor mid-turnaround, not a settled winner.
This matters because vendors in that phase have predictable properties: support is unusually responsive, pricing is unusually generous, marketing claims are unusually aggressive. The first two helped me. The third is what I’m careful about throughout the rest of this post.
The integration: what calling Sarvam Vision actually looks like
This is the part that matters if you’re considering Sarvam for a real build. The API is asynchronous job-based, not a simple sync request. That’s the first thing that surprised me coming from OpenAI’s pattern.
The four-step workflow
Every document goes through the same four-step lifecycle:
- Create a job — get back a
job_idand an upload target - Upload the file — PDF or a flat ZIP of JPGs/PNGs, max 10 pages per upload, max 200 MB
- Start the job — kicks off async processing
- Poll status, then download — pull the result as Markdown, HTML, or structured JSON
The design choice makes sense once you see it: documents take real time to process — especially multi-page PDFs with complex layouts — and an async pattern lets you scale workers independently from API request handling. Your client code is heavier than an OpenAI Vision call, but more resilient.
A minimal Python integration
Here’s the simplified version of the flow I ran during the build.
Setup:
pip install sarvamai
export SARVAM_API_KEY="your_subscription_key_here"
The full extraction flow:
import os
import time
import requests
API_KEY = os.environ["SARVAM_API_KEY"]
BASE_URL = "https://api.sarvam.ai"
HEADERS = {"api-subscription-key": API_KEY}
def extract_document(file_path: str, language: str = "hi-IN") -> dict:
"""
Full Sarvam Document Intelligence flow:
create job -> upload -> start -> poll -> download
"""
# 1. Create the job
create_resp = requests.post(
f"{BASE_URL}/doc-digitization/job/v1",
headers=HEADERS,
json={
"language": language,
"output_format": "md", # 'md' or 'html'; JSON included by default
},
timeout=30,
)
create_resp.raise_for_status()
job = create_resp.json()
job_id = job["job_id"]
upload_url = job["upload_url"]
# 2. Upload the file to the signed URL
with open(file_path, "rb") as f:
upload_resp = requests.put(upload_url, data=f, timeout=120)
upload_resp.raise_for_status()
# 3. Start the job
start_resp = requests.post(
f"{BASE_URL}/doc-digitization/job/v1/{job_id}/start",
headers=HEADERS,
timeout=30,
)
start_resp.raise_for_status()
# 4. Poll for completion
while True:
status_resp = requests.get(
f"{BASE_URL}/doc-digitization/job/v1/{job_id}/status",
headers=HEADERS,
timeout=30,
)
status = status_resp.json()
if status["state"] in ("completed", "failed"):
break
time.sleep(2)
if status["state"] == "failed":
raise RuntimeError(f"Job {job_id} failed: {status.get('error')}")
# 5. Download the result
result_resp = requests.get(
f"{BASE_URL}/doc-digitization/job/v1/{job_id}/result",
headers=HEADERS,
timeout=60,
)
return result_resp.json()
if __name__ == "__main__":
result = extract_document("hindi-land-record.pdf", language="hi-IN")
for page in result["pages"]:
for block in page["blocks"]:
print(f"[{block['layout_tag']}] {block['text']}")
The shape of the response (the part that actually mattered for DocuBharat):
{
"job_id": "doc_a7f3...",
"pages": [
{
"page_number": 1,
"language_detected": "hi",
"blocks": [
{"layout_tag": "title", "text": "खाता विवरण"},
{"layout_tag": "table", "text": "<table>...</table>"},
{"layout_tag": "paragraph", "text": "..."},
{"layout_tag": "key_value", "text": "खाता संख्या: 1247"}
]
}
],
"page_metrics": {...},
"markdown_url": "https://...",
"html_url": "https://..."
}
The key bit isn’t the API shape — every vendor has one. It’s the layout_tag field. Sarvam returns parsed structure — titles, tables, paragraphs, key-value pairs — not just OCR text. With OpenAI Vision, you get extracted text and you build structure on top by prompting on the text. That’s a one-day vs three-day difference for a structured-extraction build. With three days total, that mattered.
The two real constraints
Two things will catch you on the first integration:
- 10-page limit per upload. Both PDF and ZIP uploads cap at 10 pages. Beyond that, you split the document yourself and reassemble the result.
- 500-page total cap per job, 200 MB file size. Neither bit me on DocuBharat, but plan around them for production.
Once that’s understood, the integration is genuinely clean.
The three reasons I picked Sarvam
1. Native, not bolted on
The phrase that did it for me was designed for it.
OpenAI Vision handles Devanagari. Sarvam Vision is trained on Devanagari — specifically on government bulletins, financial records, textbooks, and historical manuscripts. When the model encounters a mixed-script document with Hindi text, English numerals, and a stamped seal, that’s not an edge case for it. That’s the design target.
A 3B model purpose-built for the input distribution will often beat a much larger general model on that distribution. That was the architectural bet.
I didn’t run the head-to-head benchmark to verify it. I made the call on principle and moved on.
2. Free during February changed how I tested
Sarvam Vision was free during the month I was building.
The cost number isn’t what matters. The posture it produces is. When inference is free, you test aggressively. You break the system on purpose. You feed it documents you wouldn’t have tried if every test was a line item.
On OpenAI Vision’s pricing, the same test load — hundreds of pages, dozens of document types — would have run about $250 during the build. Not enough to derail a project. Enough that I’d have tested carefully rather than exhaustively.
The build I ended up with had table extraction working on six layout variants because I tested it on six layout variants. With paid inference and a 3-day timebox, that number is probably two.
3. Positioning, not posturing
If I’d ever taken DocuBharat further — pitched it to a buyer, opened it up to users — the positioning would have mattered as much as the technology.
Indian product, Indian documents, Indian infrastructure. “Built on Sarvam” tells a coherent story to an Indian buyer in a way that “Built on OpenAI” doesn’t. Not because OpenAI is bad — because the story is different. A foreign-API integration becomes a question to answer. An Indian-API integration becomes part of the pitch.
This reason inverts for other audiences. For a US enterprise buyer, “Built on Sarvam” would cut against you. Audience-mapping is part of the architecture decision, not separate from it.
How to read AI vendor benchmarks without getting played
This is the part most “I picked Vendor X” posts skip.
Sarvam published these numbers alongside the Vision launch:
| Benchmark | Sarvam Vision | Gemini 3 Pro | GPT 5.2 |
|---|---|---|---|
| olmOCR-Bench | 84.3% | 80.2% | 69.8% |
| OmniDocBench | 93.28% | “significantly lower” | “significantly lower” |
Two things to hold in mind before quoting these:
The benchmarks are vendor-published. Sarvam ran the eval. There’s nothing wrong with that — every AI company does it — but it means the team with the most incentive to make the numbers look good ran the test. Until an independent lab reproduces them, they’re marketing claims with technical backing, not settled facts.
The benchmarks favour Sarvam’s training distribution. olmOCR-Bench and OmniDocBench rely on dense, mixed-script Indian document layouts — the exact input Sarvam was trained on. Global models trained on a much broader distribution will, predictably, underperform on a narrow Indian-document test. The benchmark confirms specialisation. It doesn’t prove Sarvam is universally better — only that it’s better at what it was built for.
The defensible version of the claim: Sarvam Vision outperforms general models on the documents it was designed for. On standard global document types, the gap probably narrows or reverses.
The version that travels: “Sarvam beats Gemini and ChatGPT.” That’s the version I’d push back on if a CTO quoted it at me as fact.
A three-question rule I now apply to any AI vendor benchmark:
- Who ran the eval — vendor or independent party? (Almost always the vendor.)
- How does the benchmark relate to the model’s training distribution? (Specialists win their own benchmark by definition.)
- What’s the failure mode they’re not publishing? (Specialists fail outside their distribution. Generalists fail at the edges of any distribution.)
There’s also one unresolved thread worth flagging: a June 2025 Hugging Face post alleged Sarvam was inflating model downloads via bots after the initial criticism. Never conclusively resolved either way. Not raising it to indict the company. Raising it because anyone making a vendor call should know the question was asked.
None of this changes my pick. It changes how I talk about my pick.
What I’d do differently if I rebuilt it
I’d benchmark Sarvam against itself.
Not Sarvam vs OpenAI — Sarvam across document categories. Clean printed Hindi. Handwritten Marathi. Tamil government forms with seals and stamps. Bengali medical records. Multi-column Telugu newspapers. English-Hindi mixed contracts. Score each one on extraction accuracy, layout fidelity, and table parsing.
I didn’t. The three-day timebox meant I tested whatever Hindi, Tamil, and Bengali documents I had on hand, confirmed Sarvam handled them well, and moved on. What I came away with is anecdotal confidence that Sarvam handles Indian documents well within the categories I tested. What I don’t have is calibrated knowledge of where it breaks.
For a learning build, anecdotal confidence is enough. For a production pipeline, it isn’t.
A CTO evaluating Sarvam for an actual document workflow would need exactly that calibration. Where’s the cliff? What document types should be routed to a different OCR engine? At what layout complexity does extraction quality degrade past usable? I don’t know. That’s the work I should have done — and the work any CTO evaluating Sarvam should run themselves before committing.
The framework: regional specialist or global generalist?
Three questions that matter more than benchmarks.
Is the vendor’s training distribution aligned with your input distribution? For Indian documents, Sarvam wins on principle. For global English documents, OpenAI or Gemini do. Match the distribution, win the architecture decision.
What reputation phase is the vendor in? Proving themselves, scaling, or mature? Each phase has different properties. Proving-phase vendors (Sarvam in Feb 2026) give you generous pricing and responsive support but aggressive marketing claims and unproven longevity. Mature vendors give you stability but no special treatment.
What does vendor support look like at 2 AM when something breaks? For a 3-day learning build this matters less. For production, it’s the whole game.
If your input distribution is global, OpenAI or Gemini are probably right. If your input distribution is narrow, regional, and underserved — and you’re early enough that vendor lock-in isn’t yet a concern — a specialist like Sarvam is worth the bet I made.
The real decision isn’t “which model is better.” It’s “which model is trained for what I’m building, and how much do I trust the people behind it to keep showing up.”
What I built, what I learned
I built DocuBharat as a learning exercise — to understand what the Sarvam Document Intelligence API actually does, where it holds up on Hindi, Tamil, and Bengali documents, and where the architecture starts to show its design choices. The Sarvam bet paid off for the document types I tested. The architecture matched the input.
The lesson isn’t “pick the Indian vendor” or “pick the global vendor.” The lesson is that vendor benchmarks are downstream of training distribution, and training distribution is what you should actually be picking on.
Read benchmarks. Then read past them.
I’m Muneeb Ullah — a software developer who builds with AI for founders and CTOs. If you’re making a similar architecture call and want a second pair of eyes, I’m at hiremuneeb@gmail.com

Comments