Introduction
Most RAG tutorials show you how to chunk a clean English PDF.
Real production data is never that simple.
When I built Raj Brain — a search system over 100+ hours of Hindi, English, and Hinglish podcast content — I ran into every chunking problem that nobody writes about. The dataset had 109 episodes, 1.6 million words, and three languages sometimes appearing in the same sentence.
The first version worked perfectly in demos. In real queries, it was useless.
This post is about what went wrong, why it went wrong, and the specific chunking strategy that fixed it. No theory. Real numbers from a production system.
What Is RAG Chunking and Why It Matters More Than the Model
Retrieval-Augmented Generation (RAG) works in two stages: retrieve relevant context, then generate an answer using that context.
Most developers spend 80% of their time picking the right LLM. The chunking step gets 20 minutes and a Stack Overflow answer.
That’s backwards.
The LLM can only answer well if the retrieved chunks actually contain the answer. If your chunking strategy is wrong, the best model in the world returns garbage — confidently.
Chunking is not a preprocessing step. It is the foundation of your retrieval quality.
The Dataset: What Made This Hard
Before the solution, you need to understand the problem space.
Raj Brain processes podcast content from a creator who speaks in Hindi, English, and Hinglish (code-switched Hindi-English). The dataset when I built it:
- 109 episodes
- 100+ hours of audio
- 1.6 million words across transcripts
- 4,345 searchable chunks (final, after optimization)
- Language distribution: roughly 40% Hindi, 35% Hinglish, 25% English — often within the same paragraph
The transcripts were generated via speech-to-text, which means:
- No punctuation consistency
- Speaker diarization errors
- Hindi words occasionally transliterated into English characters
- English technical terms inside Hindi sentences
This is not an edge case. This is what real multilingual content looks like.
The 3 Chunking Mistakes That Broke Version 1
Mistake 1: Fixed-Size Chunking
The first version used fixed-size chunking: split every 500 tokens, overlap by 50.
Standard advice. Works fine for clean English documents.
For transcribed speech, it was a disaster.
Fixed-size chunking has no awareness of sentence or thought boundaries. It splits mid-sentence constantly. You end up with chunks like:
“…jo main bol raha tha ke aapko business mein success ke liye”
That’s half a thought. The embedding model has to make sense of an incomplete idea. It can’t. The vector it generates is meaningless, and retrieval pulls it for unrelated queries.
The result: Search returned chunks that looked topically relevant but were semantically incomplete. The LLM tried to answer from half-thoughts and produced confused, hedging responses.
Mistake 2: Treating Hindi and Hinglish as the Same Language
This is the mistake I’ve never seen documented anywhere.
Hindi and Hinglish are not the same language for embedding purposes.
Hinglish is code-switching: a speaker mid-sentence switches from Hindi to English and back. A sentence like:
“yeh system bohot fast hai but retrieval accuracy drop ho rahi hai jab queries complex hoti hain”
Standard Hindi tokenizers don’t handle the English words correctly. Standard English tokenizers don’t handle the Devanagari script or the romanized Hindi. Either way, the tokenizer mangles the sentence.
The embedding model then works with mangled tokens. The semantic meaning of the sentence — which is clear to any bilingual person — is lost in the vector representation.
What this causes in retrieval: You ask “why does retrieval accuracy drop?” and the system can’t find the chunk above because the embedding for that chunk is distorted. You get chunks that mention “accuracy” in a completely different context instead.
This is invisible during development if you only test in one language. It only surfaces in production when real users ask real questions.
The fix was using Multilingual MiniLM (384-dimensional embeddings), which is specifically trained on code-switched and multilingual data. It handles Hinglish as a first-class input, not a malformed Hindi or broken English.
Mistake 3: Chunks Were Too Large
My initial chunk size was 800 words with 100-word overlap.
The logic seemed sound: bigger chunks = more context per retrieval = better answers.
The reality: bigger chunks buried the answer.
A podcast host might spend 800 words on a topic, but the specific answer to a user’s question lives in 2-3 sentences in the middle. When you retrieve that 800-word chunk, you’re feeding the LLM 797 words of noise along with the 3 words that matter.
The LLM struggles to isolate the relevant part. Response quality drops. Latency increases because you’re processing more tokens for generation.
The math: At 800-word chunks, I had fewer, larger chunks. Retrieval was faster but answer quality was low. Users couldn’t find what they were looking for.
The Solution: Utterance-Based Chunking
After 3 weeks of failed attempts, I rebuilt the chunking pipeline from scratch using utterance-based chunking.
What Is Utterance-Based Chunking?
Instead of splitting at fixed token counts, you split at natural speech boundaries — the points where a speaker finishes a complete thought.
For transcribed audio, these boundaries are:
- Sentence-ending punctuation (where the transcription includes it)
- Speaker pauses (marked in some transcription formats)
- Logical paragraph breaks in the transcript
- Topic transitions (detectable via semantic similarity between adjacent sentences)
The goal is: every chunk should contain one complete idea or answer.
The Exact Parameters We Use
After extensive testing across Hindi, English, and Hinglish queries:
| Parameter | Value | Reason |
|---|---|---|
| Target chunk size | 300 words | Enough context, not too much noise |
| Maximum chunk size | 500 words | Hard ceiling to prevent over-sized chunks |
| Minimum chunk size | 100 words | Prevents fragments with no semantic value |
| Overlap | 1 utterance | Preserves context across boundaries |
| Splitting unit | Utterance (complete thought) | Language-agnostic boundary detection |
Why 300 words?
At 300 words, a Hinglish chunk typically contains one complete point from the podcast. Long enough to embed meaningful semantic content. Short enough that retrieval isn’t pulling paragraphs of noise.
We tested 200, 300, 400, and 500. At 200, chunks were too fragmented — single sentences with insufficient context. At 400+, answer precision dropped because chunks contained multiple distinct points and retrieval became ambiguous.
Why 1-utterance overlap?
Adjacent utterances in speech often reference each other. A speaker might say:
Utterance 1: “The main reason businesses fail with AI is they choose the wrong model for every task.” Utterance 2: “GPT-4o for a simple classification problem? You’re burning money.”
Without overlap, a query about “AI model selection mistakes” might retrieve Utterance 2 without Utterance 1, losing the framing.
1-utterance overlap ensures that boundary-spanning ideas are preserved in both adjacent chunks.
The Processing Pipeline
Here’s the actual flow:
Raw Transcript Text
↓
Language Detection (per segment)
↓
Sentence Boundary Detection (language-aware)
↓
Utterance Grouping (300-word target)
↓
Overlap Injection (1 utterance)
↓
Multilingual MiniLM Embedding (384-dim)
↓
ChromaDB Storage (HNSW index)
↓
Search Layer
Language detection happens at the segment level, not the document level. A single paragraph might switch languages twice. Detecting language per segment allows boundary detection to use the right rules for that segment.
HNSW indexing in ChromaDB was chosen for its approximate nearest-neighbor performance at scale. With 4,345 chunks, exact search would work fine — but HNSW gives us sub-200ms retrieval even as the dataset grows.
Results: Before vs After
| Metric | Version 1 (Fixed-Size) | Version 2 (Utterance-Based) |
|---|---|---|
| Total chunks | 2,100 | 4,345 |
| Avg chunk size | 800 words | 287 words |
| Vector search time | 80-120ms | 100-200ms |
| Full response time | 12-18 seconds | 3-4 seconds |
| Multilingual accuracy | Poor | High |
| Hindi query accuracy | ~40% relevant | ~85% relevant |
| Hinglish query accuracy | ~25% relevant | ~80% relevant |
The full response time improvement (12-18s → 3-4s) came from two sources: smaller chunks mean less token processing for generation, and better retrieval means fewer fallback attempts.
The accuracy numbers are internal evaluations based on 50 test queries per language, rated by a native Hindi/Hinglish speaker.
What This Means If You’re Building a Multilingual RAG System
Three things to apply immediately:
1. Never use fixed-size chunking for transcribed or conversational content. Fixed-size chunking was designed for structured documents. Speech transcripts are not structured documents. Use utterance or sentence boundaries.
2. Your embedding model choice is determined by your language mix. If you have any code-switching or non-English content, test your embedding model specifically on those language combinations. Multilingual MiniLM handles Hinglish. Standard OpenAI embeddings do not.
3. Chunk size is a retrieval precision dial, not a context dial. Smaller chunks = more precise retrieval, less generation context. Larger chunks = less precise retrieval, more generation context. Find the balance for your specific content. For conversational audio, 250-350 words is usually the right range.
The Stack
For reference, the full production stack for Raj Brain:
- Chunking: Custom Python pipeline, utterance-based
- Embeddings: Multilingual MiniLM L6 v2 (384-dim)
- Vector DB: ChromaDB with HNSW indexing
- Generation: Perplexity API
- Backend: Python/FastAPI
- Vector search latency: 100-200ms
- Full response time: 3-4 seconds
Conclusion
The Raj Brain search system works not because we used a powerful model — we use a relatively small embedding model — but because the chunks going into the vector database actually contain complete, retrievable ideas.
Chunking is unglamorous work. There are no benchmark leaderboards for it. Nobody writes about Hinglish tokenization edge cases.
But if you’re building a RAG system over real-world multilingual content, chunking is where your system will succeed or fail.
Get the chunks right first. The model does the rest.
Building something similar? Connect with me on LinkedIn

Comments