RAG Chunking Strategy for Multilingual Content

Introduction

Most RAG tutorials show you how to chunk a clean English PDF.

Real production data is never that simple.

When I built Raj Brain — a search system over 100+ hours of Hindi, English, and Hinglish podcast content — I ran into every chunking problem that nobody writes about. The dataset had 109 episodes, 1.6 million words, and three languages sometimes appearing in the same sentence.

The first version worked perfectly in demos. In real queries, it was useless.

This post is about what went wrong, why it went wrong, and the specific chunking strategy that fixed it. No theory. Real numbers from a production system.

What Is RAG Chunking and Why It Matters More Than the Model

Retrieval-Augmented Generation (RAG) works in two stages: retrieve relevant context, then generate an answer using that context.

Most developers spend 80% of their time picking the right LLM. The chunking step gets 20 minutes and a Stack Overflow answer.

That’s backwards.

The LLM can only answer well if the retrieved chunks actually contain the answer. If your chunking strategy is wrong, the best model in the world returns garbage — confidently.

Chunking is not a preprocessing step. It is the foundation of your retrieval quality.

The Dataset: What Made This Hard

Before the solution, you need to understand the problem space.

Raj Brain processes podcast content from a creator who speaks in Hindi, English, and Hinglish (code-switched Hindi-English). The dataset when I built it:

109 episodes
100+ hours of audio
1.6 million words across transcripts
4,345 searchable chunks (final, after optimization)
Language distribution: roughly 40% Hindi, 35% Hinglish, 25% English — often within the same paragraph

The transcripts were generated via speech-to-text, which means:

No punctuation consistency
Speaker diarization errors
Hindi words occasionally transliterated into English characters
English technical terms inside Hindi sentences

This is not an edge case. This is what real multilingual content looks like.

The 3 Chunking Mistakes That Broke Version 1

Mistake 1: Fixed-Size Chunking

The first version used fixed-size chunking: split every 500 tokens, overlap by 50.

Standard advice. Works fine for clean English documents.

For transcribed speech, it was a disaster.

Fixed-size chunking has no awareness of sentence or thought boundaries. It splits mid-sentence constantly. You end up with chunks like:

“…jo main bol raha tha ke aapko business mein success ke liye”

That’s half a thought. The embedding model has to make sense of an incomplete idea. It can’t. The vector it generates is meaningless, and retrieval pulls it for unrelated queries.

The result: Search returned chunks that looked topically relevant but were semantically incomplete. The LLM tried to answer from half-thoughts and produced confused, hedging responses.

Mistake 2: Treating Hindi and Hinglish as the Same Language

This is the mistake I’ve never seen documented anywhere.

Hindi and Hinglish are not the same language for embedding purposes.

Hinglish is code-switching: a speaker mid-sentence switches from Hindi to English and back. A sentence like:

“yeh system bohot fast hai but retrieval accuracy drop ho rahi hai jab queries complex hoti hain”

Standard Hindi tokenizers don’t handle the English words correctly. Standard English tokenizers don’t handle the Devanagari script or the romanized Hindi. Either way, the tokenizer mangles the sentence.

The embedding model then works with mangled tokens. The semantic meaning of the sentence — which is clear to any bilingual person — is lost in the vector representation.

What this causes in retrieval: You ask “why does retrieval accuracy drop?” and the system can’t find the chunk above because the embedding for that chunk is distorted. You get chunks that mention “accuracy” in a completely different context instead.

This is invisible during development if you only test in one language. It only surfaces in production when real users ask real questions.

The fix was using Multilingual MiniLM (384-dimensional embeddings), which is specifically trained on code-switched and multilingual data. It handles Hinglish as a first-class input, not a malformed Hindi or broken English.

Mistake 3: Chunks Were Too Large

My initial chunk size was 800 words with 100-word overlap.

The logic seemed sound: bigger chunks = more context per retrieval = better answers.

The reality: bigger chunks buried the answer.

A podcast host might spend 800 words on a topic, but the specific answer to a user’s question lives in 2-3 sentences in the middle. When you retrieve that 800-word chunk, you’re feeding the LLM 797 words of noise along with the 3 words that matter.

The LLM struggles to isolate the relevant part. Response quality drops. Latency increases because you’re processing more tokens for generation.

The math: At 800-word chunks, I had fewer, larger chunks. Retrieval was faster but answer quality was low. Users couldn’t find what they were looking for.

The Solution: Utterance-Based Chunking

After 3 weeks of failed attempts, I rebuilt the chunking pipeline from scratch using utterance-based chunking.

What Is Utterance-Based Chunking?

Instead of splitting at fixed token counts, you split at natural speech boundaries — the points where a speaker finishes a complete thought.

For transcribed audio, these boundaries are:

Sentence-ending punctuation (where the transcription includes it)
Speaker pauses (marked in some transcription formats)
Logical paragraph breaks in the transcript
Topic transitions (detectable via semantic similarity between adjacent sentences)

The goal is: every chunk should contain one complete idea or answer.

The Exact Parameters We Use

After extensive testing across Hindi, English, and Hinglish queries:

Parameter	Value	Reason
Target chunk size	300 words	Enough context, not too much noise
Maximum chunk size	500 words	Hard ceiling to prevent over-sized chunks
Minimum chunk size	100 words	Prevents fragments with no semantic value
Overlap	1 utterance	Preserves context across boundaries
Splitting unit	Utterance (complete thought)	Language-agnostic boundary detection

Why 300 words?

At 300 words, a Hinglish chunk typically contains one complete point from the podcast. Long enough to embed meaningful semantic content. Short enough that retrieval isn’t pulling paragraphs of noise.

We tested 200, 300, 400, and 500. At 200, chunks were too fragmented — single sentences with insufficient context. At 400+, answer precision dropped because chunks contained multiple distinct points and retrieval became ambiguous.

Why 1-utterance overlap?

Adjacent utterances in speech often reference each other. A speaker might say:

Utterance 1: “The main reason businesses fail with AI is they choose the wrong model for every task.” Utterance 2: “GPT-4o for a simple classification problem? You’re burning money.”

Without overlap, a query about “AI model selection mistakes” might retrieve Utterance 2 without Utterance 1, losing the framing.

1-utterance overlap ensures that boundary-spanning ideas are preserved in both adjacent chunks.

The Processing Pipeline

Here’s the actual flow:

Raw Transcript Text
        ↓
Language Detection (per segment)
        ↓
Sentence Boundary Detection (language-aware)
        ↓
Utterance Grouping (300-word target)
        ↓
Overlap Injection (1 utterance)
        ↓
Multilingual MiniLM Embedding (384-dim)
        ↓
ChromaDB Storage (HNSW index)
        ↓
Search Layer

Language detection happens at the segment level, not the document level. A single paragraph might switch languages twice. Detecting language per segment allows boundary detection to use the right rules for that segment.

HNSW indexing in ChromaDB was chosen for its approximate nearest-neighbor performance at scale. With 4,345 chunks, exact search would work fine — but HNSW gives us sub-200ms retrieval even as the dataset grows.

Results: Before vs After

Metric	Version 1 (Fixed-Size)	Version 2 (Utterance-Based)
Total chunks	2,100	4,345
Avg chunk size	800 words	287 words
Vector search time	80-120ms	100-200ms
Full response time	12-18 seconds	3-4 seconds
Multilingual accuracy	Poor	High
Hindi query accuracy	~40% relevant	~85% relevant
Hinglish query accuracy	~25% relevant	~80% relevant

The full response time improvement (12-18s → 3-4s) came from two sources: smaller chunks mean less token processing for generation, and better retrieval means fewer fallback attempts.

The accuracy numbers are internal evaluations based on 50 test queries per language, rated by a native Hindi/Hinglish speaker.

What This Means If You’re Building a Multilingual RAG System

Three things to apply immediately:

1. Never use fixed-size chunking for transcribed or conversational content. Fixed-size chunking was designed for structured documents. Speech transcripts are not structured documents. Use utterance or sentence boundaries.

2. Your embedding model choice is determined by your language mix. If you have any code-switching or non-English content, test your embedding model specifically on those language combinations. Multilingual MiniLM handles Hinglish. Standard OpenAI embeddings do not.

3. Chunk size is a retrieval precision dial, not a context dial. Smaller chunks = more precise retrieval, less generation context. Larger chunks = less precise retrieval, more generation context. Find the balance for your specific content. For conversational audio, 250-350 words is usually the right range.

The Stack

For reference, the full production stack for Raj Brain:

Chunking: Custom Python pipeline, utterance-based
Embeddings: Multilingual MiniLM L6 v2 (384-dim)
Vector DB: ChromaDB with HNSW indexing
Generation: Perplexity API
Backend: Python/FastAPI
Vector search latency: 100-200ms
Full response time: 3-4 seconds

Conclusion

The Raj Brain search system works not because we used a powerful model — we use a relatively small embedding model — but because the chunks going into the vector database actually contain complete, retrievable ideas.

Chunking is unglamorous work. There are no benchmark leaderboards for it. Nobody writes about Hinglish tokenization edge cases.

But if you’re building a RAG system over real-world multilingual content, chunking is where your system will succeed or fail.

Get the chunks right first. The model does the rest.

Building something similar? Connect with me on LinkedIn

Categorized in:

Tagged in:

RAG Chunking Strategy for Multilingual Content: How We Made 1.6M Words Searchable

Introduction

What Is RAG Chunking and Why It Matters More Than the Model

The Dataset: What Made This Hard

The 3 Chunking Mistakes That Broke Version 1

Mistake 1: Fixed-Size Chunking

Mistake 2: Treating Hindi and Hinglish as the Same Language

Mistake 3: Chunks Were Too Large

The Solution: Utterance-Based Chunking

What Is Utterance-Based Chunking?

The Exact Parameters We Use

The Processing Pipeline

Results: Before vs After

What This Means If You’re Building a Multilingual RAG System

The Stack

Conclusion

Comments

Leave a Reply Cancel reply

Previous Article

Why AI Projects Fail: 3 Architecture Decisions That Fix It

Next Article

What 3 Production Systems Taught Me About Building for Real Businesses

AddressFix: Clean Messy Indian Addresses with Mappls

Delhivery Maps + Naksha LLM: A Deep Dive Into India’s First Delivery-Trained Mapping Stack

Why I Picked Sarvam Vision Over OpenAI for DocuBharat

3 Database Mistakes I Made Building a Multi-Tenant SaaS in MongoDB

Press ESC to close

Or check our Popular Categories...

Introduction

What Is RAG Chunking and Why It Matters More Than the Model

The Dataset: What Made This Hard

The 3 Chunking Mistakes That Broke Version 1

Mistake 1: Fixed-Size Chunking

Mistake 2: Treating Hindi and Hinglish as the Same Language

Mistake 3: Chunks Were Too Large

The Solution: Utterance-Based Chunking

What Is Utterance-Based Chunking?

The Exact Parameters We Use

The Processing Pipeline

Results: Before vs After

What This Means If You’re Building a Multilingual RAG System

The Stack

Conclusion

Comments

Leave a Reply Cancel reply

Related Articles

Previous Article

Next Article