- Published on
Transcribing 40,000 WhatsApp Voice Notes with Groq Whisper
- Authors

- Name
- Adrian Gan
- @AdrianGanJY
We run a POS hardware business. Our customer service team communicates almost entirely over WhatsApp — voice notes, videos, product demos, complaints, follow-ups. Three phone numbers. Thousands of conversations. Years of operational history locked inside .opus files.
This is the story of how we built a pipeline to transcribe all of it.
The Problem
We already had a WhatsApp chat viewer — a self-hosted Node.js app that decrypts .crypt15 backups delivered via Syncthing and serves them as a read-only UI. You could scroll through chats, search text messages, view images.
But voice notes were a black box. You could play them. You couldn't search them, summarize them, or analyze them. And most of our real communication happened in voice.
The goal: transcribe all voice notes and videos, make them searchable, and lay the groundwork for AI analytics (session classification, SOP compliance scoring, CS quality measurement).
Language Reality
Before picking a model, you have to understand what you are actually transcribing.
Our audio is a mix of:
- Bahasa Malaysia (BM) — most common
- English — mixed in freely
- Mandarin — some customers and staff
- Cantonese — occasional
And critically: code-switching within a single sentence. "Bos, boleh check tak the invoice untuk customer tadi?" is a single utterance mixing BM, English, and business jargon.
This ruled out any English-only model (distil-whisper-large-v3-en). We needed whisper-large-v3 — the full multilingual model.
We ran a 50-file comparison between whisper-large-v3 and whisper-large-v3-turbo. Turbo is faster and cheaper (0.111/hr) but handled code-switching and Cantonese noticeably worse. The $12 cost delta for the full backlog was not a gate. We locked in whisper-large-v3.
Architecture
data/
{num}/
Media/
WhatsApp Business Voice Notes/ <- .opus files
WhatsApp Business Video/ <- .mp4 files
msgstore.dec.db <- WhatsApp message DB (read-only)
data/transcripts.db <- our enrichment DB (writable, persistent)
The key insight: the enrichment DB must be separate from the source DB.
msgstore.dec.db gets hot-replaced every time Syncthing delivers a new backup. If we stored transcripts inside it, they would be wiped. transcripts.db is a separate SQLite file that persists across all syncs.
transcripts.db Schema
CREATE TABLE transcripts (
id INTEGER PRIMARY KEY,
number_id TEXT NOT NULL,
file_path TEXT NOT NULL,
audio_hash TEXT NOT NULL,
text TEXT,
language TEXT,
duration_sec REAL,
segments_json TEXT,
model TEXT,
cost_usd REAL,
created_at INTEGER,
UNIQUE(number_id, file_path)
);
CREATE VIRTUAL TABLE transcripts_fts USING fts5(
text, content='transcripts', content_rowid='id',
tokenize='unicode61 remove_diacritics 2'
);
unicode61 remove_diacritics 2 handles both Latin (BM/EN) and CJK (Mandarin/Cantonese) — the tokenizer works character-by-character for CJK, which gives reasonable phrase search without a dedicated Chinese tokenizer.
The audio_hash enables cross-chat deduplication: forwarded voice notes have identical bytes, so we copy the text instead of calling the API again.
The Batch Runner
The batch runner is the workhorse. Key design decisions:
Resume-safe by default
On startup, we load all already-done file_path values into a Set, then filter them out of the queue. Kill it, restart it — it picks up exactly where it left off.
const done = getDonePathSet(db, numId); // Set of rel paths
const todo = allFiles.filter(f => !done.has(f.rel));
Hash dedup
Before calling the API, we SHA-256 hash the file. If we already have a transcript for that hash (from another chat or number), we copy the text — zero API cost.
const hash = hashFile(item.abs);
const existing = getTranscriptByHash(db, hash);
if (existing) {
// copy text — no API call
} else {
// transcribe
}
Video: ffmpeg extraction
WhatsApp video files (.mp4) go through a pre-processing step before hitting the Whisper API:
function extractAudio(videoPath, outPath) {
execFileSync('ffmpeg', [
'-y', '-i', videoPath,
'-vn', // strip video track
'-ar', '16000', // 16kHz — Whisper native rate
'-ac', '1', // mono
'-b:a', '32k', // 32kbps, sufficient for speech
outPath,
], { stdio: 'pipe', timeout: 60_000 });
}
The temp .mp3 is created, sent to Groq, then deleted in a finally block. The file_path stored in the DB is the original .mp4 path — so it joins back to message_media correctly.
Retry-After handling
Groq returns Retry-After headers on 429s. We honor them:
if (resp.status === 429 && attempt < maxAttempts) {
const ra = parseFloat(resp.headers.get('retry-after') || '0');
const wait = (ra > 0 ? ra * 1000 : 5000 * Math.pow(2, attempt - 1)) + Math.random() * 500;
await sleep(wait);
continue;
}
Cost cap
if (totalCost >= MAX_COST) { stopFlag = true; return; }
Default cap: 19.
Multi-Key Rate Limit Strategy
Groq free tier is 2,000 requests/day per account. With 47,000 files, that is ~24 days on one account.
The fix: multiple Groq accounts, each with their own API key.
The naive approach is round-robin — rotate keys per request. The problem: with concurrency=3, two workers can hit the same key in the same minute, triggering a 429.
The better approach: key-per-worker. Each worker slot is pinned to one key.
await runPool(queue, CONCURRENCY, async (item, i, slot) => {
const apiKey = apiKeys[slot % apiKeys.length];
// ...
});
With 3 accounts at 20 RPM each = 60 RPM effective throughput, no cross-key interference.
Adding a new key is just one line in .env:
GROQ_API_KEY=gsk_...
GROQ_API_KEY_2=gsk_...
GROQ_API_KEY_3=gsk_...
Server Integration
The server reads transcripts.db once on startup and injects it into the Hono API. When a page of messages is fetched, we bulk-lookup transcripts for all audio/video messages on that page:
const audioPaths = enriched
.filter(m => m.message_type === 2 && m.media)
.map(m => m.media);
const transcriptMap = getTranscriptsByPaths(db, numId, audioPaths);
return enriched.map(msg => ({
...msg,
transcript: transcriptMap.get(msg.media)?.text || null,
transcript_lang: transcriptMap.get(msg.media)?.language || null,
}));
One query per page load. No N+1.
Search is also enriched — FTS5 matches against transcript text, resolves file_path back to the audio message via message_media, merges with text search results, deduplicates by message_id.
UI
Under each audio player, if a transcript exists:
[audio player]
BM Hai bos, sorry ya, semalam lupa nak beritahu D1 tu...
The language chip maps Whisper language detection output ("Malay (macrolanguage)") to short labels (BM, EN, ZH, YUE). Searching for a spoken word surfaces audio messages just like text messages.
Live Progress Tracker
A batch job running for days needs visibility. The simplest approach that worked:
The batch runner writes progress.json every 10 items:
{
"done": 4821,
"total": 47807,
"cost_usd": 1.42,
"rate_per_sec": 0.81,
"eta_min": 943,
"updated_at": "2026-05-02T05:12:33.000Z",
"numbers": {
"cs": { "total": 24733, "done": 4821, "errors": 0 },
"pos1": { "total": 23074, "done": 0, "errors": 0 }
}
}
A static HTML page fetches this every 15 seconds and renders three progress bars — CS, POS 1, Overall — with ETA, cost, and rate. No WebSocket, no server changes. Just a file and a setInterval.
Results
| Value | |
|---|---|
| Total files | 47,807 (40k audio + 7.7k video) |
| Estimated cost | ~$19 |
| Model | whisper-large-v3 |
| Languages detected | BM, EN, ZH, YUE, and mixes |
| Search latency | under 50ms (FTS5) |
| Time to build | 2 sessions |
What is Next
This transcription layer is Stage 1. Stage 2 is LLM analytics passes over the corpus:
- Session splitting — time gap over 4h = new session
- Issue classification — Sales / After-Sales / Support / Complaint
- SOP compliance — did CS greet, ask requirements, follow up, send quote?
- Response time analysis — first reply latency, avg per session
- Sentiment arc — detect frustration, early churn warning
The transcript corpus we are building now is the input to all of that. Build once, analyze many times.
Stack
- Transcription: Groq Whisper API (
whisper-large-v3) - Audio extraction: ffmpeg (local)
- Storage: better-sqlite3, FTS5
- Server: Hono (Node.js)
- Sync: Syncthing (phone to server)
- Decryption: custom crypt15 (pure Node.js)
- UI: vanilla JS SPA