Published on

Transcribing 40,000 WhatsApp Voice Notes with Groq Whisper

Authors

We run a POS hardware business. Our customer service team communicates almost entirely over WhatsApp — voice notes, videos, product demos, complaints, follow-ups. Three phone numbers. Thousands of conversations. Years of operational history locked inside .opus files.

This is the story of how we built a pipeline to transcribe all of it.


The Problem

We already had a WhatsApp chat viewer — a self-hosted Node.js app that decrypts .crypt15 backups delivered via Syncthing and serves them as a read-only UI. You could scroll through chats, search text messages, view images.

But voice notes were a black box. You could play them. You couldn't search them, summarize them, or analyze them. And most of our real communication happened in voice.

The goal: transcribe all voice notes and videos, make them searchable, and lay the groundwork for AI analytics (session classification, SOP compliance scoring, CS quality measurement).


Language Reality

Before picking a model, you have to understand what you are actually transcribing.

Our audio is a mix of:

  • Bahasa Malaysia (BM) — most common
  • English — mixed in freely
  • Mandarin — some customers and staff
  • Cantonese — occasional

And critically: code-switching within a single sentence. "Bos, boleh check tak the invoice untuk customer tadi?" is a single utterance mixing BM, English, and business jargon.

This ruled out any English-only model (distil-whisper-large-v3-en). We needed whisper-large-v3 — the full multilingual model.

We ran a 50-file comparison between whisper-large-v3 and whisper-large-v3-turbo. Turbo is faster and cheaper (0.04/hrvs0.04/hr vs 0.111/hr) but handled code-switching and Cantonese noticeably worse. The $12 cost delta for the full backlog was not a gate. We locked in whisper-large-v3.


Architecture

data/
  {num}/
    Media/
      WhatsApp Business Voice Notes/   <- .opus files
      WhatsApp Business Video/          <- .mp4 files
    msgstore.dec.db                     <- WhatsApp message DB (read-only)

data/transcripts.db                     <- our enrichment DB (writable, persistent)

The key insight: the enrichment DB must be separate from the source DB.

msgstore.dec.db gets hot-replaced every time Syncthing delivers a new backup. If we stored transcripts inside it, they would be wiped. transcripts.db is a separate SQLite file that persists across all syncs.

transcripts.db Schema

CREATE TABLE transcripts (
  id           INTEGER PRIMARY KEY,
  number_id    TEXT NOT NULL,
  file_path    TEXT NOT NULL,
  audio_hash   TEXT NOT NULL,
  text         TEXT,
  language     TEXT,
  duration_sec REAL,
  segments_json TEXT,
  model        TEXT,
  cost_usd     REAL,
  created_at   INTEGER,
  UNIQUE(number_id, file_path)
);

CREATE VIRTUAL TABLE transcripts_fts USING fts5(
  text, content='transcripts', content_rowid='id',
  tokenize='unicode61 remove_diacritics 2'
);

unicode61 remove_diacritics 2 handles both Latin (BM/EN) and CJK (Mandarin/Cantonese) — the tokenizer works character-by-character for CJK, which gives reasonable phrase search without a dedicated Chinese tokenizer.

The audio_hash enables cross-chat deduplication: forwarded voice notes have identical bytes, so we copy the text instead of calling the API again.


The Batch Runner

The batch runner is the workhorse. Key design decisions:

Resume-safe by default

On startup, we load all already-done file_path values into a Set, then filter them out of the queue. Kill it, restart it — it picks up exactly where it left off.

const done = getDonePathSet(db, numId);  // Set of rel paths
const todo = allFiles.filter(f => !done.has(f.rel));

Hash dedup

Before calling the API, we SHA-256 hash the file. If we already have a transcript for that hash (from another chat or number), we copy the text — zero API cost.

const hash = hashFile(item.abs);
const existing = getTranscriptByHash(db, hash);
if (existing) {
  // copy text — no API call
} else {
  // transcribe
}

Video: ffmpeg extraction

WhatsApp video files (.mp4) go through a pre-processing step before hitting the Whisper API:

function extractAudio(videoPath, outPath) {
  execFileSync('ffmpeg', [
    '-y', '-i', videoPath,
    '-vn',          // strip video track
    '-ar', '16000', // 16kHz — Whisper native rate
    '-ac', '1',     // mono
    '-b:a', '32k',  // 32kbps, sufficient for speech
    outPath,
  ], { stdio: 'pipe', timeout: 60_000 });
}

The temp .mp3 is created, sent to Groq, then deleted in a finally block. The file_path stored in the DB is the original .mp4 path — so it joins back to message_media correctly.

Retry-After handling

Groq returns Retry-After headers on 429s. We honor them:

if (resp.status === 429 && attempt < maxAttempts) {
  const ra = parseFloat(resp.headers.get('retry-after') || '0');
  const wait = (ra > 0 ? ra * 1000 : 5000 * Math.pow(2, attempt - 1)) + Math.random() * 500;
  await sleep(wait);
  continue;
}

Cost cap

if (totalCost >= MAX_COST) { stopFlag = true; return; }

Default cap: 20.Thefull47,000filebacklogestimates 20. The full 47,000-file backlog estimates ~19.


Multi-Key Rate Limit Strategy

Groq free tier is 2,000 requests/day per account. With 47,000 files, that is ~24 days on one account.

The fix: multiple Groq accounts, each with their own API key.

The naive approach is round-robin — rotate keys per request. The problem: with concurrency=3, two workers can hit the same key in the same minute, triggering a 429.

The better approach: key-per-worker. Each worker slot is pinned to one key.

await runPool(queue, CONCURRENCY, async (item, i, slot) => {
  const apiKey = apiKeys[slot % apiKeys.length];
  // ...
});

With 3 accounts at 20 RPM each = 60 RPM effective throughput, no cross-key interference.

Adding a new key is just one line in .env:

GROQ_API_KEY=gsk_...
GROQ_API_KEY_2=gsk_...
GROQ_API_KEY_3=gsk_...

Server Integration

The server reads transcripts.db once on startup and injects it into the Hono API. When a page of messages is fetched, we bulk-lookup transcripts for all audio/video messages on that page:

const audioPaths = enriched
  .filter(m => m.message_type === 2 && m.media)
  .map(m => m.media);

const transcriptMap = getTranscriptsByPaths(db, numId, audioPaths);

return enriched.map(msg => ({
  ...msg,
  transcript: transcriptMap.get(msg.media)?.text || null,
  transcript_lang: transcriptMap.get(msg.media)?.language || null,
}));

One query per page load. No N+1.

Search is also enriched — FTS5 matches against transcript text, resolves file_path back to the audio message via message_media, merges with text search results, deduplicates by message_id.


UI

Under each audio player, if a transcript exists:

[audio player]
BM  Hai bos, sorry ya, semalam lupa nak beritahu D1 tu...

The language chip maps Whisper language detection output ("Malay (macrolanguage)") to short labels (BM, EN, ZH, YUE). Searching for a spoken word surfaces audio messages just like text messages.


Live Progress Tracker

A batch job running for days needs visibility. The simplest approach that worked:

The batch runner writes progress.json every 10 items:

{
  "done": 4821,
  "total": 47807,
  "cost_usd": 1.42,
  "rate_per_sec": 0.81,
  "eta_min": 943,
  "updated_at": "2026-05-02T05:12:33.000Z",
  "numbers": {
    "cs":   { "total": 24733, "done": 4821, "errors": 0 },
    "pos1": { "total": 23074, "done": 0,    "errors": 0 }
  }
}

A static HTML page fetches this every 15 seconds and renders three progress bars — CS, POS 1, Overall — with ETA, cost, and rate. No WebSocket, no server changes. Just a file and a setInterval.


Results

Value
Total files47,807 (40k audio + 7.7k video)
Estimated cost~$19
Modelwhisper-large-v3
Languages detectedBM, EN, ZH, YUE, and mixes
Search latencyunder 50ms (FTS5)
Time to build2 sessions

What is Next

This transcription layer is Stage 1. Stage 2 is LLM analytics passes over the corpus:

  • Session splitting — time gap over 4h = new session
  • Issue classification — Sales / After-Sales / Support / Complaint
  • SOP compliance — did CS greet, ask requirements, follow up, send quote?
  • Response time analysis — first reply latency, avg per session
  • Sentiment arc — detect frustration, early churn warning

The transcript corpus we are building now is the input to all of that. Build once, analyze many times.


Stack

  • Transcription: Groq Whisper API (whisper-large-v3)
  • Audio extraction: ffmpeg (local)
  • Storage: better-sqlite3, FTS5
  • Server: Hono (Node.js)
  • Sync: Syncthing (phone to server)
  • Decryption: custom crypt15 (pure Node.js)
  • UI: vanilla JS SPA