Technical

Local Speaker Diarization: How We Built a 100% Private Meeting Note Taker

OpenWhispr is the open source meeting assistant where your audio never leaves your device. Every voice fingerprint, every speaker label, every embedding — processed and stored on your own machine.

OpenWhispr

OpenWhispr

Engineering

April 15, 2026
Table of contents

Local speaker diarization is the process of labeling who spoke when in a meeting recording, run entirely on the user's device with no audio, embeddings, or transcripts transmitted to any external server. Meeting transcripts have a “who said what” problem. Zoom gives you a wall of text. Most AI note takers — Otter, Fireflies, Read, Granola — solve it by uploading your meeting audio to their servers, running a speaker model there, and sending back labeled segments. Your voice, and your colleagues' voices, end up as vectors in someone else's database. We didn't want that, and neither do our users. OpenWhispr is the open source meeting note taker built on the opposite assumption: audio never leaves your device, speaker fingerprints live in a local SQLite file on your disk, and nothing about a meeting gets uploaded unless you explicitly choose to. This article is the technical breakdown of the diarization side of that promise — what it is, which models we chose, how they fit together, and exactly where every byte of data lives at every step.

Last updated: April 15, 2026. Implementation details reference the speaker diarization feature merged into the OpenWhispr desktop app on April 14, 2026. Every technical claim below is sourced from primary references or directly inspectable in the open source repository.

Fact-Check Snapshot (Dual Sources)

What Speaker Diarization Actually Is

Speaker diarization is the process of labeling who spoke when in a recording — not what they said. Transcription answers what; diarization answers who. The two are complementary: a good meeting note taker needs both, and getting them working together is harder than either one alone.

Diarization is genuinely hard for a handful of stubborn reasons. Two voices can overlap on the same call. Background noise bleeds in. Short utterances (“yeah,” “okay”) carry almost no speaker information. New speakers show up halfway through a meeting. And unlike transcription, diarization has to make global decisions — it cannot label segment seven without comparing it to segment two. That global context is why most commercial solutions do diarization server-side on the full recording, then ship the labeled transcript back.

Transcription alone is not enough. A ten-person meeting with forty-five minutes of cross-talk, rendered as one undifferentiated wall of text, is unreadable. Meeting notes — the actual deliverable people want — require speaker attribution, because action items, decisions, and commitments are all attached to people. “Alice will ship the migration by Friday” is a note. “Will ship the migration by Friday” is noise.

Diarization is not the same as speaker recognition

Diarization says “these three chunks are Speaker A; these two are Speaker B.” It does not know who Speaker A is. Speaker recognition puts a name on Speaker A by matching a voice fingerprint to a stored profile. OpenWhispr does both — diarization runs on every meeting, recognition kicks in once you have labeled someone by name. Both stay local.

The Four-Stage Local Pipeline

OpenWhispr's speaker diarization pipeline has four stages: voice activity detection, segmentation, speaker embedding, and clustering — followed by a merge step that reconciles speaker segments with the transcript. Each stage is a separate ONNX model or native binary. No Python runtime, no PyTorch, no CUDA. The total on-disk footprint of the models is about 45MB.

The Local Diarization Pipeline

Audio CaptureMic + system, 16kHz
VADSilero, 2MB
Segmentationpyannote 3.0
EmbeddingCAM++, 512-dim
ClusteringAgglomerative, 0.5
LabelsMatch profiles in SQLite

Why it stays local:

Every stage runs on your device via a single sherpa-onnx binary — no Python, no cloud.

~45MB of ONNX models downloaded once, then cached at ~/.cache/openwhispr/diarization-models/.

Embeddings are 512-dim float32 vectors stored as SQLite BLOBs on your local disk.

Audio capture is dual-stream from day one. The microphone is definitionally “you,” so we never spend compute trying to diarize your own voice — every mic segment gets labeled you by source, not by voice matching. System audio (everyone else on the call) is the only thing that actually needs speaker-splitting, which immediately halves the work.

Once audio is captured, the four stages run in sequence. VAD drops silence so the expensive embedding stage never wastes cycles on empty frames. Segmentation cuts continuous speech into single-speaker chunks. The embedding model turns each chunk into a 512-dimensional voice fingerprint. Agglomerative clustering groups fingerprints that look alike, producing a final set of speaker IDs. A last pass merges those IDs back into the transcript by timestamp overlap, so every word ends up tagged with a speaker.

The whole thing runs inside a single child process spawned from the app — no server, no listening port, no IPC to anything outside Electron. You can confirm that by reading src/helpers/diarization.js in the open source repo.

Stage 1: Voice Activity Detection (Silero)

Voice activity detection (VAD) is the cheap filter that separates speech from silence before the expensive stages run. Running a speaker embedding model on silence is wasted CPU and produces garbage vectors. VAD is what keeps the rest of the pipeline fast.

We use Silero VAD: a 2MB ONNX model, MIT-licensed, effectively the open-source industry default. Silero returns a per-frame probability that the current 32-millisecond window contains speech. On a modern CPU, the cost is negligible — about 0.1 milliseconds per 32-millisecond chunk, or roughly a third of a percent of one core.

The live pipeline runs Silero continuously on system audio during a call. The thresholds we actually ship with are tuned for real meeting audio, not a clean benchmark:

  • Window size: 512 samples (32ms at 16kHz)
  • Speech threshold: 0.15 — deliberately aggressive so quiet or distant speakers are not missed
  • Silence threshold: 0.08
  • Segment ends after: 16 consecutive silent windows (~512ms)
  • Minimum segment for embedding: 0.8 seconds
  • Live identification cadence: every 1 second once ≥1.6 seconds of speech have accumulated

Honest trade-off

An aggressive speech threshold of 0.15 catches soft talkers on distant mics, but it also occasionally false-triggers on loud keystrokes or room noise. The 0.8-second minimum-segment filter discards most of those false positives before they reach the embedding stage, so downstream accuracy is protected.

Stage 2: Segmentation (pyannote 3.0)

Segmentation takes continuous speech and cuts it into chunks that each contain only one speaker, detecting both speaker-change points and overlap regions. It is the stage that decides where the speaker boundaries are. Everything after it assumes those boundaries are correct.

We use pyannote-segmentation-3.0, the current version of the model from the pyannote.audio team, ONNX-exported by the sherpa-onnx project. Pyannote is the de-facto academic baseline: the pyannote.audio pipeline achieves diarization error rates in the 12–15% range on standard benchmarks like AMI and CALLHOME — the same numbers commercial cloud providers like to quote.

Why the ONNX port specifically? Because shipping PyTorch inside an Electron app is a non-starter — the runtime alone is roughly 1.5GB and effectively requires a GPU to be fast. The ONNX export is a single 6.6MB file that runs on CPU via ONNX Runtime. Same model, tiny fraction of the weight.

We invoke it through sherpa-onnx's offline diarization binary with a handful of flags. The minimum-duration settings come from the pyannote paper's recommended defaults: short speech bursts get merged into adjacent silence, and short silences do not break a continuous speaker segment. Output is a list of start/end time pairs tagged with provisional speaker IDs (speaker_0, speaker_1, etc.).

Stage 3: Speaker Embeddings (CAM++)

A speaker embedding is a fixed-length list of numbers that represents a voice the way a hash represents a file. Two recordings of the same voice produce embeddings that are close in vector space; different voices produce embeddings that are far apart. “Close” is measured by cosine similarity — a single number between -1 and 1. This is the mathematical heart of everything else in the pipeline.

Our embedding model is CAM++from the 3D-Speaker project at Alibaba's DAMO Academy, trained on the VoxCeleb speaker-verification benchmark. The specific file we ship is 3dspeaker_speech_campplus_sv_en_voxceleb_16k.onnx, a roughly 28MB ONNX export that produces a 512-dimensional float32 vector per segment. The paper is arXiv:2303.00332 (Wang et al., 2023).

We picked CAM++ over the obvious alternative, ECAPA-TDNN, for three concrete reasons: roughly half the parameters, lower equal-error rate on VoxCeleb1-O, and faster CPU inference. NVIDIA NeMo's MSDD model has no first-party ONNX export and leans heavily on PyTorch. Picovoice Falcon carries a commercial per-seat license. Apple's CoreML Speech framework is macOS-only. CAM++ was the only option that satisfied our constraints — cross-platform, ONNX, open license, CPU-fast.

To feed CAM++ we ship our own log-mel filterbank feature extractor in src/helpers/speakerEmbeddings.js: 80 mel bands, 25-millisecond windows, 10-millisecond hops, standard parameters for this class of model. Features flow into the ONNX model via onnxruntime-node, running in the Electron main process. Out the other side comes a 512-entry Float32Array, which we write into SQLite as a 2,048-byte BLOB.

The matching step itself is the simplest possible thing: cosine similarity between two vectors. Here is the literal code that decides whether two meeting segments are the same speaker:

// src/helpers/speakerEmbeddings.js
cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  const denom = Math.sqrt(normA) * Math.sqrt(normB);
  return denom === 0 ? 0 : dot / denom;
}

Why 512 dimensions?

512 is the current sweet spot on the VoxCeleb speaker-verification leaderboard between accuracy and storage cost. 256 dimensions start losing discriminative power on close-but-different voices; 1024 dimensions add storage and compute with diminishing returns. 512 float32s × 2KB × a thousand meetings of ten speakers is 20MB on disk. That fits comfortably in a local SQLite file.

Stage 4: Clustering — the “how many speakers?” problem

After embedding every segment you have N vectors. Clustering groups them into K speakers — but the hard part is that you usually don't know K up front. A sales call might have 2 speakers. An all-hands might have 8. Fixed-K clustering breaks at both ends.

We use agglomerative clustering with a cosine-similarity threshold of 0.5: start with every segment as its own cluster, merge the two clusters with the highest similarity, repeat until no remaining pair exceeds the threshold. The threshold, not a target count, decides where to stop. This scales naturally from two speakers to ten without any configuration.

The actual invocation lives in src/helpers/diarization.js. It is a spawn of the sherpa-onnx binary with a handful of CLI flags:

// src/helpers/diarization.js
const args = [
  `--segmentation.pyannote-model=${segPath}`,
  `--embedding.model=${embPath}`,
  `--clustering.num-clusters=${numSpeakers}`,    // -1 = auto
  `--clustering.cluster-threshold=${threshold}`, // 0.5 default
  "--min-duration-on=0.2",
  "--min-duration-off=0.5",
  wavPath,
];

Output is plain text: one line per segment, formatted as start_sec -- end_sec speaker_NN. We parse that into an array of objects and then walk the transcript, matching each transcript segment to the diarization segment with maximum time overlap. Mic-sourced segments are always labeled you — source beats voice, every time.

The 0.5 threshold is not a guess. We tuned it against our own eval harness (scripts/meeting-diarization-eval.js) on real meeting recordings. Higher thresholds under-cluster — one speaker gets split into two. Lower thresholds over-cluster — two similar voices get merged. 0.5 is the sweet spot we converged on.

Live Diarization: Labels As You Speak

Batch-only diarization is accurate but invisible during a call. Live-only is immediate but noisy. OpenWhispr runs both: live labels appear during the meeting, and a full batch pass refines them after.

The live path lives in src/helpers/liveSpeakerIdentifier.js. It taps the system audio stream at 16kHz, feeds every 32-millisecond frame into Silero VAD, and accumulates speech frames until a segment either closes (after ~512ms of silence) or hits the live-identification cadence (every 1 second once ≥1.6 seconds have accumulated). When it triggers, it extracts a CAM++ embedding, compares it against the in-memory map of active-call embeddings and stored speaker profiles, and fires an IPC event — meeting-speaker-identified — with the best match. The React side picks that up and updates the transcript bubble in place.

Once a human has manually set a speaker name in the UI, that assignment is locked. The post-meeting batch pass is not allowed to overwrite it, even if its own analysis disagrees. This lock behavior lives in src/utils/transcriptSpeakerState.ts, and it matters: the worst possible UX is letting an automated pass silently overwrite a user's correction.

When the recording stops, we run the full offline diarization on the complete WAV file. Batch has full context — it sees the whole meeting, not just the last 1.6 seconds — and produces a cleaner final labeling. The batch result gets reconciled with live labels, respects every user lock, and replaces provisional guesses with the more accurate grouping.

Why hybrid beats either alone

Live-only is fast but noisy on short segments. Batch-only is accurate but silent until the meeting ends. Hybrid gives you both — instant feedback during the call, and a final transcript that reflects the full-context analysis. The cost is one extra reconciliation step, which runs in the background once and is effectively free.

Voice Profiles: Remembering Speakers Across Meetings

Once you label a speaker “Alice” in one meeting, OpenWhispr labels her automatically in every future meeting — without you doing anything, and without her voice fingerprint ever leaving your device.

The mechanism is straightforward: every speaker's centroid embedding gets stored in your local SQLite database. When new embeddings appear on a future call, we compare them by cosine similarity against every stored profile. A match above threshold auto-assigns the name. Nothing about this touches a server. The three tables that power it are plain SQLite:

-- src/helpers/database.js
CREATE TABLE speaker_profiles (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  display_name TEXT NOT NULL,
  email TEXT,
  embedding BLOB NOT NULL,          -- 512 float32s (~2KB)
  sample_count INTEGER DEFAULT 1,
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
  updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE speaker_mappings (
  note_id INTEGER NOT NULL,
  speaker_id TEXT NOT NULL,
  profile_id INTEGER,
  display_name TEXT NOT NULL,
  PRIMARY KEY (note_id, speaker_id),
  FOREIGN KEY (note_id) REFERENCES notes(id) ON DELETE CASCADE,
  FOREIGN KEY (profile_id) REFERENCES speaker_profiles(id) ON DELETE SET NULL
);

CREATE TABLE note_speaker_embeddings (
  note_id INTEGER NOT NULL,
  speaker_id TEXT NOT NULL,
  embedding BLOB NOT NULL,
  PRIMARY KEY (note_id, speaker_id),
  FOREIGN KEY (note_id) REFERENCES notes(id) ON DELETE CASCADE
);

Matching is three-tier, to balance confidence against false positives:

  • Cosine ≥ 0.70: auto-confirm. The label appears immediately with no prompt.
  • 0.55 ≤ Cosine < 0.70:suggest. The UI shows “Is this Alice?” with confirm / dismiss controls, and waits for your input.
  • Cosine < 0.55: stay anonymous. The segment remains Speaker N until you name them manually.

When you confirm a match, the stored centroid updates via running mean rather than a raw overwrite. The formula is simple: new_avg = (stored_avg × sample_count + new_embedding) / (sample_count + 1). This handles voice drift — a cold morning, a different mic, a bad cable — without breaking future matches. The more samples of Alice's voice the system has seen, the more stable her profile becomes.

New profiles also trigger retroactive relabeling. When you name a speaker for the first time, a background task walks every past meeting's stored embeddings, finds matches, and updates previously-unnamed speakers in historical transcripts. User-locked mappings are never touched. The result: naming Alice once retroactively labels her in every prior call she attended.

If Google Calendar is connected and the meeting has attendees, the speaker-picker UI pre-populates with their names and emails. One click maps Speaker 2 alice@example.com and persists forever — attendees and voice profiles stay linked, so the right names show up even across devices.

Where Your Data Actually Lives

Audio never leaves your device. That is not a marketing line — it is a mechanical property of the code, and because OpenWhispr is open source you can verify it yourself. Here is exactly what happens to each piece of data at each step.

WhatWhere it livesLeaves your device?
Raw meeting audioOS temp directory, deleted after diarizationNo — ever
Diarization ONNX models~/.cache/openwhispr/diarization-models/Downloaded once on first launch, then never again
Speaker embeddings (voice fingerprints)Local SQLite BLOBNo — ever
Speaker-to-name mappingsLocal SQLite tableNo — ever
Transcript textLocal SQLite (optional cloud sync if you turn it on)Only if you opt in

The raw PCM audio from a meeting is written to a file in your OS temp directory, handed to the locally-spawned sherpa-onnx binary, and deleted in a finallyblock when diarization completes. There is no code path in OpenWhispr's diarization pipeline that opens a network socket to upload audio. Because the code is open source, you can grep src/helpers/diarization.js, src/helpers/liveSpeakerIdentifier.js, and src/helpers/speakerEmbeddings.js for fetch, axios, or http and confirm that yourself.

IPC between Electron's main and renderer processes uses the built-in contextBridge — a kernel-level pipe inside a single process, not a network socket. Speaker embeddings and voice profiles are stored as SQLite BLOBs in the same database file as your notes, with the same OS-level file permissions.

One exception, transparently

The only network activity in this feature is the one-time model download on first launch — three ONNX files fetched from the sherpa-onnx GitHub releases CDN. We document this explicitly because “one network call during setup, zero after” is a very different privacy story from “always-on cloud inference,” and as an open source meeting assistant we would rather tell you than have you discover it.

Honest Limitations

Local diarization is good, not perfect. Here are the cases where it genuinely struggles, and what we do about each one.

  • Overlapping speakers.Two people talking at once is inherently lossy for single-label attribution. Pyannote 3.0 detects the overlap region, but a single speaker_id per segment is not enough to represent “both people talking.” We mark those segments as provisional and let you correct them in the UI.
  • Cold-start accuracy. The first meeting with a new colleague uses only the generic VoxCeleb-trained features. Once you label them once and the profile has a few samples, subsequent meetings are noticeably more accurate.
  • Short utterances.“Yeah,” “okay,” a quick laugh — anything under ~0.8 seconds cannot produce a reliable embedding. We drop those segments from the embedding stage and fall back to label propagation from the surrounding context.
  • Far-field or low-SNR audio. Someone on a phone speakerphone in a noisy room will degrade gracefully but visibly. Automatic gain control and VAD tuning help, but no amount of model sophistication completely overcomes bad physics.
  • CPU-only inference.A GPU would be faster, but we deliberately don't require one. In practice, throughput on a modern CPU is fine: ~100 milliseconds per embedding, and about 30 seconds of batch diarization for a 45-minute meeting on an M1 Mac.

We publish our eval harness (scripts/meeting-diarization-eval.js) in the open source repo so you can measure any of this yourself, on your own recordings — not our cherry-picked demo clips. That feels more honest than a glossy number in a marketing deck.

Why We Chose Sherpa-ONNX

We needed a diarization runtime that was cross-platform, Python-free, CPU-fast, and open source. Sherpa-ONNX was the only option that hit all four.

Our requirements were concrete. It had to run on macOS (both Intel and Apple Silicon), Windows x64, and Linux x64 without separate code paths. It could not bundle PyTorch — the runtime alone is about 1.5GB and essentially requires a GPU to be usable at interactive latencies. It could not require CUDA. And the license had to be permissive enough to ship in a commercial-but-open-source desktop app.

Here is what we actually evaluated:

  • pyannote.audio (Python): ruled out. Bundling PyTorch in an Electron app is a non-starter.
  • NVIDIA NeMo MSDD: PyTorch-dependent, no first-party ONNX export, GPU-centric.
  • Picovoice Falcon: commercial per-active-user license plus cloud-activation on first run.
  • Apple CoreML + Speech framework: macOS-only, no cross-platform parity.
  • sherpa-onnx: Apache 2.0, native ONNX binary, ~35MB of models, spawn-and-read-stdout interface, actively maintained by k2-fsa. Same toolchain powers production ASR in other shipping products.

One honest caveat: sherpa-onnx's offline speaker diarization binary is relatively new. We version-pin to v1.12.23and run our full eval harness against every update before bumping the pin. That's the right trade-off for something this load-bearing.

Frequently Asked Questions

Is there a private meeting note taker that doesn't upload my audio to the cloud?
Yes. OpenWhispr is a free, open source meeting note taker where your audio never leaves your device. Transcription runs locally with OpenAI Whisper or NVIDIA Parakeet, speaker diarization runs locally with pyannote segmentation and CAM++ embeddings, and voice fingerprints are stored in a local SQLite file on your own machine — not on our servers.
Can Otter, Fireflies, or Granola see my meetings?
Yes. Cloud-based AI note takers upload your meeting audio to their servers in order to transcribe and diarize it. Speaker labels, transcripts, and voice embeddings are computed in their infrastructure and stored in their databases. If you need an alternative that keeps every byte of audio on your device, you need a local-first tool like OpenWhispr.
What is speaker diarization?
Speaker diarization is the process of labeling who spoke when in a recording, as distinct from transcription, which answers what was said. A diarized transcript shows each utterance tagged with a speaker label — Alice, Bob, Speaker 3 — instead of one undifferentiated wall of text.
Does local speaker diarization actually work, or is it a downgrade from the cloud?
It works. OpenWhispr's pipeline — pyannote-segmentation-3.0 plus CAM++ embeddings from the 3D-Speaker project, run via sherpa-onnx — achieves diarization error rates in the same ballpark as commercial cloud APIs on standard benchmarks. The main honest trade-offs are overlapping-speaker handling and cold-start accuracy on voices the system has never heard before.
How much disk space do the local diarization models use?
About 45MB in total: pyannote segmentation (6.6MB) plus the CAM++ speaker embedding model (28MB) plus Silero VAD (2MB). All three are downloaded once on first launch and cached at ~/.cache/openwhispr/diarization-models/. After that they stay on disk and never re-download.
Do I need a GPU to run OpenWhispr's local diarization?
No. Everything runs on CPU via ONNX Runtime. Modern Apple Silicon and recent x86 CPUs handle it comfortably — in practice, about 30 seconds of batch diarization for a 45-minute meeting on an M1 Mac, with live speaker labels appearing within a second or two during the call itself.
Is OpenWhispr actually open source?
Yes. The diarization code, the transcription pipeline, and the rest of the desktop app are open source on GitHub. The privacy claims in this article are directly verifiable: grep the diarization helpers for fetch, axios, or http and you will not find any network calls in the diarization path.
What happens when two people talk over each other on a call?
Pyannote 3.0 explicitly detects and annotates overlap regions, but single-label attribution during overlap is inherently lossy. We mark those segments as provisional in the transcript and let you correct them by clicking the speaker label. Your manual corrections are locked and are never overwritten by later automated passes.

Try the Private Meeting Note Taker

OpenWhispr is a free, open source meeting assistant where your audio never leaves your device. Local diarization, local transcription, local voice profiles. Works on macOS, Windows, and Linux.

No account required · Works offline · Open source forever