Technical

How Whisper AI Works: A Complete Guide

A technical deep dive into OpenAI's open-source speech recognition model — from mel spectrograms to transformer decoding.

OpenWhispr

Engineering

February 3, 2026

Table of contents

Whisper is an automatic speech recognition (ASR) model developed by OpenAI. Released in September 2022, it is a general-purpose speech recognition system trained on 680,000 hours of multilingual audio data collected from the internet. Whisper uses an encoder-decoder Transformer architecture that converts audio into log-mel spectrograms, processes them through a neural network encoder, and autoregressively decodes text tokens. It supports transcription in 99 languages, translation to English, language identification, and timestamp prediction — all within a single unified model. Unlike most commercial ASR services, Whisper's weights are fully open source under the MIT license, which has enabled a thriving ecosystem of ports, optimizations, and applications — including whisper.cpp, the C/C++ implementation that makes Whisper practical for real-time, on-device speech recognition.

Last updated: February 17, 2026. Quantitative claims in this article are cross-checked against at least two primary sources.

Fact-Check Snapshot (Dual Sources)

Whisper release, training scale, and license: OpenAI released Whisper in September 2022, trained on 680,000 hours, under MIT. arXiv paper · OpenAI Whisper repo
Task scope and language support: Whisper supports transcription, translation-to-English, language identification, and timestamp tokens across 99 languages. Whisper model card · Whisper README
English benchmark context: around 3% WER is a benchmark figure on LibriSpeech test-clean for English-focused evaluation, not a universal real-world rate. large-v2 model card · benchmark details
Recent model updates: large-v3 reports lower errors than large-v2 on many languages, and turbo focuses on lower latency with smaller decoder compute. large-v3 model card · turbo release discussion
OpenWhispr relevance: OpenWhispr uses Whisper via whisper.cpp for local, offline-first dictation across desktop platforms. whisper.cpp · OpenWhispr

How Whisper Processes Audio

Audio Input

Log-Mel Features

Encoder

Decoder

Text

Why it matters for OpenWhispr:

Runs locally via whisper.cpp — raw audio stays on-device.

Model size controls the speed/accuracy tradeoff.

Same pipeline powers dictation across macOS, Windows, and Linux.

What Is Whisper?

Whisper was introduced in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever at OpenAI. The code and model weights were published on GitHub in September 2022 under the MIT license.

Before Whisper, state-of-the-art speech recognition systems were typically trained on curated, human-labeled datasets — often limited to specific languages, accents, or domains. Whisper broke from this approach by training on a massive, diverse dataset of 680,000 hours of audio paired with transcripts scraped from the internet. The key insight was that this "weakly supervised" approach — using imperfect, machine-generated or user-uploaded transcripts rather than hand-labeled data — could produce a model with remarkable robustness across languages, accents, background noise, and technical vocabulary.

The result is a single model that can handle multiple speech processing tasks: transcription (speech to text in the same language), translation (speech in any language to English text), language identification, and voice activity detection with timestamps. This multitask capability is baked into the model's architecture through a system of special tokens that specify which task to perform.

Whisper's open-source release was significant. It democratized access to high-quality ASR, allowing developers, researchers, and companies to use a model competitive with commercial APIs — without per-minute pricing, without sending audio to the cloud, and without vendor lock-in. This catalyzed the development of tools like whisper.cpp, Faster Whisper, and dozens of applications built on top of them — including OpenWhispr.

Architecture Deep Dive

Whisper uses an encoder-decoder Transformer architecture — the same fundamental design behind models like the original Transformer (Vaswani et al., 2017) and many machine translation systems. The encoder processes audio, the decoder generates text. Here is how each stage works.

Audio Preprocessing

Before any neural network processing, raw audio is converted into a visual representation of sound called a log-mel spectrogram. The process works as follows:

Audio is resampled to 16,000 Hz (16 kHz mono).
A Short-Time Fourier Transform (STFT) is computed using 25-millisecond windows with a 10-millisecond stride (hop length).
The frequency spectrum is projected onto 80 mel-frequency filter banks (128 for large-v3), which approximate human auditory perception by spacing frequency bands logarithmically.
A logarithmic scaling is applied, producing the final log-mel spectrogram.
The spectrogram is divided into 30-second chunks. Audio shorter than 30 seconds is zero-padded; longer audio is processed in sequential chunks.

The result is a 2D representation with shape (80, 3000) — 80 mel channels by 3,000 time steps (30 seconds at 10ms stride). This is analogous to an image, and the encoder processes it much like a vision model processes pixel data.

The Encoder

The encoder converts the mel spectrogram into a sequence of learned audio representations. It consists of:

Two 1D convolution layers — The first convolution has a kernel size of 3 and padding of 1, followed by a GELU activation. The second convolution has a kernel size of 3, stride of 2, and padding of 1, which halves the time dimension. This downsamples the 3,000 time steps to 1,500 positions.
Sinusoidal positional embeddings — Unlike the decoder, the encoder uses fixed sinusoidal embeddings (not learned) to encode the position of each time step.
Transformer blocks — A stack of standard Transformer encoder layers, each containing multi-head self-attention and a feed-forward network with GELU activation, connected by residual connections and layer normalization.

The output is a sequence of 1,500 contextualized audio feature vectors — one for every 20 milliseconds of input audio — that the decoder will attend to during text generation.

The Decoder

The decoder generates text tokens autoregressively — one token at a time, each conditioned on the encoded audio and all previously generated tokens. It consists of:

Token embedding layer — Converts token IDs into dense vector representations.
Learned positional embeddings — Unlike the encoder's sinusoidal embeddings, the decoder uses learned position embeddings that are trained alongside the model.
Transformer blocks with cross-attention — Each decoder layer has three sub-layers: masked self-attention (preventing the model from looking at future tokens), cross-attention to the encoder output (allowing the model to "listen" to the audio), and a feed-forward network.

The final decoder output is projected onto the vocabulary to produce token probabilities. Decoding strategies include greedy search, beam search, and temperature-based sampling.

Special Tokens and Multitask Training

Whisper performs multiple tasks with a single model through a clever system of special tokens that serve as instructions. The decoder input follows a structured format:

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Hello, how are you today? <|endoftext|>

<|startoftranscript|> — Signals the start of the output sequence.
<|en|> — Specifies the language (one of 99 language tokens). Can be auto-detected or manually set.
<|transcribe|> or <|translate|> — Specifies whether to transcribe in the original language or translate to English.
<|notimestamps|> — Controls whether to produce timestamp tokens. When timestamps are enabled, the model generates time offset tokens like <|0.00|> and <|2.40|> that align text to audio.

This unified token format means a single model can perform transcription, translation, language detection, and timestamped transcription — all without separate model heads or fine-tuning. The task is determined entirely by the token sequence provided to the decoder.

Training Data

The original Whisper models (tiny through large-v2) were trained on 680,000 hours of audio paired with transcripts collected from the internet. This dataset is not publicly available. The key characteristic is that it is weakly supervised — the transcript labels were not hand-verified by human annotators. Instead, they came from sources like subtitles, closed captions, and other user-generated text paired with audio.

Data Composition

According to the official model card, the 680,000-hour training set breaks down as follows:

Segment	Hours	Share	Description
English ASR	438,000	65%	English audio with English transcripts
Translation (X-to-English)	126,000	18%	Non-English audio with English transcripts
Multilingual ASR	117,000	17%	Non-English audio with native-language transcripts (covering 98 languages)

The heavy English bias (65% of training data) explains why Whisper performs significantly better on English than on lower-resource languages. For the large-v3 model released in November 2023, OpenAI expanded the training set to 1 million hours of weakly labeled audio plus an additional 4 million hours of pseudo-labeled audio generated by running Whisper large-v2 over unlabeled data — a form of self-training or knowledge distillation.

Why "Weakly Supervised" Matters

Most prior ASR systems trained on datasets like LibriSpeech (960 hours) or Common Voice (a few thousand hours per language) — carefully curated, human-verified datasets. Whisper's approach traded label quality for sheer scale: 680,000 hours versus the hundreds or low thousands of hours used by conventional systems. The paper demonstrated that this trade-off was worthwhile. The diversity of the internet-sourced data gave Whisper exceptional robustness to real-world conditions: background noise, overlapping speech, accents, domain-specific vocabulary, and acoustic environments that would challenge models trained on studio-quality read speech.

However, the weak supervision also introduces a known limitation: hallucination. Because the training transcripts are imperfect, the model sometimes generates plausible-sounding text that was not actually spoken — particularly during silence or very quiet passages. This is a trade-off inherent to the approach and remains an active area of improvement.

Model Sizes

Whisper is available in multiple sizes, ranging from 39 million to 1.55 billion parameters. Smaller models are faster but less accurate; larger models are more accurate but require more compute and memory. The .en variants are English-only models that tend to perform better for English, especially at smaller sizes. No English-only variants exist for the large models.

Model	Parameters	English-Only	VRAM	Relative Speed
tiny	39 M	tiny.en	~1 GB	~10x
base	74 M	base.en	~1 GB	~7x
small	244 M	small.en	~2 GB	~4x
medium	769 M	medium.en	~5 GB	~2x
large (v1, v2, v3)	1,550 M	--	~10 GB	1x
turbo (large-v3-turbo)	809 M	--	~6 GB	~8x

Speed is relative to the large model. VRAM requirements are for GPU inference using fp16 precision via the original Python implementation. whisper.cpp uses significantly less memory (e.g., the large model requires ~3.9 GB RAM instead of ~10 GB VRAM).

The turbo model (released October 2024) is a distilled version of large-v3 with a significantly smaller decoder. It retains most of large-v3's accuracy while running approximately 8x faster — making it a strong choice when speed matters. Note that turbo does not support the translation task.

The large model has gone through three iterations: large-v1 (September 2022), large-v2 (December 2022), and large-v3 (November 2023). Each version improved accuracy, with large-v3 introducing 128 mel-frequency bins (up from 80) and training on a much larger dataset. For a detailed breakdown of each size and how to choose, see our guide to Whisper model sizes.

How Accurate Is Whisper?

Whisper's accuracy is typically measured using Word Error Rate (WER) — the percentage of words that are incorrectly transcribed (insertions, deletions, and substitutions combined). Lower WER is better.

English Benchmarks

On LibriSpeech test-clean — the most widely used English ASR benchmark, consisting of clean read speech from audiobooks — Whisper large-v2 achieves a WER of approximately 3.0%^[1]^[2]. This is competitive with commercial ASR systems and purpose-trained models, though specialized models fine-tuned on LibriSpeech can achieve lower WER on that specific benchmark. Whisper's strength lies not in beating narrow benchmarks but in robustness — performing consistently well across a wide range of real-world conditions.

Large-v3 Improvements

According to OpenAI's evaluation, Whisper large-v3 shows a 10-20% reduction in errors compared to large-v2 across languages where the model achieves below 60% error rate on the Common Voice 15 and Fleurs evaluation datasets^[2]^[3]. The improvements are particularly notable for non-English languages, where the expanded training data (5 million total hours) makes the biggest difference.

How Whisper Compares (February 2026)

Since Whisper's release, newer open-source ASR models have surpassed it on clean-speech benchmarks. NVIDIA's Parakeet TDT (0.6B parameters) achieves 1.69% WER and their Canary model reaches 1.6% WER on LibriSpeech test-clean — both significantly below Whisper large-v3's ~2.7%. However, Whisper remains the most widely deployed open-source ASR model due to its ecosystem maturity, language coverage, and the availability of optimized runtimes like whisper.cpp.

Word Error Rate: Whisper vs Open-Source Competitors

LibriSpeech test-clean (English read speech) — lower is better

NVIDIA

Other open-source

OpenAI Whisper

Sources: HuggingFace model cards, NVIDIA model cards, Open ASR Leaderboard (Feb 2026). English-clean read speech only — real-world accuracy varies by audio quality, accent, and domain. Commercial APIs omitted as they do not publish LibriSpeech benchmarks.

Note: Commercial cloud APIs (Deepgram Nova-3, Google Chirp, AssemblyAI Universal-2) do not publish LibriSpeech WER figures, so they cannot be directly compared in this chart. Their benchmarks use proprietary real-world test sets. OpenAI also released GPT-4o-transcribe in March 2025 as an API-only model (not open-source, not Whisper-based) with reportedly lower error rates — but it is cloud-only and not available for local inference.

Where Whisper Excels

+Robustness to accents and dialects — Trained on diverse internet audio, Whisper handles regional accents better than systems trained on read speech.
+Background noise tolerance — The model maintains reasonable accuracy even with music, other speakers, or ambient noise present.
+Technical vocabulary — Due to the breadth of its training data, Whisper handles domain-specific terms (medical, legal, technical) better than many general-purpose ASR systems.
+Multilingual capability — A single model supporting 99 languages, with no per-language fine-tuning required.

Where Whisper Struggles

-Hallucination — The most commonly cited issue. During silence or very quiet passages, Whisper can generate text that was never spoken. This is a consequence of the weakly supervised training approach.
-Low-resource languages — Languages with little training data (much of the 680K hours is English) have significantly higher error rates.
-Repetitive text generation — The autoregressive decoder can get "stuck" in loops, producing the same phrase repeatedly. Beam search with specific parameters can mitigate but not eliminate this issue.
-Not real-time by default — Whisper processes 30-second chunks, so there is inherent latency. Streaming solutions (like whisper.cpp's real-time mode) work around this but add complexity.

whisper.cpp and Local Inference

The original Whisper implementation requires Python, PyTorch, and a CUDA-capable GPU for reasonable performance. This makes it impractical for desktop applications, mobile devices, and edge deployments. whisper.cpp, created by Georgi Gerganov, solves this problem.

whisper.cpp is a complete reimplementation of Whisper inference in plain C/C++ with no dependencies. It reads the same model weights but runs without Python, without PyTorch, and without a GPU (though it benefits enormously from one). Key features include:

Zero runtime memory allocations — All memory is pre-allocated, making performance predictable and suitable for embedded systems.
Apple Silicon optimization — On Macs with M-series chips, the encoder can run on the Apple Neural Engine via Core ML, delivering more than 3x faster inference compared to CPU-only execution. Metal GPU acceleration is also supported for full GPU inference.
Quantization support — Models can be quantized to 4-bit, 5-bit, or 8-bit integer precision, dramatically reducing memory usage and increasing speed with minimal accuracy loss.
GPU acceleration — Supports NVIDIA CUDA (via cuBLAS), Vulkan (cross-vendor), OpenVINO (Intel), and OpenBLAS for CPU acceleration.
Voice Activity Detection (VAD) — Built-in VAD to skip silent segments, improving both speed and transcript quality.

Memory Requirements (whisper.cpp)

Model	Disk Size	RAM Required
tiny	75 MB	~273 MB
base	142 MB	~388 MB
small	466 MB	~852 MB
medium	1.5 GB	~2.1 GB
large	2.9 GB	~3.9 GB

whisper.cpp runs on macOS (Intel and Apple Silicon), Linux, Windows, iOS, Android, FreeBSD, Raspberry Pi, and even in the browser via WebAssembly. It achieves faster-than-real-time transcription on modest consumer hardware — including devices as small as the Raspberry Pi 4 and iPhone 13.

This portability is what makes local, private speech recognition practical. Applications like OpenWhispr use whisper.cpp under the hood to provide real-time push-to-talk dictation that runs entirely on the user's device — no audio ever leaves the machine.

Whisper vs Other ASR Systems

Whisper exists in a broader landscape of speech recognition systems. Here is how it compares to the most common alternatives.

Google Cloud Speech-to-Text

Google's cloud ASR service. Excellent accuracy, especially on English, with real-time streaming support and speaker diarization. The primary differences: it is proprietary, requires sending audio to Google's servers, and is priced per minute of audio processed. Whisper offers comparable accuracy for general use, runs locally, and is free — but lacks built-in streaming and diarization.

Cloud-onlyProprietaryPay-per-use

Amazon Transcribe (AWS)

Amazon's cloud ASR, tightly integrated with the AWS ecosystem. Offers medical transcription, call analytics, and custom vocabulary. Similar trade-offs to Google: cloud dependency, per-minute pricing, and proprietary. Whisper is a better fit for applications that need offline operation, open-source licensing, or no vendor lock-in.

Cloud-onlyProprietaryAWS integration

Mozilla DeepSpeech

Mozilla's open-source ASR engine based on Baidu's Deep Speech research. Was one of the first high-quality open ASR models, but Mozilla discontinued active development in 2021. Whisper has effectively superseded DeepSpeech in accuracy, multilingual support, and community momentum. DeepSpeech remains useful as a lightweight, well-understood baseline but is no longer recommended for new projects.

Open sourceDiscontinuedEnglish-focused

Vosk

An open-source offline speech recognition toolkit that supports 20+ languages with small, fast models. Vosk is excellent for resource-constrained environments — its models are a fraction of Whisper's size and run well on low-power hardware. The trade-off is accuracy: Vosk is noticeably less accurate than Whisper, particularly on accented speech, noisy audio, and technical vocabulary.

Open sourceLightweightLower accuracy

Faster Whisper (CTranslate2)

A reimplementation of Whisper using SYSTRAN's CTranslate2 inference engine. It runs the same Whisper models with the same accuracy but is up to 4x faster than the original OpenAI implementation with lower memory usage. Faster Whisper is Python-based (unlike whisper.cpp's C/C++) and is a strong choice for server-side batch transcription. It supports CUDA GPUs and CPU inference with INT8 quantization.

Open sourcePython / CTranslate2Same Whisper accuracy

Practical Applications

Whisper's combination of accuracy, multilingual support, and open-source availability has made it the backbone of a wide range of applications:

Desktop Dictation

Apps like OpenWhispr use whisper.cpp to provide system-wide push-to-talk dictation on macOS, Windows, and Linux. Audio is processed locally — nothing is sent to the cloud.

Subtitle Generation

Whisper's timestamp prediction makes it well-suited for generating subtitles and closed captions. Tools like stable-ts and auto-subtitle automate this workflow.

Podcast and Meeting Transcription

Whisper handles long-form audio transcription well, especially with the chunked processing approach. Many podcast editors and meeting note tools use Whisper (or Faster Whisper) for batch transcription of recordings.

Translation and Localization

Whisper's built-in translation capability (any language to English) is used for cross-lingual content access, media localization, and real-time translation prototypes.

Accessibility

Real-time captioning for deaf and hard-of-hearing users, voice interfaces for users with motor disabilities, and audio description transcription are all active areas where Whisper is deployed.

Data Annotation and Research

Researchers use Whisper to create pseudo-labeled datasets for training other models — the same approach OpenAI used to create large-v3's training data. It is also used extensively in linguistic research and corpus creation.

Sources and Further Reading

[Paper] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356. arxiv.org/abs/2212.04356
[Code] OpenAI Whisper GitHub repository. github.com/openai/whisper
[Code] whisper.cpp — C/C++ port by Georgi Gerganov. github.com/ggml-org/whisper.cpp
[Model Card] Whisper large-v2 on Hugging Face (WER benchmark: ~3.0% on LibriSpeech test-clean). huggingface.co/openai/whisper-large-v2
[Model Card] Whisper large-v3 on Hugging Face (10-20% error reduction over v2). huggingface.co/openai/whisper-large-v3
[Model Card] Whisper large-v3-turbo on Hugging Face (latency-focused distilled variant). huggingface.co/openai/whisper-large-v3-turbo
[Model Card] OpenAI Whisper model card (training data composition breakdown). github.com/openai/whisper/blob/main/model-card.md
[Release Notes] OpenAI large-v3 announcement and evaluation notes. github.com/openai/whisper/discussions/1762
[Release Notes] OpenAI turbo release discussion and speed-focused tradeoffs. github.com/openai/whisper/discussions/2363
[Code] Faster Whisper — CTranslate2-based reimplementation by SYSTRAN. github.com/SYSTRAN/faster-whisper
[Product] OpenWhispr uses Whisper through whisper.cpp for local dictation workflows. openwhispr.com

Try Whisper Locally with OpenWhispr

OpenWhispr uses Whisper (via whisper.cpp) for local, private dictation on macOS, Windows, and Linux. Push-to-talk, 99+ languages, no cloud required. Try it free.

Loading...Star us on GitHub

No account required · Works offline · Open source forever

Whisper model sizes compared Local vs cloud transcription All blog posts OpenWhispr for developers Compare dictation tools