Blog

Whisper Model Sizes Explained: Tiny vs Base vs Small vs Medium vs Large

Every Whisper model, compared. Parameters, speed, accuracy, and which one you should actually use.

OpenWhispr

Engineering

February 5, 2026

Table of contents

OpenAI's Whisper comes in 9 model variants across 5 size categories: tiny (39M parameters), base (74M), small (244M), medium (769M), and large (1.55B). Each size has a multilingual version and an English-only version (except large and turbo, which are multilingual only). There is also a turbo model (809M) — a distilled version of large-v3 that is nearly as accurate but dramatically faster. Larger models are more accurate but slower and require more memory.

For the whisper.cpp quantized (GGML) format used by most local transcription apps, model files range from 75 MB (tiny) to 2.9 GB (large). Memory usage at runtime ranges from around 273 MB for tiny to 3.9 GB for large. The right model depends on your hardware, your language, and how much accuracy you need.

Last updated: February 17, 2026. Parameter, memory, and benchmark claims are cross-checked against at least two primary sources.

Fact-Check Snapshot (Dual Sources)

Official model families and parameter counts: tiny/base/small/medium/large plus turbo are documented by OpenAI. OpenAI Whisper README · Whisper model card
GGML/GGUF disk and runtime memory context: local deployment footprints come from whisper.cpp conversion/runtime docs. whisper.cpp · whisper.cpp memory notes
Benchmark caveat: WER varies heavily by dataset and language; LibriSpeech figures are not a universal quality score. Whisper paper · large-v2 model card
Turbo model positioning: turbo is a speed-focused variant with accuracy tradeoffs documented in release notes/model cards. turbo model card · turbo release discussion
OpenWhispr usage context: OpenWhispr exposes multiple Whisper sizes so users can tune speed/accuracy on-device. OpenWhispr · whisper.cpp backend

Whisper Model Sizes: Speed vs Accuracy

Bubble size reflects relative parameter count

Word Error Rate: Whisper vs Open-Source Competitors

LibriSpeech test-clean (English read speech) — lower is better

NVIDIA

Other open-source

OpenAI Whisper

Sources: HuggingFace model cards, NVIDIA model cards, Open ASR Leaderboard (Feb 2026). English-clean read speech only — real-world accuracy varies by audio quality, accent, and domain. Commercial APIs omitted as they do not publish LibriSpeech benchmarks.

The Complete Model Comparison

This table covers every official Whisper model. "Relative Speed" is relative to large (1x baseline). VRAM figures are for PyTorch inference on GPU. The GGML columns show file sizes and RAM usage for whisper.cpp, which is what most desktop apps use.

Model	Params	English-Only	VRAM (GPU)	GGML Disk	RAM (whisper.cpp)	Speed	English WER	Multilingual WER
tiny	39 M	tiny.en	~1 GB	75 MiB	~273 MB	~10x	~7.6%	~12%
base	74 M	base.en	~1 GB	142 MiB	~388 MB	~7x	~5.0%	~10%
small	244 M	small.en	~2 GB	466 MiB	~852 MB	~4x	~3.4%	~7%
medium	769 M	medium.en	~5 GB	1.5 GiB	~2.1 GB	~2x	~2.9%	~5%
large-v2	1,550 M	N/A	~10 GB	2.9 GiB	~3.9 GB	1x	~2.7%	~4%
large-v3	1,550 M	N/A	~10 GB	2.9 GiB	~3.9 GB	1x	~2.4%	~3.5%
turbo	809 M	N/A	~6 GB	1.6 GiB	~2.3 GB	~8x	~2.5%	~3.7%

Notes:WER (Word Error Rate) figures are approximate averages from the Whisper paper, HuggingFace model cards (LibriSpeech test sets for English), and OpenAI's published benchmarks. Exact WER varies significantly by dataset, language, audio quality, and evaluation methodology. Lower is better. Speed is relative to large (1x = slowest).

Tiny (39M Parameters)

The tiny model is the smallest and fastest Whisper variant. At just 39 million parameters and a GGML file size of 75 MB, it can run comfortably on almost any hardware — including Raspberry Pi, older laptops, and low-end mobile devices. It transcribes at roughly 10x real-time speed relative to the large model, making it the best choice when latency matters more than perfect accuracy.

Resources

GGML: 75 MiB on disk
RAM: ~273 MB runtime
VRAM: ~1 GB (GPU mode)

Speed

~10x faster than large
Near-instant on Apple Silicon
Real-time on most CPUs

Accuracy

English WER: ~7.6%
Multilingual WER: ~12%
Struggles with accents/noise

On the tiny.en variant (English-only), LibriSpeech test-clean WER is around 5.6%, rising to ~14.9% on test-other (noisy/accented audio). The multilingual tiny model performs slightly worse on English but supports all 99 Whisper languages.

Tiny is ideal for quick notes, low-stakes dictation, live captioning prototypes, and embedded or edge devices where compute is limited. It is the default starting model for many whisper.cpp-based tools because it downloads in seconds and runs instantly.

Best for: Fastest possible transcription, low-resource hardware, Raspberry Pi, real-time applications

Base (74M Parameters)

The base model doubles the parameter count of tiny (74M vs 39M) while remaining very lightweight. At 142 MiB (GGML) and ~388 MB of RAM at runtime, it still fits comfortably on virtually any modern device. The accuracy improvement over tiny is noticeable — especially on noisy audio and accented speech.

Resources

GGML: 142 MiB on disk
RAM: ~388 MB runtime
VRAM: ~1 GB (GPU mode)

Speed

~7x faster than large
Still very fast on CPU
Sub-second on Apple Silicon

Accuracy

English WER: ~5.0%
Multilingual WER: ~10%
Better on noisy audio than tiny

The base.en variant achieves around 4.3% WER on LibriSpeech test-clean and ~12.8% on test-other. This is a meaningful improvement over tiny.en (5.6% and 14.9% respectively), particularly on challenging audio.

Base is a good choice for low-power devices where tiny is not quite accurate enough. It is also popular for real-time transcription pipelines where you want a balance of speed and quality without going above 500 MB of RAM.

Best for: Low-power devices that need better accuracy than tiny, real-time streaming, budget hardware

Small (244M Parameters)

Small is where Whisper starts to feel genuinely accurate. At 244M parameters, it represents a 3.3x jump from base and delivers a significant accuracy improvement — especially for non-English languages and noisy recordings. The GGML file is 466 MiB, and runtime RAM usage is around 852 MB. It runs at about 4x the speed of large.

Resources

GGML: 466 MiB on disk
RAM: ~852 MB runtime
VRAM: ~2 GB (GPU mode)

Speed

~4x faster than large
Fast on modern laptops
Real-time on Apple Silicon

Accuracy

English WER: ~3.4%
Multilingual WER: ~7%
Handles accents well

The small.en variant reaches about 3.0% WER on LibriSpeech test-clean — a remarkable result for a model that fits in under 500 MB. For many English-only use cases, small.en provides accuracy that is close to medium at a fraction of the resource cost.

Small is often recommended as the default model for most users. It runs well on any modern laptop (including MacBook Air M1 with 8 GB RAM), delivers good accuracy across languages, and transcribes fast enough for real-time dictation workflows.

Our recommendation: If you are not sure which model to start with, start with small. It is the best balance of speed, accuracy, and resource usage for the majority of hardware and use cases.

Best for: Most users on modern laptops, default recommended model, good multilingual support

Medium (769M Parameters)

Medium is where diminishing returns begin to set in — but the accuracy gains over small are still meaningful, especially for multilingual transcription, technical vocabulary, and challenging audio conditions. At 769M parameters, it requires 1.5 GiB of disk space (GGML) and about 2.1 GB of RAM at runtime.

Resources

GGML: 1.5 GiB on disk
RAM: ~2.1 GB runtime
VRAM: ~5 GB (GPU mode)

Speed

~2x faster than large
Comfortable on 8 GB+ machines
Needs decent hardware

Accuracy

English WER: ~2.9%
Multilingual WER: ~5%
Strong on technical content

The medium.en model achieves around 3.0% WER on LibriSpeech test-clean and approximately 7.5% on test-other. However, across broader benchmarks — particularly those with diverse accents, background noise, and domain-specific vocabulary — medium consistently outperforms small.

Medium is a strong choice for professional transcription workflows where accuracy matters but you do not have the hardware (or patience) for large. It runs well on machines with 8 GB or more of RAM and benefits significantly from GPU acceleration.

Best for: Professional use, multilingual transcription, technical content, machines with 8 GB+ RAM

Large (1,550M Parameters)

The large model is the most accurate Whisper variant, with 1.55 billion parameters. There are three versions: large-v1 (the original release), large-v2 (trained for 2.5x more epochs with added regularization), and large-v3 (trained on more data with 128 Mel frequency bins instead of 80). Large-v3 is the current recommended version for maximum accuracy.

Resources

GGML: 2.9 GiB on disk
RAM: ~3.9 GB runtime
VRAM: ~10 GB (GPU mode)

Speed

1x (baseline — slowest)
GPU strongly recommended
CPU: may be slower than real-time

Accuracy

English WER: ~2.4% (v3)
Multilingual WER: ~3.5% (v3)
Best across all languages

large-v2 vs large-v3

large-v2

Trained for 2.5x more epochs than large-v1
Added regularization for better generalization
LibriSpeech test-clean WER: ~3.0%
More stable on some audio types

large-v3

128 Mel bins (vs 80) — finer frequency resolution
Trained on 1M hours labeled + 4M hours pseudo-labeled audio
10-20% error reduction over large-v2 across languages
Added Cantonese language support

Which large model to use: For new projects, use large-v3. It is strictly better than large-v2 on average across languages. However, some users have reported that large-v2 produces fewer hallucinations (generating text not present in the audio) in specific edge cases with very quiet or silent audio segments. If you encounter hallucination issues with large-v3, try large-v2 as a fallback.

Best for: Maximum accuracy, professional/medical/legal transcription, low-resource languages, batch processing

Turbo (809M Parameters)

The turbo model is a distilled version of large-v3 that dramatically reduces inference time while retaining most of the accuracy. OpenAI achieved this by pruning the decoder from 32 layers down to just 4 layers — cutting the parameter count from 1,550M to 809M. The encoder (which does the heavy lifting) remains identical to large-v3.

Resources

GGML: ~1.6 GiB on disk
RAM: ~2.3 GB runtime
VRAM: ~6 GB (GPU mode)

Speed

~8x faster than large
Near-large accuracy, near-tiny speed
Great on GPU with torch.compile

Accuracy

English WER: ~2.5%
Multilingual WER: ~3.7%
Minor quality loss vs large-v3

Turbo is the best choice when you want near-large-v3 accuracy but cannot afford the latency of the full large model. On GPU with optimizations like torch.compile, turbo can achieve up to 4.5x speed improvement over standard inference, making it extremely fast.

Important limitation: The turbo model was not trained for translation tasks. If you need to translate speech from one language to English, use medium or large-v3 instead.

Best for: High-accuracy real-time transcription, latency-sensitive apps, when large-v3 is too slow

English-Only vs Multilingual Models

For the tiny, base, small, and medium sizes, Whisper offers two variants: a multilingual model and an English-only (.en) model. The large and turbo models are multilingual only — there are no English-only versions.

When to use .en (English-only)

You only transcribe English audio
Better English accuracy at tiny/base size levels
Slightly faster — no language detection overhead
Same file size as multilingual equivalent

When to use multilingual

You transcribe non-English audio
Your audio contains mixed languages
You need speech translation (to English)
Required if using large or turbo (no .en variant exists)

The performance gap between .en and multilingual models is most significant at smaller sizes. For tiny and base, the .en variants are noticeably better at English. By the time you reach small and medium, the difference narrows considerably. At the large level, there is only a multilingual model — and it performs excellently on English regardless.

Rule of thumb: If you exclusively use English and are choosing tiny or base, pick the .en variant. For small or larger, the multilingual version works great for English and gives you flexibility for other languages.

Hardware Requirements: What Runs Where

The hardware you need depends heavily on the model size and whether you are using whisper.cpp (CPU/Metal/CUDA) or the PyTorch implementation (GPU-focused). Here is what to expect on common hardware classes.

MacBook Air M1 (8 GB RAM)

tiny/base: Instant. Faster than real-time.
small: Fast. Real-time or better with Metal.
medium: Usable. May be slightly slower than real-time on CPU. Metal helps significantly.
large: Possible but tight on RAM. Expect slower than real-time.
turbo: Good fit. Near real-time with Metal acceleration.

MacBook Pro M2/M3 (16-36 GB RAM)

tiny/base/small: Instant. All well within hardware limits.
medium: Fast. Comfortable with Metal.
large: Good. Real-time or near it with Metal.
turbo: Excellent. Fast real-time transcription.

Mid-Range Windows/Linux Laptop (8 GB RAM, integrated GPU)

tiny/base: Fast. CPU-only is fine.
small: Good. Comfortable on most modern CPUs.
medium: Workable. May be 2-3x slower than real-time on CPU.
large: Challenging. CPU-only will be slow.
turbo: Workable. Benefits from Vulkan if available.

Desktop with NVIDIA GPU (RTX 3060+, 8 GB+ VRAM)

All models: Fast. GPU handles everything comfortably.
large: Needs 10 GB VRAM (RTX 3060 12GB or better).
turbo: Fastest option with CUDA. Significantly faster than large.
Use CUDA backend in whisper.cpp for best performance.

whisper.cpp Acceleration Backends

whisper.cpp — the C/C++ port that most desktop apps use — supports several hardware acceleration backends that can dramatically improve performance:

Metal (Apple Silicon)

Full GPU inference on M1/M2/M3/M4 Macs. Runs the entire model on the GPU with zero memory copies. This is the fastest backend for macOS users and what OpenWhispr uses on Mac.

Core ML (Apple)

Uses Apple's neural engine for inference. Can be faster than Metal for some model sizes on newer chips, and more power-efficient. Requires converting models to Core ML format first.

CUDA (NVIDIA)

GPU acceleration for NVIDIA cards. Fastest backend for users with dedicated NVIDIA GPUs. Requires CUDA toolkit. Best choice for large and turbo models on desktop.

Vulkan (Cross-vendor GPU)

Works with Intel, AMD, and NVIDIA GPUs. Good fallback for non-NVIDIA systems. Available on Linux and Windows. Performance varies by GPU vendor and driver.

Approximate Transcription Speed

Real-time factor (RTF) for a 60-second audio clip. RTF < 1.0 means faster than real-time. These are rough estimates — actual performance depends on audio content, model quantization, and system load.

Model	M1 MacBook Air	M3 Pro MacBook	Intel i7 (CPU)	RTX 3060 (CUDA)
tiny	~0.05x	~0.03x	~0.1x	~0.02x
base	~0.08x	~0.05x	~0.15x	~0.04x
small	~0.2x	~0.12x	~0.4x	~0.08x
medium	~0.6x	~0.3x	~1.2x	~0.15x
large-v3	~1.5x	~0.7x	~3.0x	~0.3x
turbo	~0.3x	~0.15x	~0.6x	~0.08x

Reading the table: 0.1x means transcribing 60 seconds of audio takes about 6 seconds. 1.0x = real-time. Values above 1.0x mean slower than real-time. All figures use whisper.cpp with default settings and the Metal (Mac) or CUDA (NVIDIA) backend where applicable.

Which Model Should You Use?

Here are concrete recommendations based on common scenarios. When in doubt, start with small and move up or down based on your experience.

"I want the fastest possible"

Use tiny or tiny.en. Transcribes in milliseconds on modern hardware. Accuracy is rough but usable for quick notes and low-stakes dictation.

tiny / tiny.en

"I want good accuracy on a laptop"

Use small. Best balance of speed and accuracy for modern laptops. If your machine has 16 GB+ RAM, medium is also an excellent choice.

small or medium

"I want the best accuracy"

Use large-v3. It is the most accurate Whisper model overall. If speed matters too, turbo gets you 95% of the accuracy at 8x the speed.

large-v3 or turbo

"I only use English"

Use the .en variant of your chosen size for the best English accuracy. For tiny and base, the .en models are noticeably better. For small+, the difference is minimal.

small.en or medium.en

"I use multiple languages"

Use the multilingual variant. small or medium for balanced performance. large-v3 for best multilingual accuracy.

small, medium, or large-v3

"I have very limited hardware"

Use tiny or base. Both run on 1 GB of RAM and even work on Raspberry Pi and other single-board computers. Base offers a meaningful accuracy bump over tiny with minimal added cost.

tiny or base

How OpenWhispr Handles Model Selection

OpenWhispr is an open-source desktop dictation app that uses whisper.cpp under the hood. It lets you download and switch between any Whisper model directly from the settings — no command-line required. Models are downloaded once and stored locally. You can change models at any time to balance speed and accuracy for your hardware.

Download any model

Pick from tiny through large-v3 in settings. OpenWhispr downloads the GGML version automatically.

Switch instantly

Change models at any time. No restart needed. Test different sizes to find what works best on your hardware.

Hardware optimized

Uses Metal on Apple Silicon, CUDA on NVIDIA, and optimized CPU inference elsewhere. All local, all private.

Our suggestion: Start with small. If it feels slow, drop down to base or tiny. If you want better accuracy and your hardware can handle it, move up to medium or large-v3. OpenWhispr makes it easy to experiment.

Sources

Radford et al. "Robust Speech Recognition via Large-Scale Weak Supervision" (2022) — The original Whisper paper with full benchmark results.
OpenAI Whisper GitHub Repository — Official model table, parameter counts, VRAM requirements, and relative speed figures.
OpenAI Whisper Model Card — Official training data composition and evaluation caveats.
whisper.cpp GitHub Repository — GGML model sizes, RAM usage, and hardware acceleration details.
Whisper large-v2 Model Card (HuggingFace) — LibriSpeech benchmark context used in many WER comparisons.
Whisper large-v3 Model Card (HuggingFace) — Training details, architecture changes from large-v2, and multilingual improvements.
Whisper large-v3-turbo Model Card (HuggingFace) — Turbo architecture details, decoder pruning specifics, and performance tradeoffs.
OpenAI Whisper Turbo Announcement (GitHub Discussion #2363) — Official turbo release notes and cross-language performance comparisons.
OpenWhispr — Practical local deployment context for choosing model sizes.

Try Different Whisper Models in OpenWhispr

Open source, local-first dictation. Download any Whisper model, switch between sizes with one click, and find the perfect balance for your hardware.

Loading...Star us on GitHub

No account required · Works offline · Open source forever

How Whisper AI works Dictation for Linux All blog posts OpenWhispr for developers Local vs Cloud Transcription