Whisper Model Sizes Explained: Tiny vs Base vs Small vs Medium vs Large
Every Whisper model, compared. Parameters, speed, accuracy, and which one you should actually use.
OpenAI's Whisper comes in 9 model variants across 5 size categories: tiny (39M parameters), base (74M), small (244M), medium (769M), and large (1.55B). Each size has a multilingual version and an English-only version (except large and turbo, which are multilingual only). There is also a turbo model (809M) โ a distilled version of large-v3 that is nearly as accurate but dramatically faster. Larger models are more accurate but slower and require more memory.
For the whisper.cpp quantized (GGML) format used by most local transcription apps, model files range from 75 MB (tiny) to 2.9 GB (large). Memory usage at runtime ranges from around 273 MB for tiny to 3.9 GB for large. The right model depends on your hardware, your language, and how much accuracy you need.
Last updated: February 17, 2026. Parameter, memory, and benchmark claims are cross-checked against at least two primary sources.
Fact-Check Snapshot (Dual Sources)
- Official model families and parameter counts: tiny/base/small/medium/large plus turbo are documented by OpenAI. OpenAI Whisper README ยท Whisper model card
- GGML/GGUF disk and runtime memory context: local deployment footprints come from whisper.cpp conversion/runtime docs. whisper.cpp ยท whisper.cpp memory notes
- Benchmark caveat: WER varies heavily by dataset and language; LibriSpeech figures are not a universal quality score. Whisper paper ยท large-v2 model card
- Turbo model positioning: turbo is a speed-focused variant with accuracy tradeoffs documented in release notes/model cards. turbo model card ยท turbo release discussion
- OpenWhispr usage context: OpenWhispr exposes multiple Whisper sizes so users can tune speed/accuracy on-device. OpenWhispr ยท whisper.cpp backend
Whisper Model Sizes: Speed vs Accuracy
Bubble size reflects relative parameter count
Word Error Rate: Whisper vs Open-Source Competitors
LibriSpeech test-clean (English read speech) โ lower is better
Sources: HuggingFace model cards, NVIDIA model cards, Open ASR Leaderboard (Feb 2026). English-clean read speech only โ real-world accuracy varies by audio quality, accent, and domain. Commercial APIs omitted as they do not publish LibriSpeech benchmarks.
The Complete Model Comparison
This table covers every official Whisper model. "Relative Speed" is relative to large (1x baseline). VRAM figures are for PyTorch inference on GPU. The GGML columns show file sizes and RAM usage for whisper.cpp, which is what most desktop apps use.
| Model | Params | English-Only | VRAM (GPU) | GGML Disk | RAM (whisper.cpp) | Speed | English WER | Multilingual WER |
|---|---|---|---|---|---|---|---|---|
| tiny | 39 M | tiny.en | ~1 GB | 75 MiB | ~273 MB | ~10x | ~7.6% | ~12% |
| base | 74 M | base.en | ~1 GB | 142 MiB | ~388 MB | ~7x | ~5.0% | ~10% |
| small | 244 M | small.en | ~2 GB | 466 MiB | ~852 MB | ~4x | ~3.4% | ~7% |
| medium | 769 M | medium.en | ~5 GB | 1.5 GiB | ~2.1 GB | ~2x | ~2.9% | ~5% |
| large-v2 | 1,550 M | N/A | ~10 GB | 2.9 GiB | ~3.9 GB | 1x | ~2.7% | ~4% |
| large-v3 | 1,550 M | N/A | ~10 GB | 2.9 GiB | ~3.9 GB | 1x | ~2.4% | ~3.5% |
| turbo | 809 M | N/A | ~6 GB | 1.6 GiB | ~2.3 GB | ~8x | ~2.5% | ~3.7% |
Notes: WER (Word Error Rate) figures are approximate averages from the Whisper paper, HuggingFace model cards (LibriSpeech test sets for English), and OpenAI's published benchmarks. Exact WER varies significantly by dataset, language, audio quality, and evaluation methodology. Lower is better. Speed is relative to large (1x = slowest).
Tiny (39M Parameters)
The tiny model is the smallest and fastest Whisper variant. At just 39 million parameters and a GGML file size of 75 MB, it can run comfortably on almost any hardware โ including Raspberry Pi, older laptops, and low-end mobile devices. It transcribes at roughly 10x real-time speed relative to the large model, making it the best choice when latency matters more than perfect accuracy.
- GGML: 75 MiB on disk
- RAM: ~273 MB runtime
- VRAM: ~1 GB (GPU mode)
- ~10x faster than large
- Near-instant on Apple Silicon
- Real-time on most CPUs
- English WER: ~7.6%
- Multilingual WER: ~12%
- Struggles with accents/noise
On the tiny.en variant (English-only), LibriSpeech test-clean WER is around 5.6%, rising to ~14.9% on test-other (noisy/accented audio). The multilingual tiny model performs slightly worse on English but supports all 99 Whisper languages.
Tiny is ideal for quick notes, low-stakes dictation, live captioning prototypes, and embedded or edge devices where compute is limited. It is the default starting model for many whisper.cpp-based tools because it downloads in seconds and runs instantly.
Base (74M Parameters)
The base model doubles the parameter count of tiny (74M vs 39M) while remaining very lightweight. At 142 MiB (GGML) and ~388 MB of RAM at runtime, it still fits comfortably on virtually any modern device. The accuracy improvement over tiny is noticeable โ especially on noisy audio and accented speech.
- GGML: 142 MiB on disk
- RAM: ~388 MB runtime
- VRAM: ~1 GB (GPU mode)
- ~7x faster than large
- Still very fast on CPU
- Sub-second on Apple Silicon
- English WER: ~5.0%
- Multilingual WER: ~10%
- Better on noisy audio than tiny
The base.en variant achieves around 4.3% WER on LibriSpeech test-clean and ~12.8% on test-other. This is a meaningful improvement over tiny.en (5.6% and 14.9% respectively), particularly on challenging audio.
Base is a good choice for low-power devices where tiny is not quite accurate enough. It is also popular for real-time transcription pipelines where you want a balance of speed and quality without going above 500 MB of RAM.
Small (244M Parameters)
Small is where Whisper starts to feel genuinely accurate. At 244M parameters, it represents a 3.3x jump from base and delivers a significant accuracy improvement โ especially for non-English languages and noisy recordings. The GGML file is 466 MiB, and runtime RAM usage is around 852 MB. It runs at about 4x the speed of large.
- GGML: 466 MiB on disk
- RAM: ~852 MB runtime
- VRAM: ~2 GB (GPU mode)
- ~4x faster than large
- Fast on modern laptops
- Real-time on Apple Silicon
- English WER: ~3.4%
- Multilingual WER: ~7%
- Handles accents well
The small.en variant reaches about 3.0% WER on LibriSpeech test-clean โ a remarkable result for a model that fits in under 500 MB. For many English-only use cases, small.en provides accuracy that is close to medium at a fraction of the resource cost.
Small is often recommended as the default model for most users. It runs well on any modern laptop (including MacBook Air M1 with 8 GB RAM), delivers good accuracy across languages, and transcribes fast enough for real-time dictation workflows.
Our recommendation: If you are not sure which model to start with, start with small. It is the best balance of speed, accuracy, and resource usage for the majority of hardware and use cases.
Medium (769M Parameters)
Medium is where diminishing returns begin to set in โ but the accuracy gains over small are still meaningful, especially for multilingual transcription, technical vocabulary, and challenging audio conditions. At 769M parameters, it requires 1.5 GiB of disk space (GGML) and about 2.1 GB of RAM at runtime.
- GGML: 1.5 GiB on disk
- RAM: ~2.1 GB runtime
- VRAM: ~5 GB (GPU mode)
- ~2x faster than large
- Comfortable on 8 GB+ machines
- Needs decent hardware
- English WER: ~2.9%
- Multilingual WER: ~5%
- Strong on technical content
The medium.en model achieves around 3.0% WER on LibriSpeech test-clean and approximately 7.5% on test-other. However, across broader benchmarks โ particularly those with diverse accents, background noise, and domain-specific vocabulary โ medium consistently outperforms small.
Medium is a strong choice for professional transcription workflows where accuracy matters but you do not have the hardware (or patience) for large. It runs well on machines with 8 GB or more of RAM and benefits significantly from GPU acceleration.
Large (1,550M Parameters)
The large model is the most accurate Whisper variant, with 1.55 billion parameters. There are three versions: large-v1 (the original release), large-v2 (trained for 2.5x more epochs with added regularization), and large-v3 (trained on more data with 128 Mel frequency bins instead of 80). Large-v3 is the current recommended version for maximum accuracy.
- GGML: 2.9 GiB on disk
- RAM: ~3.9 GB runtime
- VRAM: ~10 GB (GPU mode)
- 1x (baseline โ slowest)
- GPU strongly recommended
- CPU: may be slower than real-time
- English WER: ~2.4% (v3)
- Multilingual WER: ~3.5% (v3)
- Best across all languages
large-v2 vs large-v3
large-v2
- Trained for 2.5x more epochs than large-v1
- Added regularization for better generalization
- LibriSpeech test-clean WER: ~3.0%
- More stable on some audio types
large-v3
- 128 Mel bins (vs 80) โ finer frequency resolution
- Trained on 1M hours labeled + 4M hours pseudo-labeled audio
- 10-20% error reduction over large-v2 across languages
- Added Cantonese language support
Which large model to use: For new projects, use large-v3. It is strictly better than large-v2 on average across languages. However, some users have reported that large-v2 produces fewer hallucinations (generating text not present in the audio) in specific edge cases with very quiet or silent audio segments. If you encounter hallucination issues with large-v3, try large-v2 as a fallback.
Turbo (809M Parameters)
The turbo model is a distilled version of large-v3 that dramatically reduces inference time while retaining most of the accuracy. OpenAI achieved this by pruning the decoder from 32 layers down to just 4 layers โ cutting the parameter count from 1,550M to 809M. The encoder (which does the heavy lifting) remains identical to large-v3.
- GGML: ~1.6 GiB on disk
- RAM: ~2.3 GB runtime
- VRAM: ~6 GB (GPU mode)
- ~8x faster than large
- Near-large accuracy, near-tiny speed
- Great on GPU with torch.compile
- English WER: ~2.5%
- Multilingual WER: ~3.7%
- Minor quality loss vs large-v3
Turbo is the best choice when you want near-large-v3 accuracy but cannot afford the latency of the full large model. On GPU with optimizations like torch.compile, turbo can achieve up to 4.5x speed improvement over standard inference, making it extremely fast.
Important limitation: The turbo model was not trained for translation tasks. If you need to translate speech from one language to English, use medium or large-v3 instead.
English-Only vs Multilingual Models
For the tiny, base, small, and medium sizes, Whisper offers two variants: a multilingual model and an English-only (.en) model. The large and turbo models are multilingual only โ there are no English-only versions.
When to use .en (English-only)
- You only transcribe English audio
- Better English accuracy at tiny/base size levels
- Slightly faster โ no language detection overhead
- Same file size as multilingual equivalent
When to use multilingual
- You transcribe non-English audio
- Your audio contains mixed languages
- You need speech translation (to English)
- Required if using large or turbo (no .en variant exists)
The performance gap between .en and multilingual models is most significant at smaller sizes. For tiny and base, the .en variants are noticeably better at English. By the time you reach small and medium, the difference narrows considerably. At the large level, there is only a multilingual model โ and it performs excellently on English regardless.
Rule of thumb: If you exclusively use English and are choosing tiny or base, pick the .en variant. For small or larger, the multilingual version works great for English and gives you flexibility for other languages.
Hardware Requirements: What Runs Where
The hardware you need depends heavily on the model size and whether you are using whisper.cpp (CPU/Metal/CUDA) or the PyTorch implementation (GPU-focused). Here is what to expect on common hardware classes.
MacBook Air M1 (8 GB RAM)
- tiny/base: Instant. Faster than real-time.
- small: Fast. Real-time or better with Metal.
- medium: Usable. May be slightly slower than real-time on CPU. Metal helps significantly.
- large: Possible but tight on RAM. Expect slower than real-time.
- turbo: Good fit. Near real-time with Metal acceleration.
MacBook Pro M2/M3 (16-36 GB RAM)
- tiny/base/small: Instant. All well within hardware limits.
- medium: Fast. Comfortable with Metal.
- large: Good. Real-time or near it with Metal.
- turbo: Excellent. Fast real-time transcription.
Mid-Range Windows/Linux Laptop (8 GB RAM, integrated GPU)
- tiny/base: Fast. CPU-only is fine.
- small: Good. Comfortable on most modern CPUs.
- medium: Workable. May be 2-3x slower than real-time on CPU.
- large: Challenging. CPU-only will be slow.
- turbo: Workable. Benefits from Vulkan if available.
Desktop with NVIDIA GPU (RTX 3060+, 8 GB+ VRAM)
- All models: Fast. GPU handles everything comfortably.
- large: Needs 10 GB VRAM (RTX 3060 12GB or better).
- turbo: Fastest option with CUDA. Significantly faster than large.
- Use CUDA backend in whisper.cpp for best performance.
whisper.cpp Acceleration Backends
whisper.cpp โ the C/C++ port that most desktop apps use โ supports several hardware acceleration backends that can dramatically improve performance:
Metal (Apple Silicon)
Full GPU inference on M1/M2/M3/M4 Macs. Runs the entire model on the GPU with zero memory copies. This is the fastest backend for macOS users and what OpenWhispr uses on Mac.
Core ML (Apple)
Uses Apple's neural engine for inference. Can be faster than Metal for some model sizes on newer chips, and more power-efficient. Requires converting models to Core ML format first.
CUDA (NVIDIA)
GPU acceleration for NVIDIA cards. Fastest backend for users with dedicated NVIDIA GPUs. Requires CUDA toolkit. Best choice for large and turbo models on desktop.
Vulkan (Cross-vendor GPU)
Works with Intel, AMD, and NVIDIA GPUs. Good fallback for non-NVIDIA systems. Available on Linux and Windows. Performance varies by GPU vendor and driver.
Approximate Transcription Speed
Real-time factor (RTF) for a 60-second audio clip. RTF < 1.0 means faster than real-time. These are rough estimates โ actual performance depends on audio content, model quantization, and system load.
| Model | M1 MacBook Air | M3 Pro MacBook | Intel i7 (CPU) | RTX 3060 (CUDA) |
|---|---|---|---|---|
| tiny | ~0.05x | ~0.03x | ~0.1x | ~0.02x |
| base | ~0.08x | ~0.05x | ~0.15x | ~0.04x |
| small | ~0.2x | ~0.12x | ~0.4x | ~0.08x |
| medium | ~0.6x | ~0.3x | ~1.2x | ~0.15x |
| large-v3 | ~1.5x | ~0.7x | ~3.0x | ~0.3x |
| turbo | ~0.3x | ~0.15x | ~0.6x | ~0.08x |
Reading the table: 0.1x means transcribing 60 seconds of audio takes about 6 seconds. 1.0x = real-time. Values above 1.0x mean slower than real-time. All figures use whisper.cpp with default settings and the Metal (Mac) or CUDA (NVIDIA) backend where applicable.
Which Model Should You Use?
Here are concrete recommendations based on common scenarios. When in doubt, start with small and move up or down based on your experience.
"I want the fastest possible"
Use tiny or tiny.en. Transcribes in milliseconds on modern hardware. Accuracy is rough but usable for quick notes and low-stakes dictation.
tiny / tiny.en"I want good accuracy on a laptop"
Use small. Best balance of speed and accuracy for modern laptops. If your machine has 16 GB+ RAM, medium is also an excellent choice.
small or medium"I want the best accuracy"
Use large-v3. It is the most accurate Whisper model overall. If speed matters too, turbo gets you 95% of the accuracy at 8x the speed.
large-v3 or turbo"I only use English"
Use the .en variant of your chosen size for the best English accuracy. For tiny and base, the .en models are noticeably better. For small+, the difference is minimal.
small.en or medium.en"I use multiple languages"
Use the multilingual variant. small or medium for balanced performance. large-v3 for best multilingual accuracy.
small, medium, or large-v3"I have very limited hardware"
Use tiny or base. Both run on 1 GB of RAM and even work on Raspberry Pi and other single-board computers. Base offers a meaningful accuracy bump over tiny with minimal added cost.
tiny or baseHow OpenWhispr Handles Model Selection
OpenWhispr is an open-source desktop dictation app that uses whisper.cpp under the hood. It lets you download and switch between any Whisper model directly from the settings โ no command-line required. Models are downloaded once and stored locally. You can change models at any time to balance speed and accuracy for your hardware.
Download any model
Pick from tiny through large-v3 in settings. OpenWhispr downloads the GGML version automatically.
Switch instantly
Change models at any time. No restart needed. Test different sizes to find what works best on your hardware.
Hardware optimized
Uses Metal on Apple Silicon, CUDA on NVIDIA, and optimized CPU inference elsewhere. All local, all private.
Our suggestion: Start with small. If it feels slow, drop down to base or tiny. If you want better accuracy and your hardware can handle it, move up to medium or large-v3. OpenWhispr makes it easy to experiment.
Sources
- Radford et al. "Robust Speech Recognition via Large-Scale Weak Supervision" (2022) โ The original Whisper paper with full benchmark results.
- OpenAI Whisper GitHub Repository โ Official model table, parameter counts, VRAM requirements, and relative speed figures.
- OpenAI Whisper Model Card โ Official training data composition and evaluation caveats.
- whisper.cpp GitHub Repository โ GGML model sizes, RAM usage, and hardware acceleration details.
- Whisper large-v2 Model Card (HuggingFace) โ LibriSpeech benchmark context used in many WER comparisons.
- Whisper large-v3 Model Card (HuggingFace) โ Training details, architecture changes from large-v2, and multilingual improvements.
- Whisper large-v3-turbo Model Card (HuggingFace) โ Turbo architecture details, decoder pruning specifics, and performance tradeoffs.
- OpenAI Whisper Turbo Announcement (GitHub Discussion #2363) โ Official turbo release notes and cross-language performance comparisons.
- OpenWhispr โ Practical local deployment context for choosing model sizes.
Try Different Whisper Models in OpenWhispr
Open source, local-first dictation. Download any Whisper model, switch between sizes with one click, and find the perfect balance for your hardware.
No account required ยท Works offline ยท Open source forever