Deploying an Automatic Speech Recognition System
Explore the deployment of an automatic speech recognition system built around Whisper V3. Understand how to estimate storage, inference, and bandwidth resources for large-scale user traffic. Learn the design of modular subsystems including audio preprocessing, model hosting on GPUs, and post-processing for accurate and efficient transcription. Discover how to integrate these services into a reliable, scalable pipeline capable of handling real-time and batch audio transcription.
In the previous lesson, we trained and evaluated Whisper V3 for automatic speech recognition (ASR). Now that we have a production-ready model, the next step is designing the system that deploys it at scale.
This lesson walks through the full System Design, from resource estimation to a detailed architecture, showing how different services work together to handle real-world audio transcription traffic reliably and efficiently.
Let’s start with the resource estimation.
Resource estimation
Before designing the system, we need to estimate the resources required. We’ll look at three key areas: storage, inference servers, and network bandwidth. All estimates are based on 100 million daily active users, each submitting 10 audio clips per day.
Storage estimation
Storage requirements can be divided into two categories: stable storage (updated infrequently) and dynamic storage (scales with usage).
Stable storage
Model weights (Whisper V3, ~1.5 billion parameters at FP16 precision):
User profile data (100 M users x 10 KB each):
Dynamic storage
Each audio clip is assumed to be 30 seconds long, encoded at 256 kbps, roughly
Daily audio uploads:
Indexing overhead (25% for fast querying and retrieval):
Total daily storage:
...