Deploying an Automatic Speech Recognition System

Explore the deployment of an automatic speech recognition system built around Whisper V3. Understand how to estimate storage, inference, and bandwidth resources for large-scale user traffic. Learn the design of modular subsystems including audio preprocessing, model hosting on GPUs, and post-processing for accurate and efficient transcription. Discover how to integrate these services into a reliable, scalable pipeline capable of handling real-time and batch audio transcription.

We'll cover the following...

Resource estimation
- Storage estimation
  - Stable storage
  - Dynamic storage
Inference server estimation
Bandwidth estimation
High-level System Design
Achieving functional requirements
Detailed System Design
Putting it all together
- Infrastructure components
- End-to-end workflow
Achieving nonfunctional requirements
Conclusion

In the previous lesson, we trained and evaluated Whisper V3 for automatic speech recognition (ASR). Now that we have a production-ready model, the next step is designing the system that deploys it at scale.

This lesson walks through the full System Design, from resource estimation to a detailed architecture, showing how different services work together to handle real-world audio transcription traffic reliably and efficiently.

Let’s start with the resource estimation.

Resource estimation

Before designing the system, we need to estimate the resources required. We’ll look at three key areas: storage, inference servers, and network bandwidth. All estimates are based on 100 million daily active users, each submitting 10 audio clips per day.

Storage estimation

Storage requirements can be divided into two categories: stable storage (updated infrequently) and dynamic storage (scales with usage).

Stable storage

Model weights (Whisper V3, ~1.5 billion parameters at FP16 precision): $\text{3 GB}$
User profile data (100 M users x 10 KB each): $\text{~1 TB}$

Dynamic storage

Each audio clip is assumed to be 30 seconds long, encoded at 256 kbps, roughly 1 MB= 256x10^3 bits/sec x 30 seconds audio = 32x10^3 Bytes/sec x 30 seconds audio = 0.96 MB ~ 1 MB per clip per clip.

Daily audio uploads: $\text{100 M users × 10 clips × 1 MB = 1 PB/day}$
Indexing overhead (25% for fast querying and retrieval): $\text{250 TB/day}$
Total daily storage: $\text{~1.25 PB/day}$ ...

1.Introduction to GenAI System Design

2.Fundamental Concepts in GenAI

3.Back-of-the-envelope Calculations

4.Systematic Framework for Designing GenAI Systems

5.System Design of a Text-to-Text Generation System

6.System Design of a Text-to-Image Generation System

7.System Design of a Text-to-Speech Generation System

8.System Design of a Text-to-Video Generation System

9.System Design of an Image Captioning System

10.System Design of an Automatic Speech Recognition

11.System Design of Retrieval-Augmented Generation (RAG)

12.Conclusion

13.Free GenAI System Design Lessons

Deploying an Automatic Speech Recognition System

Resource estimation

Storage estimation

Stable storage

Dynamic storage