Search⌘ K
AI Features

High-Level Design of Scalable Data Infrastructure for AI/ML

Building on the requirements and estimations from the previous lesson, we now focus on the system’s high-level design.

High-level design

The platform consists of five core layers that establish a clear data flow from ingestion to serving. This structure ensures the system meets both functional and non-functional requirements.

High-level design of the AI/ML data platform
High-level design of the AI/ML data platform

The high-level workflow operates as follows:

  1. Data ingestion: Data originates from diverse sources such as logs, transactional databases (CDCStands for change data capture in transactional databases; Identifies and streams data changes (inserts, updates, and deletes) in real-time, acting as a powerful data integration pattern to sync operational data to destinations like data warehouses or event streams (Kafka) without impacting source performance.), IoT devices, and third-party SaaS platforms. These sources push data through API connectors for synchronous loads or publish events to message queues for real-time streams. The ingestion layer is split into specialized stream and batch ingestion components. Stream ingestion handles real-time events continuously, while batch ingestion manages scheduled bulk loads.

  2. Raw data storage layer: Ingested data lands in the raw data lake, typically using scalable object storage like S3. It employs a “schema-on-read” approach to preserve original fidelity for compliance and reprocessing.

  3. Data ...