Scalable Data Lake
Explore the concept of scalable data lakes on AWS using Amazon S3 and Lake Formation. Understand how these solutions differ from data warehouses and production databases, and how they support the storage and analysis of structured and unstructured data for business intelligence and machine learning.
A data lake is a centralized location for storing data that has been ingested from various places. The term was coined around 2011 to distinguish it from other forms of centralized data storage. Others creatively coined the term “data swamps” to describe badly managed data lakes.
In this lesson, we consider the AWS approach for setting up a data lake and how a data lake differs from data warehouses and production data stores.
AWS services for scalable data lake
The AWS team suggests the following two services for setting up a scalable data lake: Simple Storage Service (S3) and Lake Formation.
Amazon S3
Amazon’s Data Lake on AWS architecture recommends S3 as the centralized location to store data of all formats.
Amazon S3 is a scalable and cost-effective way to store a variety of objects and has been widely used among AWS customers of all sizes and industries.
Since its launch in 2006, S3 now stores over 100 trillion objects and can handle tens of millions of requests per second.
Amazon S3 is similar to a cloud-based file system. It consists of buckets containing folder and file objects.
AWS Lake Formation
Launched in 2018, AWS Lake Formation is designed to allow people to quickly set up a data lake on AWS (“in days instead of months”).
Amazon S3 is used as the centralized location for the data lake and is the destination where AWS Lake Formation ingests data.
There are additional features to clean and classify the ingested data and to set up security permissions for accessing the data.
Data lakes vs. data warehouses
A data lake is designed to store all types of data, including unstructured raw data that doesn’t have a predefined database schema. This distinguishes it from a data warehouse, which was initially designed to store structured data with a known database schema (though some data warehouses have also been adding support for unstructured data).
Depending on the use case, there are benefits to either analyzing data within a data lake or within the relational structure of a data warehouse. Some examples:
For use cases involving machine learning (e.g., TensorFlow), using a data lake (e.g., Amazon S3) can be more effective. Amazon S3 can store all types of data for machine learning (ML) models and is also relatively cost-efficient for such use cases.
For use cases involving queries on relational data, using a data warehouse (e.g., Snowflake, Google BigQuery, Azure Synapse Analytics, or Amazon Redshift) can be more effective. A data warehouse is also a relational SQL database.
The topic of data lakes vs. data warehouses is an evolving area, as alluded to by the introduction of hybrid terms such as “data lakehouse.” Even within AWS, there are services that support a data lake approach (e.g., Amazon S3) and others that support a warehouse approach (e.g., Amazon Redshift).
Data warehouses are optimized for throughput, or how much data can be processed with a query. Here are some example SQL queries that warehouses handle well:
COPY table from <data location>MERGE table FROM <data location with many rows>SELECT * FROM table where <matches many rows>
For most types of data analysis, an effective approach is to ingest data into a data warehouse and then use SQL or SQL-like queries to perform the analysis. Many startups implement this approach, using Fivetran (or other tools) to ingest the data to Snowflake or BigQuery.
Teams can also use business intelligence (BI) tools (e.g., Tableau, Sigma Computing, Power BI, Amazon QuickSight) for additional data analysis and visualization. As an example, the BI tool Sigma Computing supports the following data stores: BigQuery, Databricks, PostgreSQL, Amazon Redshift, and Snowflake.
Interestingly, the AWS data analytics architecture puts the popular Amazon S3 as the centralized data store for a scalable data lake instead of Amazon Redshift, its data warehouse.
Perspectives on data analytics architectures may differ. George Fraser, a cofounder of the data ingestion tool Fivetran, said that while his company supports both data lakes and warehouses, he believes that modern data warehouses are easier to use and maintain, especially for applications that can wait a few seconds to process data.
Fans of data lakes include Tomer Shiran, a cofounder of the lakehouse platform Dremio. He said there’s less of a need to use data warehouses with advancements that make it easier to query for data in a data lake (e.g., to perform SQL-like queries). Within the AWS ecosystem, Amazon Athena can be used to analyze data in Amazon S3 using standard SQL.
Data lakes vs. production data stores
Both data lakes and data warehouses are conceptually different from the production data stores (or databases) used to power websites and mobile applications. Production data stores can be implemented with MySQL, Amazon Aurora, Amazon Relational Database Service (RDS), Microsoft SQL Server, Google Cloud SQL, PostgreSQL, and/or MongoDB.
Production data stores are typically optimized for latency or how quickly a row of data can be stored or retrieved. Administrators can set up indexes to further speed up queries and retrievals and improve the user experience of the website or application. Here are some example queries that SQL production data stores handle well:
INSERT INTO table <one row>UPDATE table set … WHERE <matches 1 row>DELETE FROM table WHERE <matches 1 row>SELECT * FROM table WHERE <matches 1 row>
As an example of how speed can affect user experience, the Google Search team discovered that users are sensitive to how quickly search results can appear. Research showed that a 400 millisecond delay leads to a 0.44 percent drop in search volume.
Compared to production databases, data lakes, and warehouses are usually less sensitive to latency requirements. Delays may affect internal users of data reports, but they don’t usually affect the experience of a much broader audience. Also, many data reports don’t require data to be available in real time (i.e., to be current within seconds).