In 2011, James Dixon, then CTO of business intelligence firm Pentaho, coined the term data lake. He described the data lake as a counterpart to the siled data marts popular at the time:
If you think of a data mart as a bottled water warehouse—purified, packaged, and structured for easy consumption—then the data lake is water in its unpackaged state. Data pours into a lake to fill it. The different users of the data lake can examine it, take samples or dive directly into it.
Since that statement, data lakes have evolved significantly, competing with data warehouses when it comes to big data storage and analytics. Various tools support faster SQL queries in data lakes, and all major cloud providers now have data lake solutions in their portfolios. With the Data Lakehouse there is even a hybrid concept that combines governance, security and analytics with affordable storage.
A data lake is essentially a singular store where all data is stored to make it available for analysis. Sometimes a data lake only houses data that does not fit into a data warehouse. They are usually saved in their native file format. However, they can also be transformed and unified to allow for more efficient analysis. The goal of a data lake is to enable business (or other analytical) value from data. Data lakes can contain the following types of data:
Unstructured data ,
semi-structured data (like CSV and JSON files) as well as
structured data (usually from relational databases).
Structured data is more useful for analysis, while semi-structured data is easily put into a structured form. In many cases, unstructured data can be transformed into structured data through intelligent automation.
The question is not whether you need a data lake or a data warehouse. Most likely you need both, but for different purposes. It is also possible to combine data lake and data warehouse. First, let's look at the key differences between data lakes and data warehouses:
Schema Strategy: The database schema for data lakes is typically applied at the time of analysis, referred to as "schema-on-read". The database schema for enterprise data warehouses is typically designed before the data store is created and applied to the data during import. This is known as "schema-on-write".
Storage infrastructure: Data warehouses often have large amounts of expensive memory and SSD hard drives to enable fast search queries. Data lakes, on the other hand, often use cheap, traditional hard drives on clusters of commodity computers. Both data warehouses and data lakes rely on MPP (Massively Parallel Processing) to speed up SQL queries.
Users: Data warehouse users are typically business analysts. Data lake users are typically data scientists or data engineers. Business analysts get access to the data as soon as it is curated.
Analytics form: Typical analytics for data warehouses include business intelligence, batch reporting, and visualizations. In the case of data lakes, machine learning, predictive analytics, data discovery and data profiling are typical.
Data marts, on the other hand, are analytical databases that are limited to the data of a department or a division. This is in contrast to data warehouses, which consolidate all of an organization's relational data into an analysis-ready form. Data marts offer efficient analysis options because they only contain data that is relevant for the respective department. As such, they are inherently siled. In practice, this has an unfavorable effect: There is always a manager who needs reports based on combined data from several business areas. That's one reason why there are currently many data lakes and data warehouses but few data marts.
If you store raw data in a data lake, the data may be unusable for business analysts until processed by a data engineer or data scientist. In addition to filtering and data transformation features, data lakes also require data catalogs, data security, and schema definitions. If these functions are missing, it is a so-called data swamp.
Luckily, there are a number of tools to help you filter and organize the data in your data lake. For example, you can fill the need for a schema by creating a Hive metastore in ORC format. Once set up, it supports fast SQL queries via an MPP SQL engine like Presto. ORC (Optimized Row Columnar) format is compressed columnar storage optimized for Hive and works well with Presto.
Apache Spark is another MPP SQL engine. While it can work with the ORC format, it works even better with Parquet, another columnar store. Spark can partition Parquet files both vertically and horizontally, creating a query plan that reads only relevant data and skips irrelevant data.
Databricks, the company behind Spark and MLflow, offers a so-called data lakehouse. According to the provider, this combines the best properties of data warehouses and data lakes:
Historically, on-premises data lakes have been implemented using Apache Hadoop clusters on commodity computers and the Hadoop Distributed File System (HDFS). Similarly, Hadoop clusters were once big business for Cloudera, Hortonworks, and others. In 2018, Cloudera and Hortonworks merged, giving some indication of the direction this market is going.
The cloud has made a significant contribution to this development – and in particular the hyperscalers Amazon Web Service (AWS), Microsoft (Azure) and Google Cloud (GCP). All three now offer Data Lake products:
Amazon Simple Storage Service
Amazon EMR ,
Azure Data Lake Store and
Google Cloud Storage .
Services for data ingestion, data processing, analytics and machine learning are also part of the portfolio of the big three. It is much easier and faster to create, manage and scale cloud data lakes than to manage Hadoop clusters in your own data center. The downside of cloud solutions: the long-term operational expenses will eventually add up massively.
While essential and often the first step in any data lake, SQL is just one way to analyze data. Business intelligence and deep learning tools should also be considered at this point. The hyperscalers, in turn, have their own analytics and machine learning tools that are connected to the respective data lakes:
Amazon Athena uses Presto and Hive to make SQL queries against data in Amazon S3.
Amazon SageMaker is a managed service for building, training, and deploying machine learning models.
Azure Data Lake Analytics (ADLA) is a legacy on-demand (serverless) analytics service that simplifies big data and leverages U-SQL. ADLA will be replaced by Azure Synapse Analytics. This analytics service brings together data integration, enterprise data warehousing, and big data analytics. Data and code from ADLA and data warehouses can be migrated automatically. Additionally, Synapse is deeply integrated with Azure Machine Learning, Azure Cognitive Services and Power BI.
This post is based on an article from our US sister publication, Network World.