Data lakes, data warehouses, data cleansing, data mining... The world of big data and business intelligence is full of terminology to learn. In our new explainer series we will be taking a look at some of these concepts in detail. Whether you are just starting out on your BI journey or whether you've always wondered what that term means but been too afraid to ask, we hope you find this series of articles informative and useful.
In this post, we're taking a deep dive into data lakes. What are they, how are they used and how can they fit into your BI strategy?
So What's a Data Lake then?
Put simply, a data lake is a repository where you store large volumes of data, both structured and unstructured, in its raw, original format. It differs from a data warehouse, which stores only structured data, and which needs to have that data structure tightly defined in advance.
While data warehouses have been around since the 1980s, data lakes have come about more recently from a growing realisation that the overwhelming majority of business data is unstructured. In fact, some analysts have estimated that maybe as much as 90% of all business data is unstructured.
With a data lake, you can store large amounts of this unstructured data without needing to predefine the formats and schemas ahead of time. Typically this done by leveraging cloud object storage services, such as AWS or Azure.
A data lake also makes it easy to ingest real-time data, such as information from sensors and IoT devices, as well as all sorts of unstructured data from things like social media channels and mobile applications. ETL (Extract, Transform, Load) activities such as defining the schema and building queries into the data only need to happen when the data is being analysed.
In many organisations with large-scale data collection and analysis needs, it's likely that there will be both data warehouses and data lakes.
Staying out of the Data Swamp
While data lakes offer many benefits, they are not without their issues. Without proper oversight and governance over exactly what data is deposited into the lake, it can become an unwieldy mess that is hard to maintain and extract value from. Some commentators have taken to referring to this situation as a data swamp.
Although things like curation, metadata, and governance can help avoid the lake becoming a swamp, it is a fine balancing act between imposing enough structure or oversight without losing the advantages of having a data lake in the first place.
In an attempt to avoid these potential problems, the concept of a data lakehouse has emerged. This seeks to adopt a best of both worlds approach, by applying some degree of structure (similar to the type of structure employed in a data warehouse) to the unstructured data in the lake. Typically this is achieved through a metadata layer that sits on top of the raw data and helps to categorise and index the underlying data so that the relevant information can easily be extracted for analysis.
Whatever your views on the concept of the data lake, and the lakehouse, it seems likely to be with us for the foreseeable future. There are a number of major technology vendors that have begun to offer data lakehouse solutions, including Databricks, AWS and Snowflake, as well as the open source Delta Lake project.