Data warehouse vs Data lake
One of the most negative aspects of Big Data systems is the need to be able to create, read, update and delete (CURD). Especially because of the cost of updating and deleting data, in the case of updates, performance degrades too much, especially in Near to real time (NRT) systems, and in terms of deletion, many Big Data prefer to make a logical deletion of data and not physically delete it. This is the reason why it is so important to understand what is the Data lakehouse and the data lake house architecture.
This situation has also led to great discomfort on the part of the business intelligence teams that manage a Data Warehouse and that have all the reports resolved and now we are trying to tell them to include the concept of a Data Lake and distributed computing within the data extraction, transformation and loading mechanisms.
It even happens that, for example, the BI Team considers the Data Engineering Team as enemies, because they think that they are going to replace them with a new technology and take them out of their Datawarehouse comfort zone. Such a recognized problem in the aforementioned industry, that books have been written to try to indicate how the Datawarehouse is not a contradictory opposite to the Data Lake, but rather a complementary opposite.
The good thing is that this WAR between why I’m going to use databricks if I already have snowflake, or why use S3 if I already have redshift, or hadoop if I already have my datawarehouse is OVER.
Data lakehouse concept origin
The Databricks team with the delta lake concept on the one hand and the Uber team with the hudi concept on the other have allowed the war to cease and come to peace. Both delta and hudi concepts are looking for files created in S3, Google Cloud Storage, Apache Hadoop and Azure Data Lake Storage to support data modifications and deletions efficiently.
This is all accomplished by saving file changes to a manifest file and resolving data modifications and deletions in these manifests before directly querying the files. All this means that there is no longer a need to have a data lake and a data warehouse, but both can coexist in the same technology allowing CURD (delta or hudi). This is how the concept of lake house was born.
Data lakehouse
A data lake house is then a data management architecture that combines the benefits of a traditional data warehouse and a data lake. It seeks to merge the ease of access and support for business analytics capabilities found in data warehouses with the flexibility and relatively low cost of the data lake.
The business world needs to meet new needs in terms of advanced analytics, which represents a challenge that forces organizations and people to give their best. The appearance of this new concept does nothing more than delve into the clear obligation of evolution and continuous improvement.
It represents an important leap forward by combining in one architecture, the processing of all types of data, both in Streaming and Batch, allowing the integration of artificial intelligence models, to obtain complete reporting.
In other words, it represents the emergence of a new reference solution in Advanced Analytics.