Delta Lake — Game changing update with version 4.0 — Coordinated Commit

Stefan Graf
3 min readJun 18, 2024

--

Photo by Tanbir Mahmud on Unsplash

Delta, one of the leading Lakehouse file formats, will receive a really big update with its new version 4.0. Coordinated Commits will help your solutions to overcome one of the big shortcomings of your data pipelines built with delta as the underlying data format.

You might ask: What is the big deal, we already have commits with delta, why should I care if I can coordinate them? Let me explain, why I think, this will be one of the most valuable additions to the delta format for a long time.

Note: I will not talk about technical details, because this feature is still in early review and will probably change — I will talk about the idea behind it, which should not change anymore.

The current limitations of Delta commits

Being able to apply the logic of commits was really ground breaking for the delta format. Basically, it kickstarted the development of Lakehouses by providing ACID guarantees. This, among other innovations, elevated data lake based data systems to become a competitor of Data Warehouses. Think for example of Databricks, once they had their Lakehouse solution in place, they targeted “classic” Data Warehouses as a competitor to gain a bigger market share.

But getting back again to the commits: There is one big limitation for the delta file commits. They are always scoped to one file/table. You could not create Transactions, which are covering changes to multiple tables.

Why is this a problem? This could mean, that you are changing your tables and end up with a inconsistent state. If you have for example a customer and an order table your model might look like this:

example data model

If you delete an entry of customers which is used by the orders table, it can get tricky if the orders task failed. This will lead to the following situation:

  • There is a new commit to customers, including the deletion of one customer
  • The orders table has not updated and is still pointing to a not anymore existing customer_id

This means that your data is not consistent anymore.

Now there are many workarounds for this scenario:

  • Create a DAG which covers first the INSERTS/UPDATES for both tables in the order of customers -> orders and afterwards you can delete in the other order, so from orders -> customers
  • Use staging schemas, so that the consumer facing layer is not corrupted in case of an error
  • Implementing a rollback logic, which also covers this problem.

All these solutions would lead again to a consistent data state again. But they mean additional effort and are a potential source of error.

The solution we already know from the DWH world

The classic solution for these challenges from the Data Warehousing world is to use Transactions, which are covering the full process of these two depending tables. So you would start a transaction update/insert/delete what you need in the customers table and afterwards you do the same for the orders table.

  • If the orders job fails again, like in the example above, the Transaction fails and nothing is committed. Therefore there is no corrupted state.
  • If everything runs successfully, there will be a coordinated commit for both tables and you have a updated data state.

This leads to consistent data with a minimal extra effort.

Conclusion

I’m really excited about this announcement of coordinated commits, this will provide an additional tool, to deliver high quality data inside our Lakehouses. We have to see how the technical implementation will work out in the end, after the solution leaves the early preview. But it is great to see, that there are still many new features to come for the Delta file format!

Sources
Delta Lake 4.0 Preview | Delta Lake

--

--

Stefan Graf
Stefan Graf

Written by Stefan Graf

Data Engineer Consultant @Microsoft — Data and Cloud Enthusiast

No responses yet