UniForm: Peace to Delta Lake + Iceberg + Hudi

Jun 24, 2024

In case you haven’t been paying attention the last few years, there is a war being waged… a war of which open table data format will come out on top: Delta Lake, Iceberg, and Hudi. These data formats each offer similar and unique features to manage and scale big data lakes, and so the choice of which format to use has been a hot topic among the data engineering community for quite some time.

Organizations looking to maximize the value and scalability of their data lakes are really the ones the draw the short straw. No one wants to be torn between multiple tools, but companies also want to ensure the choice they make doesn’t become obsolete or turn out to be wrong, only to have to perform a complex migration. Fortunately, the table format wars can finally resolve in peace thanks to UniForm by Databricks.

Background: Iceberg vs. Delta Lake

Delta Lake and Iceberg have been the dominate formats in this field for a while, so we’ll focus on these:

Delta Lake

Developed by Databricks, open-sourced in April, 2019.
Offers ACID transactions for data integrity and consistency guarantees.
Maintains metadata to support scalable operations and versioning, distinct from data files which use the Parquet format.
Versions data and enables time travel queries to view historical data easily.
Downloaded >20M times per month as of the time of this writing.
Delta Kernel is a set of libraries for the core Delta Lake logic, allowing users to operate on Delta Lake tables from almost any language or engine (Java, Python, C++, Rust, Spark, Trino, Pandas, Polars, DuckDB, etc.) without re-writing the core behaviors.
Project home page: https://delta.io/

Iceberg

Developed by Netflix, donated to Apache Software Foundation in 2018.
Offers ACID transactions for data integrity and consistency guarantees.
Also maintains metadata and data files separately, enabling scalable operations on large tables.
Versions data and enables time travel to view historical data. Also borrows popular concepts from version control (e.g. Git) such as cherry-picking, branches/tags, and merges.
Supports a wide range of engines (Spark, Trino, Flink, Amazon Athena, etc.) and maintains these implementations within the main repository of Iceberg itself except for a few exceptions such as Trino and Presto. See https://iceberg.apache.org/multi-engine-support/
Project home page: https://iceberg.apache.org/

For years these formats have competed for dominance in the data lake ecosystem. Businesses often had to choose between them, hedging their bets against a particular format and set of languages and engines deemed better than the rest (or at least lower effort depending on the organization’s existing tech stack). This division created challenges in terms of interoperability, data governance, subject matter expertise, and overall data lake management.

UniForm, our savior 🙌

Introduce UniForm (Universal Format), a feature offered by Databricks to bring peace. UniForm aims to unify the interoperability of the Delta Lake, Apache Iceberg, and Apache Hudi data formats by capitalizing on the fact that all 3 have the same fundamental trait: parquet data files combined with a format-specific metadata. Because the metadata is the only real difference, UniForm compromises the format wars by allowing users to automatically and asynchronously generate the metadata for all of the data formats (Delta, Iceberg, Hudi), or only the ones you wish to enable in the event that you only care about 2 out of the 3.

It’s simple, when creating a table in Databricks you just set a property to enable the formats you want metadata for:

CREATE TABLE main.sales.skus (sku_name STRING, inserted_at TIMESTAMP)
TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg,hudi');

Now, for engines that may only support Iceberg and not Delta Lake for example, you can access this table as if it were just an Iceberg table (and vice versa). You can also begin using UniForm on existing tables, the property does not have to be set at the time of creation:

ALTER TABLE table_name SET TBLPROPERTIES ('delta.universalFormat.enabledFormats' = 'iceberg');

Databricks acquires Tabular

As a matter of fact, this is old news–Databricks began offering UniForm in 2023? So what’s all the buzz, aren’t we at peace already? Well no, not exactly. These changes take time, both from the technology and community perspective. From a community perspective, this new feature takes time to recognize and adopt, even though it is incredibly easy to get started with. From a technology perspective, it is very challenging and time-consuming for a single project (Delta Lake) to now try to reconcile all the ongoing changes and features from the other two formats, which in niche cases may create a parity gap where in fact it feels like your faux Iceberg table does not integrate successfully.

On June 4, 2024, Databricks announced it had agreed to acquire Tabular. Tabular is a data management company founded by Ryan Blue, Daniel Weeks, and Jason Reid, among who were also the original creators of Apache Iceberg at Netflix. According to Databricks in the press release:

Databricks intends to work closely with the Delta Lake and Iceberg communities to bring format compatibility to the lakehouse; in the short term, inside Delta Lake UniForm and in the long term, by evolving toward a single, open, and common standard of interoperability. Databricks and Tabular will work together towards a joint vision of the open lakehouse.
Credit: Databricks Press Release

Furthermore, Databricks CEO, Ali Ghodsi, along with Ryan Blue, shares at the 2024 Data + AI Summit keynote that this strategic decision is intended specifically to combine their technical subject matter expertise between the two leading formats, Delta Lake and Iceberg, such that we can expect the unilateral development within UniForm to become more and more ubiquitous.

Data+AI Summit 2024 Keynote talk by Databricks CEO, Ali Ghodsi (left), and Tabular’s Ryan Blue (right).

For this, these data technology giants have my respect as I believe the long-term value is openness and maximum flexibility for all. What a breath of fresh air after countless hours of the normal one-sided marketing pitches that lead to vendor lock-in.

MakeWithData

Discussion about this post