#apache-spark #delta #deletion-vectors #Databricks #merge-on-write

πŸ“‚ Copy-on-Write & Merge-on-Read

⁜ Copy-on-write (CoW) β€” is a (resource management) technique used by query engines (such as Apache Spark) to efficiently and safely create duplicate data. In the case of Spark, this is especially required since we are working with immutable data and always creating new data frames. A practical example would be when we are working with Parquet files and want to perform some operations (filter, update, etc.). The engine will rewrite all unmodified rows to a new file and add/append the updated ones.

⁜ Merge-on-Read (MoR) β€” on the other hand, tracks the changes in log files instead of rewriting the data and files. Then, it’s up to the engine to piece together the table by reading both the data and the additional log files to create the final state. Other systems implementing this technique include Delta and Deletion Vectors and the famous Git.

🌊 Delta

Delta (delta.io) is an open-source storage format that powers Lakehouses and Datalakes. It offers ACID guarantees, change data feeds, streaming and batch unification, schema enforcement, upserts & deletes, to name a few. Additionally, it’s query engine-agnostic. Since the most common use cases are in Apache Spark and, most likely, within the Databricks environment, we will focus on that.

As we saw earlier, Delta implements the merge-on-read pattern, which allows the support for ACID guarantees and maintaining a versioned history of the table through log files. These log files are commonly stored under the directory _delta_logs. Each commit is reflected in the filesystem by two companion files, a JSON and a CRC, named after an incremental number (zero-padded to 20 digits).

.
└── _delta_log/
    β”œβ”€β”€ 00000000000000000000.crc
    β”œβ”€β”€ 00000000000000000000.json
    └── ...

For more details regarding the Delta protocol, you should check β€£.

<aside> ☝

We will skip any discussions regarding vacuuming and optimizing the log files, z-order and liquid clustering, time travel, CDF, etc. That’s for another time!

</aside>

On Versions and States

As depicted below, a version is composed of multiple actions (e.g., update, delete, insert, etc.), which in turn inform the state (or snapshot) of the tableβ€”a view of the data with respect to a version.

Fig 1 β€” Delta actions, versions and states.

Fig 1 β€” Delta actions, versions and states.

Subsequently, the state allows isolation; any reader process will only fetch data indicated by the last committed version when that process starts, and concurrent writes are isolated until committed (if successful).

♻️ Deletion Vectors