#apache-spark #delta #deletion-vectors #Databricks #merge-on-write
β Copy-on-write (CoW) β is a (resource management) technique used by query engines (such as Apache Spark) to efficiently and safely create duplicate data. In the case of Spark, this is especially required since we are working with immutable data and always creating new data frames. A practical example would be when we are working with Parquet files and want to perform some operations (filter, update, etc.). The engine will rewrite all unmodified rows to a new file and add/append the updated ones.
β Merge-on-Read (MoR) β on the other hand, tracks the changes in log files instead of rewriting the data and files. Then, itβs up to the engine to piece together the table by reading both the data and the additional log files to create the final state. Other systems implementing this technique include Delta and Deletion Vectors and the famous Git.
Delta (delta.io) is an open-source storage format that powers Lakehouses and Datalakes. It offers ACID guarantees, change data feeds, streaming and batch unification, schema enforcement, upserts & deletes, to name a few. Additionally, itβs query engine-agnostic. Since the most common use cases are in Apache Spark and, most likely, within the Databricks environment, we will focus on that.
As we saw earlier, Delta implements the merge-on-read pattern, which allows the support for ACID guarantees and maintaining a versioned history of the table through log files. These log files are commonly stored under the directory _delta_logs
. Each commit is reflected in the filesystem by two companion files, a JSON and a CRC, named after an incremental number (zero-padded to 20 digits).
.
βββ _delta_log/
βββ 00000000000000000000.crc
βββ 00000000000000000000.json
βββ ...
JSON
Each log file is stored in a human-readable format, JSON. Each transaction (atomic set of actions that modify the state of the table) is represented by one unique JSON file. Each file contains numerous actions, such as:
Action | Description |
---|---|
metadata |
Denotes changes in the metadata of the table. The most important field (and required) is the ID, which uniquely identifies the table in which all actions are being performed. Other fields include |
format
: the underlying file format and specification, i.e. Parquet.schemaString
: the schema of the table.partitionColumns
: an array of the columns used for partitioning the table.configuration
: a map containing any additional configuration options for metadata search. |
| add
& remove
| Both of these actions are used to indicate modifications to the table, either by adding or removing individual logical files respectively, |
| commitInfo
| This field is optional and can contain additional runtime information. Databricks automatically populates it with details regarding the user
, cluster
, job
, and operation
. |CRC As the name implies, each file stores a CRC32 checksum that ensures the integrity of its corresponding counterpart log file.
For more details regarding the Delta protocol, you should check β£.
<aside> β
We will skip any discussions regarding vacuuming and optimizing the log files, z-order and liquid clustering, time travel, CDF, etc. Thatβs for another time!
</aside>
As depicted below, a version
is composed of multiple actions (e.g., update, delete, insert, etc.), which in turn inform the state
(or snapshot
) of the tableβa view of the data with respect to a version
.
Fig 1 β Delta actions, versions and states.
Subsequently, the state
allows isolation
; any reader
process will only fetch data indicated by the last committed version
when that process starts, and concurrent writes
are isolated until committed (if successful).