In data engineering, or in data space in general, receiving data in a remote object store, such as an S3 bucket, triggering a pipeline to apply some processing, performing quality checks, filtering & cleaning, and using business logic is standard. Subsequently, they act on these data, such as exporting them into another system, i.e., another bucket.
I sought to explore how a data processing pipeline, or some parts of it, could translate into Rust by using some of the most prominent libraries, such as Datafusion and Tokio. For the first time, I tried Localstack as a local AWS development environment; pretty cool π.
Some of the topics I was interested in exploring were:
The main
function is the entry point for the lambda function. It only does two things β set up the logger and β execute and wait for the lambda function.
Subsequently, the lambda function handler, defined at event_handlers.rs
, just filters for the specific S3 events that we are interested in intercepting. In our case, ObjectCreated:Put
triggers the pipeline
function whenever new files are added to the bucket.
The pipeline
function contains all the logic to set up the environment and the actual discreet processing steps. Specifically, it β prepares the Datafusion session and passes along some configuration settings regarding execution and filtering. Additionally, it configures and registers the relevant S3 buckets, which are needed for reading and writing to the buckets later on. As soon as everything passes properly in step β the actual data processing happens. In our minute example, we just query the tables using standard SQL and write them back into another bucker partitioned by Partition_Date
as Parquet files. In between, we create a metric that contains the result of COUNT(*)
and send it to CloudWatch.
Easy π«°
The following steps are required to run things: Itβs up to interested readers to decide how to install and manage packages.