A data processing pipeline in Rust; PoC.

In data engineering, or in data space in general, receiving data in a remote object store, such as an S3 bucket, triggering a pipeline to apply some processing, performing quality checks, filtering & cleaning, and using business logic is standard. Subsequently, they act on these data, such as exporting them into another system, i.e., another bucket.

I sought to explore how a data processing pipeline, or some parts of it, could translate into Rust by using some of the most prominent libraries, such as Datafusion and Tokio. For the first time, I tried Localstack as a local AWS development environment; pretty cool 👍.

Some of the topics I was interested in exploring were:

Lambdas with events and retry logic.
Datafusion for processing and SQL queries. In the same context; loading/storing from/to object stores.
Sending metrics to CloudWatch.
Configuration through environment variables; source from Terraform.

On the code

The main function is the entry point for the lambda function. It only does two things ➊ set up the logger and ➋ execute and wait for the lambda function.

Subsequently, the lambda function handler, defined at event_handlers.rs, just filters for the specific S3 events that we are interested in intercepting. In our case, ObjectCreated:Put triggers the pipeline function whenever new files are added to the bucket.

The pipeline function contains all the logic to set up the environment and the actual discreet processing steps. Specifically, it ➊ prepares the Datafusion session and passes along some configuration settings regarding execution and filtering. Additionally, it configures and registers the relevant S3 buckets, which are needed for reading and writing to the buckets later on. As soon as everything passes properly in step ➋ the actual data processing happens. In our minute example, we just query the tables using standard SQL and write them back into another bucker partitioned by Partition_Date as Parquet files. In between, we create a metric that contains the result of COUNT(*) and send it to CloudWatch.

Easy 🫰

Running things

The following steps are required to run things: It’s up to interested readers to decide how to install and manage packages.

Rust
Cargo Lambda
Apache Datafusion
Localstack