In data engineering, or in data space in general, receiving data in a remote object store, such as an S3 bucket, triggering a pipeline to apply some processing, performing quality checks, filtering & cleaning, and using business logic is standard. Subsequently, they act on these data, such as exporting them into another system, i.e., another bucket.

I sought to explore how a data processing pipeline, or some parts of it, could translate into Rust by using some of the most prominent libraries, such as Datafusion and Tokio. For the first time, I tried Localstack as a local AWS development environment; pretty cool πŸ‘.

Some of the topics I was interested in exploring were:

On the code

The main function is the entry point for the lambda function. It only does two things ➊ set up the logger and βž‹ execute and wait for the lambda function.

Subsequently, the lambda function handler, defined at event_handlers.rs, just filters for the specific S3 events that we are interested in intercepting. In our case, ObjectCreated:Put triggers the pipeline function whenever new files are added to the bucket.

The pipeline function contains all the logic to set up the environment and the actual discreet processing steps. Specifically, it ➊ prepares the Datafusion session and passes along some configuration settings regarding execution and filtering. Additionally, it configures and registers the relevant S3 buckets, which are needed for reading and writing to the buckets later on. As soon as everything passes properly in step βž‹ the actual data processing happens. In our minute example, we just query the tables using standard SQL and write them back into another bucker partitioned by Partition_Date as Parquet files. In between, we create a metric that contains the result of COUNT(*) and send it to CloudWatch.

Easy 🫰

Running things

The following steps are required to run things: It’s up to interested readers to decide how to install and manage packages.