Briefly on Streaming and Deployment

#streaming #deployment #spark #databricks #kafka

What the stream?

Recently, I have developed an interest and curiosity about deploying, with no downtime, streaming jobs (primarily on Spark, but anything else works, too) on Databricks and/or Kubernetes.

There are numerous reasons one would like to follow such a deployment pattern, rolling out new features, fixing bugs, some A/B testing (especially if ML is involved), etc.

These deployment strategies challenge maintaining data consistency and state management during the transition. For streaming applications, this is particularly essential. Various techniques, such as checkpointing, state migration, and distributed caching, can be employed to address some of these.

Keeping track of data, in data

When working with streaming pipelines, a pretty standard and nice method of keeping track of the pipeline version (as in, which Spark job wrote which rows) is to write this information in the data (rows) themselves. Additionally, this would be extremely useful when performing continuous/frequent upserts.

	Column 1	Column 2	…	LastUpdatedBy	LastUpdatedAt
0				job-v1
1				job-v2
2				job-v3_feat_x

Another solution would be to keep a dictionary, lookup-like table with more details regarding our jobs, features, and configuration settings and subsequently just make associations using IDs/UUIDs of some sort. This approach of using a lookup table can provide more flexibility and scalability, especially when dealing with complex job configurations or when frequent updates to job metadata are required. It also allows for easier querying and analysis of job-related data, which can be particularly useful for monitoring and optimization purposes.

On Streaming

In streaming scenarios, a pretty common pattern is to employ a streaming job that targets specific partitions (or perhaps topics in the streaming layer) based on some criterion, i.e. most recent data within a date range (a week, a year, etc.).

Another approach, if the context allows, would be to have bucketed data, i.e. based on IDs; in this case, the jobs target different bucket ranges. Some buckets could result from high-volume data, while others could identify data of less-important, or individual customers.

In any case, this approach allows the deployment of multiple streaming jobs with different features while targeting different data (partitions/keys/etc). Of course, it is important to notice that all streaming jobs are writing data into DIFFERENT partitions so as to avoid merge conflicts 😉.

Of course, these partitions are just for demonstrative purposes, just as well; one could be reading from different topics or using a consumer group — still, this would require additional logic to differentiate which messages to process with which job. The most common strategies include ❶ key-based (the Kafka partition key) and ❷ value-based (a field in a message). And finally, ❸

versioned topics; to publish data and separate the traffic for our downstream Spark jobs. If needed, a combination of these strategies can be employed in tandem.