Table Of Contents
Table Of Contents

2024-10-23 - Filedrop queueing for large volumes of files - take 2

Release

Status: Available

Type: DataOps

Date: 2024-10-23

Problem

The original filedrop design was architected for a single dropped file loading into a target table, but running simultaneously so multiple different targets can be loaded from different files.

For example, the customer drops a file called products.csv and its loaded into a table called products , they also drop a file called sales.csv at the same time which loads into a table called sales

This pattern doesn’t work as well when the customer batches up all their change data and sends 50 change data files, for example audit_log_1.csv, audit_log_2.csv … audit_log_50.csv etc at the same time, that all need to load into a target table called audit logs, because now we need to queue the files and load them one at a time before continuing with downstream processing.

Solution

By implementing a file queue in our config database we can queue all files that are dropped and then de-queue them one at a time to maintain control of the processing, even when we are handling hundreds of files and loading them into different targets.

Leverage the Magic

We leverage google spanners transactional processing and record locking to prevent simultaneous updates to the queue. This is required because the bucket trigger that starts the pipeline when files are dropped auto scales to create multiple load pipelines, all of which are reading and writing to the queue simultaneously.

ADI

Now thats magical !

Last Refreshed

Doc Refreshed: 2024-11-21