2023-11-10 - Refine identification of Change vs Event data sources¶

Release¶

Status: Available

Type: DataOps

Date: 2023-11-10

Problem¶

Our magic target-config parser which determines whether a data source is event or change data was incorrectly identifying data in a number of projects.

This was resulting in some loads incurring high costs and other loads inserting duplicate records.

Solution¶

This is one of the most important patterns we run as it determines how rules are run, and tiles are loaded. The logic was re-worked and re-tested on all the rules that were incorrectly configured.

Leverage the Magic¶

This was a re-work of the existing tagret-config pattern which checks for known event attributes, and profiles the range of effective dates in the source data, and the volume of changes per effective date.

Magician Partner¶

The patterns we use to identify whether source data is event or change data are our most important because the tag of ‘change-data ‘ or ‘event-data’ for every rule/table is used to determine the configuration of the target table (partitioning and clustering) and the templates that are used to insert/update data in every table.

Change data and Event data are loaded very differently to keep control of costs and maintain data integrity and to maximise performance.

Before a rule is published, we run the rule through the target config parser. The parser profiles the source data to work out whether its change (slowly changing) or event (time series, insert only), which determines how we create every table and how we load it.

If the pattern is wrong then we might end up doing full table scans on event data looking for changes (expensive), or we might load change data by inseting without checking if it exists (introducing duplicates)

Last Refreshed¶

Doc Refreshed: 2024-05-23