Table Of Contents
Table Of Contents

2024-01-16 - Data Quality Jobs Optimisation - Take 2

Release

Status: Available

Type: DataOps

Date: 2024-01-16

Problem

The system (default) data quality jobs that check for duplicate keys in concept and event tiles are too expensive.

This is becasue we are querying the _key column across the whole table looking for duplicates.

Solution

By updating the jinga data quality template to query the clustered column (row_hash) which is just the key hashed, we can reduce the cost of these queries by 10x !!!

This brings a 3000MB (3GB) query down to 300MB which is a big performance and cost saving.

Leverage the Magic

A small tweak to the data quality template for concept and event tiles means a big performance and cost saving.

Last Refreshed

Doc Refreshed: 2024-03-02