2024-01-16 - Data Quality Jobs Optimisation - Take 2¶
Problem¶
The system (default) data quality jobs that check for duplicate keys in concept and event tiles are too expensive.
This is becasue we are querying the _key column across the whole table looking for duplicates.
Solution¶
By updating the jinga data quality template to query the clustered column (row_hash) which is just the key hashed, we can reduce the cost of these queries by 10x !!!
This brings a 3000MB (3GB) query down to 300MB which is a big performance and cost saving.
Leverage the Magic¶
A small tweak to the data quality template for concept and event tiles means a big performance and cost saving.
Last Refreshed¶
Doc Refreshed: 2024-05-20