Table Of Contents
Table Of Contents

History Data

When we collect data into the AgileData History area we treat it as immutable data.

We store all historical data over all time to provide an immutable corporate memory of that data. We never delete the History data, the histiry area stores a copy of all your data over all time.

AgileData Data Collection Patterns

In AgileData we collect data based on a number of common patterns.

  • Full Snapshot’s

  • Delta’s

  • Change Data Capture (CDC)

  • Events

Full Snapshot’s

Full snapshots is a full copy of the data at a specific point in time. It’s like taking a photograph of the data.

We collect a copy of all the data over all time each time we collect it.

Full Snapshot Data Example

As an example lets say we are collecting data about the products we sell.

The first time we collect the snapshot data we might collect:

T-Shirt Product Catalog First Snapshot

Product ID

Name

Color

Size

Price

001

ADI T-shirt

Black

M

$19.99

002

ADI Yellow T-shirt

Black

M

$19.99

003

Return of the Data Model

Black

S

$21.99

The second time we collect the snapshot data we might collect:

T-Shirt Product Catalog Second Snapshot

Product ID

Name

Color

Size

Price

001

ADI T-shirt

Black

M

$19.99

002

ADI Yellow T-shirt

Black

M

$19.99

003

Return of the Data Model T-shirt

Black

S

$22.99

004

The Semantic Layer Strikes Back T-shirt

Black

S

$23.99

In this second snapshot we can see two things have happened.

First the price for the Return of the Data Model T-shirt has changed from $21.99 to $22.99

T-Shirt Product Catalog Price Change

Product ID

Name

Color

Size

Price

003

Return of the Data Model T-shirt

Black

S

$22.99

Second a 4th t-shirt has been added.

T-Shirt Product Catalog Second Snapshot

Product ID

Name

Color

Size

Price

004

The Semantic Layer Strikes Back T-shirt

Black

S

$23.99

But we have to compare the first and second snaphots of the data to determine these changes have happened.

Delta’s

Delta’s, on the other hand, refers to the changes in data over time. Instead of collecting a complete copy of the data each time, with a delta pattern we only collect the differences between one point in time and another.

Imagine you have a photograph of a landscape taken at a specific point in time. This photograph represents a “snapshot” of the landscape at that moment. Now, let’s say you take another photograph of the same landscape a week later. This second photograph is another “snapshot”.

The “delta” in this analogy would be the differences between the first and the second photograph. For instance, maybe a tree has lost some leaves, or a new building has been constructed in the background. These changes - the lost leaves, the new building - represent the “delta” between the two photographs.

Delta Data Example

Let’s continue with the example for collecting data about the products we sell.

If we collect the data using a delta pattern the first time we collect the snapshot data we might collect:

T-Shirt Product Catalog First Snapshot

Product ID

Name

Color

Size

Price

001

ADI T-shirt

Black

M

$19.99

002

ADI Yellow T-shirt

Black

M

$19.99

003

Return of the Data Model

Black

S

$21.99

When we collect the second delta we would collect:

T-Shirt Product Catalog Second Snapshot

Product ID

Name

Color

Size

Price

003

Return of the Data Model T-shirt

Black

S

$22.99

004

The Semantic Layer Strikes Back T-shirt

Black

S

$23.99

We are only collecting new data, the System or Capture or a separate system of collection is doing the work to identify the changes and providing the changes to AgileData.

Change Data Capture (CDC)

Change Data Capture (CDC) is a pattern that captures and tracks the changes in data, so action can be taken based on that change. CDC captures changes such as INSERT, UPDATE, and DELETE operations in the system of capture data, allowing other systems to consume and act on these changes

To make it confusing both the CDC and Delta patterns are a form of change data. The difference is the Delta pattern is typically used to describe comparing Snaphshot data to determine the changes, whereas CDC is typically used to describe the pattern where the system of capture only provides changes in data.

Change Data Capture Data Example

Let’s continue with the example for collecting data about the products we sell.

For the CDC pattern we would normally start by collecting an initial load of the data. We do this as we have never collected any of the data before, so we need to start with an initial copy of what the data looks like.

Our initial load we collect would look like :

T-Shirt Catalog Change Data Initial Load

Product ID

Name

Color

Size

Price

Change Type

Change TimeStamp

001

ADI T-shirt

Black

M

$19.99

INITIAL_LOAD

null

002

ADI Yellow T-shirt

Black

M

$19.99

INITIAL_LOAD

null

003

Return of the Data Model

Black

S

$21.99

INITIAL_LOAD

null

And yes as you have probably noticed this looks exactly the same as the initial data collection we do for both the Snapshot and Delta patterns.

When we collect the second set of change data capture data we would collect:

T-Shirt Product Catalog Second Snapshot

Product ID

Name

Color

Size

Price

Change Type

Change TimeStamp

003

$22.99

UPDATE

2023-07-18 10:00:00

004

The Semantic Layer Strikes Back T-shirt

Black

S

$23.99

INSERT

2023-07-18 11:00:00

In this example we have collected a change for t-shirt #003 and been told the price has changed and is now $22.99.

We have also collected the creation of a new product and have been told all the details for that product.

There is another CDC pattern where we might have collected:

T-Shirt Product Catalog Second Snapshot

Product ID

Name

Color

Size

Price

Change Type

Change TimeStamp

003

Return of the Data Model

Black

S

$22.99

UPDATE

2023-07-18 10:00:00

In this pattern the CDC collection system is providing us with all the details for the #003 product, whether it has changed or not, and we need to determine which fields have chnaged by comparing this record to the previous record we have collected.

Key CDC Fields

In a CDC pattern there are three key fields that need to be provided in the data collection to ensure the pattern operates correctly.

  • Unique Business ID

  • Change Type

  • Change Effective Timestamp

Unique Business Key

This is a unique identifier for each record in the data. It’s used to track changes to each specific record over time. In a product catalog, for example, the unique business key might be the product ID. In a customer database, it might be the customer ID. The unique business key is what allows the CDC system to know which record a change applies to.

In our example the Unique Business ID is the Product ID, #003

Change Type

This field indicates the type of change that occurred. Common values are INSERT, UPDATE, and DELETE, representing a new record being added, an existing record being modified, or a record being removed, respectively. This field is crucial for understanding what kind of operation needs to be performed to apply the change.

Change Effective Timestamp

This is the time when the change took effect. It’s important for understanding the order of changes, especially when multiple changes are made to the same record over time. By looking at the timestamps, you can reconstruct the state of the data at any point in time.

Events

Event data refers to information that is generated when something happens, or an “event” occurs. These events can be anything that happens within a system or a process. In the context of computer systems and applications, events can include things like a user clicking a button, a system error occurring, a transaction being processed, or a sensor reading being taken. Event data is often used in event-driven programming, event sourcing, and complex event processing.

Event Data Example

Typically, the Event data collection pattern is not used to collect reference data, for example the addition or change of products within a product catalog.

So lets use a typical example of a user adding a product to the shopping cart on a eCommerce website.

Shopping Cart Events

Event ID

Event Type

User ID

Product ID

Quantity

Event Timestamp

E004

ADD_TO_CART

U001

001

2

2023-07-18 13:00:00

E005

ADD_TO_CART

U002

003

1

2023-07-18 14:00:00

E006

ADD_TO_CART

U001

002

3

2023-07-18 15:00:00

In this example we can see that Event E004 is telling us that user 001 has added two of Product 001 to their shopping cart.

But if we wanted to continue with the example for collecting data about the products we sell, using the event data collection pattern an example would be:

Product Catalog Events

Event ID

Event Type

User ID

Product ID

Product Name

Color

Size

Price

Event Timestamp

E404

ADD_PRODUCT

U042

004

The Semantic Layer Strikes Back T-shirt

Black

M

$23.99

2023-07-18 10:00:00

Key Event Fields

In a Event pattern there are a number of key fields that need to be provided in the data collection to ensure the pattern operates correctly.

  • Event Identifier

  • Event Type

  • Event Timestamp

  • Event Attributes

Event Identifier

This is a unique identifier for each event. It helps in distinguishing each event from others. This could be a simple sequential number, a UUID, or some other form of unique identifier.

Event Type

This field indicates the type or category of the event. It helps in classifying the event and determining how it should be processed. Examples of event types could include “click”, “purchase”, “error”, “login”, etc., depending on the context.

Event Timestamp

This is the time when the event occurred. Timestamps are crucial for understanding when things happened and in what order, especially when dealing with a large number of events.

Event Attributes

These are additional pieces of information about the event. They provide context and details about the event. The exact attributes will depend on the type of event and the system it’s coming from. For example, a “click” event might have attributes like “user_id”, “page_url”, and “button_id”. An “error” event might have attributes like “error_code”, “error_message”, and “system_state”.

Snapshots, Delta’s, CDC and Events

Snapshots provide a full view of the data at a specific point in time, deltas capture the changes between different versions of the data, CDC captures and processes changes as they happen, and events represent individual occurrences within a system or process.

Each of these patterns have their own strengths and weaknesses, and they are often used together. In AgileData we support collecting data using any of these patterns and the choosen pattern is typically dictated by the capabilities within the System of Capture that is capturing the data to be collected.

AgileData Data Collection Patterns

Change Data

When we receive data into the landing area (from dropped files) we append all the records into a landing table.

A default rule is autogenerared to load this data into the history area.

The default rule is for prototyping and wont always be suitable for all types of landing data, because of the following:

  • The default rule uses the date the file was dropped as the ‘effective date’ of the records

This pattern is fine for most data, except records that are changing intra-day (that dont have an timestamp for when they are effective), in that situation we would only capture the unique records by file drop date and not intraday changes.

https://storage.googleapis.com/docs-agiledata-io/product-guild/collect/landing-to-history-default.png

Best practice for change data is to specify an effective date (timestamp) attribute in the landing to history rule so change records can be safely sequenced and persisted into history.

Specifying an effective date also means that status changes on a record are correctly identified as the effective date gets included in the row key to create unique change records for loading.

https://storage.googleapis.com/docs-agiledata-io/product-guild/collect/landing-to-history-effective-date.png

Event Data

Event data is automatically detected by the presence of the following attributes:

  • event_date

  • event_name

  • event_timestamp

  • event_params

  • user_pseudo_id

  • traffic_source

…and greater than 50 change records per event date, otherwise we default to a change data load pattern

Must have

  • Event Type

  • Event Date / Timestamp column

Can have duplicate events but we de-duplicate them which means we ignore the duplicate event.

History vs Concepts, Detils, Events and Consume

Refer to Data Load Patterns