.. _history_pattern: =========================== History Data =========================== When we collect data into the AgileData History area we treat it as immutable data. We store all historical data over all time to provide an immutable corporate memory of that data. We never delete the History data, the histiry area stores a copy of all your data over all time. AgileData Data Collection Patterns ==================================== In AgileData we collect data based on a number of common patterns. * Full Snapshot's * Delta's * Change Data Capture (CDC) * Events Full Snapshot's --------------------------- Full snapshots is a full copy of the data at a specific point in time. It's like taking a photograph of the data. We collect a copy of all the data over all time each time we collect it. Full Snapshot Data Example +++++++++++++++++++++++++++++++++ As an example lets say we are collecting data about the products we sell. The first time we collect the snapshot data we might collect: .. list-table:: T-Shirt Product Catalog First Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price * - 001 - ADI T-shirt - Black - M - $19.99 * - 002 - ADI Yellow T-shirt - Black - M - $19.99 * - 003 - Return of the Data Model - Black - S - $21.99 The second time we collect the snapshot data we might collect: .. list-table:: T-Shirt Product Catalog Second Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price * - 001 - ADI T-shirt - Black - M - $19.99 * - 002 - ADI Yellow T-shirt - Black - M - $19.99 * - 003 - Return of the Data Model T-shirt - Black - S - $22.99 * - 004 - The Semantic Layer Strikes Back T-shirt - Black - S - $23.99 In this second snapshot we can see two things have happened. First the price for the Return of the Data Model T-shirt has changed from $21.99 to $22.99 .. list-table:: T-Shirt Product Catalog Price Change :header-rows: 1 * - Product ID - Name - Color - Size - Price * - 003 - Return of the Data Model T-shirt - Black - S - $22.99 Second a 4th t-shirt has been added. .. list-table:: T-Shirt Product Catalog Second Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price * - 004 - The Semantic Layer Strikes Back T-shirt - Black - S - $23.99 But we have to compare the first and second snaphots of the data to determine these changes have happened. Delta's --------------------------- Delta's, on the other hand, refers to the changes in data over time. Instead of collecting a complete copy of the data each time, with a delta pattern we only collect the differences between one point in time and another. Imagine you have a photograph of a landscape taken at a specific point in time. This photograph represents a "snapshot" of the landscape at that moment. Now, let's say you take another photograph of the same landscape a week later. This second photograph is another "snapshot". The "delta" in this analogy would be the differences between the first and the second photograph. For instance, maybe a tree has lost some leaves, or a new building has been constructed in the background. These changes - the lost leaves, the new building - represent the "delta" between the two photographs. Delta Data Example +++++++++++++++++++++++++++++++++ Let's continue with the example for collecting data about the products we sell. If we collect the data using a delta pattern the first time we collect the snapshot data we might collect: .. list-table:: T-Shirt Product Catalog First Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price * - 001 - ADI T-shirt - Black - M - $19.99 * - 002 - ADI Yellow T-shirt - Black - M - $19.99 * - 003 - Return of the Data Model - Black - S - $21.99 When we collect the second delta we would collect: .. list-table:: T-Shirt Product Catalog Second Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price * - 003 - Return of the Data Model T-shirt - Black - S - $22.99 * - 004 - The Semantic Layer Strikes Back T-shirt - Black - S - $23.99 We are only collecting new data, the System or Capture or a separate system of collection is doing the work to identify the changes and providing the changes to AgileData. Change Data Capture (CDC) --------------------------- Change Data Capture (CDC) is a pattern that captures and tracks the changes in data, so action can be taken based on that change. CDC captures changes such as INSERT, UPDATE, and DELETE operations in the system of capture data, allowing other systems to consume and act on these changes To make it confusing both the CDC and Delta patterns are a form of change data. The difference is the Delta pattern is typically used to describe comparing Snaphshot data to determine the changes, whereas CDC is typically used to describe the pattern where the system of capture only provides changes in data. Change Data Capture Data Example +++++++++++++++++++++++++++++++++ Let's continue with the example for collecting data about the products we sell. For the CDC pattern we would normally start by collecting an initial load of the data. We do this as we have never collected any of the data before, so we need to start with an initial copy of what the data looks like. Our initial load we collect would look like : .. list-table:: T-Shirt Catalog Change Data Initial Load :header-rows: 1 * - Product ID - Name - Color - Size - Price - Change Type - Change TimeStamp * - 001 - ADI T-shirt - Black - M - $19.99 - INITIAL_LOAD - null * - 002 - ADI Yellow T-shirt - Black - M - $19.99 - INITIAL_LOAD - null * - 003 - Return of the Data Model - Black - S - $21.99 - INITIAL_LOAD - null And yes as you have probably noticed this looks exactly the same as the initial data collection we do for both the Snapshot and Delta patterns. When we collect the second set of change data capture data we would collect: .. list-table:: T-Shirt Product Catalog Second Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price - Change Type - Change TimeStamp * - 003 - - - - $22.99 - UPDATE - 2023-07-18 10:00:00 * - 004 - The Semantic Layer Strikes Back T-shirt - Black - S - $23.99 - INSERT - 2023-07-18 11:00:00 In this example we have collected a change for t-shirt #003 and been told the price has changed and is now $22.99. We have also collected the creation of a new product and have been told all the details for that product. There is another CDC pattern where we might have collected: .. list-table:: T-Shirt Product Catalog Second Snapshot :header-rows: 1 * - Product ID - Name - Color - Size - Price - Change Type - Change TimeStamp * - 003 - Return of the Data Model - Black - S - $22.99 - UPDATE - 2023-07-18 10:00:00 In this pattern the CDC collection system is providing us with all the details for the #003 product, whether it has changed or not, and we need to determine which fields have chnaged by comparing this record to the previous record we have collected. Key CDC Fields +++++++++++++++++++++++++++++++++ In a CDC pattern there are three key fields that need to be provided in the data collection to ensure the pattern operates correctly. * Unique Business ID * Change Type * Change Effective Timestamp Unique Business Key This is a unique identifier for each record in the data. It's used to track changes to each specific record over time. In a product catalog, for example, the unique business key might be the product ID. In a customer database, it might be the customer ID. The unique business key is what allows the CDC system to know which record a change applies to. In our example the Unique Business ID is the Product ID, #003 Change Type This field indicates the type of change that occurred. Common values are INSERT, UPDATE, and DELETE, representing a new record being added, an existing record being modified, or a record being removed, respectively. This field is crucial for understanding what kind of operation needs to be performed to apply the change. Change Effective Timestamp This is the time when the change took effect. It's important for understanding the order of changes, especially when multiple changes are made to the same record over time. By looking at the timestamps, you can reconstruct the state of the data at any point in time. Events --------------------------- Event data refers to information that is generated when something happens, or an "event" occurs. These events can be anything that happens within a system or a process. In the context of computer systems and applications, events can include things like a user clicking a button, a system error occurring, a transaction being processed, or a sensor reading being taken. Event data is often used in event-driven programming, event sourcing, and complex event processing. Event Data Example +++++++++++++++++++++++++++++++++ Typically, the Event data collection pattern is not used to collect reference data, for example the addition or change of products within a product catalog. So lets use a typical example of a user adding a product to the shopping cart on a eCommerce website. .. list-table:: Shopping Cart Events :header-rows: 1 * - Event ID - Event Type - User ID - Product ID - Quantity - Event Timestamp * - E004 - ADD_TO_CART - U001 - 001 - 2 - 2023-07-18 13:00:00 * - E005 - ADD_TO_CART - U002 - 003 - 1 - 2023-07-18 14:00:00 * - E006 - ADD_TO_CART - U001 - 002 - 3 - 2023-07-18 15:00:00 In this example we can see that Event E004 is telling us that user 001 has added two of Product 001 to their shopping cart. But if we wanted to continue with the example for collecting data about the products we sell, using the event data collection pattern an example would be: .. list-table:: Product Catalog Events :header-rows: 1 * - Event ID - Event Type - User ID - Product ID - Product Name - Color - Size - Price - Event Timestamp * - E404 - ADD_PRODUCT - U042 - 004 - The Semantic Layer Strikes Back T-shirt - Black - M - $23.99 - 2023-07-18 10:00:00 Key Event Fields +++++++++++++++++++++++++++++++++ In a Event pattern there are a number of key fields that need to be provided in the data collection to ensure the pattern operates correctly. * Event Identifier * Event Type * Event Timestamp * Event Attributes Event Identifier This is a unique identifier for each event. It helps in distinguishing each event from others. This could be a simple sequential number, a UUID, or some other form of unique identifier. Event Type This field indicates the type or category of the event. It helps in classifying the event and determining how it should be processed. Examples of event types could include "click", "purchase", "error", "login", etc., depending on the context. Event Timestamp This is the time when the event occurred. Timestamps are crucial for understanding when things happened and in what order, especially when dealing with a large number of events. Event Attributes These are additional pieces of information about the event. They provide context and details about the event. The exact attributes will depend on the type of event and the system it's coming from. For example, a "click" event might have attributes like "user_id", "page_url", and "button_id". An "error" event might have attributes like "error_code", "error_message", and "system_state". Snapshots, Delta's, CDC and Events ------------------------------------- Snapshots provide a full view of the data at a specific point in time, deltas capture the changes between different versions of the data, CDC captures and processes changes as they happen, and events represent individual occurrences within a system or process. Each of these patterns have their own strengths and weaknesses, and they are often used together. In AgileData we support collecting data using any of these patterns and the choosen pattern is typically dictated by the capabilities within the System of Capture that is capturing the data to be collected. AgileData Data Collection Patterns ==================================== Change Data --------------------------- When we receive data into the landing area (from dropped files) we append all the records into a landing table. A default rule is autogenerared to load this data into the history area. The default rule is for prototyping and wont always be suitable for all types of landing data, because of the following: * The default rule uses the date the file was dropped as the 'effective date' of the records This pattern is fine for most data, except records that are changing intra-day (that dont have an timestamp for when they are effective), in that situation we would only capture the unique records by file drop date and not intraday changes. .. image:: https://storage.googleapis.com/docs-agiledata-io/product-guild/collect/landing-to-history-default.png :width: 800px Best practice for change data is to specify an effective date (timestamp) attribute in the landing to history rule so change records can be safely sequenced and persisted into history. Specifying an effective date also means that status changes on a record are correctly identified as the effective date gets included in the row key to create unique change records for loading. .. image:: https://storage.googleapis.com/docs-agiledata-io/product-guild/collect/landing-to-history-effective-date.png :width: 800px Event Data --------------------------- Event data is automatically detected by the presence of the following attributes: * event_date * event_name * event_timestamp * event_params * user_pseudo_id * traffic_source ...and greater than 50 change records per event date, otherwise we default to a change data load pattern Must have * Event Type * Event Date / Timestamp column Can have duplicate events but we de-duplicate them which means we ignore the duplicate event. History vs Concepts, Detils, Events and Consume ================================================== Refer to :ref:`Data Load Patterns `