Backfill Data Loading Best Practices
This document assumes that your Lytics account has been configured to utilize the Bulk Topic Feature. Reach out to your Lytics Account Manager for more information.
Some use cases involve having historical data available for segmentation. This data might be demographic in nature, or describe how customers prefer to be contacted. This document offers guidelines for cases where large amounts of this type of data needs to be made available in your Lytics account.
Separate Backfill from Real-time Streams
A real-time data stream contains messages that are sent in response to the activity which they describe. This is distinguished from batched data streams where messages are sent in groups on a given schedule, or according to some other trigger.
For attributes which will be kept updated by a real-time stream, and there is the additional requirement to populate that attribute with a substantial amount of pre-existing information, separate that backfilling from now-forward messages.
Backfill messages can be sent using multiple means. API loads should be sent via the Bulk CSV or Bulk JSON endpoints. It is also possible to use integration workflows to import this data, such as Amazon S3 CSV imports or Lytics Managed SFTP CSV imports.
The benefit of separating this data out from real-time messages streams is that the processing of backfill messages do not impact the processing time of messages received from real-time streams. The bulk imports are processed in parallel to real-time messages. This means that marketing activations reliant on real-time updates are not affected.
Whenever possible, all messages should have an explicit timestamp. While all messages are additionally time stamped by Lytics at the time of ingestion, specifying a message timestamp is helpful in all circumstances, particularly in cases when messages are received out of order, so that Lytics knows which one is the most up-to-date. When a backfill is happening concurrently with a real-time stream of the same attribute, it is essential.
All means of loading data permit specifying timestamps. Via API, this is via a
timestamp_field URL parameter. In the Admin UI, data import configuration options feature a menu to pick among the file schema for a timestamp field.
All messages imported into your Lytics account are stored in their raw form in addition to being represented as profile attributes in the graph. The purpose of storing all messages is to enable the reprocessing of those messages, a process called rebuilding. Rebuilding enables all messages ever received to be represented in different ways with different attributes, identity resolution rules, and so on.
All data ingested into Lytics incrementally increase the overhead of rebuilding, making it a longer and more processing-intensive operation. Therefore, before importing large amounts of data, consider the value/benefit of that data. If there is no clear use case for backfilling, consider skipping it.