Engagement BigQuery
Engagement BigQuery is a package consisting of repeated data exports from Bloomreach Engagement application to Google BigQuery dataset and granting you read access.
Feature for advanced users
Engagement BigQuery is aimed for advanced users who have solid experience with Google BigQuery, SQL, big data and data warehouse solutions.
Engagement BigQuery (EBQ, formerly called Long Term Data Storage) is a petabyte-scale data storage in Google BigQuery. It provides a flexible, secure, and scalable infrastructure to house your data in an Engagement-like structure. Frequent data updates ensure that your data is always available on demand for custom analytics using your own BI tools. The scalable design easily handles the increased needs of your growing business, characterized by elevated data ingestion. EBQ ensures first-class care of all the technical and security aspects connected with your data storage.
Engagement BigQuery is very different from BiqQuery Integration which is used for importing data from any BigQuery that you are using.
Engagement BigQuery can store any events or user properties. Currently, events are loaded from Bloomreach Engagement into BigQuery roughly every 6 hours, and customers are loaded once a day. The data is accessible through the BigQuery Console, but you can also access it by using a GCP service account in BigQuery client libraries in supported programming languages. GCP service accounts are not created automatically; if you need this kind of access, you should request it, and we will create it for you. In both cases, we provide read-only access to the data, you cannot edit it. Deleting Customer profiles from Bloomreach Engagement also deletes them from Engagement BigQuery (this includes customer ids
and customer attributes
). Events
however, are not deleted automatically.
The data retention period in EBQ is in a 10-year rolling window, 9 years in the past and 1 year in the future. As event data start to fall beyond their 9-year anniversary mark, they will be deleted from all your EBQ event-related tables. And any incoming data more than 1 year in the future will be discarded from EBQ. Thus, they will never appear/load to EBQ.
Google account needed
Google Identity (user account or service account) is required to access the Google Console and the BigQuery Console. If you do not already have a Google account, it is recommended that you create your own GSuite for Business. It is however possible to create a Google account with an already existing email address (e.g. a company email) - read how to do that here.
Frequency of data-update
The data in Engagement BigQuery updates once a day for customer attributes and IDs, and four times a day for events.
Benefits
- Highly scalable infrastructure will always meet your demands
- Engagement-like data structure makes it easy to understand your data in 3rd party tools
- Advanced security and threat detection
- 10-year data retention - store your data for a 10-year rolling window, 9 years in the past and 1 year in the future
- Access management - distribute read access rights across your teams
- Usage-based pricing - packages based on the expected volume of stored and processed data
- Easily work with your data using Google BigQuery and 3rd party applications
Getting Started
To enable EBQ in your project, please contact your Customer Success Manager.
Accessing Engagement BigQuery
There are several ways how you can access your EBQ data, which are loaded in BigQuery.
- Use the web-based BigQuery console. You can use it to write your custom SQL queries, store and execute them.
Trouble finding the dataset
If you do not see the dataset after you received the access from your CSM, make sure you are not logged into any other Google account on your browser and try loading the BigQuery console again.
-
Use any of the modern BI tools that provide a connector to BigQuery e.g. Tableau, Power BI, Qlik or Google DataStudio to create reports and analysis using your EBQ data.
-
Install and setup JDBC/ODBC drivers for Google BigQuery to facilitate accesses by various tools that do not have native BigQuery connector (e.g. ETL tools).
-
Access Google BigQuery API using Google Cloud Client Libraries.
User Management
Google Groups
Your CSM is responsible to manage users and access rights to your Google BigQuery dataset through a Google Group. Only group members can access your BigQuery data. Invitations to Google Groups can be ignored, CSM is adding members without need for you to access the Group and you will have access to data via Google BigQuery Console.
User Accounts
For the purposes of data analysis in the BigQuery console and for BI tools access, you will use your Google user account.
Service Accounts
For the purposes of BigQuery access using an API, a service account will be created for you.
Permissions
Clients are granted read-only access to BigQuery datasets within the Bloomreach-owned Google Cloud Project and permission to run jobs. Users and Service accounts are granted “BigQuery Job User” role on the project level and a “BigQuery Data Viewer” on the dataset level. More details on which exact permissions these roles include can be found on official BigQuery docs - BigQuery predefined IAM roles. We grant access to BigQuery API (not BigQuery Storage API).
User Access and Security Policy
Bloomreach has implemented a policy for managing user access to EBQ to enhance system security and efficiency. Accounts inactive for more than 180 days undergo review and may be revoked. This action is vital for system integrity and protection against security threats, including unauthorized access. We engage in this process with our customers, ensuring transparency and compliance with all relevant regulations. Customers are informed about deactivation reasons and their consent is obtained before any action. This approach not only enhances security but also aligns our actions with customer expectations and legal requirements.
Data Schema
EBQ is a set of tables in the Google BigQuery (GBQ) dataset that is kept up to date using regular loads. A BigQuery project always contains 2 types of tables:
Events tables | One table for each event type (e.g. session_start, item_view etc,) |
Customers tables | customers_properties customers_id_history - history of all customer merges customers_external_ids - mapping between internal and external customer ids |
Event Tables
Event tables are loaded incrementally, which means that new rows are added to the tables during every load.
Each Event table contains tracked data in the same structure as in the application (see Data Manager
> Events
to understand the expected data structure). All tracked data are stored as a “properties” record and can be accessed in SQL in the following way:
SELECT properties.utm_campaign FROM
`gcloudltds.exp_558cba14_8a46_11e6_8da1_141877340e97_views.session_start`
First three columns in each Event table are:
Internal_customer_id | An ID of the customer that can be used to join with Customer tables |
ingest_timestamp | Timestamp of when a given event was processed by Bloomreach Engagement (first seen, time of recieving by tracking API or time of processing by Imports) |
timestamp | Timestamp of when a given event actually happened (user-defined business timestamp) |
To query those fields, no prefix needs to be used:
SELECT internal_customer_id FROM
`gcloudltds.exp_558cba14_8a46_11e6_8da1_141877340e97_views.session_start`
Customer Tables
Customer tables are loaded using a full load. During each daily load, tables are loaded from scratch, hence they always contain the latest information about the customers.
customers_properties
The main customer table is customer_properties
. The structure is the same as defined in the application (see Data Manager
> Customer Properties
to understand the expected data structure). All tracked data are stored as a “properties” record and can be accessed in SQL in the following way:
SELECT properties.last_name FROM
`gcloudltds.exp_558cba14_8a46_11e6_8da1_141877340e97_views.customers_properties`
customers_id_history
This table contains a history of customer merging. past_id
is the ID of the customer that is merged with another customer and internal_customer_id
is the ID of the customer with which the past_id
was merged. Until customers are merged, there is no record in this table.
customers_id_history is missing
View customers_id_history is created only when there are customer merges in the exported project.
customers_external_ids
This table maps each internal_customer_id
to all available external_customer_ids
. The value of external_customer_id
is stored in id_value
and the type of external_customer_id
(e.g. cookie, email) is stored in id_name
.
Properties schema
Both event and customers_properties tables contain the properties
field, which is a BigQuery nested record (struct). Fields in this record have types based on types of properties specified in Bloomreach Engagement application (Data Manager > Events
and Data Manager > Customer Properties
).
The conversion is done in a following way:
Bloomreach Engagement type | BigQuery type |
---|---|
number | NUMERIC |
boolean | BOOLEAN |
date* | TIMESTAMP |
datetime* | TIMESTAMP |
list** | STRING |
other | STRING |
*date
and datetime
properties will be converted correctly only if their value is unix timestamp in seconds.
** list
type is stored in BigQuery in a JSON serialized form
In addition to properties
field, customers and event tables also contain raw_properties
field. Properties in raw_properties
field are not converted according to Bloomreach Engagement schema, all of them are BigQuery STRING
type. This is useful for cases when conversion in properties
doesn't return expected results (the returned value is null
).
Data Naming
See any EBQ naming changes in
_system_mapping
table.
Data Queries
In order to get query results faster and cheaper, it is recommended to use partitioned tables in BigQuery. Timestamp partitioning is available in your project.
Timestamp partitioning is done using business timestamp. A business timestamp is the time when the event was tracked. See examples:
Notice the difference in the volume of data to be processed (indicated in the bottom right part of the screenshots).
Load monitoring
In order to enable monitoring of the load process there is a table _system_load_log in each dataset. The main benefit of _system_load_log table is that it can act as a trigger for further data exports from your EBQ dataset. The common use case is to wait until you see an update in _system_load_log and then export new and updated data to another system. If you need to know how fresh are newly exported data in case of events, use max(ingest_timestamp).
For example to find out what was the last time when session_start event type was loaded, the following query needs to be executed:
SELECT tabs, timestamp
FROM `gcloudltds.exp_4970734e_9ed3_11e8_b57b_0a580a205e7b_views._system_load_log`
CROSS JOIN unnest(tables) as tabs
WHERE tabs in('session_start')
ORDER BY 2 DESC
LIMIT 1
And here is bit more complex example how to display last 4 table updates that were not updated in last 3 days. It can be used for example to display event tables that were not tracked into Bloomreach Engagement app for some period of time.
SELECT
loaded_table_name,
ARRAY_AGG(timestamp ORDER BY timestamp DESC LIMIT 4) as latest_updates -- limit to only X latest update timestamps for each table
FROM `gcloudltds.exp_c1f2061a_e5e6_11e9_89c3_0698de85a3d7_views._system_load_log`
LEFT JOIN UNNEST(tables) AS loaded_table_name
GROUP BY loaded_table_name
HAVING EXISTS (
SELECT 1
FROM UNNEST(latest_updates) AS table_timestamp WITH OFFSET AS offset
WHERE offset = 0 and DATETIME(table_timestamp) < DATETIME_SUB(CURRENT_DATETIME(), INTERVAL 3 DAY) -- show only table names where the latest update timestamp is older than 3 days
)
ORDER BY loaded_table_name ASC
Best practice
Because unexpected delays in data processing may occur (especially during seasonal peaks), you should not rely on a specific time for new data to be exported and available in EBQ.
Instead, you should monitor the
_system_load_log
table and trigger dependent queries and scripts processing only after there is a log entry that data has been successfully exported to EBQ. If you run scripts on a fixed time, it may happen that the data is not exported to EBQ yet and it will run on an empty dataset.
Deleted events
Data in the event tables in the BigQuery are loaded incrementally and never deleted, even if you delete some data in Bloomreach Engagement. If you delete events in Bloomreach Engagement, these will not be deleted from EBQ. Meaning you will end up with BQ data that is not completely consistent with the app. EBQ was developed to serve as a long-term storage with a 10-year rolling window of data retention, 9 years in the past and 1 year in the future, which means it retains data even after deletion within the app. If you want to discount invalid, bugged, or redundant event data (which you deleted in the app), you need to remember which data have been deleted and set up an additional filter in your reporting based on EBQ data.
Please contact your CSM in case you need help or further clarification.
Deletion vs anonymization
Since you cannot rely only on mirroring data deletion from app to EBQ, in case you are fulfilling data removal request (GDPR/CCPA), you shall use anonymization functions in the app. Anonymization removes personal data also from EBQ, see Anonymizing customers.
Table _system_delete_event_type
In releases before version 1.160 (Nov 2019), the information about deleted data was stored in EBQ in the table
_system_delete_event_type
. You could use this table to filter only data that have not been deleted from Bloomreach Engagement. However this table is not updated anymore, after release of a feature Delete events by filter.
Merging customers
Every time a customer uses a different browser or device to visit your website, they are considered as separate and non-related entities. However, once the customer identifies (through registering or logging in their account for example), those 2 profiles are merged into one. In this way, customer activity can be tracked across multiple browsers and devices.
Until the customers are merged, there is no record about the first or the second customer in the customers_id_history table.
At the moment of the merge, the following information is stored in the table:
internal_customer_id | customer_on_device_1 |
past_id | customer_on_device_2 |
This means that the customer that was tracked on device 2 (customer_on_device_2) is merged into the customer on device 1 (customer_on_device_1) and both customers together are now considered as a single merged customer.
It is important to work with merged customers when you analyze the event data to get all events generated by the customer_on_device_2 and customer_on_device_1 assigned to a single customer. Use the following query to work with merged customers:
SELECT Ifnull(b.internal_customer_id, a.internal_customer_id) AS merged_id,
a.internal_customer_id AS premerged_id
FROM `gcloudltds.exp_558cba14_8a46_11e6_8da1_141877340e97_views.session_start` a
LEFT JOIN `gcloudltds.exp_558cba14_8a46_11e6_8da1_141877340e97_views.customers_id_history` b
ON a.internal_customer_id = b.past_id
For each internal_customer_id in the event table (session_start in this case) if there is a merge available in the customers_id_history, internal_customer_id will be mapped to the merged customer. As a result, analysis can be now done on merged_id, all session_starts created historically by either customer_on_device_2 or customer_on_device_1 will have merged_id = customer_on_device_1.
Anonymizing customers
When you anonymize a customer in Bloomreach Engagement, the data is also anonymized in the BigQuery. This is done in the following steps:
- During anonymization through the app or API, an
anonymization
event is tracked for the customer. Anonymization
event is loaded into the BigQuery during a periodic data-update along with all the other events.- All the new
anonymization
events are queried.
4, Anonymization is done by anUpdate
query for all theanonymization
events queried in step 3 which contained acustomer_ID
. This means that all the private event properties tracked for the particular customer IDs will be set to NULL. ltds_anonymization_timestamp
is tracked for all the anonymized events with a timestamp of the moment when the anonymization went through in the BigQuery.
Exports or recalculation of data with Scheduled queries
With the Engagement BigQuery you can periodically run simple, or calculate complex queries, and store the results in another table within your BigQuery. Afterward, you can access and manipulate these results or use a BI tool, such as Tableau, to report on the data within this table.
The prerequisites of this are:
- You must have your own Google Cloud Platform (GCP) project with BigQuery (BQ) enabled
- You must have the BigQuery Data Transfer API enabled
- As a person configuring this, you must have a ‘write access’ to the BQ and ‘read access’ to EBQ
To configure scheduled queries, follow this guide:
- Open the relevant GCP and ensure that the active project is the desired GCP project and not GCloudLTDS (EBQ)
- Create a new dataset in the BQ at the same location as EBQ (EU)
- Create a new Scheduled query, as shown on the images below
For exporting use case, we recommend querying the raw_properties
fields in customers_properties and event tables as these are always the same data type, independent of Bloomreach Engagement app Data Management settings.
More resources can be found on official Google BigQuery documentation:
- https://cloud.google.com/bigquery/docs/writing-results
- https://cloud.google.com/bigquery/docs/scheduling-queries
Updated about 2 months ago