Problems with large customer entities and how to investigate and solve them

Problems with large customer entities

When a customer entities grows too large, this is the following impact:

Some attributes will be missing values.
Some customers will be unsearchable based on attributes such as e-mail due to 1.
Some customers will not be able to be segmented due to 1.
Some customers will not be able to be sent to exports due to 1.

The exact reason is as follows:

While processing attributes, the algorithm only considers the latest 10,000 customer events. This is to prevent the process to be stuck on the recalculation of 1 entity, not allowing time for other entities to be processed, as you might be aware, the attribute calculations of each attribute is done on all events that belong to an entity, so you can imagine how long it would take to recalculate the attributes if the entity is too large (> 10,000 events).

Not only that, there is a limit for 250 values per attribute per entity. The reason for this is to improve search results during segmentation, which uses OpenSearch. It's usually very unlikely for attributes to have more than 250 values unless this scenario with large customer entities happen, which is why it is set to 250.

Investigating large customer entities

Other than the QA checks that you can do here , you can follow these steps as a guide in investigating a customer entity that has grown too large:

Referring to QA checks, first you need to obtain a list of the largest customer entities.
After which, it will be helpful to check which identifiers are contributing to the most number of events, i.e. query for stitching identifiers in the largest customer entities, and count them per entity. ( Note that this is different from the method used in QA checks "Number of unique identifiers per entity", as we are trying to get how many events a single identifier is present in, rather than the number of identifier values themselves. )

Example query for 2:

Cockroach CDP version:

select
m.customer_entity_id,
case when event_id in (
-- list of event ids that apply the stitching rule
) then
-- example stitching rule(s):
coalesce(payload->>'stitching_identifier', payload->'payload'->>'identifier') as
-- example stitching identifier id:
stitching_identifier_id
from cdp_ce.customer_events ce 
left join cdp_ps.matching m on ce.id::uuid = m.customer_event_id 
where m.customer_entity_id in 
(
 -- list of the largest entities
)

PostgreSQL CDP version:

select
m.customer_entity_id,
case when event_id in (
-- list of event ids that apply the stitching rule
) then
-- example stitching rule(s):
coalesce(payload->>'stitching_identifier', payload->'payload'->>'identifier') as
-- example stitching identifier id:
stitching_identifier_id
from cdp_ce.customer_events ce 
left join cdp_ps.matching m on ce.id::uuid = m.customer_event_id 
where m.customer_entity_id in 
(
 -- list of the largest entities
)

Example problems and causes

Remember that profile stitching is as good as the most unique identifier in all of the stitching rules, so most of the time problems arise due to the identifier not being as unique as it is assumed to be. It can be helpful for finding out the root cause of large entities if we draw from previous real-world examples, so here is a list of some reasons based on our experience as to why an entity can grow so large:

1. Some device identifier is being used as a stitching rule (refer to issue 3 in this doc), and many users or even test/bot users were using the same device to perform actions on website/app, which caused the profile stitching process to assume they are all coming form the same customer entity.

Solution to 1. :

There are multiple ways to solve this depending on the data available or can be made available to you.