How to do Quality Checking on Identity Stitching
Often times, it is not easy to see problems with identity stitching until it has finished running on a large portion of the dataset. This is why it is necessary to do quality checks on the data before loading the entire dataset. This article explains how to do so.
Before looking into the database though, try looking at the Analytics tab on Business Explorer about suspicious profiles as an initial check. There, suspicious profile entities with events >= 10,000 will be listed down, which can help identify if the stitching rules in place are over-stitching customer events together, and you can drill down into the database with those particular customer entities to investigate. Look into the edge cases to see what situations might arise that lead to problematic stitching rules.
In general, these are the steps you want to follow:
- Load sample data: load 1 week of data (adjust this to be more or less depending on the data volume) from all sources
- Set up identity stitching as per defined from the identity stitching graph
- Run the set of queries below
- Summarize the output for review
- Adjust the stitching rules or add additional cleaning steps during data processing if found to be necessary. Refer to things to look out for in identity stitching rules here.
List of Queries
Number of stitched entities
- A rough gauge on number of customers
Average number of events per entity
- Dependent on the type of events
SELECT sum(customer_events_count)::numeric / COUNT(*)::numeric as event_num_avg FROM public.customer_attributes;
Check the largest entities
- 100 biggest entities for false positive stitching and catching edge cases. For example, customers stitched together because their emails are blank or whitespace, or generic emails like firstname.lastname@example.org, email@example.com are used.
SELECT customer_entity_id, customer_events_count FROM public.customer_attributes ca GROUP BY 1 ORDER BY 2 DESC LIMIT 100;
Number of entities from different sources
- Check if the number of entities make sense. For example, entities from web traffic should be > EDM sources > Order transactions
SELECT source_id, count(*) FROM ( SELECT customer_entity_id, unnest(sources) source_id FROM customer_attributes ca ) foo GROUP BY 1;
Number of unique identifiers per entity
- Ideally, each entity should have 5 or less of each identifier (5 devices, 5 emails, 5 phone numbers).
- The attributes in this query like
mx_meiro_user_id_list needs to be built beforehand, according to your stitching rules, they do not come by default. This query will take some time to run.
select customer_entity_id, customer_events_count, array_length(mx_email_list, 1) as emails, array_length(mx_meiro_user_id_list, 1) as meiro_user_ids from customer_attributes ca group by customer_entity_id order by customer_events_count desc limit 10;
Histogram of events per customer entity
SELECT customer_events_count, count(*) customer_entities FROM ( SELECT customer_entity_id, customer_events_count FROM customer_attributes ca GROUP BY 1 ) foo GROUP BY 1 ORDER BY 1
Check if all events have identifiers
When both cdp_ps.customer_events_to_process and cdp_ps.prep are empty: check count(*) from cdp_ps.matching = count(*) from cdp_ce.customer_events
A lot of the time, there are problems that cannot be identified by these basic queries, for these edge cases, check this article.