How to do Quality Checking on Profile Stitching
Often times, it is not easy to see problems with profile stitching until it has finished running on a large portion of the dataset. This is why it is necessary to do quality checks on the data before loading the entire dataset. This article explains how to do so.
In general, these are the steps you want to follow:
- Load sample data: load 1 week of data (adjust this to be more or less depending on the data volume) from all sources
- Set up profile stitching as per defined from the profile stitching graph
- Run the set of queries below
- Summarize the output for review
- Adjust the stitching rules or add additional cleaning steps during data processing if found to be necessary. Refer to things to look out for in profile stitching rules here.
List of Queries
Number of stitched entities
- A rough gauge on number of customers
Average number of events per entity
- Dependent on the type of events
SELECTselect customer_events,sum(customer_events_count)::numeric customer_entities, (customer_events::numeric/customer_entities::numeric) as customer_events_per_customer_entity
FROM (
SELECT/ count(*)::numeric as customer_events,event_num_avg
count(distinctfrom customer_entity_id) as customer_entities
FROM cdp_ps.matching
) t;public.customer_attributes;
Check the largest entities
- 100 biggest entities for false positive stitching and catching edge cases. For example, customers stitched together because their emails are blank or whitespace, or generic emails like abc@gmail.com, foo@bar.com are used.
SELECTselect customer_entity_id, count(*) as customer_events_count FROMfrom cdp_ps.matchingpublic.customer_attributes GROUPca
BYgroup by 1
ORDERorder BYby 2 DESC;desc
limit 100;
Number of entities from different sources
- Check if the number of entities make sense. For example, entities from web traffic should be > EDM sources > Order transactions
SELECT source_id, count(*)
FROM (
SELECT customer_entity_id, unnest(sources) source_id
FROM cdp_attr.customer_entities ce
GROUP BY 1
) foo
GROUP BY 1;
Number of unique identifiers per entity
- Ideally, each entity should have 5 or less of each identifier (5 devices, 5 emails, 5 phone numbers)
- Please change the SELECT statement to cater to your stitching rules. This query will take some time to run.
SELECT customer_entity_id,
count(1) num_events,
count(DISTINCT payload->>'user_id') unique_meiro_user_ids,
count(DISTINCT coalesce(nullif(lower(trim(payload->>'email')), ''), nullif(lower(trim(payload->'payload'->>'email')), ''))) unique_emails,
count(DISTINCT coalesce(nullif(lower(trim(payload->>'phone_validated')), ''), nullif(lower(trim(payload->'payload'->>'phone_validated')), ''))) unique_phone
FROM public.customer_events ce
GROUP BY customer_entity_id
ORDER BY count(1) DESC;
Histogram of events per customer entity
SELECT num_events, count(*) num_customers
FROM (
SELECT customer_entity_id, count(1) num_events
FROM cdp_ps.matching
GROUP BY 1
) foo
GROUP BY 1
ORDER BY 1;
Check if all events have identifiers
When both cdp_ps.customer_events_to_process and cdp_ps.prep are empty: check count(*) from cdp_ps.matching = count(*) from cdp_ce.customer_events
Edge cases
A lot of the time, there are problems that cannot be identified by these basic queries, for these edge cases, check this article.