How to do Quality Checking on Identity Stitching

Often times, it is not easy to see problems with identity stitching until it has finished running on a large portion of the dataset. This is why it is necessary to do quality checks on the data before loading the entire dataset. This article explains how to do so.

Before looking into the database though, try looking at the Analytics tab on Business Explorer about suspicious profiles as an initial check. There, suspicious profile entities with events >= 10,000 will be listed down, which can help identify if the stitching rules in place are over-stitching customer events together, and you can drill down into the database with those particular customer entities to investigate. Look into the edge cases to see what situations might arise that lead to problematic stitching rules.

In general, these are the steps you want to follow:

Load sample data: load 1 week of data (adjust this to be more or less depending on the data volume) from all sources
Set up identity stitching as per defined from the identity stitching graph
Run the set of queries below
Summarize the output for review
Adjust the stitching rules or add additional cleaning steps during data processing if found to be necessary. Refer to things to look out for in identity stitching rules here.

List of Queries

Number of stitched entities

- A rough gauge on number of customers

Average number of events per entity

- Dependent on the type of events

SELECT sum(customer_events_count)::numeric / COUNT(*)::numeric as event_num_avg
FROM public.customer_attributes;

Check the largest entities

- 100 biggest entities for false positive stitching and catching edge cases. For example, customers stitched together because their emails are blank or whitespace, or generic emails like abc@gmail.com, foo@bar.com are used.

SELECT customer_entity_id, customer_events_count FROM public.customer_attributes ca
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;

Number of entities from different sources

- Check if the number of entities make sense. For example, entities from web traffic should be > EDM sources > Order transactions

SELECT source_id, count(*)
FROM (
	SELECT customer_entity_id, unnest(sources) source_id
	FROM customer_attributes ca 	
     ) foo
GROUP BY 1;

Number of unique identifiers per entity

- Ideally, each entity should have 5 or less of each identifier (5 devices, 5 emails, 5 phone numbers).

- The attributes in this query like mx_email_list and mx_meiro_user_id_list needs to be built beforehand, according to your stitching rules, they do not come by default. This query will take some time to run.

select 
customer_entity_id, 
customer_events_count, 
array_length(mx_email_list, 1) as emails,
array_length(mx_meiro_user_id_list, 1) as meiro_user_ids
from customer_attributes ca 
group by customer_entity_id
order by customer_events_count desc
limit 10;

Histogram of events per customer entity

SELECT customer_events_count, count(*) customer_entities
FROM (
      SELECT customer_entity_id, customer_events_count
      FROM customer_attributes ca
      GROUP BY 1
     ) foo
GROUP BY 1
ORDER BY 1

Check if all events have identifiers based on each stitching category

Run this query to generate a list of queries for each stitching category:

select concat('-- source_id: ', source_id, ' type:', type, chr(10), 'with events as (select * from customer_events where event_id = ''', event_id::text ,''' limit 20) select ', value_expression ,' as ', category_id, ' from events;') from stitching_attributes_config
left join events on events.id = stitching_attributes_config.event_id
order by event_id, category_id;

Run the generated queries on the database to do a basic check for if each stitching category is behaving as you would expect ( the values are empty, returning text values like "undefined", "null" or any such edge cases. )

Edge cases

A lot of the time, there are problems that cannot be identified by these basic queries, for these edge cases, check this article.

Meiro Implementation Training

SOPs for Project Implementation

Meiro Customer Data Platform - Software Limits & Requirements

How to Request for FBM Access to Clients' Accounts

How to set up MI and ME API connection in CDP

Data Audit Checklist

Format events

Transform events (CR)

Load events to CDP

Load events to CDP (CR)

Customer Entity Retention Principles

How to Implement Meiro Events in MI and CDP and set up alerts for Meiro Events

Identity Stitching Setup

Identity Stitching rules

How to do Quality Checking on Identity Stitching

Identity Stitching edge cases and solutions

How to rerun identity stitching for all/some entities/events and recalculate attributes

Problems with large customer entities and how to investigate and solve them

Description

Attribute data types and sample values

Compound attributes

How to create context queries that require information of every customer entity

How to troubleshoot when there are no attribute values

How to delete attributes correctly

How to set up attributes in Meiro Business Explorer

How to generate simulated attributes values for demo purpose

Tutorial: How to set up RFM attributes (WIP)

Set export destinations

Facebook Custom Audience destination - upsert

The Trade Desk export destination

Google Analytics Destination

Criteo Destination

Twitter Ads Audience Destination

Google Analytics 4 Destination

Facebook Conversion API Implementation

How to find credentials for GA4 destination

Step-by-step guide for creating reports in Metabase

Set Up Business Explorer Instance & QA

Workspace documentation

Workspaces Monitoring Alert & Analyst Duty

Best Practices for Insights Creation

Component Testing and Documentation guide

Best Practices for Attributes Creation

Best Practices when setting up Destinations

Tutorial: How to implement product feed

Project Handover Procedures

Project Management Handbook

Project Management FAQ

How PMs work with Jira

How to do Quality Checking on Identity Stitching

List of Queries

No Comments