Identity Stitching edge cases and solutions
This article describes the identity stitching (PS) issues / edge cases Analysts have faced during implementation. We hope that it can serve as a reference point on the possible solutions to PS.
General tips
Remove trailing and leading whitespaces, and use lower case for emails. PS will not run on NULL values, therefore we want to set empty strings to NULL as well.
NULLIF(LOWER(TRIM(payload ->> 'email')), '')
Sometimes you may find empty strings, string with whitespace, emails/ids that are ‘None’ or ‘NULL’ strings in the data. There are several ways to handle this:
- In PS rules, use trim() to remove leading and trailing whitespaces.
- In data preprocessing, use replace(' ','') or any equivalent methods (if using processors other than python) to remove any whitespace within the string.
- In data preprocessing, check your code to ensure that missing strings are not replaced with the None or NULL strings.
- If you have such cases in already loaded data you can avoid stitching these with condition similar to:
CASE
WHEN lower(payload ->> 'id') in ('none', '0', 'null', '') THEN NULL
ELSE rtrim(payload ->> 'id')
END
Warning: Using rtrim/trim
with a second parameter in Postgresql may cause unintended behavior. Example: rtrim('123000.0', '.0')
will yield 123
Casing - PS will treat test@meiro.io and Test@meiro.io as different entities!
In PS rules, use lower() to make the string lower case where applicable. Usually this applies to e-mail where it is not case sensitive. For case sensitive identifiers, this does not apply.
Validate your emails and phone numbers during data processing (example with python)
# email regular expression
import re
EMAIL_REGEX = r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)'
email_pattern = re.compile(EMAIL_REGEX)
if email_pattern.fullmatch(email):
return email.strip().lower()
# phonenumbers package can parse phone numbers with country code and split it to country code and phone number
import phonenumbers
x = phonenumbers.parse(phone_number)
event['payload']['country_code'] = x.country_code
event['payload']['phone_number_validated'] = x.national_number
National Identification Numbers:
1. Identification numbers may be missing leading or trailing letters.
Fix: Validate identification numbers according to the client’s requirements and the country’s ID rules.
Device IDs:
iOS devices often prohibit tracking and in this case device_id, advertising_id often will be in 0000-0000-0000... format. To fix this in the stitching rule, this can be done:
CASE
WHEN
regexp_extract(payload ->> 'device_id', '^[^1-9a-zA-Z]+$') is null
THEN payload ->> 'device_id'
ELSE NULL
END
Otherwise, you can turn device_id, advertising_id that are invalid into null values during preprocessing. If existing data is already loaded, use the method mentioned above.
Issue 1: Physical/In-store events
Some clients have a physical store or have in-store events for customers. In most cases, a tablet is used by multiple customers for browsing, logging in and making purchases. This results in many different customers having the same meiro user id
because they used the same device, which causes large entities during PS
Fix based on filtering out geolocation or flags:
- Do not stitch
meiro user id
for these problematic events - Add flags/geolocation fields (like IP addresses) and update PS rules to exclude the appropriate fields
-- in this example, the offline flag is used to show if the event was in-store (offline = '1') or online (offline = '0')
CASE
WHEN payload->'payload'->>'offline' = '1'
THEN NULL
ELSE payload->>'user_id'
END
- Process change - allow for browsing on incognito mode only. Close and open the incognito window for each new customer. The action of closing and re-opening incognito windows will generate new
meiro user ids
, ensuring that different customers will have differentmeiro user ids
. - In Meiro Events, implement
MeiroEvents.resetIdentity()
on specific triggers, like on a log out event. WhenMeiroEvents.resetIdentity()
method is called, the trackedmeiro user id
will be refreshed, so another customer using the same device will be tracked under a newmeiro user id
.
Issue 2: Emails with same tracking ID opened by different users
Many Email Direct Marketing (EDM) platforms now decorate their emails links with an ID to track user's browsing activity. This ID is usually unique per email campaign and per email address. However, users may forward their email or email links containing this ID to other users, which causes all such users to be stitched together because they have the same email ID.
Note: If you would like to implement this solution, please contact Meiro analysts.
Issue 3: Users in the same organization using the same device ID when testing or accessing their own apps/web services.
Some larger clients might have a lot of their own internal users using the same device ID when doing "live" tests on their own apps/web services. In this case, the users are all legitimate however, since so many users are sharing the same device ID, it can cause stitching to stitch all the events coming from all the users into a large entity.
Refer to this documentation for further explanations as to why a large entity is problematic, what is considered "large", and for a fix as it warrants its own page. This is only one of many cases that can cause a large entity but is worth noting for larger organizations that uses the CDP.
Fix by using exclusion lists of identifiers:
You can refer back to the example fixes in Issue 1, which is filtering by a flag that's added to the payload, or by certain IP addresses. If the list of IP addresses is huge or if you have a specific list of devices IDs to exclude, you can create an exclusion list in the external_data
table schema.
Here is an example implementation on how you make use of exclusion lists:
- Create a table in
external_data
table schema with a single column, indexed. The easiest way is to do something like this - in this example we'll use device id, but it can be anything:CREATE TABLE external_data.device_id_exclusion_list AS ( identifier TEXT PRIMARY KEY )
- Insert data into the table created, with identifiers that you wish to exclude from stitching.
- Note that you need to apply the exclusion list to all rules for the stitching attribute you want to apply it to. In order to find out all the rules for a stitching attribute, you can use this query:
-- In this example, the rules are for device id -- This needs to be changed if you will be creating an exclusion list for other values, -- such as phone numbers or emails. WITH events_with_device_id AS ( SELECT event_id FROM stitching_attributes_config INNER JOIN events e ON stitching_attributes_config.event_id = e.id WHERE attribute_id = 'ps_device_id' ) SELECT event_id, source_id, type, version, attribute_id, value_expression FROM stitching_attributes_config INNER JOIN events e on stitching_attributes_config.event_id = e.id WHERE event_id IN (SELECT event_id FROM events_with_device_id);
- In stitching rules, add the exclusion list in conditions using
case.. when...
expression eg. ( Note that the path to the stitching identifier in the payload can be different for each rule applied to each event type )
-- Example 1: filtering out device ids, so specific device ids can be excluded from stitching CASE WHEN (SELECT COUNT(*) FROM external_data.device_id_exclusion_list WHERE identifier = payload ->> 'device_id') > 0 THEN null ELSE payload ->> 'device_id' END -- Example 2: exclude stitching device ids for certain IP address or e-mail domains -- This example illustrates that the value used in the exclusion -- does not need to be the value used for the stitching attribute CASE WHEN (SELECT COUNT(*) FROM external_data.ip_address_exclusion_list WHERE identifier = payload ->> 'ip_address') > 0 THEN null ELSE payload ->> 'device_id' END
- Check presence of entities and events with excluded values.
The exclusion list is typically implemented after we have already done identity stitching for all identifiers, so there would already be customer entities that contain events with these excluded IDs, These customer entities need to be re-stitched.
This process consumes a lot of DB capacity, so it was optimized heavily.First the list of entities containing excluded values will be exported to a separate table:
CREATE TABLE external_data.entities_with_excluded_device_ids (customer_entity_id uuid primary key, identifiers uuid[]); INSERT INTO external_data.entities_with_excluded_device_ids WITH unnested AS ( SELECT customer_entity_id, identifiers, unnest(mx_device_id_all) FROM customer_attributes WHERE mx_device_id_all IS NOT NULL ) SELECT customer_entity_id, identifiers FROM unnested INNER JOIN external_data.device_id_exclusion_list el ON el.identifier = unnested.device_id;
Then events of those entities are exported to a table in schema external_data.
CREATE TABLE external_data.events_with_excluded_device_ids (id uuid primary key ); INSERT INTO external_data.events_with_excluded_device_ids WITH unnested AS ( SELECT unnest(identifiers::uuid[]) AS identifier, customer_entity_id FROM external_data.entities_with_excluded_device_ids ) SELECT id FROM customer_events INNER JOIN unnested on customer_events.identifier = unnested.identifier ;
- After that, make use of the exported tables and follow this doc to restitch the affected events/entities. Follow along from step 1: First export event ids to separate table in external_data schema in the doc in the "Restitching" section of the doc. In this case, you have already created the table containing event ids to be restitched in the external_data schema.
No Comments