Identity Stitching edge cases and solutions
This article describes the identity stitching (PS) issues / edge cases Analysts have faced during implementation. We hope that it can serve as a reference point on the possible solutions to PS.
Remove trailing and leading whitespaces, and use lower case for emails. PS will not run on NULL values, therefore we want to set empty strings to NULL as well.
NULLIF(LOWER(TRIM(payload ->> 'email')), '')
Sometimes you may find empty strings, string with whitespace, emails/ids that are ‘None’ or ‘NULL’ strings in the data. There are several ways to handle this:
- In PS rules, use trim() to remove leading and trailing whitespaces.
- In data preprocessing, use replace(' ','') or any equivalent methods (if using processors other than python) to remove any whitespace within the string.
- In data preprocessing, check your code to ensure that missing strings are not replaced with the None or NULL strings.
- If you have such cases in already loaded data you can avoid stitching these with condition similar to:
CASE WHEN lower(payload ->> 'id') in ('none', '0', 'null', '') THEN NULL ELSE rtrim(payload ->> 'id') END
rtrim/trim with a second parameter in Postgresql may cause unintended behavior. Example:
rtrim('123000.0', '.0') will yield
Casing - PS will treat firstname.lastname@example.org and Test@meiro.io as different entities!
In PS rules, use lower() to make the string lower case where applicable. Usually this applies to e-mail where it is not case sensitive. For case sensitive identifiers, this does not apply.
Validate your emails and phone numbers during data processing (example with python)
# email regular expression import re EMAIL_REGEX = r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)' email_pattern = re.compile(EMAIL_REGEX) if email_pattern.fullmatch(email): return email.strip().lower()
# phonenumbers package can parse phone numbers with country code and split it to country code and phone number import phonenumbers x = phonenumbers.parse(phone_number) event['payload']['country_code'] = x.country_code event['payload']['phone_number_validated'] = x.national_number
National Identification Numbers:
1. Identification numbers may be missing leading or trailing letters.
Fix: Validate identification numbers according to the client’s requirements and the country’s ID rules.
iOS devices often prohibit tracking and in this case device_id, advertising_id often will be in 0000-0000-0000... format. To fix this in the stitching rule, this can be done:
CASE WHEN regexp_extract(payload ->> 'device_id', '^[^1-9a-zA-Z]+$') is null THEN payload ->> 'device_id' ELSE NULL END
Otherwise, you can turn device_id, advertising_id that are invalid into null values during preprocessing. If existing data is already loaded, use the method mentioned above.
Issue 1: Physical/In-store events
Some clients have a physical store or have in-store events for customers. In most cases, a tablet is used by multiple customers for browsing, logging in and making purchases. This results in many different customers having the same
meiro user id because they used the same device, which causes large entities during PS
Fix based on filtering out geolocation or flags:
- Do not stitch
meiro user idfor these problematic events
- Add flags/geolocation fields (like IP addresses) and update PS rules to exclude the appropriate fields
-- in this example, the offline flag is used to show if the event was in-store (offline = '1') or online (offline = '0') CASE WHEN payload->'payload'->>'offline' = '1' THEN NULL ELSE payload->>'user_id' END
- Process change - allow for browsing on incognito mode only. Close and open the incognito window for each new customer. The action of closing and re-opening incognito windows will generate new
meiro user ids, ensuring that different customers will have different
meiro user ids.
- In Meiro Events, implement
MeiroEvents.resetIdentity()on specific triggers, like on a log out event. When
MeiroEvents.resetIdentity()method is called, the tracked
meiro user idwill be refreshed, so another customer using the same device will be tracked under a new
meiro user id.
Issue 2: Emails with same tracking ID opened by different users
Many Email Direct Marketing (EDM) platforms now decorate their emails links with an ID to track user's browsing activity. This ID is usually unique per email campaign and per email address. However, users may forward their email or email links containing this ID to other users, which causes all such users to be stitched together because they have the same email ID.
Fix by stitching only the first instance of an event:
- Edit PS rule to only stitch the FIRST instance of email ID, ignoring all subsequent events with the same email ID. This can be done by add specific rule to
cdp_ps.stitching_attributes_configfor the event you would like to fix.
- Example of the rule for Cockroach Database implementation
CASE -- a general check on the id to ensure it has 32 characters WHEN length(payload->'url_parts'->'query_parsed'->'sid'->>0) != 32 OR ( -- compares the identifier already stitched (identifiers), against the identifier (ARRAY [md5(...)::uuid]) -- if it has been stitched already (count is > 0), then do not stitch this identifier -- here, 'stitching_se_sid' is the attribute_id and -- event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload SELECT COUNT(*) FROM customer_events ce WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view' AND identifiers @> ARRAY [md5('stitching_se_sid' || (event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid] ) > 0 THEN NULL ELSE payload->'url_parts'->'query_parsed'->'sid'->>0 END
- Example of the rule for Postgres Implementation:
CASE -- a general check on the id to ensure it has 32 characters WHEN length(payload->'url_parts'->'query_parsed'->'sid'->>0)=32 THEN CASE -- compares the list of identifiers already stitched (m.identifiers and p.identifiers), against the identifier -- if it has been stitched already (count is > 0), then do not stitch this identifier -- here, 'stitching_se_sid' is the attribute_id and -- st.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload WHEN ( ( SELECT COUNT(*) FROM cdp_ps.matching m INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = m.customer_event_id WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view' AND m.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid] ) > 0 ) OR ( ( SELECT COUNT(*) FROM cdp_ps.prep p INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = p.customer_event_id WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view' AND p.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid] ) > 0 ) THEN NULL ELSE payload->'url_parts'->'query_parsed'->'sid'->>0 END END
Issue 3: Users in the same organization using the same device ID when testing or accessing their own apps/web services.
Some larger clients might have a lot of their own internal users using the same device ID when doing "live" tests on their own apps/web services. In this case, the users are all legitimate however, since so many users are sharing the same device ID, it can cause stitching to stitch all the events coming from all the users into a large entity.
Refer to this documentation for further explanations as to why a large entity is problematic, what is considered "large", and for a fix as it warrants its own page. This is only one of many cases that can cause a large entity but is worth noting for larger organizations that uses the CDP.
Fix by using exclusion lists of identifiers:
You can refer back to the example fixes in Issue 1, which is filtering by a flag that's added to the payload, or by certain IP addresses. If the list of IP addresses is huge or if you have a specific list of devices IDs to exclude, you can create an exclusion list in the
external_data table schema.
Here is an example implementation on how you make use of exclusion lists:
- Create a table in
external_datatable schema with a single column, indexed. The easiest way is to do something like this - in this example we'll use device id, but it can be anything:
CREATE TABLE external_data.device_id_exclusion_list AS ( identifier TEXT PRIMARY KEY )
- Insert data into the table created, with identifiers that you wish to exclude from stitching.
- Note that you need to apply the exclusion list to all rules for the stitching attribute you want to apply it to. In order to find out all the rules for a stitching attribute, you can use this query:
-- In this example, the rules are for device id -- This needs to be changed if you will be creating an exclusion list for other values, -- such as phone numbers or emails. WITH events_with_device_id AS ( SELECT event_id FROM stitching_attributes_config INNER JOIN events e ON stitching_attributes_config.event_id = e.id WHERE attribute_id = 'ps_device_id' ) SELECT event_id, source_id, type, version, attribute_id, value_expression FROM stitching_attributes_config INNER JOIN events e on stitching_attributes_config.event_id = e.id WHERE event_id IN (SELECT event_id FROM events_with_device_id);
- In stitching rules, add the exclusion list in conditions using
case.. when...expression eg. ( Note that the path to the stitching identifier in the payload can be different for each rule applied to each event type )
-- Example 1: filtering out device ids, so specific device ids can be excluded from stitching CASE WHEN (SELECT COUNT(*) FROM external_data.device_id_exclusion_list WHERE identifier = payload ->> 'device_id') > 0 THEN null ELSE payload ->> 'device_id' END -- Example 2: exclude stitching device ids for certain IP address or e-mail domains -- This example illustrates that the value used in the exclusion -- does not need to be the value used for the stitching attribute CASE WHEN (SELECT COUNT(*) FROM external_data.ip_address_exclusion_list WHERE identifier = payload ->> 'ip_address') > 0 THEN null ELSE payload ->> 'device_id' END
- Check presence of entities and events with excluded values.
The exclusion list is typically implemented after we have already done identity stitching for all identifiers, so there would already be customer entities that contain events with these excluded IDs, These customer entities need to be re-stitched.
This process consumes a lot of DB capacity, so it was optimized heavily.
First the list of entities containing excluded values will be exported to a separate table:
CREATE TABLE external_data.entities_with_excluded_device_ids (customer_entity_id uuid primary key, identifiers uuid); INSERT INTO external_data.entities_with_excluded_device_ids WITH unnested AS ( SELECT customer_entity_id, identifiers, unnest(mx_device_id_all) FROM customer_attributes WHERE mx_device_id_all IS NOT NULL ) SELECT customer_entity_id, identifiers FROM unnested INNER JOIN external_data.device_id_exclusion_list el ON el.identifier = unnested.device_id;
Then events of those entities are exported to a table in schema external_data.
CREATE TABLE external_data.events_with_excluded_device_ids (id uuid primary key ); INSERT INTO external_data.events_with_excluded_device_ids WITH unnested AS ( SELECT unnest(identifiers::uuid) AS identifier, customer_entity_id FROM external_data.entities_with_excluded_device_ids ) SELECT id FROM customer_events INNER JOIN unnested on customer_events.identifier = unnested.identifier ;
- After that, make use of the exported tables and follow this doc to restitch the affected events/entities. Follow along from step 1: First export event ids to separate table in external_data schema in the doc in the "Restitching" section of the doc. In this case, you have already created the table containing event ids to be restitched in the external_data schema.