ePrivacy and GPDR Cookie Consent by Cookie Consent Skip to main content

Profile Stitching edge cases and solutions

This article describes the profile stitching (PS) issues / edge cases Analysts have faced during implementation. We hope that it can serve as a reference point on the possible solutions to PS.

General tips

Remove trailing and leading whitespaces, and use lower case for emails. PS will not run on NULL values, therefore we want to set empty strings to NULL as well.

NULLIF(LOWER(TRIM(payload ->> 'email')), '')

Sometimes you may find empty strings, string with whitespace, emails/ids that are ‘None’ or ‘NULL’ strings in the data. There are several ways to handle this:

  1. In PS rules, use trim() to remove leading and trailing whitespaces.
  2. In data preprocessing, use replace(' ','') or any equivalent methods (if using processors other than python) to remove any whitespace within the string.
  3. In data preprocessing, check your code to ensure that missing strings are not replaced with the None or NULL strings.
  4. If you have such cases in already loaded data you can avoid stitching these with condition similar to:
CASE
    WHEN lower(payload ->> 'id') in ('none', '0', 'null', '') THEN NULL
    ELSE rtrim(payload ->> 'id')
END

Warning: Using rtrim/trim with a second parameter in Postgresql may cause unintended behavior. Example: rtrim('123000.0', '.0') will yield 123

Casing - PS will treat test@meiro.io and Test@meiro.io as different entities!
In PS rules, use lower() to make the string lower case where applicable. Usually this applies to e-mail where it is not case sensitive. For case sensitive identifiers, this does not apply.

Validate your emails and phone numbers during data processing (example with python)

# email regular expression
import re
EMAIL_REGEX = r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)'
email_pattern = re.compile(EMAIL_REGEX)
if email_pattern.fullmatch(email):	
	return email.strip().lower()
# phonenumbers package can parse phone numbers with country code and split it to country code and phone number
import phonenumbers
x = phonenumbers.parse(phone_number)
event['payload']['country_code'] = x.country_code
event['payload']['phone_number_validated'] = x.national_number

National Identification Numbers:
1. Identification numbers may be missing leading or trailing letters.
Fix: Validate identification numbers according to the client’s requirements and the country’s ID rules.

Device IDs: 
iOS devices often prohibit tracking and in this case device_id, advertising_id often will be in 0000-0000-0000... format. To fix this in the stitching rule, this can be done:

CASE 
    WHEN 
        regexp_extract(payload ->> 'device_id', '^[^1-9a-zA-Z]+$') is null
    THEN payload ->> 'device_id'
    ELSE NULL
END

Otherwise, you can turn device_id, advertising_id that are invalid into null values during preprocessing. If existing data is already loaded, use the method mentioned above.

Issue 1: Physical/In-store events

Some clients have a physical store or have in-store events for customers. In most cases, a tablet is used by multiple customers for browsing, logging in and making purchases. This results in many different customers having the same meiro user id because they used the same device, which causes large entities during PS

Fix based on filtering out geolocation or flags:
  • Do not stitch meiro user id for these problematic events
  • Add flags/geolocation fields (like IP addresses) and update PS rules to exclude the appropriate fields
-- in this example, the offline flag is used to show if the event was in-store (offline = '1') or online (offline = '0')
CASE 
	WHEN payload->'payload'->>'offline' = '1'
    THEN NULL 
	ELSE payload->>'user_id' 
END
  • Process change - allow for browsing on incognito mode only. Close and open the incognito window for each new customer. The action of closing and re-opening incognito windows will generate new meiro user ids, ensuring that different customers will have different meiro user ids.
  • In Meiro Events, implement MeiroEvents.resetIdentity() on specific triggers, like on a log out event. When MeiroEvents.resetIdentity() method is called, the tracked meiro user id will be refreshed, so another customer using the same device will be tracked under a new meiro user id.

Issue 2: Emails with same tracking ID opened by different users

Many Email Direct Marketing (EDM) platforms now decorate their emails links with an ID to track user's browsing activity. This ID is usually unique per email campaign and per email address. However, users may forward their email or email links containing this ID to other users, which causes all such users to be stitched together because they have the same email ID.

Fix by stitching only the first instance of an event:
  • Edit PS rule to only stitch the FIRST instance of email ID, ignoring all subsequent events with the same email ID. This can be done by add specific rule to value_expression_prep in cdp_ps.stitching_attributes_config for the event you would like to fix.
  • Example of the rule for Cockroach Database implementation
CASE
    -- a general check on the id to ensure it has 32 characters
    WHEN length(payload->'url_parts'->'query_parsed'->'sid'->>0) != 32 OR
        (
        -- compares the identifier already stitched (identifiers), against the identifier (ARRAY [md5(...)::uuid])
        -- if it has been stitched already (count is > 0), then do not stitch this identifier
        -- here, 'stitching_se_sid' is the attribute_id and
        -- event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload
            SELECT COUNT(*)
            FROM customer_events ce 
            WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
            AND identifiers @> ARRAY [md5('stitching_se_sid' || (event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
        ) > 0
THEN
    NULL
ELSE
    payload->'url_parts'->'query_parsed'->'sid'->>0
END
  • Example of the rule for Postgres Implementation:
CASE
    -- a general check on the id to ensure it has 32 characters
    WHEN
        length(payload->'url_parts'->'query_parsed'->'sid'->>0)=32
    THEN
        CASE
			-- compares the list of identifiers already stitched (m.identifiers and p.identifiers), against the identifier
            -- if it has been stitched already (count is > 0), then do not stitch this identifier
            -- here, 'stitching_se_sid' is the attribute_id and
            -- st.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload
			WHEN 
            (
                (
                    SELECT COUNT(*)
                    FROM cdp_ps.matching m
                    INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = m.customer_event_id
                    WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
                    AND m.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
                ) > 0
            ) OR
            (
                (
                    SELECT COUNT(*)
                    FROM cdp_ps.prep p
                    INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = p.customer_event_id
                    WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
                    AND p.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
                ) > 0
            )
        THEN
            NULL
        ELSE
            payload->'url_parts'->'query_parsed'->'sid'->>0
    END
END

Issue 3: Users in the same organization using the same device ID when testing or accessing their own apps/web services.

Some larger clients might have a lot of their own internal users using the same device ID when doing "live" tests on their own apps/web services. In this case, the users are all legitimate however, since so many users are sharing the same device ID, it can cause stitching to stitch all the events coming from all the users into a large entity.

Refer to this documentation for further explanations as to why a large entity is problematic, what is considered "large", and for a fix as it warrants its own page. This is only one of many cases that can cause a large entity but is worth noting for larger organizations that uses the CDP.

Fix by using exclusion lists of identifiers:

You can refer back to the example fixes in Issue 1, which is filtering by a flag that's added to the payload, or by certain IP addresses. If the list of IP addresses is huge or if you have a specific list of devices IDs to exclude, you can create an exclusion list in the external_data table schema. 

Here is an example implementation on how you make use of exclusion lists:

  1. Create a table in external_data table schema with a single column, indexed. The easiest way is to do something like this - in this example we'll use device id, but it can be anything: 
    CREATE TABLE external_data.device_id_exclusion_list AS (
       identifier TEXT PRIMARY KEY
    )
  2. Insert data into the table created, with identifiers that you wish to exclude from stitching.
  3. Note that you need to apply the exclusion list to all rules for the stitching attribute you want to apply it to. In order to find out all the rules for a stitching attribute, you can use this query:
    -- In this example, the rules are for device id
    -- This needs to be changed if you will be creating an exclusion list for other values,
    -- such as phone numbers or emails.
    WITH events_with_device_id AS (
    SELECT
           event_id
    FROM stitching_attributes_config
    INNER JOIN events e ON stitching_attributes_config.event_id = e.id
    WHERE attribute_id = 'ps_device_id'
    )
    SELECT
           event_id, source_id, type, version, attribute_id, value_expression
    FROM stitching_attributes_config
    INNER JOIN events e on stitching_attributes_config.event_id = e.id
    WHERE event_id IN (SELECT event_id FROM events_with_device_id);
  4. In stitching rules, add the exclusion list in conditions using case.. when... expression eg. ( Note that the path to the stitching identifier in the payload can be different for each rule applied to each event type )
    -- Example 1: filtering out device ids, so specific device ids can be excluded from stitching
    CASE 
    	WHEN (SELECT COUNT(*) FROM external_data.device_id_exclusion_list WHERE identifier = payload ->> 'device_id') > 0
    	THEN null
    	ELSE payload ->> 'device_id'
    END
    
    -- Example 2: exclude stitching device ids for certain IP address or e-mail domains
    -- This example illustrates that the value used in the exclusion 
    -- does not need to be the value used for the stitching attribute
    CASE
    	WHEN (SELECT COUNT(*) FROM external_data.ip_address_exclusion_list WHERE identifier = payload ->> 'ip_address') > 0
    	THEN null
    	ELSE payload ->> 'device_id'
    END
  5. Check presence of entities and events with excluded values. 

    The exclusion list is typically implemented after we have already done identity stitching for all identifiers, so there would already be customer entities that contain events with these excluded IDs, These customer entities need to be re-stitched. 
    This process consumes a lot of DB capacity, so it was optimized heavily. 

    First the list of entities containing excluded values will be exported to a separate table:

    CREATE TABLE external_data.entities_with_excluded_device_ids (customer_entity_id uuid primary key, identifiers uuid[]);
    INSERT INTO external_data.entities_with_excluded_device_ids
    WITH unnested AS (
      SELECT customer_entity_id, identifiers, unnest(mx_device_id_all)
      FROM customer_attributes WHERE mx_device_id_all IS NOT NULL
    )
    SELECT customer_entity_id, identifiers FROM unnested
    INNER JOIN external_data.device_id_exclusion_list el ON el.identifier = unnested.device_id;
    

    Then events of those entities are exported to a table in schema external_data. 

    CREATE TABLE external_data.events_with_excluded_device_ids (id uuid primary key );
    INSERT INTO external_data.events_with_excluded_device_ids
        WITH unnested AS (
          SELECT unnest(identifiers::uuid[]) AS identifier, customer_entity_id FROM external_data.entities_with_excluded_device_ids
        )
        SELECT id
        FROM customer_events
        INNER JOIN unnested on customer_events.identifier = unnested.identifier
    ;
  6. After that, make use of the exported tables and follow this doc to restitch the affected events/entities. Follow along from step 1: First export event ids to separate table in external_data schema in the doc in the "Restitching" section of the doc. In this case, you have already created the table containing event ids to be restitched in the external_data schema.