ePrivacy and GPDR Cookie Consent by Cookie Consent Skip to main content

Profile Stitching edge cases and solutions

This article describes the profile stitching (PS) issues / edge cases Analysts have faced during implementation. We hope that it can serve as a reference point on the possible solutions to PS.

General tips

Remove trailing and leading whitespaces, and use lower case for emails. PS will not run on NULL values, therefore we want to set empty strings to NULL as well.

NULLIF(LOWER(TRIM(payload ->> 'email')), '')

Sometimes you may find empty strings, string with whitespace, emails/ids that are ‘None’ or ‘NULL’ strings in the data. There are several ways to handle this:

  1. In PS rules, use trim() to remove leading and trailing whitespaces.
  2. In data preprocessing, use replace(' ','') or any equivalent methods (if using processors other than python) to remove any whitespace within the string.
  3. In data preprocessing, check your code to ensure that missing strings are not replaced with the None or NULL strings.
  4. If you have such cases in already loaded data you can avoid stitching these with condition similar to:
CASE
    WHEN lower(payload ->> 'id') in ('none', '0', 'null', '') THEN NULL
    ELSE rtrim(payload ->> 'id')
END

Warning: Using rtrim/trim with a second parameter in Postgresql may cause unintended behavior. Example: rtrim('123000.0', '.0') will yield 123

Casing - PS will treat test@meiro.io and Test@meiro.io as different entities!
In PS rules, use lower() to make the string lower case where applicable. Usually this applies to e-mail where it is not case sensitive. For case sensitive identifiers, this does not apply.

Validate your emails and phone numbers during data processing (example with python)

# email regular expression
import re
EMAIL_REGEX = r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)'
email_pattern = re.compile(EMAIL_REGEX)
if email_pattern.fullmatch(email):	
	return email.strip().lower()
# phonenumbers package can parse phone numbers with country code and split it to country code and phone number
import phonenumbers
x = phonenumbers.parse(phone_number)
event['payload']['country_code'] = x.country_code
event['payload']['phone_number_validated'] = x.national_number

National Identification Numbers:
1. Identification numbers may be missing leading or trailing letters.
Fix: Validate identification numbers according to the client’s requirements and the country’s ID rules.

Device IDs: 
iOS devices often prohibit tracking and in this case device_id, advertising_id often will be in 0000-0000-0000... format. To fix this in the stitching rule, this can be done:

CASE 
    WHEN payload @@ '$.device_id like_regex "^[^1-9a-zA-Z]+$"'THEN NULL
    ELSE payload ->> 'device_id'
END

Otherwise, you can turn device_id, advertising_id that are invalid into null values during preprocessing. If existing data is already loaded, use the method mentioned above.

Issue 1: Physical/In-store events

Some clients have a physical store or have in-store events for customers. In most cases, a tablet is used by multiple customers for browsing, logging in and making purchases. This results in many different customers having the same meiro user id because they used the same device, which causes large entities during PS

PossibleFix fixesbased andon prevention:filtering out geolocation or flags:
  • Do not stitch meiro user id for these problematic events
  • Add flags/geolocation fields (like IP addresses) and update PS rules to exclude the appropriate fields
-- in this example, the offline flag is used to show if the event was in-store (offline = '1') or online (offline = '0')
CASE 
	WHEN payload->'payload'->>'offline' = '1'
    THEN NULL 
	ELSE payload->>'user_id' 
END
  • Process change - allow for browsing on incognito mode only. Close and open the incognito window for each new customer. The action of closing and re-opening incognito windows will generate new meiro user ids, ensuring that different customers will have different meiro user ids.
  • In Meiro Events, implement MeiroEvents.resetIdentity() on specific triggers, like on a log out event. When MeiroEvents.resetIdentity() method is called, the tracked meiro user id will be refreshed, so another customer using the same device will be tracked under a new meiro user id.

Issue 2: Emails with same tracking ID opened by different users

Many Email Direct Marketing (EDM) platforms now decorate their emails links with an ID to track user's browsing activity. This ID is usually unique per email campaign and per email address. However, users may forward their email or email links containing this ID to other users, which causes all such users to be stitched together because they have the same email ID.

Fix:Fix by stitching only the first instance of an event:
  • Edit PS rule to only stitch the FIRST instance of email ID, ignoring all subsequent events with the same email ID. This can be done by add specific rule to value_expression_prep in cdp_ps.stitching_attributes_config for the event you would like to fix.
  • Example of the rule for Cockroach Database implementation
CASE
    -- a general check on the id to ensure it has 32 characters
    WHEN length(payload->'url_parts'->'query_parsed'->'sid'->>0) != 32 OR
        (
        -- compares the identifier already stitched (identifiers), against the identifier (ARRAY [md5(...)::uuid])
        -- if it has been stitched already (count is > 0), then do not stitch this identifier
        -- here, 'stitching_se_sid' is the attribute_id and
        -- event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload
            SELECT COUNT(*)
            FROM customer_events ce 
            WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
            AND identifiers @> ARRAY [md5('stitching_se_sid' || (event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
        ) > 0
THEN
    NULL
ELSE
    payload->'url_parts'->'query_parsed'->'sid'->>0
END
  • Example of the rule for Postgres Implementation:
CASE
    -- a general check on the id to ensure it has 32 characters
    WHEN
        length(payload->'url_parts'->'query_parsed'->'sid'->>0)=32
    THEN
        CASE
			-- compares the list of identifiers already stitched (m.identifiers and p.identifiers), against the identifier
            -- if it has been stitched already (count is > 0), then do not stitch this identifier
            -- here, 'stitching_se_sid' is the attribute_id and
            -- st.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload
			WHEN 
            (
                (
                    SELECT COUNT(*)
                    FROM cdp_ps.matching m
                    INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = m.customer_event_id
                    WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
                    AND m.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
                ) > 0
            ) OR
            (
                (
                    SELECT COUNT(*)
                    FROM cdp_ps.prep p
                    INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = p.customer_event_id
                    WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
                    AND p.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
                ) > 0
            )
        THEN
            NULL
        ELSE
            payload->'url_parts'->'query_parsed'->'sid'->>0
    END
END

Issue 3: Users in the same organization using the same device ID when testing or accessing their own apps/web services.

Some larger clients might have a lot of their own internal users using the same device ID when doing "live" tests on their own apps/web services. In this case, the users are all legitimate however, since so many users are sharing the same device ID, it can cause stitching to stitch all the events coming from all the users into a large entity.

Refer to this documentation for further explanations as to why a large entity is problematic, what is considered "large", and for a fix as it warrants its own page. This is only one of many cases that can cause a large entity but is worth noting for larger organizations that uses the CDP.

PossibleFix fixes:by using exclusion lists of identifiers:

You can refer back to the example fixes in Issue 1, which is filtering by a flag that's added to the payload, or by certain IP addresses. If the list of IP addresses is huge or if you have a specific list of devices IDs to exclude, you can create an exclusion list in the external_data table schema. 

Here is an example implementation on how you make use of exclusion lists:

  1. Create a table in external_data table schema with a single column, indexed. The easiest way is to do something like this - in this example we'll use device id, but it can be anything: 
    create table external_data.device_id_exclusion_list as (
       identifier text primary key
    )
  2. Insert data into the table created, with identifiers that you wish to exclude from stitching
  3. In stitching rules, add the exclusion list in conditions using case.. when... expression eg.
    case 
    	when (select count(*) from external_data.exclusion_list_device_id where identifier = payload ->> 'device_id') > 0
    	then null
    	else payload ->> 'device_id'
    end

  4. Follow this doc to restitch affected events/entities