Profile Stitching edge cases and solutions
This article describes the profile stitching (PS) issues / edge cases Analysts have faced during implementation. We hope that it can serve as a reference point on the possible solutions to PS.
General tips
Remove trailing and leading whitespaces, and use lower case for emails. PS will not run on NULL values, therefore we want to set empty strings to NULL as well.
NULLIF(LOWER(TRIM(payload ->> 'email')), '')
Sometimes you may find empty strings, string with whitespace, emails/ids that are ‘None’ or ‘NULL’ strings in the data. There are several ways to handle this:
- In PS rules, use trim() to remove leading and trailing whitespaces.
- In data preprocessing, use replace(' ','') or any equivalent methods (if using processors other than python) to remove any whitespace within the string.
- In data preprocessing, check your code to ensure that missing strings are not replaced with the None or NULL strings.
- If you have such cases in already loaded data you can avoid stitching these with condition similar to:
CASE
WHEN lower(payload ->> 'id') in ('none', '0', 'null', '') THEN NULL
ELSE rtrim(payload ->> 'id')
END
Warning: Using rtrim/trim
with a second parameter in Postgresql may cause unintended behavior. Example: rtrim('123000.0', '.0')
will yield 123
Casing - PS will treat test@meiro.io and Test@meiro.io as different entities!
In PS rules, use lower() to make the string lower case where applicable. Usually this applies to e-mail where it is not case sensitive. For case sensitive identifiers, this does not apply.
Validate your emails and phone numbers during data processing (example with python)
# email regular expression
import re
EMAIL_REGEX = r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)'
email_pattern = re.compile(EMAIL_REGEX)
if email_pattern.fullmatch(email):
return email.strip().lower()
# phonenumbers package can parse phone numbers with country code and split it to country code and phone number
import phonenumbers
x = phonenumbers.parse(phone_number)
event['payload']['country_code'] = x.country_code
event['payload']['phone_number_validated'] = x.national_number
National Identification Numbers:
1. Identification numbers may be missing leading or trailing letters.
Fix: Validate identification numbers according to the client’s requirements and the country’s ID rules.
Device IDs:
iOS devices often prohibit tracking and in this case device_id, advertising_id often will be in 0000-0000-0000... format. To fix this in the stitching rule, this can be done:
CASE
WHEN payload @@ '$.device_id like_regex "^[^1-9a-zA-Z]+$"'THEN NULL
ELSE payload ->> 'device_id'
END
Otherwise, you can turn device_id, advertising_id that are invalid into null values during preprocessing. If existing data is already loaded, use the method mentioned above.
Issue 1: Physical/In-store events
Some clients have a physical store or have in-store events for customers. In most cases, a tablet is used by multiple customers for browsing, logging in and making purchases. This results in many different customers having the same meiro user id
because they used the same device, which causes large entities during PS
Possible fixes and prevention:
- Do not stitch
meiro user id
for these problematic events - Add flags/geolocation fields (like IP addresses) and update PS rules to exclude the appropriate fields
-- in this example, the offline flag is used to show if the event was in-store (offline = '1') or online (offline = '0')
CASE
WHEN payload->'payload'->>'offline' = '1'
THEN NULL
ELSE payload->>'user_id'
END
- Process change - allow for browsing on incognito mode only. Close and open the incognito window for each new customer. The action of closing and re-opening incognito windows will generate new
meiro user ids
, ensuring that different customers will have differentmeiro user ids
. - In Meiro Events, implement
MeiroEvents.resetIdentity()
on specific triggers, like on a log out event. WhenMeiroEvents.resetIdentity()
method is called, the trackedmeiro user id
will be refreshed, so another customer using the same device will be tracked under a newmeiro user id
.
Issue 2: Emails with same tracking ID opened by different users
Many Email Direct Marketing (EDM) platforms now decorate their emails links with an ID to track user's browsing activity. This ID is usually unique per email campaign and per email address. However, users may forward their email or email links containing this ID to other users, which causes all such users to be stitched together because they have the same email ID.
Fix:
- Edit PS rule to only stitch the FIRST instance of email ID, ignoring all subsequent events with the same email ID. This can be done by add specific rule to
value_expression_prep
incdp_ps.stitching_attributes_config
for the event you would like to fix. - Example of the rule for Cockroach Database implementation
CASE
-- a general check on the id to ensure it has 32 characters
WHEN length(payload->'url_parts'->'query_parsed'->'sid'->>0) != 32 OR
(
-- compares the identifier already stitched (identifiers), against the identifier (ARRAY [md5(...)::uuid])
-- if it has been stitched already (count is > 0), then do not stitch this identifier
-- here, 'stitching_se_sid' is the attribute_id and
-- event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload
SELECT COUNT(*)
FROM customer_events ce
WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
AND identifiers @> ARRAY [md5('stitching_se_sid' || (event_to_stitch.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
) > 0
THEN
NULL
ELSE
payload->'url_parts'->'query_parsed'->'sid'->>0
END
- Example of the rule for Postgres Implementation:
CASE
-- a general check on the id to ensure it has 32 characters
WHEN
length(payload->'url_parts'->'query_parsed'->'sid'->>0)=32
THEN
CASE
-- compares the list of identifiers already stitched (m.identifiers and p.identifiers), against the identifier
-- if it has been stitched already (count is > 0), then do not stitch this identifier
-- here, 'stitching_se_sid' is the attribute_id and
-- st.payload->'url_parts'->'query_parsed'->'sid'->>0 is the identifier in the payload
WHEN
(
(
SELECT COUNT(*)
FROM cdp_ps.matching m
INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = m.customer_event_id
WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
AND m.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
) > 0
) OR
(
(
SELECT COUNT(*)
FROM cdp_ps.prep p
INNER JOIN cdp_ce.customer_events ce ON ce.id::uuid = p.customer_event_id
WHERE ce.source_id = 'eshop_website' AND ce."type" = 'page_view'
AND p.identifiers && ARRAY [md5('stitching_se_sid' || (st.payload->'url_parts'->'query_parsed'->'sid'->>0))::uuid]
) > 0
)
THEN
NULL
ELSE
payload->'url_parts'->'query_parsed'->'sid'->>0
END
END