How to clean wrongly stitched customer entities without rerunning PS
PS edge cases
It may happen when setting up PS rules, that you miss out some edge cases and this caused stitching customer entities wrongly. Examples of this cases can be:
ps_dentifier
in the payload was parsed to text valueNone
or0
. CPS is checking if PS identifieris not NULL
, butNone
will be a valid value for the algorythm and it will stitch all customer event wherepd_identifier = "None"
to one huge entity.ps_dentifier
was parsed intointeger
andfloat
for different data sources:12345
and12345.0
are different values for CPS and entities with these values won't be stitched.
Usually at the very beginning of the project when only small samples of the data were loaded, it makes more sense to rerun PS from scratch. However if big volume of the data was already processed rerunning PS from zero can take days to catchup and it is simply not effective. In this case it might be more sufficient to clean only problematic customer entities and customer events.
Check this documentation for more edge cases
Preparation
PostgreSQL process (to be deprecated)
Before the cleaning:
Define the problematic customer entity(s). If it is just one huge entity which needs to be "unstitched" - save it's value, if there many entities - decide how to filter these entities.Make sure you know how PS rule should be fixed and test it on a relevant data sampleDisable CPS component in MI and stop all the processes with app_name "Meiro CPS" from running.
--Check DB there is no query with app_name "Meiro CPS" running:
SELECT * FROM pg_stat_activity;
--If there is, kill it:
SELECT pg_terminate_backend(pid);
Create a separate schema where tables with entities and events which needs to be restiched will be stored.
Warning: Make sure CPS is not running until you finish with the cleaning.
How to make the cleaning
PostgreSQL process (to be deprecated)
Step 0
If events need to be deleted, delete them as the first step.
delete from cdp_ce.customer_events ce where id in --condition based on which entity(ies) are filtered
Step 1
Create a table where entities to delete will be stored.
create table temp_table.entities_to_clean as
select customer_entity_id from public.customer_events
where --condition based on which entity(ies) are filtered
Step 2
create table temp_table.events_to_clean as
select id from public.customer_events
where customer_entity_id in (select * from temp_table.entities_to_clean)
Step 3
cdp_ps.matching
table (this will "unstitch" the events).delete from cdp_ps.matching
where matching.customer_entity_id in (select customer_entity_id::uuid from temp_table.entities_to_clean)
Step 4
cdp_attr.customer_entities_audit
to clean everything related to these customer_entity_id
s.insert into cdp_attr.customer_entities_audit(customer_entity_id, created, action)
select distinct customer_entity_id::uuid,current_timestamp, 'delete' from temp_table.entities_to_clean
Step 5
cdp_ps.customer_events_to_process
to restitch the events.insert into cdp_ps.customer_events_to_process(customer_event_id, created)
select id::uuid, current_timestamp from temp_table.events_to_clean
Step 6
Update PS rules and enable CPS for running.
Important: Run the queries one by one and make sure that the previous step was finished before running the next one.
Help! I have accidentally deleted the customer event ids from cdp_ps.matching
and lost the customer event ids to clean before inserting them into cdp_ps.customer_events_to_process
!
cdp_ps.matching
and lost the customer event ids to clean before inserting them into cdp_ps.customer_events_to_process
!Don't panic, first, make sure CPS is not running, then run the following query to insert ALL events that are not present in cdp_ps.matching
table, a.k.a. events that have not been stitched.
insert into cdp_ps.customer_events_to_process(customer_event_id, created)
select id::uuid, current_timestamp from cdp_ce.customer_events ce
left join cdp_ps.matching m on ce.id::uuid = m.customer_event_id where m.customer_event_id is null;
Warning: Take note that this query can be expensive depending on how large the volume of data is. To avoid this problem from happening in the first place, follow the previous steps carefully. This is intended to be a workaround only.