Format Events
Format events into format suitable for customer_events table
The purpose of this component is to prepare the events' format for loading the events into the `customer_events` table. For using the component use this link to repo:
Events should be supplied in `new-line-delimited-JSON`JSON format (suffix .ndjson).
Configuration
For example given a file in data/in/files/events_mc_subscribed.ndjson
with the following 2 rows
{"email":"aaa@meiro.robin@meiro.io", "meta": {"date": "2018-08-18T14:15:16Z"}, "status": "subscribed", "list_id": "12345b", "list_name": "Loyal customers"} {"email":"foo@bar.io", "meta": {"date": "2018-08-18T15:16:17Z"}, "status": "subscribed", "list_id": "12345b", "list_name": "Loyal customers"}
the goal is to produce data/out/tables/events_mc_subscribed.csv
, which can be uploaded to customer_events
table
id | customer_entity_id | event_id | source_id | event_time | type | version | payload | created_at |
---|---|---|---|---|---|---|---|---|
md5(...) | md5(...) | mailchimp | 2018-08-18T14:15:16Z | subscribed | 0-1-0 | {"email":"robin@"... - 1:1 copy of original} | current_utc_iso8601 stamp | |
md5(...) | md5(...) | mailchimp | 2018-08-18T15:16:17Z | subscribed | 0-1-0 | {"email":"foo@bar.io"... - 1:1 copy of original} | current_utc_iso8601 stamp |
In a nutshell, this component extracts some values from the ndjson events in order to
- construct the id of the event
- construct the event_id (a reference to the
events
table) - extract event time
- set the required columns
the config.json
describes where to find these values in the event jsons
The id
calculation
id
calculationthe event id
is calculated as an md5 of event_time, source, event_type
by default
Optionally you can specify extra values to be included in the hash by extra_id_rules
parameter (below). The values are a dot-separated json paths (meta.value
) would resolve to 42
in this json {"foo": "bar", "meta": {"value": 42}}
!!! Important !!! the order of the extra_id_rules
DOES matter, as we are dealing with hashes!
The event_id
calculation
event_id
calculationThe formula, in sql syntax, is md5("source_id" || "type" || "version")
.
Vanilla config
{ "parameters": { "events": [ { "filename": "mc_subscribed_events.ndjson", "optional": true, "version": "0-1-0", "event_type": "subscribed", "source": "mailchimp", "event_time_rule": "meta.date", "extra_id_rules": ["email", "list_id"], "event_time_exclude": True } ] } }
filename
- which input file contains the eventsoptional
- iftrue
, doesn't raise error if the file is not found (which can happen if there are no events for this particular batch)source
- hardcodedsource_id
(as defined in thesources
table)event_time_rule
-"path.to.event_time.in.payload"
(jq style) used to populate theevent_time
columnextra_id_rules
- an array of values which are included in the eventid
calculation. This is must include values that uniquely identify the event (i.e. a customer_id + the event_id in the source system etc.). The values of this array are "paths" (=rules) of where to find the actual values in the event json.event_type
- if set to''
ornull
or left undefined, it is infered from thefilename
. Used as the value of thecustomer_events.type
columnevent_time_exclude
- if set to True,event_time
won't be included to eventid
calculation.calculation and `id` will be calculated as md5 ofsource, event_type, extra_ids (if available)
. If not set or set to False -event_time
will be included in calculation as usual.
in addition to that, choose one of:
version
- hardcoded value of theversion
columnversion_rule-DEPRECATEDif the version is available directly in the payload, instead of"version"parameter, supply"version_rule" : "path.to_version.in.payload"i.e. if the event (for example snowplow) looks like this{"date": "foo", "event_version": "1-0-0", "value": "42"}you would set the it to"event_version"
By default, all files defined the the events
array need to be supplied, otherwise an error will be thrown, if you want to continue on missing files, set optional
to true (false
by default).
Development
build docker image$ docker-compose build devtest with$ docker-compose run --rm dev python3 -m pytest
there is a pseudo-functional test to check there are no import or syntax errors in the main.py script
$ docker-compose run --rm dev sh $ find /tmp/data_functional# shouldn't find anything$ sh /code/functional.sh
should run the test. There are no assertions, take a peek manually
$ find /tmp/data_functional /tmp/data_functional/ /tmp/data_functional/in /tmp/data_functional/in/files /tmp/data_functional/in/files/mc_subscribed_events.ndjson /tmp/data_functional/out /tmp/data_functional/out/tables /tmp/data_functional/out/tables/subscribed.csv /tmp/data_functional/config.json
especially make sure the table /tmp/data_functional/out/tables/subscribed.csv is there (the validity should be covered by the unit tests)