Format Events

Format events into format suitable for customer_events table

The purpose of this component is to prepare the events' format for loading the events into the `customer_events` table. For using the component use this link to repo:

Events should be supplied in `new-line-delimited-~~JSON`~~JSON format (suffix .ndjson).

Configuration

For example given a file in data/in/files/events_mc_subscribed.ndjson with the following 2 rows

{"email":"aaa@meiro.robin@meiro.io", "meta": {"date": "2018-08-18T14:15:16Z"}, "status": "subscribed", "list_id": "12345b", "list_name": "Loyal customers"}
{"email":"foo@bar.io", "meta": {"date": "2018-08-18T15:16:17Z"}, "status": "subscribed", "list_id": "12345b", "list_name": "Loyal customers"}

the goal is to produce data/out/tables/events_mc_subscribed.csv, which can be uploaded to customer_events table

id	customer_entity_id	event_id	source_id	event_time	type	version	payload	created_at
md5(...)		md5(...)	mailchimp	2018-08-18T14:15:16Z	subscribed	0-1-0	{"email":"robin@"... - 1:1 copy of original}	current_utc_iso8601 stamp
md5(...)		md5(...)	mailchimp	2018-08-18T15:16:17Z	subscribed	0-1-0	{"email":"foo@bar.io"... - 1:1 copy of original}	current_utc_iso8601 stamp

In a nutshell, this component extracts some values from the ndjson events in order to

construct the id of the event
construct the event_id (a reference to the events table)
extract event time
set the required columns

the config.json describes where to find these values in the event jsons

The `id` calculation

the event id is calculated as an md5 of event_time, source, event_type by default
Optionally you can specify extra values to be included in the hash by extra_id_rules parameter (below). The values are a dot-separated json paths (meta.value) would resolve to 42 in this json {"foo": "bar", "meta": {"value": 42}}

!!! Important !!! the order of the extra_id_rules DOES matter, as we are dealing with hashes!

The `event_id` calculation

The formula, in sql syntax, is md5("source_id" || "type" || "version").

Vanilla config

{
  "parameters": {
      "events": [
        {
          "filename": "mc_subscribed_events.ndjson",
          "optional": true,
          "version": "0-1-0",
          "event_type": "subscribed",
          "source": "mailchimp",
          "event_time_rule": "meta.date",
          "extra_id_rules": ["email", "list_id"],
          "event_time_exclude": True
         }
      ]
  }
}

filename - which input file contains the events
optional - if true, doesn't raise error if the file is not found (which can happen if there are no events for this particular batch)
source - hardcoded source_id (as defined in the sources table)
event_time_rule - "path.to.event_time.in.payload" (jq style) used to populate the event_time column
extra_id_rules - an array of values which are included in the event id calculation. This is must include values that uniquely identify the event (i.e. a customer_id + the event_id in the source system etc.). The values of this array are "paths" (=rules) of where to find the actual values in the event json.
event_type - if set to '' or null or left undefined, it is infered from the filename. Used as the value of the customer_events.type column
event_time_exclude - if set to True, event_time won't be included to event id ~~calculation.~~ calculation and `id` will be calculated as md5 of source, event_type, extra_ids (if available). If not set or set to False - event_time will be included in calculation as usual.

~~in addition to that, choose one of:~~

version - hardcoded value of the version column

version_rule - ~~DEPRECATED~~ ~~if the version is available directly in the payload, instead of~~ "version" ~~parameter, supply~~ "version_rule" : "path.to_version.in.payload"
~~i.e. if the event (for example snowplow) looks like this~~ {"date": "foo", "event_version": "1-0-0", "value": "42"} ~~you would set the it to~~ "event_version"

By default, all files defined the the events array need to be supplied, otherwise an error will be thrown, if you want to continue on missing files, set optional to true (false by default).

Development

~~build docker image~~ $ docker-compose build dev

~~test with~~ $ docker-compose run --rm dev python3 -m pytest

~~there is a pseudo-functional test to check there are no import or syntax errors in the main.py script~~

$ docker-compose run --rm dev sh
$ find /tmp/data_functional 
# shouldn't find anything
$ sh /code/functional.sh

~~should run the test. There are no assertions, take a peek manually~~

$ find /tmp/data_functional
/tmp/data_functional/
/tmp/data_functional/in
/tmp/data_functional/in/files
/tmp/data_functional/in/files/mc_subscribed_events.ndjson
/tmp/data_functional/out
/tmp/data_functional/out/tables
/tmp/data_functional/out/tables/subscribed.csv
/tmp/data_functional/config.json

~~especially make sure the table~~ /tmp/data_functional/out/tables/subscribed.csv ~~is there (the validity should be covered by the unit tests)~~