ePrivacy and GPDR Cookie Consent by Cookie Consent Skip to main content

Meiro Analytics

Meiro Analytics is a solution for reporting the success of Meiro use cases. Instead of calculating the data from attributes, it takes values directly from events and calculates from there. It utilises the HyperLogLog (HLL) algorithm, which solves the count-distinct problem by approximating the number of distinct elements instead of trying to calculate them precisely. Link to the complete HLL reference doc on GitHub.

Meiro Analytics itself is a Python script saved in the GitHub repository. We use it via a Python from Git Repository processor and fill all the necessary parameters in its config. It transports events from Cockroach DB into staging tables along with event times and customer entity IDs into reporting DB and goes through a series of Postgres queries specified in the parameters of the configuration. It does not work with any Data In/Out.

Prerequisites

Meiro Analytics stores the aggregated data in a reporting database, separate from your CDP instance. 
Please inform the Meiro team to have this database set up before running your Meiro Analytics component.

Parameters

image.png

Repository Always use https://github.com/meiroio/crdb_analytics, where the component lives.
User Use a login that can read the above-mentioned repository. readonly-from-meiro can be found in 1Password and works well for this purpose.
Password Password for the above specified Github account.
Branch or Tag Unless specified otherwise by the development team, use the main branch.

Component Setup

All other parameters need to be specified in the code editor in JSON format.

src

Cockroach credentials of the instance need to be inserted here. The component will use them to connect to the events table and download specified events for further processing.

dst

Credentials to reporting the DB of the instance need to be inserted here. These are generally unavailable to users, so please contact the Meiro team to get these filled in.

skip_events_transport

Every component run drops all staging tables in the reporting DB and creates new ones. If we are building queries and testing, we can skip this by setting this to true and work with data from the previous run. Always set this to false to enable events transport for live deployment.

 

Remember: If we run the component for the first time or add new events, we need to make at least one run with events transport enabled to ensure we have data to work with.

events_since

Specify how far into the past are we looking with an integer. The component will then process data that number of days before the current date.

events_since_now

For live deployment, always leave this empty. For backloading purposes, we can specify a date in YYYY-MM-DD format to replace the "current date" for the events_since parameter. This is useful, for example, when we have loaded the last 10 days of data and want to load the 10 days before this but don't want to waste time re-processing the already loaded data by setting events_since to 20.

events

Here we specify what events we want to transport from the CR events table. First, we specify an event_id from the CR events table, and then we choose a name for the staging table into which we will load these events. For example, bde9d7e2-0ee1-5460-23e0-a8c8973091b4wb_impression will load events with this ID into table analytics_staging.wb_impression

events_since_type

If we need to make an exception from the events_since rule for specific events, we can do so here. This can help us to prevent loading and calculating with needlessly large amounts of data. Example: bde9d7e2-0ee1-5460-23e0-a8c8973091b4: 7 will cause the previously specified wb_impression event to load only 7 days, even if events_since would be set to 30 days.

analytics: transforms

The transforms section of the analytics field serves to specify transformations in the analytics_staging schema. These tables will be dropped at the beginning of each run and calculated again from the events specified above. Execution order is the same as the order the queries are listed in, so in case of reliance, we need to ensure that the order is correct. Basically, all Postgres functions and practices are allowed here. Indexing is very helpful for smooth and faster runs of the component. The goal of the transforms is to prepare as straightforward bases for the live reports themselves. Make sure to run the query through a json escape tool such as this one to prevent issues caused by json formatting.

analytics: reports

Tables created in this part of the config are created in the analytics_d schema of the reporting DB. They are not dropped with each run, they are incrementally added instead. For each table, we need to specify a query for its base (it can be as simple as a select of one of the transforms, or it can be more complex if needed), a list of dimensions to be taken from this base, and a list of metrics. For each metric, we need to specify the data type and definition. We can also specify a filter that will apply to the whole base query if needed. Beware, if a change to table structure is needed, we need to contact the Meiro team to have the table dropped with all data in it before we can load the newly structured table. Plan accordingly!

Warning: When setting up various values in events_since, events_since_now, and events_since_type make sure that there are events in CR for the days that will be loaded! In cases with short event retention, you can ruin your reports by trying to load days without events, which will cause the incremental nature of loading to load empty data.

{
  "src": {
    "host": "CR host of the instance",
    "port": "CR port",
    "user": "general CR user of the instance",
    "#pass": "password of the user",
    "db": "database name of the instance"
  },
  "dst": {
    "host": "host of reporting DB for the instance",
    "port": "port of the reporting DB",
    "user": "user of the reporing DB",
    "#pass": "password of the user",
    "db": "reporting"
  },
  "skip_events_transport": boolean,
  "events_since": "integer",
  "events_since_now": "empty or YYYY-MM-DD",
  "events": {
    "event_id1": "table_name1",
    "event_id2": "table_name2",
    "event_id3": "table_name3"
  },
  "events_since_type": {
    "event_id1": integer,
    "event_id2": integer
  },
  "analytics": {
    "transforms": [
      {
        "transform1": "query1; create index on analytics_staging.transform1 (column1);"
      },
      {
        "transform2": "query2; create index on analytics_staging.transform2 (column1);"
      },
      {
        "transform3": "query3;"
      }
    ],
    "reports": {
      "se_campaigns_emails": {
        "base": "select * from transform1",
        "dimensions": {
          "dimension1": "COALESCE(column1, '')",
          "dimension2": "COALESCE(column2, '')"
        },
        "metrics": {
          "metric1": {
            "data_type": "INT",
            "definition": "COUNT(*) FILTER (WHERE column2 = 'type1')"
          },
          "metric2": {
            "data_type": "INT",
            "definition": "SUM(value) FILTER (WHERE column2 = 'type2')"
          },
          "metric1_entity": {
            "data_type": "hll",
            "definition": "hll_add_agg(hll_hash_text(customer_entity_id::text)) FILTER (WHERE column2 = 'type1')"
          },
          "metric2_entity": {
            "data_type": "hll",
            "definition": "hll_add_agg(hll_hash_bytea(decode(replace(customer_entity_id::text, '-', ''), 'base64'))) FILTER (WHERE column2 = 'type2')"
          }
        },
        "filter": ""
      }
    }
  }
}

Metabase

Further steps are made in Metabase, where we again need to reach out to Jan or create a Jira ticket in #help_it Slack channel to have it connected to the reporting DB. Then we can start writing queries into our tables defined in analztics"reports section of the config. The basic function to work with the HLL aggregated values is hll_cardinality(hll_union_agg(metric)). More advanced methods can be researched if needed.

Troubleshooting

It is a good practice to watch the logs - the component lists the steps as it is going through them. A SQL error will always be displayed and will be related to the last query listed. 

Take care of run times - if the component spends too much time loading events, perhaps fewer events could be loaded. If a query takes too long, maybe it could be simplified or the source tables better indexed.