Data Audit Checklist

Before ingesting data and processing events, the first task is to explore the data source.

Typically we expect client to provide proper documentation on their database - including data dictionary and ERD (Entity Relationship Diagram), which will help with understanding the dataset. However, in reality, many times this is not readily available on client's side and our analyst will need to do some audit on our own and confirm/clarify with client any assumptions we have about the dataset.

The data audit is typically done on your local machine, and we have several tools to recommend for data coming in different forms:

Recommended tools

The nature of the data can be different for different data sources, they may come in the form of a relational database such as PostgreSQL or other SQL databases, or in the form of files with various formats, such as JSON or CSV files.

In the case of exploring data sitting in PostgreSQL, such as data coming from the Meiro Events SDK, we recommend the DBeaver SQL client, which supports PostgreSQL and other SQL databases as well, to do the data audit.

As for data of that comes in the form of a file with various file formats such as JSON or CSV files, we recommend using Python, a general purpose programming language to explore the data. (You can still use other tools and programming languages, but we do recommend using Python as Meiro Integrations have built-in Python components to use for processing data as well, and the data audit work can transfer over to data processing work.)

Here are the basic points an analyst should check on the data source before ingestion:

Databases

What are the important/relevant tables? (This could be sufficiently covered if client has proper documentation on their database - including data dictionary and ERD. Otherwise, we need to do data audit and describe it & get confirmation with the client)

Name
Description
Relation with other relevant tables

How can the tables join with each other? - preferably by unique identifiers
How often will data be received? One time load/ daily/weekly/monthly

Tables

How many rows are there in this table?
What is the time range of the data?
Is there a unique identifier field? Is this field unique? How is this ID field useful? (join with other tables/join with other databases/customer ID unique to Google/Facebook/Twitter)
What is the event time field for this table? This is especially important for transactional events like purchases, clicks, page views, transfers.
Is data aggregated or should it be aggregated? What are the rollup rules?
What is the type and contents of each field?
1. What is the % of duplicate data?
2. What is the % of missing data?
3. Should the field contain NULLs? If it should not, how can we interpret or fill in the NULLs?
1. What is the timezone? Do we need to change the timezone?
2. Take note of timezone formatting.
1. Can the field be negative or contain 0s?
2. Does it make sense if the field is a float?
3. Would a sum/count/min/max/mean be helpful?
1. What are the categories?
2. How many categories are there?
1. Are there strings that are too long? (Current CDP implementation allows for 256 characters max)
2. Is there a way to shorten the string? (Parse utms from url)
3. Can the strings be categorical information (Do a count, groupby to check)
1. Are the parameters in the JSON file standard across the field? (Ideally it should be)
2. Do you know how to pivot/flatten the JSON or extract the parameters (for non Postgres DB)
1. Do you know how to unnest the list/array? (for non Postgres DB)

General
Datetime
Numerical
Categorical
String
JSON
List / Array

CDP Structure and payload:

Data sources

What is the nature of the data?
How should it be organized in CDP?

Separate data source
Split into more than 1 data sources (ex.: events are from different mobile app)
Should be combined with another dataset

Events

What are the actions available in the dataset?
Are all these actions valuable for CDP?
Any actions which should be created based on the data?
How should actions be categorized to event types in CDP?
Are the events coming from legitimate traffic / sources that the business would consider as a customer? A real-world example of this:
- GoogleBot is a web crawler that uses stored cookies to generate a lot of events. These events can cause a large customer entity and should not be imported into CDP.
  When processing Meiro Events, it is recommended to use the following code block to exclude GoogleBot events
```
WHERE payload -> 'user_agent' ->> 'browser_family' NOT in ('AdsBot-Google', 'Googlebot')
```

Identity Stitching

What are unique identifiers in each data source?
Are there any edge cases (“null” as text, device ID in format “0000”, etc)? Do you need to unify identifiers?
How can the data sources be connected based on these identifiers?

Attributes

What are the fields available for each action?
Do these fields (or combination of fields) cover all requested attributes?
What are mandatory and exported attributes required for destinations?
Are there fields not required currently for attributes but potentially useful for the project?

Meiro Implementation Training

SOPs for Project Implementation

Meiro Customer Data Platform - Software Limits & Requirements

How to Request for FBM Access to Clients' Accounts

How to set up MI and ME API connection in CDP

Data Audit Checklist

Format events

Transform events (CR)

Load events to CDP

Load events to CDP (CR)

Customer Entity Retention Principles

How to Implement Meiro Events in MI and CDP and set up alerts for Meiro Events

Identity Stitching Setup

Identity Stitching rules

How to do Quality Checking on Identity Stitching

Identity Stitching edge cases and solutions

How to rerun identity stitching for all/some entities/events and recalculate attributes

Problems with large customer entities and how to investigate and solve them

Description

Attribute data types and sample values

Compound attributes

How to create context queries that require information of every customer entity

How to troubleshoot when there are no attribute values

How to delete attributes correctly

How to set up attributes in Meiro Business Explorer

How to generate simulated attributes values for demo purpose

Tutorial: How to set up RFM attributes (WIP)

Set export destinations

Facebook Custom Audience destination - upsert

The Trade Desk export destination

Google Analytics Destination

Criteo Destination

Twitter Ads Audience Destination

Google Analytics 4 Destination

Facebook Conversion API Implementation

How to find credentials for GA4 destination

Step-by-step guide for creating reports in Metabase

Set Up Business Explorer Instance & QA

Workspace documentation

Workspaces Monitoring Alert & Analyst Duty

Best Practices for Insights Creation

Component Testing and Documentation guide

Best Practices for Attributes Creation

Best Practices when setting up Destinations

Tutorial: How to implement product feed

Project Handover Procedures

Project Management Handbook

Project Management FAQ

How PMs work with Jira

Data Audit Checklist

Recommended tools

Databases

Tables

CDP Structure and payload:

Data sources

Events

Identity Stitching

Attributes

No Comments