ePrivacy and GPDR Cookie Consent by Cookie Consent Skip to main content

Data Exploration

Before ingesting data and processing events, the first task is to explore the data source. 

Typically we expect client to provide proper documentation on their database - including data dictionary and ERD, which will help with understanding the dataset. However, in reality, many times this is not readily available on client's side and our analyst will need to do some exploration on our own and confirm/clarify with client any assumptions we have about the dataset. 

Here are the basic points an analyst should check on the data source before ingestion: 

Databases

  1. What are the important/relevant tables? (This could be sufficiently covered if client has proper documentation on their database - including data dictionary and ERD. Otherwise, we need to do data exploration and describe it & get confirmation with the client)
    1. Name 
    2. Description 
    3. Relation with other relevant tables 
  2. How can the tables join with each other? - preferably by unique identifiers
  3. How often will data be received? One time load/ daily/weekly/monthly

 

Tables

  1. How many rows are there in this table? 
  2. What is the time range of the data?
  3. Is there a unique identifier field? Is this field unique? How is this ID field useful? (join with other tables/join with other databases/customer ID unique to Google/Facebook/Twitter)
  4. What is the event time field for this table? This is especially important for transactional events like purchases, clicks, page views, transfers.
  5. Is data aggregated or should it be aggregated? What are the rollup rules?
  6. What is the type and contents of each field?
    1. General
      1. What is the % of duplicate data?
      2. What is the % of missing data? 
      3. Should the field contain NULLs? If it should not, how can we interpret or fill in the NULLs?
    2. Datetime
      1. What is the timezone? Do we need to change the timezone?
      2. Take note of timezone formatting.
    3. Numerical
      1. Can the field be negative or contain 0s?
      2. Does it make sense if the field is a float? 
      3. Would a sum/count/min/max/mean be helpful?
    4. Categorical
      1. What are the categories?
      2. How many categories are there?
    5. String
      1. Are there strings that are too long? (Current CDP implementation allows for 256 characters max)
      2. Is there a way to shorten the string? (Parse utms from url)
      3. Can the strings be categorical information (Do a count, groupby to check)
    6. JSON
      1. Are the parameters in the JSON file standard across the field? (Ideally it should be)
      2. Do you know how to pivot/flatten the JSON or extract the parameters (for non Postgres DB)
    7. List / Array
      1. Do you know how to unnest the list/array? (for non Postgres DB) 

 

CDP Structure and payload:

Data sources
  1. What is the nature of the data? 
  2. How should it be organized in CDP? 
    1. Separate data source
    2. Split into more than 1 data sources (ex.: events are from different mobile app)
    3. Should be combined with another dataset
Events
  1. What are the actions available in the dataset? 
  2. Are all these actions valuable for CDP? 
  3. Any actions which should be created based on the data?
  4. How should actions be categorized to event types in CDP?
Profile Stitching
  1. What are unique identifiers in each data source?
  2. Are there any edge cases (“null” as text, device ID in format “0000”, etc)? Do you need to unify identifiers?
  3. How can the data sources be connected based on these identifiers?
Attributes
  1. What are the fields available for each action?
  2. Do these fields (or combination of fields) cover all requested attributes?
  3. What are mandatory and exported attributes required for destinations?
  4. Are there fields not required currently for attributes but potentially useful for the project?