Workspaces Monitoring Alert & Analyst Duty
Overview
On all client instances that are in production, monitoring alerts have been set up. Whenever an instance/workspace fails. an error notification will be sent to the respective Slack channel to notify us on the error.
Here is a list of the alerts channels and the alerts they receive
Slack Channel | Types of Alerts |
#alerts_cdp | Mainly cache refresh alerts and segment export failures in Business Explorer |
#alerts_cps |
Alerts sent whenever CPS component in Meiro Integrations fails |
#alerts_me | Alerts sent Meiro Events Monitoring App whenever there is no Meiro Events data for a period of time |
#alerts_workspaces and #alerts_reports | Various alerts related to failure in workspaces in Meiro Integrations |
Analysts Duty
Analysts take turns to monitor these alerts channels on a weekly rotation. As the analyst-on-duty of that week, please go through the alerts every working day and tag the analysts involved for that client instance.
The list of all client instances and their stakeholders can be found here: Meiro All Instances, Project, Analysts - Google Sheets
Guidelines
General guidelines to follow for the analyst-on-duty:
- Please check the alerts at least 2 times a day.
- If you have access to the workspace, you may want to check the error message.
- If you have encountered similar errors before, you can include the error message and the suggested fix when tagging the analyst.
- If the workspace has no errors, it usually means that the workspace has been re-run successfully and there is no need to tag the analyst.
Cache refresh (#alerts_cdp)
Cache refreshes are done for the diagnostic dashboard and a few other things which are too expensive to calculate in real time. If it is delayed for some reason (DB being down or overutilized, some other errors) or if it has not been refreshed for the scheduled time + 20%, an alert will be sent. Usually, it points to 1) cache being slow or 2) some temporary error.
Next time there are cache refresh alerts in #alerts_cdp, here’s the procedure:
- Run the cache refresh manually, and
- If it does not solve the problem, report it as a bug.