The glossary is a helpful reference for Retrace Alerts and Notifications terms.
A monitor represents a specific source and type of data that Stackify collects. The collection mechanism depends on the monitor type. Example Monitor Types include:
- CPU %
- Memory Used
- SQL Query Monitor
- Website Check Monitor
- Error Rate Monitors
An alert represents a period of time where a stream of monitor metrics for a specific Monitor exceeds a configured severity.
When configuring alerts for a monitor, define Alert Severities. There are 3 possible severities, listed in ascending order of severity:
Retrace uses this order when determining the most severe alert for a given resource, app, or server. Keep your configuration in line with these semantics. Alert Severities define two requirements that must be met for an alert to open at or transition to that severity:
- Value Threshold: the instantaneous value requirement for a monitor metric
- Duration Threshold: the amount of time all monitor metric values must meet the value threshold requirement
An example configuration might look like:
For this configuration, an alert is opened at Warning when CPU has been greater than or equal to 95% for 10 minutes consecutively. If the monitor metric values remain above 98% for 15 minutes, it transitions into a Critical severity. After 30 minutes it transitions into Outage. If a monitor metric is received with a value below 95% at any point, the alert closes.
A notification is a message that can be sent when an event occurs. Possible notification targets include:
- Slack (this can also, be configured with an @channel mention)
Only production servers are eligible for notifications.
A notification group is a collection of associated contacts.
Notifiable Events / Types of Notifications
There are four types of notifications. All of these notification types are subject to the specific monitor severities associated with the notification group (i.e. we do not send an alert, escalation, or reminder if the current severity does not match the configuration for that monitor on the notification group).
Alerts Sent whenever an alert transitions to a severity that is at or above the severity configured for the notification group.
Reminders A reminder is a type of notification that is sent on a repeating interval since the last alert severity transition. Example: If you had a reminder setup for 20 minutes and an alert that has not been acknowledged started at 9:00 AM, a reminder notification is sent out at 9:20 AM, 9:40 AM, and so forth. The interval for a reminder is configured on the notification group.
Escalations An escalation is a type of notification that is sent when an alert remains above an acknowledged severity for the configured escalation interval. If an alert is never acknowledged, the escalation notification is sent as soon as the escalation interval expires.
Clears Sent whenever any alert is closed. Alerts are closed when a monitor metric that does not meet the required severity is received.
Types of Notification Suppression
- Acknowledge An acknowledgement represents an accepted alert severity that should not send further notifications while the alert is open. Acknowledgements are not time-bound. They do not expire. Typically, an alert is acknowledged when someone is actively working to fix the issue but should be notified if it gets worse. The suppression behavior obeys the following rules:
- If an alert transitions below the acknowledged severity, notifications are not sent.
- If an alert transitions above the acknowledged severity, notifications are sent.
- If the alert transitions back to the acknowledged severity, it remains suppressed.
- Snooze A snooze is a time-based form of suppression applied to a monitor. No notifications of any kind are sent until the snooze period expires. When the snooze expires, notifications resume according to their configuration.
A note on reminders and escalations: Reminders and escalations are processed on an interval. A Snooze does not impact the timing of this process. For example, consider a notification group that has been configured with a reminder interval of 10 minutes. The following demonstrates how snoozes work with reminders and escalations:
- 00: alert started
- 10: reminder sent
- 15: 10 minute snooze applied
- 20: no reminder sent because the monitor is snoozed
- 25: snooze expires (no notifications are sent when the snooze expires)
- 30: reminder sent
- Maintenance Window A maintenance window is a special form of snooze that is applied to all monitors. All behavioral rules that apply to snooze also apply to maintenance windows.