- 6 Minutes to read
- Print
- DarkLight
- PDF
Log Query Monitor Best Practices
- 6 Minutes to read
- Print
- DarkLight
- PDF
Retrace Log Query Monitors
Retrace supports the monitoring of Errors and Logs and can alert on ingestion using our Metrics, Alerts and Notifications systems. The purpose of this document is to outline what log query monitoring is, how it works, and best practices for configuring the monitoring for durable and consistent use.
For the purposes of this document Log Monitoring and Error Monitoring are considered the same thing. This is because Errors ARE Logs and thus outside of one case are treated exactly the same.
TDLR: Monitor Configuration Recommendations
- Search Time frame of 30 minutes or greater
- Check Frequency such that the monitor fires at least 4 times over a given search time period. (These are minimum recommendations on the intervals. Having a more frequent interval will not hurt and might be beneficial)
- Timing alert thresholds (After X minutes) should be no less than the frequency.
Examples:
- For a "Last 30 minutes" Search Time frame, do a "Every 5 minutes" Check Frequency interval minimum.
- For a "Last 60 minutes" Search Time frame, do a "Every 15 minute" Check Frequency interval minimum.
- For a "Every 5 minutes" Check Frequency interval, make the "After X Minutes" time threshold 5 minutes or more.
What is Log Query Monitoring?
Log Query Monitoring is the capturing of result counts for a log query and inserting that into a metric that can be monitored and alerted on. It takes a point-in-time snapshot of a log search and reports the results through a metric just like any other App or Server monitor.
A Log query monitor can be build in 2 ways:
- An ad hoc query string attached to the monitor
- A saved query attached to the monitor
The advantages that a saved query has over an ad hoc query string is that the saved query can include specific property filters in addition to any query string filter. This allows for a more precise filter. The other advantage is that you can actively tailor the query to your desired result without having to jump through screens to update monitors the saved query is attached to. Please note that any time range filters of a saved query are ignored by log monitors.
The anatomy of the log query monitor is:
- Monitor Name: configurable value that appears the main Log Query Monitors list
- Check Frequency: the interval that the query runs
- Search Query (ad hoc) vs Saved Search (saved query): the actual query that is ran.
- Search Time Frame: the time frame (in minutes) that is evaluated in the past to perform the query. As an example, for a 60 minute time frame the entire last hour is queried
- Query Matches:where the alerting is set up to alert on the number of matches returned by the query
How does Log Query Monitoring differ from other types of monitoring?
Most of the monitors Retrace ingests are considered synchronous monitors. This means that the data that is transmitted to the Metrics pipeline is directly drawn from the source. For example, a CPU % monitor takes a snapshot of the CPU % and submits that data to the Metrics pipeline. Synchronous monitors are very durable and stable by their nature.
Log Monitors are asynchronous monitors. This means that the data transmitted to the Metrics pipeline is separated from the ingestion of the source data. The data captured will eventually be complete, but it is not necessarily complete at a given time. When a snapshot of the count is taken, not ALL the data is necessarily available for that time range. There is a certain amount of fragility baked into log monitoring that needs to be accounted for when configuring the monitors.
To understand how to properly configure a log monitor, we need to understand how log ingestion works.
Log Query Monitor Configuration Pitfalls
For log query monitors there can be a number of ways to misconfigure the monitor. This can happen in configurations where the intended outcome is not met because we hit some kind of Log Ingestion delay, hit a collection error of some nature, or the frequency checks and/or time frame intervals are not configured correctly. Let's consider the following cases:
The Short Timer
Configuration
Time Frame: 5 minutes
Frequency: 1 minutes
Value Threshold: > 0
Duration Threshold: 2 minutes
This means we are looking at a 5 minute search time frame every 1 minute and alerting if we see a count > 0 after 2 minutes. To highlight what is wrong with this configuration consider the following timeline.
Time | 12pm | :01 | :02 | :03 | :04 | :05 | :06 | :07 | :08 | :09 | :10 | :11 | :12 | :13 | :14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Logs Count at that time | 0 | 5 | 5 | 10 | 5 | 0 | 0 | 0 | 5 | 7 | 2 | 0 | 0 | 0 | 0 |
Logs Count final | 5 | 5 | 10 | 7 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 0 |
This example shows that the log counts, when close to the leading edge of ingestion, are not representative of the final count. It can easily miss logs that you want to act upon. A simple solution to this would be to increase the time frame to something that accounts for that lack of data at the leading edge. We recommend a minimum 30 minutes time frame.
The Occasional Minder
Configuration
Time Frame: 60 minutes
Frequency: 30 minutes
Value Threshold: > 0
Duration Threshold: 5 minutes
Time | 12pm | :30 | 1pm | :30 | 2pm | :30 | 3pm | :30 | 4pm | :30 | 5pm | :30 | 6pm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Logs Count | 0 | 5 | 5 | 0 | 0 | 0 | 5 | 0 | 5 | 5 | 0 | 5 | 0 |
Note | 5 logs submitted | 5 logs submitted | Collection Error | 5 logs submitted | 5 logs submitted | monitor ran late |
This example shows the dangers of having too small of a frequency. Over any given 60 minutes, the monitor only has 2 chances to see the data. In the example you can see the monitor would fire at the 1pm timestamp, then clear on the 1:30 timestamp, followed by failures to fire on the 3:30 & 6pm time stamps because of collection error and a timing issue.
This is fragile because any issue within the pipelines that occurs WILL cause a failure. We recommend that the frequency be at least 1/4th of the time frame. The more opportunities the monitor gets to act upon the data the better.
Rocky Balboa
Configuration
Time Frame: 60 minutes
Frequency: 5 minutes
Value Threshold: > 0
Duration Threshold: 5 minutes
Time | 12pm | :05 | :10 | :15 | :20 | :25 | :30 | :35 | :40 | :45 | :50 | :55 | 1pm | :05 | :10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Logs Count | 0 | 5 | 5 | 5 | 0 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 0 |
Action | -- | timer+ | Alert Opens & Notification | -- | Collection Error Alert Closes Clear Notification | timer+ | Alert Opens & Notification | -- | -- | -- | -- | -- | -- | -- | Alert Closes & Clear Notification |
This example shows a well configured monitor. It fires often enough that while the Collection Error
at 12:20 causes an additional alert and notification, over that full hour something was alerted and notified over. This monitor takes a beating and still is able to yell "ADRIAN!" at the end of it.
Recommendations
Given all of this, our recommendation for log monitor configuration is:
- Search Time frame of 30 minutes or greater
- Check Frequency such that the monitor fires at least 4 times over a given search time period. (These are minimum recommendations on the intervals. Having a more frequent interval will not hurt and might be beneficial)
- Timing alert thresholds (After X minutes) should be no less than the frequency.