Friday, June 26, 2020

Better splunk alerts with delta and predict keyword

Absolute value threshold triggered alert

The splunk triggers when a quantity hits certain threshold.

For example, a bookstore business want to get alert for abnormal ebook sell amount. A splunk trigger like the following could prompt the owner to check issues if the sell is too low during a particular hour.
The following query can be scheduled to run every hour at 0 minutes past the hour at number of results grater than 0.

index=Sell sourcetype=Ebook Stores IN (amazon109, ebay38,amazon339) (ACCOUNT_NAME=*-seller123-*)
| eval platform = case(MATCH(SERVER_NAME, "aws-*"),"AWS",1=1,"EBAY")
| bin _time span=1h
| stats count by _time, stores, platform
| where count < 10
| eval comment="see instruction http://bookstore.com/handle-low-sell.htm"

When triggered, we can post a message to slack channel with message

$name$
$description$
The sells amount dropped under 10 in the last hour

sells on store $result.stores$ in platform $result.platform$ is $result.count$, 
which is lower than the threshold in the last hour. 

Change triggered alert with delta keyword

This trigger won't be very useful if the bookstore is seasonal. In the busy days, this trigger might never get triggered and at slow days, this trigger might get triggered every single hour. What we are more concerned about is not the absolute amount of sell but the abnormal sudden spikes compare to the background curve.

In order to accomplish this goal, we need to modify the above query to compare the event corresponding to one bin to event corresponding to prior bin. For example, compare the sell count to the sell count one hours ago in order to get the difference.

index=Sell sourcetype=Ebook Stores IN (amazon109, ebay38,amazon339) (ACCOUNT_NAME=*-seller123-*)
| eval platform = case(MATCH(SERVER_NAME, "aws-*"),"AWS",1=1,"EBAY")
| bin _time span=1h
| stats count by _time, stores, platform
| delta count as sellchange p=1
| eval percentIncrease=(sellchange/count)
| where percentIncrease < -0.3
| eval comment="see instruction http://bookstore.com/handle-low-sell.htm"

Here we used a splunk keyword delta with p as parameter.
The delta keyword computes the difference between nearby results using the value of a specific numeric field. For each event where field is a number, the delta command computes the difference, in search order, between the field value for the event and the field value for the previous event. The delta command writes this difference into newfield. If the newfield argument is not specified, then the delta command uses delta(field). If field is not a number in either of the two values, no output field is generated.

By calculating the difference from hour to hour, we will detect sudden changes in counts, averages etc. between the hours, which generally means abnormal. This example has bin size 1 hour and p=1, that is hour over hour compare. For day over day comparison, we can use p=24, week over week comparison, we can use p=168. The p parameter has to be used together with the bin size. If bin size is 1h, p=1 means compare this hour with last hour. If bin size is 5m, p=1 means compare the latest 5 minutes with the previous 5 minutes. With small bin size, the trigger will be very sensitive to small and fast changes, with larger bin size, the trigger will be less fussy, small and rapid spikes can be smooth out while bigger problems get revealed.

Machine learning backed alert with predict keyword 

Splunk search result can be piped into machine learning algorithm, then we can use the previous data to predict the current data, if the actual data deviate from the prediction too much, an alarm should be triggered.

For example, the basic alarm can be modified to 

index=Sell sourcetype=Ebook Stores IN (amazon109, ebay38,amazon339) (ACCOUNT_NAME=*-seller123-*) earliest=-7d@h latest=now
| eval platform = case(MATCH(SERVER_NAME, "aws-*"),"AWS",1=1,"EBAY")
| timechart span=1h count 
| predict count algorithm=LLP period=24 holdback=24 future_timespan=23 upper99=high lower99=low as Prediction
| rename low(Prediction) as lowerThreshold
| rename high(Prediction) as UpperThreshold
| tail 24 | tail 23
| eval Result = case(count > UpperThreshold, "sellTooGood", count >= LowerThreshold, "Expected", count < LowerThreshold, "sellTooBad")
| where Result="sellTooGood" OR Result="sellTooBad"
| eval comment="see instruction http://bookstore.com/handle-low-sell.htm"

This query use LLP seasonal local prediction algorithm to make prediction. Based on the assumption that the sell amount has 24 hours cycle, we set period=24. future_timespan specify how many data points we want to predict, holdback specify how many latest points we don't want to use for the prediction. The upper99 and lower99 can be modified to upper90 lower90 etc. It decides how much we can tolerate false positive. When the LowerThreshold triggers alert. upper90 will have less false positive than upper99.

Alert strategy

When setting alerts, we are facing the eternal question how much we should alert. If the alert is too sensitive, we get lots of false alarms, on the other extreme,  we might miss critical events that could hurt us a lot for even one miss! A typical scenario is, an instructor bought 100 books in one hour, in the next hour, the sell count is normal, say 20. The change alert will be triggered, but that is a normal situation. If the 100 book sell event happens at mid-night and the alert triggered a pagerduty call, the responder might get annoyed. In another mid-night, a disk failure could also trigger the change alert, in that case, ignore it will result in financial loss or harm the business reputation. Of course we can amend the existing alert for this particular case, but another 2 similar situations could trigger the alert, the story goes on and on.

No matter what technology we use, we have to sample then decide to be or not to be, there is no silver bullet unfortunately. Someone has to apply domain knowledge to decide what is expected and what is not. However, we can use a set of alerts combined with other technologies to alleviate the problem if cut a precise decision threshold in one alert is hard.

One dimension we can engineer is the splunk sample strategy, we can use different strategies to sample the data. Lets call it quick-early, quick-late, slow-late strategies.
  • quick-early. If you want to have a "quick" alarm, then sample every 5 mins for the last 5 mins. This will have most false positives, but it will capture most of the fast changing abnormal signals as soon as possible.
  • quick-late. If you want to have a "quick" alarm but take into consideration more time, then sample every 5 mins, but for the last 15 minutes. This will get rid of the majority of false positives, however, the abnormal signals will be detected later than the previous strategy.
  • slow-late. If you want to have a "slow" alarm, then sample every 15 mins for the last 15 mins. This will have the lest false positives, but it has chance to ignore fast changing abnormals and is also slower to detect the abnormals than the quick-early strategy.
For example, we can create a splunk query that summarizes each hour over 15 hours and only trigger if the hour-based-alarm goes off for more than a few times. Set the alarm for a 5 hour run summarizing over 15 hours, and that should even out the fast changing spikes thus reduce the false positives.

Another dimension we can engineer is how we response to the alert. For example, if the alert is configured to trigger pagerduty, we can configure the pagerduty response strategy. We can have an alarm go off and then auto snooze or auto resolve if it does not go off again. This way we can have it alert us and set it to low priority, but escalate it to high priority if it happens again or let it naturally be resolved by pager duty if it does not reoccur in X number of hours. 

Like a work group need different types of team members, we can use a group of splunk alerts to monitor the same concern. For example, we can have one alert with low-priority but covers a lot of false positives, we can set the alert to be auto resolved if not repeating or simply log the alert message to a slack channel or send an email for later review. We can have other alerts with high-priority, which only capture the obvious issues and trigger pagerduty. 


1 comment:

  1. Get your Splunk SPLK-3002 exam with the help of Certs4you. All you need just visit Certs4you and get your Splunk SPLK-3002 Study Guide Material and follow their rules for better results. For more information please visit us at:

    ReplyDelete

meta.ai impression

Meta.ai is released by meta yesterday, it is super fast you can generate image while typing! You can ask meta.ai to draw a cat with curvy fu...