Friday, June 26, 2020

Better splunk alerts with delta and predict keyword

Absolute value threshold triggered alert

The splunk triggers when a quantity hits certain threshold.

For example, a bookstore business want to get alert for abnormal ebook sell amount. A splunk trigger like the following could prompt the owner to check issues if the sell is too low during a particular hour.
The following query can be scheduled to run every hour at 0 minutes past the hour at number of results grater than 0.

index=Sell sourcetype=Ebook Stores IN (amazon109, ebay38,amazon339) (ACCOUNT_NAME=*-seller123-*)
| eval platform = case(MATCH(SERVER_NAME, "aws-*"),"AWS",1=1,"EBAY")
| bin _time span=1h
| stats count by _time, stores, platform
| where count < 10
| eval comment="see instruction http://bookstore.com/handle-low-sell.htm"

When triggered, we can post a message to slack channel with message

$name$
$description$
The sells amount dropped under 10 in the last hour

sells on store $result.stores$ in platform $result.platform$ is $result.count$, 
which is lower than the threshold in the last hour. 

Change triggered alert with delta keyword

This trigger won't be very useful if the bookstore is seasonal. In the busy days, this trigger might never get triggered and at slow days, this trigger might get triggered every single hour. What we are more concerned about is not the absolute amount of sell but the abnormal sudden spikes compare to the background curve.

In order to accomplish this goal, we need to modify the above query to compare the event corresponding to one bin to event corresponding to prior bin. For example, compare the sell count to the sell count one hours ago in order to get the difference.

index=Sell sourcetype=Ebook Stores IN (amazon109, ebay38,amazon339) (ACCOUNT_NAME=*-seller123-*)
| eval platform = case(MATCH(SERVER_NAME, "aws-*"),"AWS",1=1,"EBAY")
| bin _time span=1h
| stats count by _time, stores, platform
| delta count as sellchange p=1
| eval percentIncrease=(sellchange/count)
| where percentIncrease < -0.3
| eval comment="see instruction http://bookstore.com/handle-low-sell.htm"

Here we used a splunk keyword delta with p as parameter.
The delta keyword computes the difference between nearby results using the value of a specific numeric field. For each event where field is a number, the delta command computes the difference, in search order, between the field value for the event and the field value for the previous event. The delta command writes this difference into newfield. If the newfield argument is not specified, then the delta command uses delta(field). If field is not a number in either of the two values, no output field is generated.

By calculating the difference from hour to hour, we will detect sudden changes in counts, averages etc. between the hours, which generally means abnormal. This example has bin size 1 hour and p=1, that is hour over hour compare. For day over day comparison, we can use p=24, week over week comparison, we can use p=168. The p parameter has to be used together with the bin size. If bin size is 1h, p=1 means compare this hour with last hour. If bin size is 5m, p=1 means compare the latest 5 minutes with the previous 5 minutes. With small bin size, the trigger will be very sensitive to small and fast changes, with larger bin size, the trigger will be less fussy, small and rapid spikes can be smooth out while bigger problems get revealed.

Machine learning backed alert with predict keyword 

Splunk search result can be piped into machine learning algorithm, then we can use the previous data to predict the current data, if the actual data deviate from the prediction too much, an alarm should be triggered.

For example, the basic alarm can be modified to 

index=Sell sourcetype=Ebook Stores IN (amazon109, ebay38,amazon339) (ACCOUNT_NAME=*-seller123-*) earliest=-7d@h latest=now
| eval platform = case(MATCH(SERVER_NAME, "aws-*"),"AWS",1=1,"EBAY")
| timechart span=1h count 
| predict count algorithm=LLP period=24 holdback=24 future_timespan=23 upper99=high lower99=low as Prediction
| rename low(Prediction) as lowerThreshold
| rename high(Prediction) as UpperThreshold
| tail 24 | tail 23
| eval Result = case(count > UpperThreshold, "sellTooGood", count >= LowerThreshold, "Expected", count < LowerThreshold, "sellTooBad")
| where Result="sellTooGood" OR Result="sellTooBad"
| eval comment="see instruction http://bookstore.com/handle-low-sell.htm"

This query use LLP seasonal local prediction algorithm to make prediction. Based on the assumption that the sell amount has 24 hours cycle, we set period=24. future_timespan specify how many data points we want to predict, holdback specify how many latest points we don't want to use for the prediction. The upper99 and lower99 can be modified to upper90 lower90 etc. It decides how much we can tolerate false positive. When the LowerThreshold triggers alert. upper90 will have less false positive than upper99.

Alert strategy

When setting alerts, we are facing the eternal question how much we should alert. If the alert is too sensitive, we get lots of false alarms, on the other extreme,  we might miss critical events that could hurt us a lot for even one miss! A typical scenario is, an instructor bought 100 books in one hour, in the next hour, the sell count is normal, say 20. The change alert will be triggered, but that is a normal situation. If the 100 book sell event happens at mid-night and the alert triggered a pagerduty call, the responder might get annoyed. In another mid-night, a disk failure could also trigger the change alert, in that case, ignore it will result in financial loss or harm the business reputation. Of course we can amend the existing alert for this particular case, but another 2 similar situations could trigger the alert, the story goes on and on.

No matter what technology we use, we have to sample then decide to be or not to be, there is no silver bullet unfortunately. Someone has to apply domain knowledge to decide what is expected and what is not. However, we can use a set of alerts combined with other technologies to alleviate the problem if cut a precise decision threshold in one alert is hard.

One dimension we can engineer is the splunk sample strategy, we can use different strategies to sample the data. Lets call it quick-early, quick-late, slow-late strategies.
  • quick-early. If you want to have a "quick" alarm, then sample every 5 mins for the last 5 mins. This will have most false positives, but it will capture most of the fast changing abnormal signals as soon as possible.
  • quick-late. If you want to have a "quick" alarm but take into consideration more time, then sample every 5 mins, but for the last 15 minutes. This will get rid of the majority of false positives, however, the abnormal signals will be detected later than the previous strategy.
  • slow-late. If you want to have a "slow" alarm, then sample every 15 mins for the last 15 mins. This will have the lest false positives, but it has chance to ignore fast changing abnormals and is also slower to detect the abnormals than the quick-early strategy.
For example, we can create a splunk query that summarizes each hour over 15 hours and only trigger if the hour-based-alarm goes off for more than a few times. Set the alarm for a 5 hour run summarizing over 15 hours, and that should even out the fast changing spikes thus reduce the false positives.

Another dimension we can engineer is how we response to the alert. For example, if the alert is configured to trigger pagerduty, we can configure the pagerduty response strategy. We can have an alarm go off and then auto snooze or auto resolve if it does not go off again. This way we can have it alert us and set it to low priority, but escalate it to high priority if it happens again or let it naturally be resolved by pager duty if it does not reoccur in X number of hours. 

Like a work group need different types of team members, we can use a group of splunk alerts to monitor the same concern. For example, we can have one alert with low-priority but covers a lot of false positives, we can set the alert to be auto resolved if not repeating or simply log the alert message to a slack channel or send an email for later review. We can have other alerts with high-priority, which only capture the obvious issues and trigger pagerduty. 


Wednesday, June 24, 2020

How to build rpms

prepare the source file

first of all, you need all your source files ready under ~/RPMBUILD/SOURCES dircetory
For example if you wan to build jkd1.8.0 rpm, you may want the java8-local_policy.jar and java8-US_export_policy.jar from java 8 JCE and the jre-8u111-linux-x64.tar.gz from oracle website.

prepare the .spec file

once the source files are in place, you need to write a mybuild.spec file to instruct the rpm command how to pre, build, install, clean, files, preun, post the files in the rpm package.
this file has to be put in the ~/RPMBUILD/SPECS directory.

build rpm

now you can run the following command to build the rpm
rpmbuild -ba ~/RPMBUILD/SPECS/mybuild.spec

this command will put the resulting rpm in ~/RPMBUILD/RPMS/x86_64.

post to yum repo

You can optional set the rpm to be managed by yum.
Assuming you plan to place the repository in /var/www/html which is the default directory for Red Hat based Apache installations.
Also assuming you have run 

yum install yum-utils;
reposync -p /var/www/html/;

so that you can present this directory with apache.

now all you have to do is copy the rpm into /var/www/html/fedora/jdk8/
then run command
run createrepo /var/www/html/fedora/jdk8/

If your apache server is running, your new rpm should now be managed by yum.
If not, run it:
yum install httpd;
chekconfig httpd on;
service httpd start;


config yum client

On the client host you need to point the yum repo to the yum server you just setup,

for simplicity, lets use the yum server itself as client

create a /etc/yum.repos.d/local.repo file
[Local-Repository]
name=mylocal
baseurl=http://ip-of-my-server/fedora/jdk8
enabled=1
gpgcheck=0

final result

now once you type 
yum install rpm-name-set-in-spec-file

the rpm will be intalled




4 types of git branches

If you have worked as software engineer in cooperate environment, you should have already familiar with develop branch and master branch. Svn merge strategy is slightly different from git strategy, but they are quite similar. In git, the branch names are arbitrary, we can name branch whatever we like, and all the branch names are equal. However industry adopted some conventions. Usually we have a branch named develop, which is from master branch. All the developers contribute to this branch. From this branch, test releases are often created, which are the temporary packages deployed to the QA servers.

Once the develop branch is tested and stable, software developers or release engineers will merge the develop branch to the master branch. The master branch is usually used to create the package that will be released to production and deployed on the production servers.

Besides the develop branch and master branch, we usually encounter other 4 types of repository branches.

1. Bugfix branches are typically used for fixing bugs against a release branch. Conventionally these branches are prefixed with bugfix/
2. Feature branches are typically from and merged back into the development branch, they often used for specific feature work. Conventionally these branches are prefixed with feature/.

Bugfix branches and feature branches are where developers check in code, create pull request, check in code review updates. This is a safe branch to experiment. Any mistake only effect this particular branch, won't effect other developers working on other branches. The bugfix and feature only start to effect other developers after they are merged to the development. Sometimes, more than one developers updated the same line of the develop branch in their bugfix or feature branches, there will be merge conflict. The developer merge his/her branch earlier won't have problem, the developer merges his/her branch later will get merge conflict at the time of command "git push". 

The solution is to 
  • git checkout develop
  • git pull
  • git checkout bugfix/bugfixbranchname
  • git pull
  • git merge develop
At this point, the merge will fail, but the failure message will list all the files with conflict.
Go the files and manually fix the conflict lines, the 
  • git commit -am "solve conflict"
  • git push

3. Hotfix branches are typically used to quickly fix the production branch. They conventionally are prefixed with hotfix/. Hotfix often goes the fast path in emergency situations, they often directly from the master branch, and merge back to master branch. Then these changes are applied back to the development branch.

4. Release branches are typically branches from the develop branch and changes are merged back into the develop branch, they are used for release tasks and long-term maintenance. Some company use master branch as the release branch, then use tags to differentiate different releases. Some companies has multiple release branches names other than master. They conventionally are prefixed with release/.


Wednesday, June 17, 2020

How to group splunk stats by common string in field value

We all know splunk can make time chart. For example, we want to know how many http requests are received on a particular type of servers. A typical splunk query could be:

index=http_stats_10d sourcetype=FRONT_END_LB host=*-mobileweb-* | timechart count by host

The timechart will be grouped by host such as pvd-mobileweb-001, pvd-mobileweb-002, pvd-mobileweb-003, chi-mobileweb-001, chi-mobileweb-002, tor-mobileweb-001, tor-mobileweb-002, tor-mobileweb-003.

Now let's assume we want to group the timechart by data site prefix string pvd, chi and tor instead of the whole hostname string. The following technique will do the trick.

eval site=mvindex(split(host, "-"), 0)

the above command reads, split host string by "-" and take the the index 0 element from the result array, and assign it to variable site. This way we extracts the prefix from the host string.

Now we can revise our splunk query to group by site instead of by host.

index=http_stats_10d sourcetype=FRONT_END_LB host=*-mobileweb-*
| eval site=mvindex(split(host, "-"),0)
| timechart count by site


meta.ai impression

Meta.ai is released by meta yesterday, it is super fast you can generate image while typing! You can ask meta.ai to draw a cat with curvy fu...