Thursday, May 14, 2020

SQL table join equivalent for splunk index join

splunk have a feature to join 2 indexes data together by a field. This solution is similar to sql's table join by column.
Here we can have have these analogy:

  • splunk query<-> sql query, 
  • splunk index <->sql table, 
  • splunk field <-> sql column.
  • splunk subquery join <-> sql table join

Subquery join


For example, you have index service_request and another index service_bill

For service_request index, you have a splunk query to get all the services request for a particular vendor.
index="service_request" vendor=Dell

For service_bill index, you have a query to get all service bills that are pending
index="service_bill" PENDING

Now, the problem is the billing PENDING status field is not in service_request index, so if we want to generate stats such as the percentage of service requests that are in complete status has not been paid, we need to join the two indexes. To join them, we need common field, assume both index has request_id field. We can then write the following splunk query to join the two indexes.

index="service_request" vendor=Dell
| eval REQUEST_ID=request_id
| join REQUEST_ID [search index="service_bill" PENDING | fields REQUEST_ID]
| stats count AS total, count(eval(service_status="COMPLETE")) as outOfStatus
| stats avg(eval(outOfStatus/total)*100) as outOfStatusRate

There is an important detail: the splunk subsearch query result can have maximum 10500 records. If you are query a days of data, you might be fine, if you are querying 10 year of data, you are most likely exceed the limit.
If query
index="service_bill" PENDING
returns more than 10500 rows, your final stats will be wrong!

Another import details is, when join a query and a subquery by common field, the two field names have to be exact match, upper case and lower case letters are considered different. In the example, field name request_id in index="service_request" has to be mapped to upper case REQUEST_ID with "eval REQUEST_ID=request_id" in order to match the field name REQUEST_ID in index="service_bill" before we join them.

This limitation is set by splunk platform, we can not really do much about it. One thing we can do as splunk user is to prune the subquery to return only the records we needed.
for example, if the service_bill index also has a VENDOR field, use it to prune out those rows won't be relevant.
search index="service_bill" PENDING VENDOR="DELL"

and the join will be

index="service_request" vendor=Dell
| eval REQUEST_ID=request_id
| join REQUEST_ID [search index="service_bill" PENDING VENDOR="DELL" | fields REQUEST_ID]
| stats count AS total, count(eval(service_status="COMPLETE")) as outOfStatus
| stats avg(eval(outOfStatus/total)*100) as outOfStatusRate

that way, we can do stats with larger time span.

The index subquery join technique can also be applied to generate timechart. For example, we want to know further about the trending of payment pending time for those requests. We can add that field from the subquery

index="service_request" vendor=Dell
| eval REQUEST_ID=request_id
| join REQUEST_ID [search index="service_bill" PENDING VENDOR="DELL" | fields REQUEST_ID, PAYMENT_DUE_HOURS]
| timechart span=1h avg(PAYMENT_DUE_HOURS) by service_status

Subquery Inner join vs. Outer join

In the above example, 

...query... | join common_field [...subquery...] 
| timechart/stats/ect.

The "join" keyword did an inner join, the main query and subquery need to have a common field. After join, we actually appended a few fields from the subquery to each events in the main query. The augmented main query events can then be used normally. We can filter it, pipe it to stats command or timechart command etc. The final effect is an inner join, but the splunk implementation 
run subquery query 
search index="service_bill" PENDING VENDOR="DELL" 
for the time range the main query use, then perform an inner join. So if the above query has more than 10500 events, the results set get truncated.

There is another kind of query and subquery join in splunk

...query...| append [...subquery...]
|timechart/stats/etc.

The "append" keyword did an outer join, the two indexes event fields don't have anything common exact they are arranged by time, the event set A of main query and the event set B of subquery are simply outer joined together, so we get A+B number of events.

Subquery append will be useful if we want to generate the time chart of 2 field's ratio from different indexes that share no common field

For example, we have one index impression_stats to store the advertisement impression events, and another index buy_stats to store the advertisement click events. We want to get the time chart of ad impression count to ad click count during a time period.

We can use the following example query:

index="impression_stats" platform="blogspot" | bin _time=1h | stats count as impression by _time
| append [search index="buy_stats" sourcetype="web_ad" | bin _time=1h | stats count as buy by _time]
| stats max(impression) as a, max(buy) as b by _time
| eval c=rount((b/a),0)
| stats max(c) as "Click-to-Impression Ratio" by _time

Since we don't have to coordinate the results of the main query and subquery other than _time field, we can  reduce the amount of records in the subquery final with bin and stats. 
After put the records in buckets of 1 hour then count the number of events in that bucket, we reduce the original event set to its 1 hour interval time based statistics events sets. If we have to make stats for bigger time span thus need to have more events in the subarray, we can increase the time interval in "bin _time=1h" so that we reduce the results set more aggressively, in order to not exceed the 10500 maximum events counts in subquery results. The main query and subquery don't have common field, how they are outer joined? They are outer joined by _time field. So if subquery use bin side 1 hour, the main query have to use the same bin size 1 hour, otherwise, the later stats by  _time won't make sense.

Here we use the following technique to coordinate the counts from main query and subquery. 
| stats max(impression) as a, max(buy) as b by _time

At a particular _time value, we will have 2 events, one is from main query, another is from subquery. With outer join, the main query's impression field will has a positive value, the subquery's impression field will has a default/null value as 0. 
max(impression) 
will get the non-zero value from the two. For the same reason, 
max(buy) 
will get the non-zero value from the 2 events as well. 
So at the next step, we can get the ratio  with
| eval c=rount((b/a),0)
finally, we can draw the time chart with command
| stats max(c) as "Click-to-Impression Ratio" by _time

Of course, nothing prevent us from putting a multiple curves together with the same _time field as x-axis.

index="impression_stats" platform="blogspot" | bin _time=1h | stats count as impression by _time
| append [search index="buy_stats" sourcetype="web_ad" | bin _time=1h | stats count as buy by _time]
| stats max(impression) as a, max(buy) as b by _time
| eval c=rount((b/a),0)
| stats max(a) as "Impression", max(b) as "Click", max(c) as "Click-to-Impression Ratio" by _time

No comments:

Post a Comment

Why I stopped publishing blog posts as information provider

Now the AI can generate content. Does that mean the web publishing industry reaches the end? ChatGPT said: ChatGPT Not at all. While AI can ...