Splunk : A Splunk Approach to Baselines, Statistics and Likelihoods on Big Data

January 27, 2022 at 11:26 am EST

By Josh Cowling January 27, 2022

A common challenge that I see when working with customers involves running complex statistics to produce descriptions of the expected behaviour of a value and then using that information to assess the likelihood of a particular event happening. In short: we want something to tell us, "Is this event normal?". Sounds easy right? Well; Sometimes yes, sometimes no.

Let's look at how you might answer this question and then dive into some of the issues it poses as things scale-up:

A First Pass Solution This machine is sending lots of logs. Is this normal?
This user has logged in at 1 am. Is this normal?
We've seen a network communication with this particular signature. Is this normal?

The answer to these questions invariably requires us to define what we mean by normal for any given query. Thankfully, in some cases, this can be a pretty easy thing to do:

Count up the number of times you've seen something happen in the past
Count up the total number of times you've seen that thing or other things happen
Calculate how likely any event we've seen is by looking at the ratio of these two counts
Assess our likelihood value alongside recent data to find out how likely any new event is
Profit!

Here's what that might look like with some sample data and SPL if you're not thinking about it too hard:

_{Input Data Table - Events with identity (hostname) and event_category (could be anything that we want to assess the likelihood of).}

_{Search - Assesses statistics of dataset and joins back against initial dataset}

_{Output Data Table - A set of data with associated event likelihoods for each event category on each identity (host).}

In this approach, we count up the occurrence of an event in our dataset by identity, using the "stats" and "eventstats" commands. We then "join" it against our search to work out how likely each event is for each identity. You'll get similar results using the "anomalydetection" command, and it's an equivalent level of computational effort as well. This seems to be a perfectly reasonable solution, and it is for small numbers of identities or categories and small datasets!

What happens when we go beyond our friendly controlled test environment, though? How does this fare when faced with 10s of millions of events and millions of identities or event types? Hint: Things start to break.

This initial methodology applied to this dataset has challenges that prevent it from being applicable at scale. This approach is commonly implemented in ways that aren't ideal for Splunk to efficiently process. Implementations of methods like this may try to use "join" and "append" or "union" commands with sub-searches. Unfortunately, such commands often lead to memory or event buffer size limits when trying to scale up.

Another issue we can see here is that we're scanning a wide time range to compute our likelihood values in the same search as calling up the data to join them against. This can be more than a bit annoying, as what if I want to run this anomaly detection hourly but calculate my reference likelihoods from a week, month or even a year's worth of data? Do I really have to compute those likelihoods across that whole dataset every time? Of course not!

Making Splunk Shine

We can definitely improve on this, and there are just three tricks to it:

Break our monolithic search into several separate scheduled searches:
- A search computes the reference likelihoods and stores them in a KVStore based lookup. This part runs very rarely, maybe daily or weekly, over an extended time range. These values should not change much on a day-to-day basis, so pre-computing them out of hours can significantly reduce our computational burden.
- A second search uses our new lookup data to assess whether a recent event is an unlikely one or if we've seen it commonly before. We can do this using the "lookup" command. This part can run very frequently, say every 15 mins or hourly over that same time. Since we're using a lookup command rather than join, append or any other sub-search, we side-step all of those memory and scale issues.
Store our reference likelihoods in a KVStore collection backed lookup, making full use of the _key field. I've seen the KVStore handle 10's of millions of rows with relative ease in testing.
Use "stats" (primarily) and "streamstats", and "eventstats" (where necessary) because they are the real Splunk magic, and you should know what they are capable of as they allow you to avoid issues and limits typical with other approaches.
- The stats command scales incredibly well across large datasets and can solve most problems that would need a "join".

Initial Approach: A scheduled search takes data from an index and compares it to historic data in the same index. When anomalies are found, alerts may be generated.

Modified Methodology: A scheduled search creates per-identity key-value pairs representing historic identity behaviour (counts of events and calculated likelihoods) on an infrequent basis and saves them into the KVStore. A second scheduled search which runs more frequently compares the new data with the stored statistics with high efficiency even at large scale and high cardinality.

All of this can be done in addition to other optimisation methods such as using tstats and creating accelerated data models. In addition, these tips don't stop at this kind of likelihood comparison calculation and can be customised into many different data processing pipelines for significantly increased efficiency.

So hopefully, those ideas whet your appetite. However, if I've not convinced you to look into these topics further just yet, let's investigate the kind of improvements we can create with some performance testing.

Performance Testing

My reference data for this test consists of many records, each with an identity containing several hostnames (identities) and an event_category that may take on several values. I aim to calculate the likelihood of seeing any particular event compared to the more likely events for each identity.

We will compare the method described initially against the improved methodology (for which an example is provided at the end of the blog).

I've created several datasets to test against varying data scales; 1 million events and 10 million events in "low" and "high" cardinality configurations. I've used these datasets to test the computation time to apply each of our methods. Lastly, I've used this information to predict the total time it would take to compute on a single day in my test environment, assuming that I want to run an anomaly detection search each hour and look for the least likely events.

Number of Events	Max Number of Distinct Identities	Max number of Distinct Event Categories Per Identity	Initial Approach Daily Computation Time	Modified Methodology Daily Computation Time	Reduction in Computation Time
1 Million	1000	8 "low" cardinality	1,015s	71s	92.9%
10 Million	1000	8 "low" cardinality	15,496s*	901s	94.2%
1 Million	1000	1000 "high" cardinality	946s*	121s	87.1%
10 Million	1000	1000 "high" cardinality	Search Fail**	1200s	-

As you can see, there are some significant improvements to be had! Our improved methodology is nearly 10 times faster in almost all the tests! In addition, several of the initial method tests hit memory or timeout limits(*) inherent to sub-search usage. This leads to returning incomplete results or alternately outright failed(**) searches! In production environments, I've seen this sort of methodology have many times more significant improvements than I've seen here, as the benefits can scale with the size of the problem.

Now that I've managed to convince you it's worth it (I assume I have, right? I mean, look at those numbers!) Here's an example of the searches used to compute the modified method described above.

An Implementation Example

Above, you can see the search used to create our reference statistics by identity and event category. These are stored in a KVStore collection, using a custom _key field. This might be executed daily over several days/weeks/months worth of data in off-peak hours. You'll need to create a KVStore collection and lookup definition to let you do this.

This second search then allows us to look up our pre-computed statistics for a given event and apply a threshold to find only the most unlikely ones. It can be scheduled to run very frequently, as all it does is look up a specific identity-event_category pair in the KVStore lookup we created above. We can then filter based on one of our calculated likelihood values. You could potentially even turn something like this into an automatic lookup, folding in relevant statistics about whenever your data is queried. Where we have no match on our lookup, we can surmise that we've never seen that particular identity-category combination before.

In this particular example, I've gone a bit further than the initial search and included a couple of different likelihood calculations, which can be helpful in a variety of use cases. I'll be breaking this down in more detail in some upcoming content. I've seen these kinds of likelihoods used for detecting anomalies in user behaviours (has this person done this before?), system behaviours (does this machine usually send traffic like this at this time of day?) and transactional analysis (do we typically ship this many units to these addresses?). A common approach to building out this kind of anomaly detection is to use these methods to define several different behavioural detections, which can then be aggregated to identify clusters of abnormal usage that might indicate a significant threat. This is similar to the approach taken by Splunk's RBA and UBA and, indeed, can be used in conjunction to build effective detections.

And that's it! If you want to take it to the next level again, you can think about how you might update the KVstore rather than completely recalculating our statistics daily, or just reach out to me, and I'll show you how! If you found these ideas and this approach helpful, please let me know on LinkedIn!

For more information on optimising away your joins, check out this tremendous .conf talk from 2019.

You can also see this approach applied for a slightly different use case in this whitepaper published by our SURGe team on detecting supply chain attacks using JA3.

Thanks very much to my global colleagues in the ML Architect team, Philipp Drieger, Jason Losco, Phil Salm, and our fearless Product Manager, Greg Ainslie-Malik. Also to many of my colleagues in SURGe, but specifically Marcus LaFerrera, my colleagues in the UK Architect and CSE teams: Darren Dance, Stefan Bogdanis, Rupert Truman and Ed Asiedu, and the many others who contributed to the approach and ideas outlined in this blog.

Attachments

Original Link
Original Document
Permalink

Disclaimer

Splunk Inc. published this content on 27 January 2022 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 27 January 2022 16:25:24 UTC.

	1st Jan change	Capi.
MICROSOFT CORPORATION	+20.65%	3,371B
SYNOPSYS INC.	+19.25%	94.08B
CADENCE DESIGN SYSTEMS, INC.	+15.90%	86.55B
PALANTIR TECHNOLOGIES INC.	+63.45%	62.51B
DASSAULT SYSTÈMES SE	-22.05%	49.4B
ATLASSIAN CORPORATION	-23.72%	47.23B
SEA LIMITED	+82.30%	42.4B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-4.72%	26.84B
ROBLOX CORPORATION	-11.35%	25.94B

1st Jan change

Capi.

MICROSOFT CORPORATION

+20.65%

3,371B

SYNOPSYS INC.

+19.25%

94.08B

CADENCE DESIGN SYSTEMS, INC.

+15.90%

86.55B

PALANTIR TECHNOLOGIES INC.

+63.45%

62.51B

DASSAULT SYSTÈMES SE

-22.05%

49.4B

ATLASSIAN CORPORATION

-23.72%

47.23B

SEA LIMITED

+82.30%

42.4B

TAKE-TWO INTERACTIVE SOFTWARE, INC.

-4.72%

26.84B

ROBLOX CORPORATION

-11.35%

25.94B

Splunk Inc. Introduces New Security Innovations to Power the SOC of the Future	Jun. 12	CI
Splunk Unveils Next-Generation Data Management Experience At the Edge and Beyond	Jun. 12	CI
Splunk Inc. Introduces Advanced AI Enhancements for Observability, Security and IT Service Intelligence	Jun. 11	CI
Cisco and Splunk Announce Integrated Full-Stack Observability Experience for the Enterprise	Jun. 05	CI
Bitwarden Expands Splunk Cloud Integration for Advanced Event Management	May. 16	CI
Splunk Unveils Asset and Risk Intelligence to Revolutionize Proactive Risk Mitigation	May. 06	CI
ANALYST RECOMMENDATIONS : Best Buy, Wells Fargo, AMD, Netflix, Nvidia...	Mar. 20
Splunk Inc.(NasdaqGM:SPLK) dropped from FTSE All-World Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Software & Services Select Industry Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P TMI Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Global BMI Index	Mar. 19	CI
ANALYST RECOMMENDATIONS : 3M Company, Snowflake, Splunk, Micron, Nvidia...	Mar. 19
How Cisco Will Integrate Splunk Into Company	Mar. 18	MT
Cisco: completes acquisition of Splunk for $28 billion	Mar. 18	CF
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ Composite Index	Mar. 17	CI
Cisco Systems, Inc. completed the acquisition of Splunk Inc. from Hellman & Friedman Capital Partners X, L.P., managed by Hellman & Friedman LLC, BlackRock, Inc., The Vanguard Group, Inc., PRIMECAP Management Company and others for approximately $27 billion..	Mar. 17	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ-100 Index	Mar. 14	CI
Add a little SaaS to your life	Mar. 14
EU Watchdog Green-lights Cisco Systems' Purchase of Splunk	Mar. 14	MT
Cisco gains EU antitrust nod for $28 billion Splunk acquisition	Mar. 14	RE
Oracle posts rise in quarterly profit on strong cloud demand	Mar. 11	RE
Linde to Join Nasdaq-100 Index	Mar. 11	MT
Cisco's Splunk deal set to win unconditional EU antitrust OK, sources say	Mar. 05	RE
GitLab shares drop as 'less conservative' forecast disappoints investors	Mar. 05	RE
Splunk beats quarterly revenue estimates on steady demand for cloud services	Feb. 27	RE

Splunk Inc.

Equities

SPLK

US8486371045

Software

Splunk : A Splunk Approach to Baselines, Statistics and Likelihoods on Big Data

Latest news about Splunk Inc.

Chart Splunk Inc.

Company Profile

Sector Other Software