KMeans Unsupervised Anomaly Detection: Splunk, Engine Ops

Read and understand more about Splunk KMeans unsupervised anomaly detection; there are explanations about K-Means as a term and examples of using KMeans in Splunk for Clustering and Classification.
In addition, before you can understand the process of using KMeans for anomaly detection in Splunk, you will need to get familiar with the data preprocessing stage for anomaly detection for KMeans.

Affiliate: Experience limitless no-code automation, streamline your workflows, and effortlessly transfer data between apps with Make.com.

Splunk KMeans Unsupervised Anomaly Detection – Understanding the Idea

The primary usage of the KMeans algorithm is clustering, but we can also use it for KMeans unsupervised anomaly detection. The idea is simple while having only one cluster (K=1), the KMeans algorithm calculates cluster distance for each data point. We will use an example of an “Engine” operation to detect anomalies for Predictive Maintenance. We will use external vibration and internal engine speed sensors to collect data. While training the KMeans algorithm, all the data points are considered the engine’s “good state” for us, which means that all the cluster distances calculated during the training are in a “good state.” So, if we apply the trained model, new cluster distances will be calculated for the new data points.

Suppose newly calculated cluster distances are lower than the minimum cluster distance calculated during training, and new cluster distances are higher than the maximum cluster distance calculated during training. In that case, it means that these data points are anomalous.

Engine Anomaly Detection (Predictive Maintenance) Using Vibration Sensor and Speed Data

We will use an external vibration sensor that outputs acceleration on three axes: x, y, and z. Also, we will use the engine’s internal speed sensor. An external one can do the job if you don’t have an internal speed sensor. The speed sensor outputs the data in RPM, which is around 2000 – 3000 RPM. Off course, each engine is different and has different specifications. You will need to adjust your data preprocessing techniques accordingly.

Data Preprocessing Before KMeans Unsupervised Anomaly Detection Model Training

Before using KMeans unsupervised anomaly detection method, we will understand and visualize the data from the vibration sensor and then the data from the speed sensor. Remember that all different vibration sensors or accelerometers can have slightly different outputs. You can check the article about data visualization and preprocessing.

Here is a quick summary of data preprocessing. Our accelerometer forwarded float data on three axes. Ideally, each axis should have a center on “0” and positive and negative acceleration points between “-1” and “1” (it’s just an example for simplicity, the highest numbers depend on the sensor sensitivity). While negative or positive numbers are just a direction of the acceleration. What we did receive from the sensor was slightly different since the center of each axis wasn’t at “0,” but each axis was different. The z-axis had a center at “1,” and X and Y-axes had a center around “0.3”. We needed to center the data around “0” in this case. Hence, we used Standard Scaling to centralize the data, but without standard deviation scaling, to keep the alternation of the existing data at a minimum.

While training the KMeans unsupervised anomaly detection model, we noticed that the minimum and maximum cluster distance during training (the good state) set between the three axes’ minimum and maximum points. If the minimum force on the X-axis was between 0.5g and -0.5g, so relatively was the minimum and the maximum cluster distance. Not to confuse with negative numbers since cluster distance can’t be negative. Just note the relativity. So, if our engine was to move slower, it means that the vibration should be lower. For example, we’ll take the case when the trained “good state” was around 0.5g and -0.5g, then we lowered the engine speed, so vibration (for example) on the X-axis would show between 0.3 and -0.3. In this case, you would assume that the algorithm will calculate the cluster distance outside of the minimum and the maximum cluster distance during the training (the good state), and you would understand that this is an anomaly. But this is not the case with KMeans. During the training, the algorithm set the “good state” threshold between 0.5g and -0.5. So, the calculated cluster distance for 0.3g and -0.3 will remain in the good state cluster distance, and it will not be an anomaly.

After Standard Scaling, during preprocessing stage for KMeans unsupervised anomaly detection, we used “absolute” numbers processing to convert all the negative numbers on all three axes to positive ones because negative numbers are just the direction of the acceleration on a specific axis and nothing else. But even after converting the negative numbers to the positive ones, we still encountered a problem when the minimum acceleration on the axis was around 0.0000001g during training. So, if the maximum was around 0.5g and we lowered the engine speed, the X-axis would show a maximum of 0.3g. At the same time, the minimum points would also be around 0.0000001g, which is still under the “good state” threshold. And in this case, KMeans will still calculate the cluster distance in the minimum and maximum range of the training cluster distances (good state).

To solve this issue in the preprocessing stage for KMeans unsupervised anomaly detection, we used averages of 15 seconds time buckets to straighten the data and remove the peaks. So, for a maximum of 0.5g and a minimum of 0.0000001g, there would be an average of 0.25g. After lowering the engine’s speed, now the maximum would be 0.3g, the minimum around 0.0000001g, and the average would be around 0.15g. So, this is a noticeable difference for the KMeans algorithm. We will see that the calculations of the cluster distance for new data points will be higher than the minimum and the maximum of the cluster distance during training (the good state).

Another problem during preprocessing stage for KMeans unsupervised anomaly detection was the internal speed sensor data. The KMeans algorithm is sensitive to significant numeric differences in different features. For instance, if three features range from 0.00001 to 0.3 (vibration sensor axes), and a fourth feature ranges from 700 to 3000 (engine RPM), the calculation heavily favors the scale of RPM. The calculation for the RPM only can be around a cluster distance of 700, while the calculation of the axes scale can be around 0.001 to 0.000001. This phenomenon occurs because the resulting summarized cluster distance, ranging from 700.001 to 700.000001 (the numbers are just an example for easier understanding), renders the vibration sensor data useless, undermining anomaly detection efforts. To mitigate this, we scale down the engine RPM by 1000 to align it closer with the vibration sensor scale and lower its significance in cluster distance calculations by the KMeans algorithm.

So, after all the preprocessing, here is our SPL command:

index=your_index
| eval splunk_stime = strptime(sampled_time, "%H:%M:%S.%f")
| eval stime = strftime(splunk_stime, "%H:%M:%S.%f")
| bin _time span=15s as interval
| eventstats avg(external_vib_x) as mean_x, avg(external_vib_y) as mean_y, avg(external_vib_z) as mean_z by interval
| eval SS_x = (external_vib_x - mean_x), SS_y = (external_vib_y - mean_y), SS_z = (external_vib_z - mean_z)
| eval abs_external_vib_x=abs(SS_x), abs_external_vib_y=abs(SS_y), abs_external_vib_z=abs(SS_z)
| eventstats avg(abs_external_vib_x) as avg_external_vib_x, avg(abs_external_vib_y) as avg_external_vib_y, avg(abs_external_vib_z) as avg_external_vib_z by interval
| eval normal_internal_speed = internal_speed / 1000
| table _time stime avg_* normal_internal_speed
| sort stime

Training KMeans Unsupervised Anomaly Detection Model in Splunk

Selecting Data for Model Training

Now, we need to train our KMeans unsupervised anomaly detection model on the dataset that we consider the normal operation of the engine. You may record the normal operation data from your engine for several hours: the external vibration sensor, three axes, and internal speed sensor. Then, to train the model based on that data.

KMeans Splunk Usage

To help you understand better KMeans usage in Splunk, read how to use KMeans in Splunk for clustering and classification.
The difference is that we’ll use KMeans for anomaly detection and don’t need clustering. So, in our case, K=1:

| fit KMeans avg_external_vib_x avg_external_vib_y avg_external_vib_z normal_internal_speed k=1 random_state=0 into my_kmeans_model

In this command:
fit: The fit command is part of Splunk’s Machine Learning Toolkit and is used to fit (train) a model to the data.
KMeans: This specifies the type of machine learning model to fit the data. In this case, a K-Means clustering model an unsupervised machine learning algorithm that groups data into a K number of clusters.
avg_external_vib_x avg_external_vib_y avg_external_vib_z normal_internal_speed: These are the names of the fields which our KMeans unsupervised anomaly detection model will train on. These could be any numerical fields in your data.
k=1: This sets the number of clusters that will form to 1. In a typical use case, you would expect the number of clusters to be more than 1. When k=1, K-Means assigns all data points to one cluster.
random_state=0: This will seed the random number generator for the KMeans algorithm, ensuring reproducible results. When “random_state” equals an integer, the KMeans algorithm will always produce the same clusters when run on the same data.
into my_kmeans_model: This saves the fitted model into a model artifact named “my_kmeans_model.” We will use this model artifact later to predict the cluster of new data points or to analyze the clusters created.

This command trains a KMeans model with one cluster on four input fields, uses a random number generator’s seed, and saves the model under “my_kmeans_model.”

Calculating the Minimum and Maximum of the Cluster Distance Field

After executing the line with the “fit” command to train our KMeans unsupervised anomaly detection model, we will get two new fields in the results: “cluster” and “cluster_distance.” Since we chose to use only 1 cluster, all the values in the “cluster” field will be “0”, the first cluster. And the “cluster_distance” for each event row will be in the “good state” range since this is what we trained the KMeans model on.

To capture this range, we will use the “eventstats” command with “min” and “max” on all the data points in the “cluster_distance” field:

| eventstats min(cluster_distance) as min_dist, max(cluster_distance) as max_dist

In this command:
| eventstats is an SPL command that calculates aggregate statistics, such as averages, sums, counts, and more, over all the events the search returns. It adds these statistics as new fields to each event in the results set. Unlike the stats command, which collapses all events into a single statistical summary event, “eventstats” leaves the original events intact.
min(cluster_distance) as min_dist: This part of the command calculates the minimum of the values in the cluster_distance field over all the events. It then adds this minimum value to each event in a new field named min_dist.
max(cluster_distance) as max_dist: This part of the command calculates the maximum of the values in the cluster_distance field over all the events. It then adds this maximum value to each event in a new field named max_dist.

This command runs after the training of the KMeans unsupervised anomaly detection model command. Its overall effect is adding two new fields, min_dist, and max_dist, to each event in the result set. These fields contain the minimum and maximum cluster_distance values over all events.

This command is useful when you want to track the range of a particular field (“cluster_distance” in this case) within the same event list. The calculated values are added to all events because this might be useful for calculations down the line in your Splunk pipeline or for enriching your event data with this extra information.

Save the Cluster Distance Values During Training for Later Model Application Usage

We need to save the “cluster_distance” range that the algorithm calculated during the training stage of the KMeans unsupervised anomaly detection model somewhere for the “apply” command to use later. Splunk allows us to keep the results in a CSV file (Lookup Table) for later usage in other SPL commands using the “outputlookup” SPL command. This command will output to CSV everything that results before the command. We need to craft the output so only the minimum and the maximum values of the “cluster_distance” field will be in the output.

Another question we need to ask is how to import the data from this CSV during the “application” of the trained KMeans unsupervised anomaly detection model on the new data. One of the fastest commands to do that is the “lookup” SPL command suggested by the Splunk community. But this command needs a reference value on all the rows to add the CSV data. We want the minimum and the maximum values of the cluster_distance on all the rows. Hence, we need to add reference values to all the event rows after the “application” of the model on new data and add this reference value to the CSV file so that the “lookup” command will know how to align the “cluster_distance” data from CSV Lookup Table.

| eval ref_key=1
| head 1
| table ref_key min_dist max_dist
| outputlookup min_max_dist.csv

In this command:
| eval ref_key=1: The eval command creates a new field named “ref_key” and gives it a constant value of “1”. This value will assign to all the resulting rows under this field.
| head 1: This command limits the output to the first ‘n’ results. Here, “head 1” limits the results to the first row since we want only one row of minimum and maximum values.
| table ref_key min_dist max_dist: The table command creates a tabular data representation. It returns a table with the specified fields as column headers. This command generates a table with the fields “ref_key,” “min_dist,” and “max_dist,” ensuring that we have only one representation of minimum and maximum values and the reference value in the CSV after training of our KMeans unsupervised anomaly detection model.
| outputlookup min_max_dist.csv: This command writes the search results to a lookup table file. The previous commands’ results will get to a CSV file named “min_max_dist.csv.” Your Splunk user needs appropriate permissions to save the CVS file with this command.

To view the contents of the CSV file, you can use the “inputlookup” command:

| inputlookup min_max_dist.csv

It doesn’t matter if you have something before the pipe (“|”). You must use the pipe to output data with this command, even on an empty search. This command is not part of the KMeans unsupervised anomaly detection training stage but will help you check the CSV file.

Another method is to use the Splunk web interface to see where the file is located and navigate there to see the contents:

Click on “Settings” on the top right menu.
In the drop-down, select “Lookups.”
Under “Lookups,” you will see several options. Click on “Lookup table files.”
On the “Lookup table files” page, you can view all the lookup table files in your Splunk environment.
Find min_max_dist.csv in the list of files. If you don’t find it, try using another app context, so you might need to look at other apps.
Note: The ability to view lookup table files through the Splunk Web Interface can depend on the permissions associated with your Splunk role and the app context the file is under.

The Whole SPL Command with Preprocessing, Training, and Values Saving

Let’s emerge the whole command with the data preprocessing part that we had in the beginning, training the KMeans unsupervised anomaly detection model and saving the minimum and the maximum values of the cluster distance to the CSV file:

index=your_index
| bin _time span=15s as interval
| eventstats avg(external_vib_x) as mean_x, avg(external_vib_y) as mean_y, avg(external_vib_z) as mean_z by interval
| eval SS_x = (external_vib_x - mean_x), SS_y = (external_vib_y - mean_y), SS_z = (external_vib_z - mean_z)
| eval abs_external_vib_x=abs(SS_x), abs_external_vib_y=abs(SS_y), abs_external_vib_z=abs(SS_z)
| eventstats avg(abs_external_vib_x) as avg_external_vib_x, avg(abs_external_vib_y) as avg_external_vib_y, avg(abs_external_vib_z) as avg_external_vib_z by interval
| eval normal_internal_speed = internal_speed / 1000
| fit KMeans avg_external_vib_x avg_external_vib_y avg_external_vib_z normal_internal_speed k=1 random_state=0 into my_kmeans_model
| eventstats min(cluster_distance) as min_dist, max(cluster_distance) as max_dist
| eval ref_key=1
| head 1
| table ref_key min_dist max_dist
| outputlookup min_max_dist.csv
| table *

We will not use the converted “sampled_time” field for training since sorting is unnecessary at this stage.
In addition, this Splunk SPL command will not show you the training results. There are methods to make it a secondary search that will not interfere with showing you the trained results. These methods are not in the scope of this article, however.

If you get an error:

Input event count exceeds max_inputs for KMeans (100000), model will be fit on a sample of events. To configure limits, use mlspl.conf or the "Settings" tab in the app navigation bar.

Consider reading our article about Input Event Count Exceeds Maximum Inputs Error for possible solutions.

Applying the Trained KMeans Unsupervised Anomaly Detection Model

Idea Discussion

The idea behind this stage is to apply the trained KMeans unsupervised anomaly detection model on the new data points, fetch the normal operation minimum and the maximum cluster_distance values from the CSV, and check if the calculated cluster_distance for the new data points is in this “normal operation” cluster_distance range. If it is not, we should create an alert. We will create a new field called “Alert” filled with “0” if cluster_distance on new data points in the range of normal operation cluster_distance. If not, all the Alert values will be “1”. Then based on the values of the “Alert” field, you can create an alert in Splunk that will send an email to whoever monitors the engine’s status.

Possible Issues

We need to solve another issue when applying our KMeans unsupervised anomaly detection model. It is when any of the sensors don’t send any data. Without data on all the sensors, the machine learning algorithm will give you an error: there is not enough data to apply the trained model. And if there is only one sensor that got disconnected, there is a chance that you don’t get an alert from the algorithm. You want to be on the safe side and preprocess this before the empty values from the sensors get to the “apply” command.

We used a Splunk Universal Forwarder to send data to the main Splunk Instance. The sensors connect to the forwarder, and a Python script converts the sensor data to a format that the forwarder understands to send it to the main instance. If the sensors get disconnected or there is no data, the script turns the numbers into “0”. Since the vibration on all three axes can’t be 0 if there is a normal connection. Also, the speed sensor’s speed can’t be “0” (only when the engine stops, which we also want to have an Alert on).

Creating Boolean Field for Zero Values

To solve the issue before applying the KMeans unsupervised anomaly detection model, first, we’ll use the “eval” command to create an “is_zero” field that will be populated with “1” if all the axes of the vibration sensor are “0” or the speed sensor will be “0”, and “is_zero” will be “0” if everything is fine.

| eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0)

In this command:
The “if” function performs a conditional assignment, which checks for a series of boolean expressions related to several fields, and assigns a value of 1 or 0 to the “is_zero” field based on the result.
The conditions:
(isnull(external_vib_x) OR external_vib_x==0): This checks if the field “external_vib_x” is null or equal to zero. The “isnull” function returns “true” if the field is null, and “external_vib_x==0” checks if the field’s value equals zero.
(isnull(external_vib_y) OR external_vib_y==0): Same as the above condition, but for the field “external_vib_y.”
(isnull(external_vib_z) OR external_vib_z==0): Again, same as the above condition, but for the field “external_vib_z.”
* This step is crucial before applying the KMeans unsupervised anomaly detection model.

The “AND” operator combines the above three conditions, meaning all three must be true for the entire expression to be true.
Then, there’s an “OR” operator, followed by another condition:
(isnull(internal_speed) OR internal_speed==0): This checks if the “internal_speed” field is null or equal to zero.

Finally, based on these conditions, the if function assigns a value to the “is_zero” field:
If any of the combined conditions for “external_vib_x,” “external_vib_y,” “external_vib_z,” or the condition for “internal_speed” is true, the “is_zero” field is assigned a value of 1.
If none of the conditions are met, “is_zero” is assigned a value of 0.

Processing Conditions Based on The Boolean Outcome

This part is still before applying the KMeans unsupervised anomaly detection model. Now is the part when it becomes trickier. We need to execute two searches, filter the results based on the outcome, and then combine those two search results into one table.

index=your_index earliest=-1m
| eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0)
| where is_zero==0
| <Here will be commands that preprocess for the "apply" command and the application of the model itself>
| eval Alert = if(cluster_distance > max_dist OR cluster_distance < min_dist, 1, 0)
| union
    [ search index= your_index earliest=-1m
    | eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0)
    | where is_zero==1
    | eval Alert=1 ]
| table _time stime Alert is_zero internal* _raw cluster_distance *dist *
| sort stime

We will divide the command explanation into two parts, the one before the “union” command and the part after.

The part before the “union”:
index=your_index: The index keyword specifies which index to search. Replace “your_index” with the name of your index.

earliest=-1m: This part of the command sets the time range of the search. In this case, “earliest=-1m” means the search should start from one minute ago from the current time and go up to the present moment. The “m” stands for minutes.
The “earliest” parameter specifies the start of the time range. When the value is negative, it counts back from the current time. So “-1m” means “one minute ago.” If you wanted to start the search from two hours ago, you would write “earliest=-2h”, where “h” stands for hours.
If you want to change the time range, you should also change it in the sub-search.

| eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0): we discussed that part earlier.
where is_zero==0: The where command filters the results to include only those where the is_zero field is 0. Meaning that the first part before “union” will process only the events where external vibration axes aren’t “0” or the engine’s internal speed sensor isn’t “0”.

After the “where” SPL command, we’ll put all the preprocess commands we used for training our model since the data should be in the same format as it got into the model during training. After preprocessing commands, we will use the MLTK “apply” command to apply our trained KMeans unsupervised anomaly detection model.

eval Alert = if(cluster_distance > max_dist OR cluster_distance < min_dist, 1, 0): This eval command creates a new field “Alert” that will receive the value “1” if “cluster_distance” is either greater than “max_dist” or less than “min_dist”; otherwise, it will receive “0” value. The “max_dist” and “min_dist” will get from the CSV file we saved during model training; more on that later.

The second part after the “union” command, which will not include anything directly related to the application of the KMeans unsupervised anomaly detection trained model:
search: A secondary search will run. In most cases, the whole secondary search will have an encapsulation between brackets (“[]”). Each line of commands between the brackets will have an indentation of 4 spaces.
index= your_index earliest=-1m
| eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0)
After the “search” command, there come the same lines that we used in part before “union” since we need to filter the results from the same table that will be with values “1” in the “is_zero” field.
| where is_zero==1: In this second part, we filter the results where the is_zero field is 1.
| eval Alert=1 ]: At this stage of the secondary search, we have only the results where the “is_zero” field is “1”, so we’ll create the “Alert” that will have the value of “1” for all these results.

In summary, if the external vibration sensor’s three axes are “0” and the internal sensor speed is “0,” we should get an alert.

The “union”:
union: This command combines the main search results with a sub- search’s results.

After the secondary search, the stage already has all the results of both searches combined: the main search with the application of the KMeans unsupervised anomaly detection trained model and the secondary search with an Alert field set to “1” in case of empty data from sensors. So, we make a “table” to show only the fields we want and sort the results by sampled time. More on the sample time in ML data preprocessing article.

Preprocess and KMeans Unsupervised Anomaly Detection Trained Model Application

After we filtered the results for the main part where the “is_zero” field equals 0, we’ll add the preprocessing part and the “apply” MLTK SPL command, to apply our trained KMeans unsupervised anomaly detection model:

| eval splunk_stime = strptime(sampled_time, "%H:%M:%S.%f")
| eval stime = strftime(splunk_stime, "%H:%M:%S.%f")
| bin _time span=15s as interval
| eventstats avg(external_vib_x) as mean_x, avg(external_vib_y) as mean_y, avg(external_vib_z) as mean_z by interval
| eval SS_x = (external_vib_x - mean_x), SS_y = (external_vib_y - mean_y), SS_z = (external_vib_z - mean_z)
| eval abs_external_vib_x=abs(SS_x), abs_external_vib_y=abs(SS_y), abs_external_vib_z=abs(SS_z)
| eventstats avg(abs_external_vib_x) as avg_external_vib_x, avg(abs_external_vib_y) as avg_external_vib_y, avg(abs_external_vib_z) as avg_external_vib_z by interval
| eval normal_internal_speed = internal_speed / 1000
| apply my_kmeans_model

To understand the preprocessing command, read the data preprocessing in the Splunk article. We’re only interested in the “apply” command.
| apply my_kmeans_model: The command applies a previously trained machine learning model to your data. In this case, the previously saved model is called “my_kmeans_model,” our KMeans unsupervised anomaly detection model. The MLTK’s “apply” command allows you to use a model that you’ve previously trained and saved to predict outcomes on new or existing data.
Please note, however, that the model should have been trained and saved under the name “my_kmeans_model” for this command to work. If no such model exists, you will get an error.
Moreover, it’s also essential to ensure that the structure and nature of the data you’re applying the model to are similar to the data the model used to train on. Machine learning models learn patterns from the training data and use them to make predictions or classifications on new data. If the new data drastically differs from the training data, the model’s performance will suffer.

Fetch the Cluster Distance Data of Normal Engine Operation from Saved CSV

At this stage, we will fetch the data for the cluster distance range of normal engine operation that we saved during the KMeans unsupervised anomaly detection model’s training. This data will apply to each event in the new data set so that we can fill the “Alert” field based on cluster_distance calculations for the new data set.

| eval ref_key=1
| lookup min_max_dist.csv ref_key OUTPUT min_dist max_dist

In this command:
| eval ref_key=1: The eval command creates a new field named “ref_key” and sets its value to “1” for all events in the search results. We need this to do the same process we did during training of the model when we created the “ref_field” with value “1” to export the data to CSV. So, the CSV data will show after the “apply” command.
| lookup min_max_dist.csv ref_key OUTPUT min_dist max_dist: The lookup command enriches the existing data with extra details from an external source. Here, it uses the file “min_max_dist.csv,” a lookup table in Splunk. This table should contain a column named “ref_key.” The “lookup” command matches the value of “ref_key” in the search result set with the values in the “ref_key” column in the lookup table.
“OUTPUT” is used to specify which fields you want to bring in from the lookup table to your search results. In this case, “min_dist” and “max_dist” are the fields in the lookup table that you want to add to your search results. If there is a match on “ref_key,” it will add the corresponding values of “min_dist” and “max_dist” from the lookup table to the search results.

Note: Several commands can combine the results from a particular CSV lookup table. As per the Splunk community, the “lookup” command is one of the fastest.

So, overall, this SPL query creates a field “ref_key” with a constant value of “1”. Then it uses that field to look up and bring in “min_dist” and “max_dist” values from an external CSV file (min_max_dist.csv). The final result set will include all the original data and the data from the application of the trained KMeans unsupervised anomaly detection model, plus the “min_dist” and “max_dist” values, wherever there is a match on “ref_key” in the lookup table (all the events in our case).

The Whole Trained KMeans Unsupervised Anomaly Detection Model Application Command

Since we already explained all the issues and all the commands above, here is the whole command to solve the problems we discussed and apply the trained KMeans unsupervised anomaly detection model for the new data points to make the prediction and Alert necessary:

index=your_index earliest=-1m
| eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0)
| where is_zero==0
| eval splunk_stime = strptime(sampled_time, "%H:%M:%S.%f")
| eval stime = strftime(splunk_stime, "%H:%M:%S.%f")
| bin _time span=15s as interval
| eventstats avg(external_vib_x) as mean_x, avg(external_vib_y) as mean_y, avg(external_vib_z) as mean_z by interval
| eval SS_x = (external_vib_x - mean_x), SS_y = (external_vib_y - mean_y), SS_z = (external_vib_z - mean_z)
| eval abs_external_vib_x=abs(SS_x), abs_external_vib_y=abs(SS_y), abs_external_vib_z=abs(SS_z)
| eventstats avg(abs_external_vib_x) as avg_external_vib_x, avg(abs_external_vib_y) as avg_external_vib_y, avg(abs_external_vib_z) as avg_external_vib_z by interval
| eval normal_internal_speed = internal_speed / 1000
| apply my_kmeans_model
| eval ref_key=1
| lookup min_max_dist.csv ref_key OUTPUT min_dist max_dist
| eval Alert = if(cluster_distance > max_dist OR cluster_distance < min_dist, 1, 0)
| union
    [ search index= your_index earliest=-1m
    | eval is_zero=if(((isnull(external_vib_x) OR external_vib_x==0) AND (isnull(external_vib_y) OR external_vib_y==0) AND (isnull(external_vib_z) OR external_vib_z==0)) OR (isnull(internal_speed) OR internal_speed==0), 1, 0)
    | where is_zero==1
    | eval Alert=1 ]
| table _time stime Alert is_zero internal* _raw cluster_distance *dist *
| sort stime

The data is constantly streaming to our Splunk Instance, and the prediction query above runs each minute to process the last-minute data.

2 thoughts on “KMeans Unsupervised Anomaly Detection: Splunk, Engine Ops”

  1. Your style is so unique compared to many other people. Thank you for publishing when you have the opportunity,Guess I will just make this bookmarked.2

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.