What is KMeans Unsupervised Machine Learning Algorithm
Let’s understand the K-Means algorithm before understanding how to use the Splunk KMeans unsupervised algorithm. K-means is a frequently used algorithm in machine learning, specifically unsupervised learning, for tackling clustering issues. It’s applied when there’s a need to group data points into distinct clusters based on their similarities.
Affiliate: Experience limitless no-code automation, streamline your workflows, and effortlessly transfer data between apps with Make.com.
The fundamental steps of the K-means algorithm are:
Initialization: The process commences by randomly picking a specified number of centroids, signifying the cluster centers. The ‘K’ in K-means corresponds to this predetermined number of clusters.
Cluster Assignment: Each data point is associated with the closest centroid in this step. The measure of distance is typically the Euclidean distance. The assembly of data points linked with a centroid constitutes a cluster.
Centroid Recalculation: Next, the algorithm recalibrates the position of every centroid to reflect the mean (which is why it’s called “K-means”) of the locations of all data points that are part of that cluster.
Iteration: The algorithm performs steps 2 and 3 iteratively until there’s no further change in the cluster assignments and centroid positions. The action signals that the algorithm has reached convergence, and the optimal solution has been identified for the provided initialization.
K-means has its limitations. It can end up with different final clusters based on the initial random placements of the centroids, which means it may not always discover the globally optimal solution. So, the algorithm is usually run multiple times with different initializations. The clustering outcome with the highest score (e.g., most minor within-cluster variance) is then selected.
Additionally, K-means presumes that clusters are convex and isotropic, which isn’t always applicable, making K-means unsuitable for clusters that have complex shapes or are scale-variant. Furthermore, K-means requires defining the number of clusters (K) in advance. There are methods to ascertain the best K value, such as the elbow or silhouette method, but these techniques aren’t flawless.
How to Use Splunk KMeans Unsupervised Algorithm for Clustering and Classification
Splunk KMeans unsupervised algorithm is part of Splunk Machine Learning Toolkit (MLTK) add-on and uses KMeans unsupervised algorithm from scikit-learn python library.
Check our articles if you need help installing Splunk add-ons and preparing Splunk for Machine Learning.
Splunk KMeans unsupervised algorithm works best with numeric data. Here’s an example we used the KMeans algorithm for classification. There may be better ML algorithms for classification, but KMeans is suitable for doing so. For example, you have several features that classify a Dog, a Cat, and a Parrot. So, you will be training KMeans on known data points for these features on three clusters (K=3). Here are example features (Splunk fields):
size has_feathers is_barks paw_size type
So, “size” is self-explanatory, is the size of the animal, “has_feathers” is Boolean and will have “1” or “0” if the animal has feathers. Off course, It will be “1” only on the Parrot class. “is_barks” will also be Boolean “1” or “0”, which will be suitable for a Dog. “paw_size” is a numeric that will indicate a paw size of each animal. “type” will have “0” for a Dog, “1” for a Cat and “2” for a Parrot. KMeans is unsuitable for string data, so we convert each animal type into an integer. Even then, there are better methods to classify input than KMeans, but we use this only as an example for classification.
If you have your type as strings, we will use the “eval” command to convert it to integers:
index=main sourcetype=animals
| eval type=case(type_string=="Dog", 0, type_string=="Cat", 1, type_string=="Parrot", 2)
Now we’ll use Splunk Machine Learning Toolkit “fit” SPL command to train our model and save it as “my_kmeans_model”:
index=main sourcetype=animals
| eval type=case(type_string=="Dog", 0, type_string=="Cat", 1, type_string=="Parrot", 2)
| fit KMeans size has_feathers is_barks paw_size type k=3 into my_kmeans_model
| table *
In this command:
index=main sourcetype=animals: This is the start of the search. The command specifies that Splunk should look at the “main” index for data with the source type of “animals.”
| eval type=case(type_string== “Dog”, 0, type_string== “Cat”, 1, type_string== “Parrot”, 2): The eval function allows you to perform calculations and rename or create new fields. Here, it’s creating a new field called “type” based on the values in “type_string.” If the value in “type_string” is “Dog,” “type” will be set to 0. If it’s “Cat,” “type” will be set to 1, and if it’s “Parrot,” “type” will be set to 2.
| fit KMeans size has_feathers is_barks paw_size type k=3 into my_kmeans_model: Here, is a command that utilizes a machine learning algorithm. The “fit” command runs a machine learning model on the data. In this case, it’s a KMeans clustering model with 3 clusters (k=3) since we have only 3 types of animals: Dog, Cat, and Parrot. The fields that the model will train on are “size,” “has_feathers,” “is_barks,” “paw_size,” and “type.” The trained model will be saved as “my_kmeans_model” using “into.”
| table *: This command generates a table of all available fields in the data. The asterisk (*) is a wildcard that matches all fields.
This sequence of commands, therefore, classifies the data from the “animals” source type in the “main” index into three groups using the KMeans algorithm based on the attributes “size,” “has_feathers,” “is_barks,” “paw_size,” and “type,” where the “type” is determined based on whether the “type_string” field is “Dog,” “Cat,” or “Parrot.” It then displays a table of all the fields in the data.
If you get random results after several executions of the “fit” command of the KMeans algorithm, try using the “random_state=0” optional parameter:
| fit KMeans size has_feathers is_barks paw_size type k=3 random_state=0 into my_kmeans_model
To apply the trained machine learning model on a new data source “new_animals” (that we don’t know the type of the animal), use MLTK “apply” command:
index=main sourcetype=new_animals
| apply my_kmeans_model
| table *
When running “fit” or “apply” commands on KMeans, will result in two new fields created:
cluster: After the KMeans algorithm has been applied to your data using the fit command, a new field named “cluster” is added to your results. The cluster field represents the cluster number that each event or data point has been assigned to by the algorithm. Remember that KMeans is an unsupervised machine learning algorithm that groups similar data points, and the user specifies the number of groups (in this case, k=3). The algorithm assigns each data point to one of these three clusters, and this assignment is set in the cluster field. The first cluster is “0”, the second one is “1,” and the third one is “2”.
cluster_distance: Along with the cluster field, the fit command also adds a “cluster_distance” field to your results when it runs a KMeans algorithm. The cluster_distance field represents the Euclidean distance from each event or data point to the cluster’s centroid that the algorithm made an assignment to. In the context of KMeans, the centroid is each cluster’s “center” or “mean” data point. The cluster_distance gives you an idea of how close or far away each data point is from the center of its cluster. The smaller the distance, the closer the data point is to the centroid of its cluster, and vice versa.
In this scenario, we’re only interested in the placement of each row in its respective “cluster” rather than focusing on the “cluster_distance.” And how close each row was to Dog, Cat, or Parrot.
Using KMeans for Anomaly Detection
“cluster_distance” is useful when using Splunk KMeans unsupervised algorithm for anomaly detection in your data set. The idea in this scenario is that the known or the “good state” of your data is between the minimum and the maximum points of the “cluster_distance” of the trained model. Any new events you apply the model to outside the minimum and maximum of the “cluster_distance” are anomalous.
How to calculate the centroid of each cluster?
The only data about centroids you get is “cluster_distance” field. This is the distance of the current point from the center of the cluster. If you need to represent the centroid, then “cluster_distance” should equal 0.