Splunk Standard Scaling: ML Preprocessing, Normalization

What is Standard Scaling

Before Splunk Standard Scaling usage, let’s understand what Standard Scaling is. Standard scaling, also known as standardization, normalization, or Z-score normalization, is a common preprocessing technique used in machine learning and statistics to normalize data.

Affiliate: Experience limitless no-code automation, streamline your workflows, and effortlessly transfer data between apps with Make.com.

The process involves subtracting the mean (average) of a feature (variable) from each data point and then dividing the result by the standard deviation of the feature. After applying Standard Scaling to a dataset, the transformed data will have a mean of 0 and a standard deviation of 1. But it’s important to note that the actual range of the resulting data doesn’t have a specific interval. It depends on the data: if there are extreme “outliers,” the range can be pretty extensive in either direction because the Standard Scaling does not bind the range. But your data will be scaled between positive and negative points, somewhere around 3 and -3.

The formula for standardization is:
Z = (X – mean) / stdev

where:
Z: is the standard scaled output.
X: is the original data point.
Mean: is the mean or average of the feature in the given time range.
Stdev: is the standard deviation of the feature in the given time range.

Standard scaling is particularly useful when working with machine learning algorithms that assume data distribution is normal, such as in linear regression, logistic regression, and some types of neural networks. It also helps in scenarios where the algorithm does not perform well when the input variables have different scales, as seen in K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Principal Component Analysis (PCA).

If you need help installing Machine Learning Toolkit in Splunk, consider these articles: How to Install Splunk Addons and How to Prepare Splunk for Machine Learning.

Splunk Standard Scaling using Machine Learning Toolkit

Splunk’s Machine Learning Toolkit (MLTK) provides a rich set of commands for preprocessing your data, training machine learning models, and evaluating their performance. One of these commands is “fit,” which we use for applying Splunk standard scaling to the data.

Here is a simple example of how you might standardize a single field in your data using the StandardScaler in Splunk MLTK (which is an implementation of the StandardScaler preprocessing algorithm of the scikit-learn python library):

index=your_index 
| fit StandardScaler your_field into model

In this command:
index=your_index: fetches data from the index “your_index.” Replace this with your actual index name.
fit: is the MLTK command for fitting a model to your data.
StandardScaler: is the algorithm to use (in this case, standard scaling).
your_field is the field you want to standardize. Replace this with your actual field name.
into model specifies the name (“model”) to save the fitted model under.

The part “into model” is more appropriate for machine learning models since you will use the “apply model” command to apply the trained model. So, we can omit it with preprocessing algorithms because we’re interested only in processing the data and not applying the model later:

index=your_index 
| fit StandardScaler your_field

It’s important to note that if you want to fit Splunk standard scaling to multiple fields, you can list them after the StandardScaler keyword:

index=your_index 
| fit StandardScaler field1 field2 field3

StandardScaler algorithm will process each field separately.
The command will calculate the mean and standard deviation of the specified fields and immediately apply the standard scaling to these fields. The command will create new fields in your data named “SS_field1”, “SS_field2”, and “SS_field3,” which contain the standardized values of the fields “field1”, “field2”, and “field3,” respectively.

In Splunk MLTK, when using the StandardScaler, you may adjust two more properties: “with_mean” and “with_std.” These properties are Boolean values (true or false), and they control whether the standard scaler should apply mean centering or standard deviation scaling.

with_mean controls whether the scaler should subtract the mean from each data point. If set to “true,” the scaler will subtract the mean from each data point (mean centering), which means that the data points will center to “0”. If set to “false,” the scaler will not subtract the mean. The default is “true.”

with_std controls whether the scaler should divide each data point by the standard deviation. If set to “true,” the scaler will divide each data point by the standard deviation (standard deviation scaling), meaning the algorithm will scale the data to unit variance. If set to “false,” the scaler will not divide by the standard deviation. The default is “true.”

Whether to use these parameters or not should be your decision based on the data. If the data points don’t have a center at “0,” then you should center it with “with_mean=true.” If there’s a big difference between the minimum and maximum data points, use the “with_std=true.” An example:

index=your_index 
| fit StandardScaler field1 with_mean=true with_std=false

The Splunk standard scaling algorithm is helpful if you have all the data in a time range you know before applying a machine learning algorithm. Suppose you must change the range, like make it longer or shorter. Your scaled data points will change for the same period. Let’s say you fit the StandardScaler for your data from 13:10:00 to 13:11:00. There will be specific scaled results. You will get more data points if you change the time from 13:10:00 to 13:15:00. Your “mean” and “standard deviation” may change. So as scaled data points. And it is sometimes inconvenient during the testing of your model.

You can make windows of 15 seconds each to apply the StandardScaler algorithm, but it is more challenging than it sounds. In this case, you can apply custom or manual standard scaling using Splunk variables without using Splunk Machine Learning Toolkit.

Standardizing Data without MLTK Using Only SPL

Suppose you want to perform Splunk standard scaling without using the Machine Learning Toolkit (MLTK). In that case, you can use built-in Splunk commands to calculate the mean and standard deviation and then manually apply the standard scaling formula. Here is an example using Splunk’s “eventstats” and “eval” SPL commands:

index=your_index
| eventstats avg(field1) as mean_field1, stdev(field1) as stdev_field1
| eval SS_field1=(field1-mean_field1)/stdev_field1
| table _time SS_*
| sort _time

In this command:
| eventstats avg(field1) as mean_field1, stdev(field1) as stdev_field1: The eventstats command calculates aggregate statistics for the specified field1. It calculates the average (avg) and standard deviation (stdev) for field1 and assigns the results to new fields mean_field1 and stdev_field1, respectively.
| eval SS_field1=(field1-mean_field1)/stdev_field1: The eval command will create a new field, “SS_field1,” that holds the standardized score for each data point in the original field1. This score indicates how many standard deviations an individual data point is from the mean of its field.
| table _time SS_*: The table command generates a table with _time and all fields starting with SS_ (which includes the standardized scores for field1 (and other fields if you used any more).
| sort _time: Finally, the sort command arranges the events in ascending order of the _time field.

If you want to omit to center the scale to “0,” you can remove the usage of the mean variable:

index=your_index
| eventstats stdev(field1) as stdev_field1
| eval SS_field1=field1/stdev_field1
| table _time SS_*
| sort _time

and if you don’t want to scale your data points, you can omit the standard deviation variable:

index=your_index
| eventstats avg(field1) as mean_field1
| eval SS_field1=field1-mean_field1
| table _time SS_*
| sort _time

and here’s an example of using standard scaling with Splunk’s built-n commands for three fields:

index=your_index
| eventstats avg(field1) as mean_field1, stdev(field1) as stdev_field1, avg(field2) as mean_field2, stdev(field2) as stdev_field2, avg(field3) as mean_field3 stdev(field3) as stdev_field3
| eval SS_field1=(field1-mean_field1)/stdev_field1, SS_field2=(field2-mean_field2)/stdev_field2, SS_field3=(field3-mean_field3)/stdev_field3
| table _time SS_*
| sort _time

Splunk Standard Scaling by Constant Interval

Let’s take an example where you want to make Splunk Standard Scaling every 15 seconds because your data constantly comes in from a sensor, and you don’t know how your data points will distribute in the next hour. Something harder to do with Splunk Machine Learning Toolkit. We’ll achieve this with built-in commands:

index=your_index
| bin _time span=15s
| eventstats avg(field1) as mean_field1, stdev(field1) as stdev_field1, avg(field2) as mean_field2, stdev(field2) as stdev_field2, avg(field3) as mean_field3 stdev(field3) as stdev_field3 by _time
| eval SS_field1=(field1-mean_field1)/stdev_field1, SS_field2=(field2-mean_field2)/stdev_field2, SS_field3=(field3-mean_field3)/stdev_field3
| table _time SS_*
| sort _time

New commands that we added:
| bin _time span=15s: The bin command will split the _time field into 15-second intervals. The _time field in Splunk represents the timestamp associated with each event.
by _time: This is now part of the “eventstats” command, which calculates mean values and also the standard deviation for each field. Now the calculations will be in intervals of 15 seconds that we specified in the “bin” command.”

When using “bin _time span=15s” and “eventstats” “by _time,” it transforms the time of all your events to 15 seconds buckets. The time in all your events will be the same “_time” value on all events within each 15 seconds range. The problem with that is when you want to sort the results of a table by the “_time” field. All the events sorted inside the 15-second range will be random. We can overcome this by assigning the “bin” command to a new field (bin _time span=15s as interval) and using this new time field with “eventstats” “by interval”.

index=your_index
| bin _time span=15s as interval
| eventstats avg(field1) as mean_field1, stdev(field1) as stdev_field1, avg(field2) as mean_field2, stdev(field2) as stdev_field2, avg(field3) as mean_field3 stdev(field3) as stdev_field3 by interval
| eval SS_field1=(field1-mean_field1)/stdev_field1, SS_field2=(field2-mean_field2)/stdev_field2, SS_field3=(field3-mean_field3)/stdev_field3
| table _time SS_*
| sort _time

No Change in Input Data Will Result in a Standard Scaling of 0

Let’s say you have several data points that have the same value of 11 with the same period over your whole preprocessing stage if you’re using MLTK StandardScaler or if you’re using “bin” time buckets of 15 seconds for a range of 1 minute. The phenomenon will be more noticeable with 15 seconds time buckets, so we will use it as an example. So, over 1 minute with all data points being the same with a value of 11 will result in a Standard Scaling of “0” over results. One minute range will have 4 results with 15 seconds buckets. Standard Scaling will result in “0” because the average of all the points with a value of 11 will be 11. It doesn’t matter how many points you have. All of them are with the same value of 11. If you have 20 points, it will be 11*20 / 20; the result will be 11.

If we remember the formula. It is:

Z = (X - mean) / stdev

So, the original value (X) of 11 subtracting the “mean”, which is also 11, will result in “0”.

The phenomenon is not good nor bad, but something you need to note while preprocessing your data with Standard Scaling. Understand how it affects your machine learning model.

Bonus: Splunk MLTK RobustScaler

Splunk Machine Learning Toolkit contains another scaling algorithm that works better if your data points have outliers: Machine Learning Toolkit RobustScaler.