About Splunk Supervised Machine Learning
Before understanding the usage of Splunk supervised machine learning, let’s understand what supervised ML is. Supervised machine learning is a subset of machine learning. This scientific field involves the development of statistical models and algorithms to perform tasks without explicit instructions to the computer. In the context of Splunk supervised machine learning, it refers to applying these principles within the Splunk platform.
Affiliate: Experience limitless no-code automation, streamline your workflows, and effortlessly transfer data between apps with Make.com.
In supervised machine learning, algorithms learn from labeled data. This data collects examples, including input variables and the correct output. The algorithm analyzes the data and produces a model that can make predictions or decisions without being explicitly programmed to perform the task. The model is then tested and adjusted as necessary to improve its accuracy.
An advanced data analytics platform, Splunk has integrated supervised machine learning into its operations. This integration has allowed Splunk to offer intelligent solutions to its users, transforming how it handles data and offers solutions. The Splunk Machine Learning Toolkit (MLTK) is a critical component in this integration.
In the context of engine operation, Splunk supervised machine learning or any supervised ML algorithm needs examples of a “good state” in the training data set and a “bad state” to know what will be predicted based on the incoming data set. If you don’t specify any states and will have only one state, then the algorithm will predict everything as that state. You can have more labels for state prediction besides “good state” or “bad state.” The algorithm will try to predict all of them based on the input data.
Unlike supervised ML, unsupervised machine learning algorithms, in most cases, don’t need any state besides the “good state” since they should find an “anomaly” in the pattern of the good state. Also, some unsupervised algorithms have their use case. For example, K-Means is used for Clustering and also for Classification. We have two example use cases: KMeans unsupervised algorithm for Classification in Splunk and the KMeans unsupervised algorithm for Anomaly Detection in Splunk.
Preparing Platform for Splunk Supervised Machine Learning
You may first prepare Splunk for Machine Learning capabilities by installing the add-ons. After that, prepare your data for machine learning. Suppose you still need more practice in machine learning capabilities in engine failure prediction or similar tasks. You may refer to Splunk Essentials for Predictive Maintenance add-on guidelines in that case.
Splunk Supervised Machine Learning Usage
We’ll use an SPL query and explain it:
index=your_index source=your_source
| eval good_state = if(external_vibration_sensor_x < threshold1 and external_vibration_sensor_y < threshold2 and external_vibration_sensor_z < threshold3 and internal_engine_sensor_speed < threshold4, 1, 0)
| fit DecisionTreeClassifier good_state from external_vibration_sensor_x external_vibration_sensor_y external_vibration_sensor_z internal_engine_sensor_speed into my_model
index=your_index source=your_source: This command specifies the data set to analyze. “your_index” represents the name of the index where your data resides, and “your_source” is the source of the data in that index.
eval good_state = if(external_vibration_sensor_x < threshold1 and external_vibration_sensor_y < threshold2 and external_vibration_sensor_z < threshold3 and internal_engine_sensor_speed < threshold4, 1, 0): This command creates a new field named “good_state” using the “eval” function. It uses the “if” function to assign a value to the “good_state” field. Suppose you meet all the conditions, i.e., if the values of “external_vibration_sensor_x,” “external_vibration_sensor_y,” “external_vibration_sensor_z,” and “internal_engine_sensor_speed” are less than their respective thresholds, then “good_state” is assigned the value “1”. Otherwise, it is assigned the value “0”, meaning it is a “bad state.” Essentially, it’s assessing the system’s state based on sensor readings and assigning a “good” state (1) if they’re all under the respective thresholds and a “not good” state (0) otherwise. Of course, you can use different / more conditions to set the “good_state” for your liking.
fit DecisionTreeClassifier good_state from external_vibration_sensor_x external_vibration_sensor_y external_vibration_sensor_z internal_engine_sensor_speed into my_model: This command applies machine learning to the data. It uses the “fit” command, which trains a machine-learning model. Machine Learning Toolkit adds the “fit” command to Splunk SPL. We are currently utilizing the “DecisionTreeClassifier” model algorithm. You can use any supervised machine learning algorithm that suits your needs better. The target variable (what you want to predict) is “good_state.” The predictor variables (the data used to predict the target variable) are the sensor readings “external_vibration_sensor_x,” “external_vibration_sensor_y,” “external_vibration_sensor_z,” and “internal_engine_sensor_speed.” “from” command specifies from which predictor variables, a.k.a. the features train the ML model. The trained model is then saved under “my_model” for later use.
Essentially, this Splunk command evaluates the system’s “good” state based on a set of conditions related to sensor readings and then trains a Decision Tree Classifier to predict this state based on those same sensor readings.
Applying Splunk Supervised Machine Learning Trained Model
Applying Splunk supervised machine learning trained model is the easiest part:
| inputlookup your_test_data_source
| apply my_model
inputlookup your_test_data_source: This is any data source to which you want to apply your trained Splunk supervised machine learning model.
apply my_model: This command is the actual application of the model. Once you apply the model will generate a new field named “predicted(good_state).”
DecisionTreeClassifier model algorithm
The “DecisionTreeClassifier” is a specific machine learning methodology hailing from the supervised learning class of algorithms. This approach is integral to Splunk’s supervised machine-learning efforts. The algorithm predicts an observation’s type or category based on distinguishing features.
Visualize a tree-shaped decision-making structure – how the decision tree model functions. The whole dataset begins at the tree’s root, and from there, it branches off according to a decision rule set by the model. For instance, “Is Attribute A higher than a certain threshold?” This branch-splitting process repeats recursively, forming a tree where each node indicates a decision rule, and each leaf denotes a prediction or an outcome.
A few prominent attributes distinguish decision trees:
Interpretability: The decision trees stand out due to their simplicity and transparency, making them easy to understand and interpret. This clarity is crucial in Splunk’s supervised machine learning as it facilitates the visualization of the entire decision path.
Capacity to handle categorical and numerical data: While some applications may require categorical data to undergo one-hot encoding, decision trees are well-equipped to handle both forms of data.
Non-parametric: Decision trees don’t make assumptions about the underlying data’s distribution, a crucial aspect of Splunk’s supervised machine learning methodology.
Feature Importance: Decision trees can highlight essential features that contribute to predictions.
The Splunk Machine Learning Toolkit utilizes a decision tree model tailored for classification tasks, built on the DecisionTreeClassifier class from the scikit-learn Python library.
This model’s simplicity and transparency make it an appropriate starting point for many classification tasks, emphasizing the importance of machine learning data preparation. However, it’s important to remember that all machine learning techniques, including those used in Splunk’s supervised machine learning, have their strengths and limitations. Therefore, there might be better solutions for some scenarios.
Another example based on Splunk Essentials for Predictive Maintenance add-on
Before training our model in production, we used the Splunk Essentials for Predictive Maintenance add-on to understand how it works in Splunk. We used this SPL command after completing the tutorial.:
sourcetype=iot_pm_fail earliest=1
| reverse
| table unit_cycle sname*
| head 259
| where unit_cycle>0 AND unit_cycle<173
| eval good_state = if(sname_HPC_Outlet_Temp<1600 and sname_Bypass_Ratio<8.45 and sname_LPT_Outlet_Temp<1405 and sname_Static_HPC_Outlet_Pres<47.5, 1, 0)
| table unit_cycle good_state sname_Bypass_Ratio sname_HPC_Outlet_Temp sname_LPT_Outlet_Temp sname_Static_HPC_Outlet_Pres
| fit DecisionTreeClassifier "good_state" from sname_Bypass_Ratio sname_HPC_Outlet_Temp sname_LPT_Outlet_Temp sname_Static_HPC_Outlet_Pres into my_model
The command is for reference only, and to understand the values, you will need to understand the data from provided example jet engine dataset.
sourcetype=iot_pm_fail: The Splunk Essentials for Predictive Maintenance add-on adds the “iot_pm_fail” source type.
earliest=1: To search for events from the beginning until now, use the “All time” frame – the earliest possible event represented by “1”.
reverse: inverts the order of the results so the latest event is first.
table unit_cycle sname*: The table command creates a tabular data output with columns “unit_cycle” and any field that begins with “sname.”
head 259: This command limits the search results to the top 259 events, which is the first window of the whole engine cycle before the first maintenance.
where unit_cycle>0 AND unit_cycle<173: The “where” command filters the results. It only includes those where “unit_cycle” is greater than “0” and less than “173”.
eval good_state = if(sname_HPC_Outlet_Temp<1600 and sname_Bypass_Ratio<8.45 and sname_LPT_Outlet_Temp<1405 and sname_Static_HPC_Outlet_Pres<47.5, 1, 0): This eval command creates a new field called good_state. The if condition assigns 1 to “good_state” if it meets the four conditions and 0 otherwise.
table unit_cycle good_state sname_Bypass_Ratio sname_HPC_Outlet_Temp sname_LPT_Outlet_Temp sname_Static_HPC_Outlet_Pres: This command generates a new table with the specified fields.
fit DecisionTreeClassifier “good_state” from sname_Bypass_Ratio sname_HPC_Outlet_Temp sname_LPT_Outlet_Temp sname_Static_HPC_Outlet_Pres into my_model: The fit command applies the DecisionTreeClassifier machine learning model to the data. It tries to predict the good_state field based on the four “sname” fields, saving the results of the model into “my_model”.