Predicting fault in a chemical manufacturing process with Random Forests and Bayesian Belief Networks.

Vivek Chaudhary
7 min readMar 7, 2022

--

Introduction:

Real-time root cause analysis (RCA) is an essential tool for personnel managing any process, in order to reduce the risk of failure. The difficulty of carrying out real-time root cause analysis scales exponentially with the size of the process. In this article, we will develop a program for identifying the root causes and predicting the probability of a fault in a chemical manufacturing process using Random Forests and Bayesian Belief Networks. Though we will be working with a chemical manufacturing process, the same principles of root cause analysis are true for a variety of contexts, such as supply chain, logistics, and services.

Photo by Dimitry Anikin on Unsplash

Background:

The process we will be analyzing is the Tennessee Eastman Process. The TEP is essentially a real industrial process that was modeled computationally in 1993 by Downs and Vogel. The TEP is comprised of 8 chemical components in total: 4 reactants, 2 products, 1 by-product, and 1 inert component. These components undergo a chemical process dominated by 5 main process units: a reactor that allows for the reaction of the gaseous feed components (A, C, D, and E) into liquid products (G and H), a condenser to cool down the gaseous product stream coming out of the reactor, a gas-liquid separator to split gas and liquid components from the cooled product stream, a centrifugal compressor to flow this gas stream back into the reactor and a stripper to handle the efficient separation of the 2 products from any unreacted feed components. There is also a purge to remove the inert (B) and the by-product (F) from the system. An overview of the industrial process is given below.

Process diagram of the Tennessee Eastman Process
Chemical reactions in the Tennessee Eastman Process

From this entire process, we can get over 50 different variables that record properties of the system such as the flow rates, pressures, temperatures, levels, mole fractions, and compressor power outputs. Over 10 of these are manipulated variables (flow rates, valve positions, and the reaction agitator speed) which the operator can control to ensure that the chemical process is operating under control. At the current time of writing, one of the most prevalent sources of TEP data comes from here, which is the dataset set referenced in Rieth et al. (2017). In this dataset, process variables are sampled every 3 minutes for 25 hours in the training dataset, and 48 hours for the testing dataset. The variables are tabulated as follows:

Description of variables

Now one of the reasons that this dataset is so widely used for benchmarking anomaly detection algorithms is that it contains both ‘fault-free’ and ‘faulty’ data. The former corresponds to the values of the TEP process when under normal operation while the latter contains 20 different simulated process faults. In theory, a good anomaly detection algorithm shouldn’t give any false positives for the fault-free dataset while at the same time catching as many of the 20 faults introduced in the faulty dataset as possible.

The Solution:

In this solution, we will use Random Forests to identify the 10 most important features in the model for predicting the fault. We will use these top 10 features to construct a Bayesian Belief Network. The Bayesian Belief Network will be used to calculate the conditional probability of fault in the system based on the values of the features. The solution design is as follows:

Solution Design

Simulation Data:

The simulation data is available at this location in the rdata format. This data can be loaded onto python using the following lines of code:

import pyreadr as py
import pandas as pd
#%%
# Reading train data in .R format
a1 = py.read_r("TEP_FaultFree_Training.RData")
a2 = py.read_r("TEP_Faulty_Training.RData")
# Reading test data in .R format
a3 = py.read_r("TEP_FaultFree_Testing.RData")
a4 = py.read_r("TEP_Faulty_Testing.RData")
#%%# See the Objects in the rdata file
print("Objects that are present in a1 :",a1.keys())
print("Objects that are present in a2 :",a2.keys())
print("Objects that are present in a3 :",a3.keys())
print("Objects that are present in a4 :",a4.keys())
#%%# Reading the .Rdata files in pandas dataframe and saving it in .csv file
# Reading train data
b1 = a1['fault_free_training']
b2 = a2['faulty_training']
# Reading test data
b3 = a3['fault_free_testing']
b4 = a4['faulty_testing']

Random Forest Model:

To extract the most important features in the prediction of fault in the process, we will be using the scikit learn library. The data is cleaned using the following lines of code:

# Impoting the required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Fitting the data onto a Random Forest Classifier
clf = RandomForestClassifier(n_jobs=-1,n_estimators= 500,max_depth=35)
clf.fit(train_norm, y_train)

The most important features of the model are extracted using the following lines of code:

importances = pd.DataFrame({'Importance': clf.feature_importances_}, index=tr.columns).sort_values(by='Importance', ascending=False)
importances = importances.sort_values('Importance', ascending=False)

The features in descending order of importance are as follows:

Most important features in random forest classifier

For making the Bayesian belief network, we will be considering the top 10 most important features which are as follows:

Most important features with descriptions

Bayesian Belief Network:

To create the Bayesian belief network, we will be using the bnlearn library. First, we will create an adjacency list of the top 10 most important features extracted in the previous step based on the process flow diagram and store it in a list called “edges”. Then we will fit the data to a Bayesian Belief Network using the following lines of code:

# Impoting the required libraries
import pandas as pd
import bnlearn as bn
edges = [('A feed flow valve', 'A feed stream'),
('A feed stream', 'Reactor temp'),
('Reactor cooling water flow valve', 'Reactor cooling water outlet temp'),
('Reactor temp', 'Reactor cooling water outlet temp'),
('Reactor temp', 'Condenser cooling water outlet temp'),
('Reactor cooling water outlet temp', 'Fault'),
('Stripper temperature', 'Fault'),
('Total feed flow stripper valve', 'Stripper temperature'),
('Stripper steam valve', 'Stripper temperature'),
('Stripper steam flow', 'Stripper steam valve')]
# Create the DAG
DAG = bn.make_DAG(edges)
# Plot the DAG.
bn.plot(DAG, interactive=True)

The resulting Bayesian Belief Network looks like this:

Bayesian Belief Network created by bnlearn

In the final step, we will make a maximum likelihood estimator using the bnlearn library. The following line of code will take the values of the 10 features and return the probability of fault in the system.

# Parameter learning on the user-defined DAG and input data using maximumlikelihood
model_mle = bn.parameter_learning.fit(DAG, df, methodtype='maximumlikelihood')
# Print the learned CPDs
bn.print_CPD(model_mle)
# Values for 2 out of the 10 features are entered in the dictionary
# 10 variables can be added to dictionary to make predictions
query = bn.inference.fit(model_mle, variables=['Fault'], evidence={'Reactor temp':1, 'Stripper steam valve':2 })
reult_df = query.df

Scope for future work

A solution like this can serve as a tool by personnel for real-time root cause analysis by deploying a flask app on a suitable cloud platform.

Flask application for prediction of fault

It is also possible to create a directed acyclic graph by unsupervised learning in the bnlearn library. However, the learned Bayesian Belief Network is prone to overfitting and may not be consistent with the real-world design of the process. The probability of fault can be calculated with greater accuracy by including more features in the Bayesian Belief Network instead of just including the top 10 features.

Please do reach out for feedback or if you need any help with your projects :)

Find the complete code in this GitHub Repository

Connect with me on Linkedin

--

--

Vivek Chaudhary
Vivek Chaudhary

Written by Vivek Chaudhary

Supply chain analytics consultant | Pythonista

No responses yet