Equipment Failure Prediction

21 min readDec 15, 2020

Table Of Content:

1) Overview
2) Business Problem in machine learning terms
3) Metric to be used
4) Data Downloaded from
5) Existing approaches to the problem
6) My approach
7) EDA with observations
8) Exploratory Data Analysis
9) Performing a similar EDA on the outliers
10) Model Training
11) Comparision of all the models in a tabular format.
12) Future Work
13) The web application
14) Reference Links
15) Profile

1) Overview :

This is a case study regarding Conocophillips , multinational energy firm , that funds multiple energy projects in the US. As per them 80% of oil wells in the US are Stripper Wells(oil or gas well that is nearing the end of its economically useful life). These wells produce less volume but at an aggregate
level are responsible for significant amount of oil production.
They have low operational costs and low capital intensity — ultimately providing a source of steady cash flow to fund operations that require more funds to get off the ground. Meaning less investment and relatively better outcomes. The company requires these low cost wells to remain well maintained so that the cash flow remains steady .
But even mechanical and electronic equipment in any field have their shelf life and break down with time. It takes a lot investment of money and resources to get the repairs/replacement done and results in lost oil production .
The aim is to predict this equipment failure depending upon the data given from the sensors so that teams are prepared to handle failures as they occur.

2) Business Problem in machine learning terms :

Given the data points, with their respective features, use classification to find out whether the data points belong to surface failure or down hole failure.

3) Metric to be used :

The formula for the standard FBeta-score is the Harmonic Mean of the precision and recall. A perfect model has an F-score of 1.

Precision =TP/(TP+FP)
Recall =TP/(TP+FN)

Formula :

What is the difference between f.5 and f2

a) F.5 Score :

Means here the weight of the precision is cut to a quarter of original value. Meaning that if we divide the numerator by this lessened denominator (.25 * precision) then we get more overall value than the value we get if we divide the numerator by 1 * Recall. We get more value for the f.5 score by dividing by precision that means more weightage given to the precision (the component containing the false positive) here.

The only difference here is between precision and recall is that of the false positive and false negative respectively.

b) F2 Score :

Here the weight of the precision is more than quadrupled. Meaning that if we divide the numerator by this increased denominator (4 * precision) then we get less overall value than the value we get if we divide the numerator by 1 * Recall. We get more value for f2 score by dividing by recall that means more weightage given to the recall (the component containing the false negative) here.

So if I consider downhole failures as my positive class then I do not want that I should mistake a downhole failure for a surface failure. Meaning that here, I should not have any False Negative. Meaning I should not mistake downhole for a surface failure, so I want to reduce false negative, I will consider f2 score respectively.

In this case my priority is the prevention of downhole failures.
(They are more expensive, more impactful on failure, hazardous and less accessible for repair when they occur,difficult to handle and very less in numbers).
This means I should not confuse a downhole failure for a surface failure.
Accordingly, as the data is imbalanced , I will use the f2 to give the downhole failure, my priority.

4) Data Downloaded from :

https://www.kaggle.com/c/equipfailstest

5) Existing approaches to the problem :

This is a Kaggle problem so there have been various attempts at solving the problem.

The machine learning approaches have included upsampling/balancing the dataset /imputation and passing the datasets through various Machine Learning Models/MLPs and using the model that gets the best prediction.

2. For the non machine learning methodology, I came across a research paper

Failure Analysis of the Offshore Process Component Considering Causation Dependence by Samir M. Deyab, Mohammed Taleb-Berrouane, Faisal Khan, Ming Yang

Link :

https://www.researchgate.net/publication/321137529_Failure_analysis_of_the_offshore_process_component_considering_causation_dependence

This research paper includes the below proposed methodology :

Step 1: Data collection : Initial hazard identification like fire.

Step 2: Probabilistic analysis : Probabilistic failure analysis was performed (using Bayesian networks) based on dependencies identification between the root causes, linking the scenarios’ elements, building model scenarios by asserting conditional probabilities of failure events . It aims and providing accurate analysis where the elements are interconnected in a conditional way.

Step 3: Sensitivity analysis of root causes : The sensitivity analysis is performed in the case of dependency between the root causes and the case of independency. To find out which of the basic events has more impact on the undesired event, a comparative study is performed based on the generated data.
Below equation is used for sensitivity analysis calculations:
Percentage change =((Posterior probability — Prior probability)/Prior probability) × 100

Step 4: Application of methodology to offshore processing system units:

a) Both have events divided into basic and intermediate events.
b) The basic events comprise contain major failure events.
c) The probability of an intermediate event is based on the conditional states of the basic events.
d) Formulation of the Bayesian networks is based on the conditional probabilities table (CPT) This table consists of both basic and intermediate event and their failure frequencies per year.

Step 5: Results and Discussion : Results contain the fully formed Bayesian network and Sensitivity graphs in case of both dependency and independency. Mapping out the change of probability of a hazard based on the change of the probability of a basic/intermediate event.

6) My approach :

It includes passing the data through a series of modifications. I pass my data set through 0 value imputation, mean value imputation, median value imputation class wise and observe the graphs which would provide the most variations (Keeping in mind the outlier effect on the mean imputation). I perform Truncated SVD to get the minimum features that give me a similar variation as the 100% features . Recursive Feature Elimination to get the feature rankings to keep the most important features and discard the features that are not that useful find /Spearman Corelation Coefficient to try to use only those features that are not that much correlated.

Then I pass the data set through class balancing methodolgies of
The data set is highly imbalanced , and all the features are numerical so I will try :
1) Adasyn (oversampling)
2) OneSidedSelection (undersampling).
3) SmoteTomek (combination of over sampling and undersampling).

And I pass them through multiple machine learning models/an MLP and a custom model to see from which combination of data and model do I get the best results.
I then select that model with those set of hyperparameter and dataset and run a similar preprocessing on my test data and then run prediction sing the best model to get my prediction results and find the F2 Score.

7) EDA with observations :

Here I only show the observations from median data and perform smotetomek class balancing method upon the data set (as this is what got me the best scores out of a lot of combinations of data sets and machine learning models).

A) Inital PreProcessing :

First I replace “na” values np.NaN as median calculation and replacement.

2. Then I divide the data into train and test.

3. Then impute median for train n test as per train inplace of np.NaN

4. Standardize.

B) Perform Class Balancing :

It is important to perform class balancing as I have 47200 surface based failures for which I only have 800 surface based failures. I got the best F2 Score value for smotetomek class balancing so here , I show only smotetomek which is a combination of both oversampling the minority and undersampling the majority.

Perform smotetomek class balancing :

C) Try to reduce useless features :

Given in the dataset, there are a lot of sensor features and histogram based features , but it is better to use only the important features that help in differentiating between the 2 types of failures.

I will send the data through a series of operations to only use the best features to get the predictions from my model.

Perform Truncated svd :

Truncated Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, excepting that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Here , I use TSVD to find only enough features that give atleast 99% variance of the total data. This way I am trying to avoid extra features that may decrease the prediction scores.

truncated_svd(x_smotetomek,169)

I choose 120 components to describe my data as 120 components explain atleast 99% variance of the total data.

svd=truncated_svd(x_smotetomek,120)

Meaning here , 99% variance of 170 features can be explained by 120 components.

So here I am taking approximately 120 features as baseline to cover the variance of 170 features.

2. Perform Recursive Feature Elimination.

Using TSVD I was able to find how many features enough to explain the variance of 170 features, but which features to actually use ? I will use Recursive Feature Elimination to get the best Ranking of the features.

I select only the first Ranking features.

I have 145 features out of 170 features.

3. Perform Spearman corelation coefficient.

Out of the given 145 features , I want only those features that are less correlated on one hand and still keep the necessary number of features that can cover the required variance. Therefore I will , take only those features that keep in both the factors.

Issue faced : If I include only those features that are very less correlated then I have very less features.

Selecting only the appropriate co — relation and number of features.

I choose to to have atleast 117 features . So I choose to include the features within the .99 range as if I go below this value, the number of features start reducing more and more.

One point to remember is that make sure that these features are the only ones that are present in the test data as well.

8) Exploratory Data Analysis :

Here I did median value imputation because when any value that is missing , I do not want the value that I replace it with to be affected by the outliers.

But the remaining data sets still have outliers among them and I want to see how does having outliers in a class affect the sensor value. Therefore after imputation, I perform EDA on both the mean of the sensors(after median value imputation) classwise and median of the sensors(after median value imputation) classwise by itself.

1) Print the mean of the imputed values for the sensors class wise (failure wise that is 0: surface failure, 1 : downhole failure).

Observation : All the mean values that we have taken from every median imputed features, we come to know that there is a difference between both the types of the failures.

If we want, we can take only those sensor values that have significant difference in between the mean/median readings. That significant difference in itself can be a hyper parameter.

The bigger the gap, the more chance I may have of predicting to which class of failures does a given failure belong.

Observation : Here we find the sensors present that have greater differences in the median values of both the failure types which help in classifying them.

2) Print the median values for the sensors class wise (failure wise that is 0: surface failure , 1 : downhole failure).

Observation : Coming to the median values , we can observe that the values are not as high as that of the mean values but they seems to have increased more towards the negative side.

Features with significant difference. :

So we have approximate 64 sensor that show significant difference (> .15) between median readings of surface and downhole failures.

3) Histograms for means.

Observations :

Here for the surface and downhole failures, we can see that the different downhole features have more variance. Whereas the majority of the surface related features have a very similar value.

4) Histograms for medians

Observations :

Here for the surface and downhole failures, we can see that the different downhole features have more variance. Whereas the majority of the surface related features have a very similar value.

5) Box plots for means

Observation : Similar behaviour can be observed in the box plots as well where we can see that where the downhole failure values are spread across the a huge number of values , the surface values are limited only the values nearby 0 .

6) Box plots for medians

Observation : For the median values , as we observed in the bar plots for histogram, that values seem to have shifted more towards the negative . It means all those values which were separating the values possibly included the outliers.

7) Voilin plots for means

Observation : For Voilin Plots we can observe the density of the downhole failures spread throughout a value of 0 to 4. For the surface failures , they seem to be limited to value nearby 0 only. Even the outliers for the surface failure also seem to be near the 0 value points.

8) Voilin plots for medians.

Observation : We can see that there is more overlap in the case of median values rather than the mean values. But these median values are nearer to the median value of a sensor rather that mean values. And I do not want to take outliers into consideration so I will prefer the median imputation here.

9) Principal Component Analysis.

Now I take my smotetomek data and try to find if using PCA , can I separate the two types of features.

9) Performing a similar EDA on the outliers :

Here we take the outliers from the features that have been selected.

For each sensor we find the outliers that also classwise.

The issue is that there maybe high outliers here , but there maybe lower outliers here as well.

https://www.geeksforgeeks.org/interquartile-range-and-quartile-deviation-using-numpy-and-scipy/
Higher outlier = more than 1.5 ⋅ IQR above the third quartile.
Lower outlier = less than 1.5 . IQR below the first quartile.
The IQR is the difference between Q3 and Q1.

Total number of data points that are outliers in atleast one or more than one feature : 79724 out of a total of 82598

Outliers for all the 117 sensors

Observation : Here we observe that the majority of the histogram features have more outliers for the surface failures than for the downhole failures.

2) Histogram for mean outliers :

Observations : For the outliers, majority of them seem to be concentrated around 0 for the surface outliers with variance as well.

For downhole outliers they have less density but more variance in the other range values.

3) Histogram for median outliers:

Observation : A similar behaviour can be observed for lower values.

4) Box plot for mean outliers.

5) Box plot for median outliers :

Observation : These outliers show a very similar behaviour to the whole data set. We can observe that even the outliers for surface failures are near the 0 value whereas for downhole failures we can see that they are spread throughout the range and show more variance.

6) Voilin Plot mean outliers :

7) Voilin Plot median outliers :

Observation : We can observe the same behaviour for voilin plot as we did for the box plot. But as we can observe that both the failures are at their densest from 0–1 range we might have issues in trying to classify them to the right class.

Conclusion from Exploratory Data Analysis : So I can conclude that my features and sensor median values are good enough to classifiy the failures into their respective types but not all the failures have a chance of getting classified corrrectly.

10) Model Training

1. The Basic Classifier :

In this basic classifier, I will take an input dataset and for each feature I have 2 median values both for if the failure is surface or downhole .
And see that the feature value is closer to which particular median.
1) If my feature value is closer to the median value of the class 0 i.e surface failure,I increase my point for surface failure.
2) If my feature value is closer to the median value of the class 1 i.e downhole failure, I increase my point for downhole failure.
3) If my feature value is closer to both the classes, I take the majority into consideration and increase a point for class 0.
In the end I check for which points are greater , surface or downhole.
Accordingly, I will classify the point to that class which has majority points or the majority features of it being near to that particular class.

2. Approach :

My first cut approach was to try most of the combinations of data sets along with different types of class balancing methods and feed it to a series of machine learning models/an MLP/ and a custom model. But here I am showing only for the smotetomek balanced data as this got me the best results.

3. Explanation for all the attempted models:

I gave three types of datasets (Median imputed dataset, median imputed Adasyn and median imputed Smotetomek) to each of the below models in order to find out which of the model can give the best result. But here too I only show results for smotetomek. I am not showing the results for Adasyn and OSS.

Also I want more data to train my models so I will perform GridSearch/RandomSearch instead of explicitly defining train data and cross validation data as by explicitly defining , my cv data will never be used to train my model.

K-Nearest Neighbour

f2_score :  0.7821457821457822
confusion_matrix : 
 Predicted    0.0  1.0
True                 
0.0        11570  230
1.0            9  191

2. Logistic Regression ( sklearn.linear_model.LogisticRegression)

Logistic Regression in Sklearn doesn’t have a ‘sgd’ solver.
It implements a log regularized logistic regression : it minimizes the log-probability.
SGDClassifier is a classifier that uses Stochastic Gradient Descent as a solver.
With SGDClassifier you can use lots of different loss functions (a function to minimize or maximize to find the optimum solution)
The SGD classifier can have the same loss function as the Logistic Regression but a different solver.

f2_score :  0.739549839228296
confusion_matrix : 
 Predicted    0.0  1.0
True                 
0.0        11540  260
1.0           16  184

3. Logistic Regression (SGDClassifier)

f2_score :   0.6642066420664208
confusion_matrix : 
Predicted    0.0  1.0
True                 
0.0        11425  375
1.0           20  180

4. Support Vector Machine

f2_score :  0.6499636891793755
confusion_matrix : 
 Predicted    0.0  1.0
True                 
0.0        11402  398
1.0           21  179

5. Decision Trees

f2_score :  0.855899419729207
confusion_matrix : 
 Predicted    0.0  1.0
True                 
0.0        11743   57
1.0           23  177

6. Random Forest

f2_score :  0.9289883268482491confusion_matrix :
Predicted    0.0  1.0
True
0.0        11763   37
1.0            9  191

7. GBDT(XGBClassifier)

f2_score :  0.9720837487537389confusion_matrix :
Predicted    0.0  1.0
True
0.0        11792    8
1.0            5  195

8. Adaboost

f2_score :  0.9643916913946587
confusion_matrix : 
Predicted    0.0  1.0
True                 
0.0        11784   16
1.0            5  195

9. Multi Layer Perceptron : Here, I take a simple MLP model with different units in the hidden layers with different dropouts.

image credits : https://www.researchgate.net/

Run the model now with the given configurations :

Output: 
No Model yet.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 100)               11800     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                1020      
_________________________________________________________________
dropout_2 (Dropout)          (None, 20)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                210       
_________________________________________________________________
dropout_3 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 5)                 55        
_________________________________________________________________
dropout_4 (Dropout)          (None, 5)                 0         
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 12        
=================================================================
Total params: 18,147
Trainable params: 18,147
Non-trainable params: 0
_________________________________________________________________
type(validation_data) <class 'tuple'>
validation_data[0] (16520, 117)
validation_data[1] (16520, 2)
Epoch : 0                        val_f2: 0.937062
Epoch 000: saving weights only to smotetomek_data_checkpoint_100_0.5_50_0.4_20_0.3_10_0.2_5_0.1_None_0.0001/checkpoint_20201112-031547_Epoch_000


Epoch : 1                        val_f2: 0.945431
Epoch 001: saving weights only to smotetomek_data_checkpoint_100_0.5_50_0.4_20_0.3_10_0.2_5_0.1_None_0.0001/checkpoint_20201112-031547_Epoch_001


Epoch : 2                        val_f2: 0.952316
Epoch 002: saving weights only to smotetomek_data_checkpoint_100_0.5_50_0.4_20_0.3_10_0.2_5_0.1_None_0.0001/checkpoint_20201112-031547_Epoch_002

Epoch : 39                       val_f2: 0.999125
Epoch 039: saving weights only to smotetomek_data_checkpoint_200_0.1_300_0.2_400_0.3_500_0.4_600_0.5_elu_0.01/checkpoint_20201112-043719_Epoch_039

Load the model from the checkpoint with the best weights.

confusion_matrix : 
Predicted    0.0  1.0
True                 
0.0        11778   22
1.0           17  183

10. Custom Model :

Here I try to make customised features from the data that I have. Using not just the train data for creating customised features but also to train the main model which is going to make the prediction.I take Decision Tree as my base model.

I split my train data into D1 and D2.

Using D1, I pass D1 through k models (Each model here, is a separate decision tree) to create k features.

My hyperparameters :
1) The batch size (how many data points in D1 to keep in each batch with replacement).
2) The number of features i.e. number of models (k).

I save the data with scores, parameters and models and select only the top 10 scores to consider(using GridSearchCV) the parameters for the creation of features for which I will use D2 batch .

Now I take these models, predict D2 Dataset through them, to get my D2 data in the form of these features

Passing the test set that I have, to each of the Top 10 models that were trained with D1_train data and used to create new features by passing D2_train data with configurations with top 10 scores so that I get “model_count” predictions.

Now, with the featurized D2_new data set that I have, and the D2_Y values that I have ,I will train new meta model, with 10 configurations. and select the top meta model.

Selecting the top Meta Model with the best CV scores and model configurations on which the main test data.

Now I pass this new test dataset as per the top combination and hyperparameter and pass it to the meta_model to get the final prediction.

number of models 7, batch_size in 80
f2_score :  0.8737864077669903
confusion_matrix : 
 Predicted    0.0  1.0
True                 
0.0        11750   50
1.0           20  180

11) Comparision of all the models in a tabular format :

12) Future Work :

Inspite of the smotetomek data balancing method, there are a few data points that our best model was not able to classify. That may be due to the not sufficient training of the data or not enough datasets. We have very less downhole failures which we are trying to prevent. The class balancing methods that were used were Adasyn (upsampling) and Smotetomek (upsampling and downsampling) OSS(upsampling).

1) But I did not use GANS to train on the minority class (downhole failure) and generate more downhole data. Therefore one method would be to use GANs to generate enough minority data points so that the model can learn better as GANs is known for generating very accurate synthetic data which is like the data but not a copy of the data.

2) Now my MLP each hidden layer neuron has a particular weight which is changed a bit once trained on 0 class and then the same weight changes again when trained on 1 class. So here comes a chance that when enough training of minority class is not done due to less data points , the weights will get inclined towards the majority class. And whatever the MLP had learned about the minority class may not be that impactful as the same weight of the neuron is modified more for the majority class and less for the minority class.

So if I have 2 MLPs, and train one MLP exclusively on majority class and the other MLP exclusively on minority class,the weights will be preserved for both the types of classes on their respective MLPs instead of the same weight being changed by the minority class and more by the majority class in case their is only one MLP.

Therefore both the classes after being balanced using any of the methods(GANs) could be sent to 2 different MLPs, meaning surface failures class is sent to one MLP and the Downhole failure class is sent to another MLP.

During the prediction phase, each data point irrespective of class is sent to both the MLPs and for each dataset I try to find probability of it belonging to that particular class as to which probability scores are higher and accordingly give out results.

Here I am assuming that both the MLP’s have been trained in their respective type of failures classes and detecting the probablity of a data point belonging to the downhole failure, the downhole failure trained MLP should give a higher probability of the data point belonging to the downhole class than the surface failure MLP’s probability of that datapoint belonging to the surface class.

(We can think of this as two people specializing in two different types of failures and both predicting the probability of their respective types of failures to occur for a particular data point. Again I am assuming that if a data point is a downhole failure then I get a higher probability of it being a downhole failure by the downhole trained MLP.)

The overall value with the higher probability score could give me a better result.

13) The web application :

This application takes in the test data with target data or test data only.

When only test data is given then predictions are shown along with any issues in the data.
When the test data is given along with the target data , then predictions are shown along with the target output and the F2 score that we are trying to find. Any issues that are present within the data are also shown.

Demo Link :

App Demo for Equipment Failure Prediction.mp4

App Link : https://failure-prediction.herokuapp.com/

14) Reference Links:

Dataset : https://www.kaggle.com/c/equipfails/
Research paper : https://www.researchgate.net/publication/321137529_Failure_analysis_of_the_offshore_process_component_considering_causation_dependence
https://www.appliedroots.com/
https://towardsdatascience.com/implementing-macro-f1-score-in-keras-what-not-to-do-e9f1aa04029d
https://medium.com/@thongonary/how-to-compute-f1-score-for-each-epoch-in-keras-a1acd17715a2
https://lerner.co.il/2014/10/14/python-attributes/
https://keras.io/guides/writing_your_own_callbacks/
https://keras.io/api/callbacks/#create-a-callback
https://keras.io/api/callbacks/base_callback/
https://research.aimultiple.com/synthetic-data-generation/
https://medium.com/datadriveninvestor/creating-a-custom-data-generator-in-keras-534fd3098a58
https://www.quora.com/profile/Fred-Feinberg
https://drillers.com/what-is-a-stripper-well/
https://en.wikipedia.org/wiki/Plunger_pump
https://production-technology.org/beam-pumping-unit/
https://www.youtube.com/watch?v=3y_LZq_yzzA&feature=emb_rel_end
https://www.quora.com/profile/Brad-Heers
https://www.youtube.com/watch?v=I74PMMZUgfc
https://www.glossary.oilfield.slb.com/en/Terms/f/fluid_pound.aspx
www.downholediagnostic.com
https://www.slideshare.net/RamezMaher/managing-downhole-failures-in-a-rod-
pumped-well
https://www.thesciencethinkers.com/what-is-fracking-facts-about-fracking/
https://www.bbc.com/news/uk-14432401
https://sensing.honeywell.com/honeywell-sensors-switches-oil-rig-application-note-
000756–4-en.pdf
https://www.hydraulicspneumatics.com/fluid-power-basics/sensors/article/
21883925/fundamentals-of-pressure-transducers
https://www.youtube.com/watch?v=UZLiLRlJzbU&feature=youtu.be
https://www.stellartech.com/sti-products/oil-gas/
https://www.bandaslawfirm.com/personal-injury/work-injuries/oil-accidents/
equipment-failure/
https://www.geeksforgeeks.org/interquartile-range-and-quartile-deviation-using-numpy-and-scipy/
https://stackoverflow.com/questions/43961225/sgdclassifier-vs-logisticregression-with-sgd-solver-in-scikit-learn-library

15) Profile:

Github : https://github.com/Mohit-eng-creator/Failure-Prediction-Theory
LinkedIn : https://www.linkedin.com/in/mohit-b199571b4/