This article outlines a machine learning approach to detect and diagnose anomalies in the context of machine maintenance. However, before we dive into the approach, we will gather a brief understanding of machine maintenance. This article is arranged as below:
- Introduction to machine maintenance
- What is predictive maintenance?
- Approaches for machine diagnosis
- Machine diagnosis using machine learning
1) Introduction to Machine Maintenance
For any industrial machinery equipment, owners want to increase operational flexibility and reduce operating costs. To achieve this objective, system engineers mainly focuses on 3 attributes of the machinery.
- Reliability (R): It is defined as the probability of a machine or machine component operating as expected without failure for a given period of time. The commonly used metric for it is “Mean time between failure” i.e. “Total operating time”https://www.kdnuggets.com/”Total failures”
- Maintainability (M): It is defined as the probability that a machine or machine component can be repaired within a specified period of time. The commonly used metric for it is “Mean time to repair” i.e. “Total downtime”https://www.kdnuggets.com/”Total outages”
- Availability (A): It is defined as the probability that a machine or machine component is functional at a given point in time. Availability depends on reliability and maintainability, defined as “Total operating time”/(“Total operating time” + “Total downtime”)
Maintenance strategy significantly improve the reliability and availability of assets and, as a result, decreases the number of unpredicted breakdowns. With advancements in technology, maintenance strategies have also evolved over time as summarized in Table 1.
|Breakdown Maintenance or Run to Failure||Preventive Maintenance or Scheduled Maintenance||Predictive Maintenance or Condition-based Maintenance|
|Definition||Maintenance actions are taken only after a breakdown happened.||Planned maintenance actions after specific time intervals.||Maintenance actions are taken according to the actual condition of the operating equipment assessed through condition monitoring procedures.|
Fail and fix
Scheduled at regular intervals
|Pros||Cost-effective for small, non-critical equipment||A proactive strategy that helps to minimize downtime, prevent costly repairs caused by secondary damages.||Reduces maintenance costs, downtime, secondary damage, and avoids unnecessary parts replacement|
|Cons||Costly downtime, extensive secondary damages||The cost of maintenance is very high. Unplanned breakdowns can still occur.||None|
Table 1: Maintenance strategies (evolved with time in the order from left to right)
In this article, we will be talking about a machine learning approach that aligns with the predictive maintenance strategy. Hence, let’s understand “what is predictive maintenance?” before getting into the actual approach.
2) What is Predictive Maintenance?
Predictive maintenance is determined based on the actual condition of the machine and its components also known as condition-based maintenance (CBM). CBM suggests maintenance action only when there is evidence of abnormal behaviours from a component.
CBM heavily relies upon diagnostic (what is current condition?) and prognostic (what will be the condition in future?) information from the machine and its components. Both serve as different objective as shown in table 2.
|Diagnostic (What is condition currently?)||Prognostic (What will be the condition in future?)|
|Definition||Diagnostics is the process of determining the current health status and the equipment deterioration using information delivered by the condition-monitoring system.||Prognostics is the ability to forecast the machine deterioration using information gathered from the machine and its components (like vibrations, change in temperature, change in pressure, current consumption, etc).|
i. fault detection – fault is about to happen;
ii. fault isolation – locates the faulty component
iii. fault identification – determine the root cause of the fault
i. forecasting the impending failures and
ii. estimating the remaining useful life
Table 2: Success of CBM relies on diagnostic and prognostic ability both
In this article, we will be addressing the diagnosis process of the CBM approach which includes anomaly detection, isolation, and identification to assist root cause analysis and plan maintenance.
3) Approaches for machine diagnosis
The first goal of the diagnostics is to identify the malfunctioning components. When observations from an operating machine differ from the expected behaviour then the real need for diagnostics arises. There are many approaches to do diagnostics and a few commonly used ones are listed in Table 3.
|#||Diagnostic approach||Description||Limitations / Disadvantages|
Fault tree analysis is a top-down approach that was originally developed in Bell laboratories in the year 1962.
It uses predefined logics to identify the component level failures that lead to occur system-level failure.
|1) Need a lot of domain expertise
2) Expensive to build and maintain
|2||Rule based||As the name suggests, knowledge for diagnosis is captured in the form of IF-THEN rules. Rule-based systems are built with the help of expert diagnosticians to capture associations between the symptoms of an abnormal system and the underlying failures/faults.||1) Need a lot of domain expertise
2) Expensive to build and maintain
|3||Model based||A machine learning based model learns how the system components are connected and how they normally behave. Model is then tasked to identify those machine components which, when assumed to function abnormally, will account for the difference between the observed and expected behaviour.||1) Need high computing resources
2) Requires a good amount of historical data
Table 3: Commonly used approaches to do machine diagnostics
In this article, we will discuss modelling based approach with a case study.
4) Machine diagnosis using machine learning
We will be explaining the ML approach using a case study on “Condition-based predictive maintenance of Gas Turbines in a Power Plant”. The solution is using the ideas discussed in the paper  from AAAI Conference on Artificial Intelligence, Jul 2019.
This solution is designed to address the most commonly faced challenges as listed below –
- In most real-world scenarios, it is very difficult to get a sufficient amount of anomaly events data points in historical data. This makes supervised learning techniques infeasible to detect or classify anomalies from normal behaviour.
- In multivariate time series data, it not only requires to capture the temporal dependency in each time series but also needs to encode the inter-correlations between different pairs of time series.
- In real-world applications, it is common to have noise which may not eventually lead to a true system failure. Therefore, an anomaly detection system should provide operators with an anomaly scores indicating the severity of incident.
4.1) Modelling methodology
We attempt to model an accurate short-term estimate of gas turbine engine performance and integrity conditions which can be invaluable for maintenance strategy and planning. We build our model based on the operational data of a gas turbine engine collected from different sensors deployed for monitoring the engine’s status.
We have the historical data from n sensors, monitoring the engine’s status for a period T,
i.e., 𝑋 = (𝑥1, … , 𝑥𝑛 ) 𝑇 ∈ ℝ𝑛∗𝑇 ,
During this period T, we assume that there were no anomalies or fault events and engine was operated in normal operating condition. Given this data, we train a model to learn different statuses of an engine during its normal operations and detect difference in engine’s status during abnormal operations. We aim to detect anomaly events during operations and diagnose the severity and root cause of the anomaly.
The modelling methodology is unsupervised learning using auto-encoders that learns how to represent original data into a compressed encoded representation and then learns how to reconstruct the original input data from the encoded representation. More details about model is given in next section 4.1.1.
4.1.1) Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED)
MSCRED is an unsupervised learning technique that learns the normal operating conditions of the equipment from operational data by learning the signature matrices representing the different states of operation of the machine in normal conditions. We train our model only on the normal signature matrices and assume that the signature matrices of machines in abnormal operations differ from the normal operations.
What is a Signature Matrix?
A signature matrix is a way of representing the data wherein the multivariate time series data is transformed into correlation matrices to characterize the system status. The inter-correlations between different pairs of time series in a multivariate time series segment capture the shape similarities and value scale correlation between pairs of time series.
Fig 1. Signature Matrices
Characterizing System Status with Signature Matrices:
- We first fix the window-sizes for different resolutions and the step-size by which to slide these windows. For example, suppose we have w1, w2, w3 windows such that w3>w2>w1 and step-size t, we right-align these windows on the time series and divide it into segments while sliding by step-size.
- Next for a segment, for each window-size, we calculate the correlations between different pairs of multivariate time series to get n * n matrices Mt, where n is the number of sensors/time-series. For e.g. if we have 30 sensors then it will give us a matrix of 30*30 for a window.
- We stack these correlation matrices from segments of different resolutions (window-size) together to form the signature matrices. For e.g. for 3 different window-size, the signature matrix is of dimension 30*30*3.
MSCRED modelling framework: Here are steps to create an unsupervised model using MSCRED modelling framework.
Step 1: Construct multi-resolution signature matrices to characterize multiple levels of the system operational statuses across different time steps (segments) as discussed in the section above. Multi-resolution signatures are used to reduce the operational noise in data and help us indicate the severity of abnormal incidents.
Fig 2. MSCRED Model Framework: (a) Signature matrices encoding via CNN. (b) Temporal patterns modelling by attention based convLSTM. (c) Signature matrices decoding via deconvolution neural networks. (d) Loss function.
Step 2: Use a convolutional encoder to capture and encode the inter-sensor correlation patterns from the signature matrices (as shown in part (a) of Fig 2).
Step 3: Use an attention-based Convolutional Long-Short Term Memory (ConvLSTM) network to capture the temporal patterns (as shown in part (b) of Fig 2).
Step 4: Use a convolutional decoder to reconstruct the signature matrices from the feature maps which encode the inter-sensor correlations and temporal information (as shown in part (c) of Fig 2).
Step 5: The residual error between reconstructed signature matrices and original signature matrices is then utilized to detect and diagnose anomalies (as shown in part (d) of Fig 2).
4.1.2) Anomaly Detection and Root Cause Identification
Steps to detect anomaly and identify the root cause is as below.
- We utilize the residual error matrix to detect the anomaly and identify the root cause(s).
- Anomaly score is calculated for each window by adding up the absolute value of residuals in the residual matrix.
- Anomaly score greater than a defined threshold is marked as an anomaly. Analysing the corresponding residual matrix to find the rows and columns with the higher error in the residual matrix give us the root cause or affected components.
- The signature matrices of operational data includes channels (s = 3 windows) that capture system status at different scales. Anomaly severity is given by computing the anomaly scores from residual error matrices of three channels, i.e., small, medium and large with size w – 10, 30, and 60, respectively as shown in fig 3.
Fig 3. Anomaly Diagnosis Results
Analysis of anomalous residual matrices helps operator in root cause analysis and identify affected components. This method captures the temporal patterns in the time series as well as the inter-sensor correlation patterns. For machine diagnosis, this method is claimed to outperform other state-of-the-art models.
- Ramana PV, Fault Tree Analysis, https://sixsigmastudyguide.com/fault-tree-analysis/
- Xiao-Wen Deng, Qing-Shui Gao, Chu Zhang, Di Hu, Tao Yang, “Rule – based Fault Diagnosis Expert System for Wind Turbine”, ITM Web Conf. 11 07005 (2017), DOI: 10.1051/itmconf/20171107005
- Chuxu Zhang and Dongjin Song and Yuncong Chen and Xinyang Feng and Cristian Lumezanu and Wei Cheng and Jingchao Ni and Bo Zong and Haifeng Chen and Nitesh V. Chawla, A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data, AAAI 2019: 1409-1416
Ankush Kundaliya is a Data Scientist at Abzooba. With more than 5 years of extensive experience in the field of data science, Ankush has expertise in building data-driven solutions to complex business problems using advance deep learning and machine learning algorithms. Having worked across multiple domains of industries including IT Service Management, Human Resources, Manufacturing, Life Sciences, and Financial Services, he has diverse knowledge and business acumen.
Aditya Aggarwal serves as Data Science – Practice Lead at Abzooba Inc. With more than 12+ years experience in driving business goals through data driven solutions, Aditya specializes in predictive analytics, machine learning, business intelligence & business strategy across range of industries.