• Data Center
  • Applications
  • Open Source

Logo

How Data Mining is Used by Nasdaq, DHL, Cerner, PBS, and The Pegasus Group: Case Studies

Data Discovery Represented by Magnifying Glass Over Globe of Binary Code Data.

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Companies understand that data mining can provide insights to improve the organization. Yet, many struggle with the right types of data to collect, where to start, or what project may benefit from data mining.

Examining the data mining success of others in a variety of circumstances illuminates how certain methods and software in the market can assist companies. See below how five organizations benefited from data mining in different industries: cybersecurity, finance, health care, logistics, and media.

See more: What is Data Mining? Types & Examples

1. Cerner Corporation

Over 14,000 hospitals, physician’s offices, and other medical facilities use Cerner Corporation’s software solutions.

Cerner’s access allows them to combine patient medical records and medical device data to create an integrated medical database and improve health care.

Using Cloudera’s data mining allows different devices to feed into a common database and predict medical conditions.

“In our first attempts to build this common platform, we immediately ran into roadblocks,” says Ryan Brush, senior director and distinguished engineer at Cerner.

“Our clients are reporting that the new system has actually saved hundreds of lives by being able to predict if a patient is septic more effectively than they could before.”

Industry: Health care

Data mining provider: Cloudera

  • Collect data from unlimited and different sources
  • Enhance operational and financial performance for health care facilities
  • Improve patient diagnosis and save lives

Read the Cerner Corporation and Cloudera, Inc. case study.

DHL Temperature Management Solutions provides temperature controlled pharmaceutical logistics to ensure pharmaceutical and biological goods stay within required temperature ranges to retain potency.

Previously, DHL transferred data into spreadsheets that took a week to compile and would only contain a portion of the potential information.

Moving to DOMO’s data mining platform allows for real-time reporting of a broader set of data categories to improve insight.

“We’re able to pinpoint issues that we couldn’t see before. For example, a certain product, on a certain lane, at a certain station is experiencing an issue repeatedly,” says Dina Bunn, global head of central operations and IT for DHL Temperature Management Solutions.

Industry: Logistics

Data mining provider: DOMO

  • Real-time versus week-old logistics information
  • More insight into sources of delays or problems at both a high and a detailed level
  • More customer engagement

Read the DHL and DOMO case study.

See more: Current Trends & Future Scope of Data Mining

The Nasdaq electronic stock exchange integrates Sisense’s data mining capabilities into their IR Insight software to help customers analyze huge data sets.

“Our customers rely on a range of content sets, including information that they license from others, as well as data that they input themselves,” says James Tickner, head of data analytics for Nasdaq Corporate Solutions.

“Being able to layer those together and attain a new level of value from content that they’ve been looking at for years but in another context.”

The combined application provides real-time analysis and clear reports easy for customers to understand and communicate internally.

Industry: Finance

Data mining provider: Sisense

  • Meets rigorous data security regulations
  • Quickly processes huge data sets from a variety of sources
  • Provides clients with new ways to visualize and interpret data to extract new value

Read or watch the Nasdaq and Sisense case study.

The Public Broadcasting System (PBS) of the U.S. manages an online website to service 353 PBS member stations and their viewers. Their 330 million sessions, 800 million page views, and 17.5 million episode plays generate enormous data that the PBS team struggled to analyze.

PBS worked with LunaMetrics to perform data mining on the Google Analytics 360 platform to speed up insights into PBS customers.

Dan Haggerty, director of digital analytics for PBS, says “that was the coolest thing about it. A machine took our data without prior assumptions and reaffirmed and strengthened ideas that subject matter experts already suspected about our audiences based on our contextual knowledge.”

Industry: Media

Data mining provider: Google Analytics and LunaMetrics

  • Identified seven key audience segments based on web behaviors
  • Developed in-depth personas per segment through data mining
  • Insights help direct future content and feature development

Read the PBS, LunaMetrics, and Google Analytics case study.

5. The Pegasus Group

Cyber attackers compromised and targeted the data mining system (DMS) of a major network client of The Pegasus Group and launched a distributed denial-of-service (DDoS) attack against 1,500 services.

Under extreme time pressure, The Pegasus Group needed to find a way to use data mining to analyze up to 35GB of data with no prior knowledge of the data contents.

“[I analyzed] the first three million lines and [used RapidMiner’s data mining to perform] a stratified sampling to see which ones [were] benign, which packets [were] really part of the network, and which packets were part of the attack,” says Rodrigo Fuentealba Cartes of The Pegasus Group.

“In just 15 minutes … I used this amazing simulator to see what kinds of parameters I could use to filter packets … and in another two hours, the attack was stopped.”

Industry: Cybersecurity

Data mining provider: RapidMinder

  • Uploaded and analyzed three million lines of data 
  • Recommended analysis models provided answers within 15 minutes
  • Data analysis suggested solutions that stopped the attack within two hours

Watch The Pegasus Group and RapidMiner case study.

See more: Top Data Mining Tools

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Hubspot crm vs. salesforce: head-to-head comparison (2024), ultimate guide to data visualization jobs, ai in cybersecurity: the comprehensive guide to modern security, get the free newsletter.

Subscribe to Data Insider for top news, trends & analysis

Latest Articles

Hubspot crm vs. salesforce:..., 15 top cloud computing..., ultimate guide to data..., ai in cybersecurity: the....

Logo

Article Contents

1 introduction, 2 methodology, 4 discussion, 5 conclusion.

  • < Previous

An energy efficiency solution based on time series data mining algorithm—a case study of small hotel

  • Article contents
  • Figures & tables
  • Supplementary Data

Qiang Gong, Ying Zeng, Ebu Adu, Shanshan Han, Shuming Zhang, Weiwen Cui, Haode Sun, Xiaodong Liu, An energy efficiency solution based on time series data mining algorithm—a case study of small hotel, International Journal of Low-Carbon Technologies , Volume 17, 2022, Pages 1406–1419, https://doi.org/10.1093/ijlct/ctac115

  • Permissions Icon Permissions

This study aims to conduct data mining research on the time series energy consumption dataset of a small hotel. Earlier studies on data mining have demonstrated that cluster and association analysis had been commonly used methods today, while this has not yet been investigated under time series dimension. For that consequence, this article utilizes K-shape and Apriori algorithm coded by Python language to explore the time series energy consumption data of the small hotel in terms of the subentry and total energy consumption. Final results reveal that the energy consumption curve and association rules can effectively reflect the working characteristics of the small hotel. From the clustering results, the small hotel working feature and weather condition determine the time series energy consumption curve shape. As for association rules, there is a different chain relationship between the energy consumption of each subentry and total energy consumption of the small hotel, especially the consumption of heating gas and cooling, which mainly determines the changes in other energy consumption.

1.1 Background

Nowadays, a great deal of attention is being paid to the increase in energy consumption due to the world’s energy resources being rapidly depleting, which may plunge the world into an energy crisis. As noted by the International Energy Agency, buildings have become the world’s largest consumer of energy, accounting for more than one-third of final energy consumption and an equally essential contributor to carbon dioxide emissions [ 1 ]. Nevertheless, considering the existence of various building operation defects, 16% of building operation energy consumption is wasted showing the large energy saving potential [ 2 ]. Some research have been carried out to develop advanced data analysis methods for mining large volumes of building operational data so far. Especially, data mining attracts increasing attention as a promising solution for its advantages in discovering knowledge from big data [ 3 ]. Consequently, this research aims to employ advanced solution to mine knowledge focusing on time series energy consumption data so that providing efficiency suggestions for energy saving behavior.

1.2 Application of clustering algorithm in the field of architecture

In the field of architecture, many researches have been done on building energy efficiency based on clustering algorithms [ 4 ]. Among which, it is widely used to support building energy consumption pattern analysis, [ 5 ] fault detection of abnormal energy consumption patterns [ 6 ], as well as prediction of future energy consumption [ 7 ].

For example, McLoughlin applied k-means, k-medoid and SOM to segmenting households into several groups in line with their pattern of electricity use across the day and results in a series of profile classes representing common patterns of electricity use within the home [ 8 ]. Li et al. [ 9 ] and Song et al. [ 10 ] adopted clustering algorithms to obtain the daily building electricity consumption pattern under hourly operation data. In addition, it was obtained that this method could also be employed to identify energy consumption situation in the period of month for residential building system by Yu et al. [ 33 ]. Similarity, Rhodes et al. [ 11 ] performed an experiment to reveal the daily power usage mode of residential buildings via clustering solution. Abreu et al. [ 12 ] developed a framework using a clustering algorithm to extract daily house occupancy procedures. Fan et al. [ 31 ] discovered the typical operating mode of the building cooling system. However, the performance of most clustering techniques depends on the chosen distance measure solution when comparing two time series sequences. To solve this problem, Paparrizos proposed K-shape time series clustering algorithm that adopts a normalized cross-correlation algorithm to consider of the time series data shape [ 13 ]. The study compares the dataset pattern and shows an outperforming mined ability in terms of partitional, hierarchical and spectral clustering than other approaches.

In addition, in terms of energy forecast, Tang et al. [ 14 ] implemented k-means to cluster data before prediction model establishment and found that it could significantly improve final accuracy. Hierarchy clustering solution was also adopted to achieve the building electricity consumption forecast [ 15 ]. What is more, clustering algorithms not only being widely applied to identify building energy issue but also recognizes the impact factors in energy consumption. For instance, Carmo et al. adopted clustered solution to divide the hourly heat load data of 139 single-family detached houses into three groups in order to illustrate the influence of household and building characteristics on heat load demand [ 16 ]. Besides, they also developed more cost-effective approaches that implemented the k-means method determining the typical heating load curve of a single-family detached house in Denmark.

In short, it can be shown from the above research that the clustering algorithm is one of the most important data mining algorithms in the construction field, which could uncover some precious knowledge concealed in amount of database.

1.3 Application of association algorithm in the field of architecture

On the other hand, another unsupervised analytics algorithm named association rule mining is to discover the internal relationship between variables. Illustrated in Table 1 is the relative researches basic information.

Relative literatures in recent years in terms of data mining on building research.

Fan et al. (2015) applied the Apriori algorithm revealing the correlation between building cooling system components. In recent years, Qiu et al. [ 27 ] developed a control strategy based on association algorithms to detect heating, ventilation and air conditioning systems. Li et al. [ 24 ] used association algorithms to identify equipment failures and improper operating modes of variable refrigerant flow air conditioning systems. In order to discover dynamic anomalies in the HVAC system, Fan et al. (2015) further developed a time-based association algorithm, combining the association solution under the time dimension. Sun et al. [ 29 ] took similar methods creating fault detection thresholds finding anomalies in buildings. Yu et al. [ 18 ] applied associated algorithms to discover all the building operation data correlations so as to directly identify energy waste and equipment failures in the process air-conditioning system usage.

What is more, the Apriori algorithm is one of the most popular approaches and has been developed for a long time as a solution for association rule mining. Li et al. [ 24 ] proposed a method based on Apriori discovering the energy consumption mode and compressor control strategy of the refrigeration system. Wang et al. [ 25 ] adopted this method revealing the electricity consumption patterns of the remaining buildings. Moreover, some improved Apriori algorithms were also proposed, which can identify rare and unexpected building energy patterns [ 25 ]. Yu et al. (2011) proposed two association algorithms identified the behavior of occupants with low energy consumption in residential buildings. Wang et al. [ 25 ] analyzed the impact of occupants’ behavior on residential electricity consumption using improved Apriori algorithm.

To sum up, clustering and association algorithms are of great significance for data mining in the field of architecture. From this point of view, this investigation aims to apply clustering and association algorithms to conduct data mining task. The outline of this research is arranged as follows: Section 1 presents the introduction about literatures done so far. Methodology used in this research is introduced in Section 2. In Section 3, the results of clustering and association are described. Research deficiencies and future research directions are discussed in Section 4. Section 5 concludes the conclusion of this research.

2.1 Outline

As is apparently depicted in Figure 1 , it can be seen that the whole outline of this research contains four main phases: data collection, data mining, data analysis and conclusion. Firstly, gathering time series energy consumption data by using EnergyPlus software, which aims to simulate the small hotel provided by the US Department of Energy and build a CSV-format data set. Subsequently, preprocessing collecting data in the second step converts units of energy consumption into kWh. After that, it commenced to complete the program code of K-shaped clustering and Apriori association algorithm for data mining of time series energy consumption database based on Python language. Lastly, analyzing the data mining results of the subentry consumption and total energy consumption, and then the relevant energy consumption patterns of the small hotel have been obtained.

Outline of the research methodology workflow.

Outline of the research methodology workflow.

2.2 Database

The database of this study is based on the commercial building simulation database provided by the US Department of Energy, which was published by the National Renewable Energy Laboratory in 2011. Meanwhile, this database developed standard and reference building energy consumption models to serve as a starting point for energy conservation studies as shown in Table 2 . Moreover, it is observed that the database offered by the National Building Stock Commercial Reference Building Model of the US Department of Energy has distinctive advantages compared with the actual building energy load data, which accurately represents the characteristics and construction practices of current existing buildings. In this case, it provides a useful research database for measuring progress toward the department’s energy efficiency goals for commercial buildings.

Building types in the commercial reference building energy database.

2.3 Software

EnergyPlus as a classical energy simulation software developed by the Department of Energy and Lawrence Berkeley National Laboratory [ 30 ] is used to simulate building energy consumption and collect time series energy consumption data in this research. Being matching for the EnergyPlus, this database has three versions of models for each building type: new buildings, buildings after 1980 and buildings before 1980. Models of the same type have the identical form and area, as well as the operating schedule, which is predominantly distinct in the insulation of the envelope.

Considering the universality of this study, the small hotel building type was selected as the research object among 16 building types. While the Miami area of the USA was taken as the input condition of the climate zone. Depicted in Figure 2 is the small hotel building model applied in EnergyPlus software. Tables 3 and 4 are the basic information and the envelope condition for the model. Once EnergyPlus simulation is accomplished, relevant data can be stored directly in CSV files, recording data in hourly time resolution and different types.

Schematic diagram of the small hotel energy consumption simulation model.

Schematic diagram of the small hotel energy consumption simulation model.

The parameters of the small hotel in Miami.

Building envelope related parameters.

2.4 Algorithms

On the aspect of algorithms, K-shape and Apriori algorithm are predominantly adopted in this study.

where R represents the cross-correlation number among various variables. Meanwhile, it is also employed to divide the coefficient normalization process. In addition, this equation sorts each time series data into corresponding clusters through measuring centroids. In this premise, the maximum of normalized cross-correlation value is the sign of the cluster centroid point. Algorithm finds similar time series curves via this indicator.

After clustering all required time series data, it is also indispensable to scale data into a normalized coordinate system with invariance curve shape. All data are defined into a normal distribution condition that the data set mean is 0 and standard deviation is 1. In addition, after obtaining data documents in CSV format, it is necessary to use programming language to encode the algorithm, so that the program can run clustering and association algorithm. As is apparently depicted in Figure 3 , the logic of the clustering algorithm in this research is illustrated. Firstly, importing the data to K-Shape algorithm. Secondly, determining the optimal number of clusters k value through the elbow rule and silhouette coefficient. Finally, the algorithm is implemented to analyze dataset and investigate energy consumption patterns.

Logic of the clustering algorithm.

Logic of the clustering algorithm.

Furthermore, another important association algorithm is used to study the relationship of energy consumption at each moment in this research. In this research, the Apriori algorithm is adopted, which represents the internal chain dynamic relationship between different types of energy.

Figure 4 is the logic of the association algorithm, which can be seen that the first step is to load the raw database into the algorithm. Secondly, the data of the next moment should be subtracted from the previous moment forming the corresponding datum of the relevant changes. After that, the subtraction dataset is mined by the Apriori program. Ultimately, the output of relevant rules will be mined for analysis, and the final result is exported to txt file for viewing. Moreover, in order to evaluate the performance of association rules, minimum support and confidence values specified by users need to be met. These two indicators are utilized to measure the credibility degree of the mined information.

Logic of the association algorithm.

Logic of the association algorithm.

2.5 Perform indicator

2.5.1 silhouette coefficient.

Silhouette coefficient is generally used to evaluate the clustering performance although there are many possibilities for the data analysis of building energy consumption. In the process of clustered running, the distance between the sample and the center of mass is taken as the objective function to be minimized. When the algorithm runs, the center of mass is first randomly specified and then assigned to the nearest cluster basing on the distance. This process is repeated until achieving the limitation condition specified by researchers. However, since there is no fixed category marker in unsupervised learning, the number of clusters demands to be determined via the silhouette coefficient in advance.

Silhouette coefficient uses two basic concepts of cohesion and separation to measure clustering condition. Cohesion signs the degree of similarity between objects and their clusters while separation compares the difference among various clusters. The similarity is measured by the value of silhouette coefficient ranging from −1 to 1. The number of cluster types k corresponding to the larger silhouette coefficient value is the target of search. This search process is shown as follows.

Calculate the averaged distance a (i) between a sample i and the others within the same cluster. All elements should be classified into the cluster with smallest distance. The averaged value of all samples in a cluster a (i) refers to the similarity degree of the cluster.

Compute the mean distance b (i) between samples in different clusters, and define b (i) = min{bi1, bi2, ……, bik}. b (i) measuring the degree of dissimilarity between clusters.

  • 3) In line with the two parameters defined above, the formula for calculating the silhouette coefficient is as follows: $$ \begin{align} \mathrm{S}\left(\mathrm{i}\right)=\frac{\mathrm{b}\left(\mathrm{i}\right)-\mathrm{a}\left(\mathrm{i}\right)}{\max \left\{\mathrm{a}\left(\mathrm{i}\right),\mathrm{b}\left(\mathrm{i}\right)\right\}} \end{align}$$ (2) $$ \begin{equation} \mathrm{S}\left(\mathrm{i}\right)=\left\{\begin{array}{c}1-\frac{\mathrm{a}\left(\mathrm{i}\right)}{\mathrm{b}\left(\mathrm{i}\right)},\mathrm{a}\left(\mathrm{i}\right)<\mathrm{b}\left(\mathrm{i}\right)\\ {}0,\mathrm{a}\left(\mathrm{i}\right)=\mathrm{b}\left(\mathrm{i}\right)\\ {}\frac{\mathrm{a}\left(\mathrm{i}\right)}{\mathrm{b}\left(\mathrm{i}\right)}-1,\mathrm{a}\left(\mathrm{i}\right)>\mathrm{b}\left(\mathrm{i}\right)\end{array}\right. \end{equation}$$ (3)

The above two formulas Equations ( 2 ) and ( 3 ) are the equivalent calculation methods of the silhouette coefficient s (i) and the second formula is the first deformation.

Judge the clustering effect according to the calculation results. The closer s (i) is to 1, the more reasonable for the clustering number k is set.

2.5.2 Three indicators for association evaluation

In the association algorithm, the three concepts of support, confidence and lift are generally used to weigh the results of the algorithm.

where X, Y represents the itemset. In terms of association rules, support means the importance of the association rules. In this case, rules with low support generally have little meaning and studied value. Therefore, this parameter could be used for effective discovery of valuable rules.

Confidence points the credibility of association rules. The purpose of this research is to seek the rules with high confidence level.

Lift > 1 indicates that the positive correlation between the two itemsets. Lift < 1 means that the negative relationship for them and lift =1 presents there is no correlation between them.

Relationship between data cycles in the entire research process.

Relationship between data cycles in the entire research process.

2.6 Data circle path

Figure 5 shows the relationship of data flow in the entire research process. The basic cluster analysis result provides basis for association rule mining. In a word, there are 365 days in a year and 24 hours a day. As this study takes hour as the time resolution of energy consumption, a total of 8760 energy consumption data in a year is generated by EnergyPlus.

In cluster analysis, the energy variation curve formed by 24 energy consumption value per hour is considered as the transaction to be analyzed. In this scenario, the cluster algorithm performs on the 365 energy curves all over a year. The obtained consequence depicts the disparate energy usage patterns for various sub-energy type.

After finishing the cluster process, various groups of energy consumption plots and corresponding time are obtained. In line with the cluster analysis results, energy usage curves under same group are generated. At the same time, it is also indispensable to filter out the time range of various electricity clusters. The association algorithm works on the energy performance data image under the identical time period. The energy consumption data difference at adjacent moment for each energy type constitutes the transaction set for association algorithm. Noted that final associated rules to be generated focuses on each hour gap; hence, Apriori algorithm requires to be run 23 times.

3.1 Cluster analysis

3.1.1 cooling electricity consumption.

Cooling electricity consumption refers to the amount of electricity consumed for decreasing the interior temperature, which is the air conditioning portion of an HVAC system. Before processing group analysis, it is necessary to determine the number of clusters via the silhouette coefficient number. Table 5 shows the corresponding silhouette coefficient under the condition of different cluster numbers. Comparing the distortion degree, it can be seen that when the number of clusters is 2, the algorithm effect of the whole temporal energy consumption is the best.

Silhouette coefficient of different cluster number of cooling electricity.

Figure 6 is the clustered analysis result of cooling electricity consumption. As can be seen from the figure, the sequential cooling power types of the small hotel in a year can be clustered into two patterns. In each pattern, the red line represents the cluster result curve and the multiple black lines are the energy consumption mode daily under homologous group. At the same time, the specific date corresponding to each cluster can be obtained by Python program that contributes to study the building operation characteristic.

Clustering effect of cooling electricity consumption.

Clustering effect of cooling electricity consumption.

The figure shows that type 1 is cooling energy usage patterns on summer, which performs an inverted U-shaped shape. As the local temperature always remains at a high level in Miami, cooling system consumes a large of amount energy in day and night. Type 2 is the winter energy consumption pattern, which manifests an M shape. This is attributed to the special feature of Miami climate. Because of the outdoor temperature is not high on winter, the cooling consumption curve illustrates low state on day. In morning and night, cooling electricity curve illustrates two peaks due to the mainly people action during this period. For example, on winter, 7:00–11:00 and 18:00–24:00 cooling energy consumption plots presents two waves of peak appearance. In summary, the cooling power consumption curve illustrates a stable condition. This is because that the hot Miami climate requires the chiller plant to operate almost all year round.

Basing on above clustered consequence, in summer, it is recommended to decrease the cooling load at night using the outside air shown as purple arrow. The cooling in day could not be diminished because of the high outdoor temperature. In winter, considering the active time at morning and night, the small hotel could extend the room opening time such as restaurant with the purpose of moving the peak curve to the trough position. In this case, general cooling load curve becomes more stable without any fluctuation realizing energy efficiency.

3.1.2 Heating consumption

In Miami, the heating equipment mainly supplies for dehumidification instead of winter heating demand. Illustrated in Table 6 is the silhouette coefficient condition for various numbers of cluster.

Silhouette coefficient of different cluster numbers of heating gas.

Depicted in Figure 7 is the clustering effect of heating gas consumption. Being similar to the above energy consumption curve, Figure 7 indicates that the energy expenditure pattern using for small hotel dehumidification. The results show that hotel only performs heating function at night and morning presenting U shape. This is because that dehumidification activity is conducted only in the period of activity low. For instance, before 8:00 in summer and 9:00 in winter, hotel dehumidifies each room by the method of heating. Facing this issue, abbreviate the heating time during morning could significantly diminishes general energy consumption level.

Clustering effect of heating consumption.

Clustering effect of heating consumption.

3.1.3 Water systems power consumption

Water systems power consumption mainly serves to heat water. In terms of the small hotel, the service domestic hot water is needed for guest room. Table 7 shows the silhouette coefficient. Judging from the silhouette coefficient, the cluster number k is set to 1 for investigation.

Silhouette coefficient of different cluster number of water systems gas.

Silhouette coefficient of different cluster number of pumps electricity.

Figure 8 shows the cluster result of water systems power consumption. As can be seen, there is only one energy expenditure pattern for small hotel with respect to this type of energy. This pattern indicates an M shape in each day with fluctuation condition. It is mainly because that the water heater is in the process of constantly starting and stopping. The two energy consumption peaks primarily occur at noon and night when corresponds to the people active time. For example, energy consumption level is significantly higher than the other time period between 8:00 to 15:00 and 20:00–23:00. So as to balance the energy level, small hotel could use some water storage facilities to store the heated water. In this sense, it means that the energy peak is moved to the energy wave trough position so that makes the curve more stable.

Clustering effect of water systems power consumption.

Clustering effect of water systems power consumption .

3.1.4 Pumps electricity consumption

Pump in the small hotel is mainly responsible for water pump, including water suction, transportation, etc. The power consumption of the pumps relates to the amount of water supply. Table 8 shows the optimal cluster number is 3.

Figure 9 shows the clustering effect of pumps electricity consumption. It can be observed that the two curve types remain similar, which present an M shaped pattern. Generally speaking, as the people active in public rooms and the shower requirement, two energy peaks appear at noon and night. Noted that expect the difference between winter and daylight time, in winter, the night shower takes lower time than summer. The time with high level of pump electricity consumption primarily concentrates from 8:00 to 16:00, which is the main time period of people’s activities in the small hotel. Being resemble with water system, it could also be used some storage equipment storing the water for showering saving the pump electricity in various floors.

Clustering effect of pumps electricity consumption.

Clustering effect of pumps electricity consumption.

3.1.5 Light facility consumption

Light facility consumption in the small hotel is mainly relates to the interior lighting supply. Table 9 shows the corresponding silhouette coefficient under the condition of different cluster numbers. The k is set to 3 for cluster analysis.

Silhouette coefficient of different cluster number of light facility.

Figure 10 presents the light energy consumption pattern in a year. The results indicate that the lighting power expense illustrates M-shaped form. The two peaks primarily occur in the morning and night. It should be noted that, in cloudy weather, the light is always consumed during whole work time including daytime as shown type 3. Basing on these issues, the light energy consumption has great potential in terms of saving especially in the cloudy day as there is no need to turn on lights in daytime. Therefore, it is recommended to pay attention to the lamp condition in cloudy daytime.

3.1.6 Sum all consumption

In the above part, the time series cluster analysis is carried out on the subentry energy consumption of the small hotel. In addition to these energy types, the realization of building energy conservation also needs to pay attention to the situation of total energy consumption. Table 10 shows the corresponding silhouette coefficient and the number of clusters select 3 for cluster analysis.

Silhouette coefficient of different cluster number of sum all.

Depicted in Figure 11 is the clustering result of the total daily energy consumption curve of the small hotel. It can be observed that the main difference between two types concentrated on the season. In general, Type 1 resembles with the cooling load curve pattern of inverted-U shape and the Type 2 likes the electricity facility consumption plot of M shape. This is because that the total energy load mainly depends on the air conditioner equipment. Thus, small hotel energy saving should focus on the consumption in terms of the cooling and heating.

Clustering effect of light facility consumption.

Clustering effect of light facility consumption.

Clustering result of sum all energy consumption.

Clustering result of sum all energy consumption.

3.2 Association analysis of energy consumption data

The time series energy consumption patterns of the main energy types of small hotel buildings have been obtained by the above cluster analysis. In order to discover the energy law of each hour, it was determined to adopt an association solution to find the relationship between various energy types. In addition, the Apriori association algorithm employed in this research could pick up the internal connection between subentry energy genre variation by focusing on each hour. For this reason, according to the mined law of association, the total energy consumption can be regulated more reasonable via changing subentry energies consumption. In line with the clustering form consequence obtained above, the two main clusters in summer and winter are used as the database for association rule analysis.

3.2.1 Association rules for energy consumption in winter

The association rules analysis has two kinds of itemset of antecedent and consequent in each hour. All mined rules list as a table with antecedent, consequent and confidence shown as Table 11 .

Association rules found in winter.

It can be seen that the heating gas consumption mainly affects the total energy consumption of the entire building, which is mainly due to the fact that this kind of energy needs to be consumed in large quantities almost throughout the day. For instance, the heating gas consumption is positively correlated with total energy in each hour. On the contrary, the confidence in this rule is lower for most of the day that is because the two items usually represent a constant trend over the period, which reduces the confidence in reducing variation. For this reason, it is suggested to adjust the energy consumption of heating gas in order to reduce the total energy consumption in the morning and evening, which will affect the energy intensity of the building in most cases.

In addition, for other energy types, it mainly affects the total energy consumption at the time from noon to afternoon, which can be evidenced by the confidence between subentry energy and the total one. That is due to most of the people inside the hotel building begun to move during this period, which was caused by the function of the building. Therefore, the increased power like water system and cooling consumption significantly influence the overall energy change. Consequently, energy efficiency solutions in this period should focus on cooling, water and pump energy consumption.

In general, for the hotel buildings, it is supposed to focus on the heating gas consumption in the morning and evening on winter, so as to achieve the purpose of energy efficiency. As for the noon time, as personnel activities become more active, energy consumption such as water systems gains apparently resulting in an increase in the total energy trend. Therefore, it is necessary to save energy on the whole hotel building.

3.2.2 Association rules for energy consumption in summer

Table 12 represents that the association rules in summer for small hotel, which can be seen that the sum of electricity mainly relates to the cooling consumption. Especially before 10:00 and after 20:00, the cooling change mainly determines the total energy change condition. While at noon, various increasing subentry energies lead to an increase in total energy usage.

Association rules found in summer.

However, the dominance of cooling diminishes as other energies decrease after 10:00. At the same time, other energy types also show a steady state of change, resulting in a stable variation phenomenon, which is due to people in the hotel gradually go out during this time. Meanwhile, after 20:00, the number of people checking in in the hotel began to increase, which in turn leads to an increase in various energy forms.

Generally speaking, all association rules in summer shows that the total energy consumption mainly depends on the cooling electricity consumption. What is more, the time between 10:00 to 20:00, any itemized energy-saving solution can reach the energy-saving target as most of the people in the hotel start going out at this time, and the energy consumption of all kind decreases. In short, the energy-saving strategy of hotel buildings in summer should focus on the above aspects and move the peak value to make the total energy consumption curve more stable.

Based on the accurate international official energy consumption database, this study uses advanced artificial intelligence data mining methods to study the time series data of hotel building energy consumption, which are received several types. Depicted in the aforementioned figure are the types of energy consumption summarized by clustering. Moreover, the association rules also identified the main factors affecting the total energy consumption in different periods that are helpful to analyze the potential reason.

Nonetheless, although this study has done a lot of research works, there are still many deficiencies worthy of further discussion. In the aspect of building energy consumption data mining, the current research principally focuses on the actual building built for a long time leading to the mined rule that only suits for this architecture instead of corresponding type of buildings. In addition, this paper improves the research accuracy of time resolution and finds the regularity under each hour being superior to the coarse time in the past.

Meanwhile, in the future, the time dimension can be reduced to minutes and seconds, or extended to weeks and months in fact, so that can further mine the information in the data. Therefore, future research can be further analyzed from the time dimension. Furthermore, in terms of seasons, this investigation mainly analyzes summer and winter, while there are many different characteristics of transitional seasons of spring and autumn. Therefore, how to further refine the clustering of seasons in order to separate the characteristics of transitional season is also worth studying.

What is more, this study takes simulated data as the research object to discover common features of architecture, while some real buildings have specific energy consumption patterns. Under the circumstances, the method presented in this study can be used for energy diagnosis of such buildings, and future studies need to be based on actual building observing energy datasets.

To sum up, this study used the advanced artificial intelligence algorithm of time series data for building energy consumption in clustering and association analysis. As mentioned in the appeal, the proposed method combines clustering and associated data mining to successfully obtain the energy information hidden behind a large amount of data. In other words, energy consumption patterns are found through cluster analysis, and association rules are responsible for variations in energy per hour. Final results show as follows.

In summer, it is recommended to decrease the cooling load at night using the outside air. In winter, considering the active time at morning and night, the small hotel could extend the room opening time such as restaurant with the purpose of moving the peak curve to the trough position.

Basing on these issues, the light energy consumption has great potential in terms of saving especially in the cloudy day as there is no need to turn on lights in daytime.

Abbreviate the heating time during morning could significantly diminishes general energy consumption level.

In general, for the hotel buildings, it is supposed to focus on the heating gas consumption in the morning and evening on winter, so as to achieve the purpose of energy efficiency. As for the noon time, as personnel activities become more active, energy consumption such as water systems gains apparently resulting in an increase in the total energy trend.

All association rules in summer show that the total energy consumption mainly depends on the cooling electricity consumption. In addition, based on the point in time of energy efficiency, some energy saving efforts can be made according to the rules mined to eventually save energy on hotel buildings.

International Energy Agency . Transition to Sustainable Buildings: Strategy and Opportunities to 2050 . France : IEA Publications , 2013 .

Google Scholar

Google Preview

Park JY , Nagy Z . Comprehensive analysis of the relationship between thermal comfort and building control research—a data-driven literature review . Renew Sust Energ Rev 2018 ; 82 : 2664 – 79 .

Fan C , Xiao F , Li Z et al.  Unsupervised data analytics in mining big building operational data for energy efficiency enhancement: a review . Energy and Buildings 2018 ; 159 : 296 – 308 .

Wang C , du Y , Li H et al.  New methods for clustering district heating users based on consumption patterns . Appl Energy 2019 ; 251 : 113373 .

Chicco G , Napoli R , Piglione F . Comparisons among clustering techniques for electricity customer classification . IEEE Trans Power Syst 2006 ; 21 : 933 – 40 .

Khan I , Capozzoli A , Corgnati SP et al.  Fault detection analysis of building energy consumption using data mining techniques . Energy Procedia 2013 ; 42 : 557 – 66 .

Duan P , Xie K , Guo T et al.  Short-term load forecasting for electric power systems using the PSO-SVR and FCM clustering techniques . Energies 2011 ; 4 : 173 – 84 .

McLoughlin F , Duffy A , Conlon M . A clustering approach to domestic electricity load profile characterisation using smart metering data . Appl Energy 2015 ; 141 : 190 – 9 .

Li K , Ma Z , Robinson D et al.  Identification of typical building daily electricity usage profiles using Gaussian mixture model-based clustering and hierarchical clustering . Appl Energy 2018 ; 231 : 331 – 42 .

Song P et al.  Cluster analysis for occupant-behavior based electricity load patterns in buildings: a case study in Shanghai residences . Build Simul 2017 ; 10 : 889 – 98 .

Rhodes JD , Cole WJ , Upshaw CR et al.  Clustering analysis of residential electricity demand profiles . Appl Energy 2014 ; 135 : 461 – 71 .

Abreu JM , Pereira FC , Ferrão P . Using pattern recognition to identify habitual behavior in residential electricity consumption . Energy and Buildings 2012 ; 49 : 479 – 87 .

Paparrizos J , Gravano L . K-shape: efficient and accurate clustering of time series . ACM Trans Math Software 2016 ; 45 : 69 – 76 .

Tang F , Kusiak A , Wei X . Modeling and short-term prediction of HVAC system with a clustering algorithm . Energy and Buildings 2014 ; 82 : 310 – 21 .

Jota PR , Silva VR , Jota FG . Building load management using cluster and statistical analyses . Int J Electr Power Energy Syst 2011 ; 33 : 1498 – 505 .

do Carmo CMR , Christensen TH . Cluster analysis of residential heat load profiles and the role of technical and household characteristics . Energy and Buildings 2016 ; 125 : 171 – 80 .

Hernández L , Baladrón C , Aguiar J et al.  Classification and clustering of electricity demand patterns in industrial parks . Energies 2012 ; 5 : 5215 – 28 .

Yu ZJ , Haghighat F , Fung BC et al.  A novel methodology for knowledge discovery through mining associations between building operational data . Energy and Buildings 2012 ; 47 : 430 – 40 .

D’Oca S , Hong T . A data-mining approach to discover patterns of window opening and closing behavior in offices . Build Environ 2014 ; 82 : 726 – 39 .

Wang E . Benchmarking whole-building energy performance with multi-criteria technique for order preference by similarity to ideal solution using a selective objective-weighting approach . Appl Energy 2015 ; 146 : 92 – 103 .

Cheng Y , Yu W-D , Li Q . GA-based multi-level association rule mining approach for defect analysis in the construction industry . Autom Constr 2015 ; 51 : 78 – 91 .

Yang J , Ning C , Deb C et al.  K-shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement . Energy and Buildings 2017 ; 146 : 27 – 37 .

Xue P , Zhou Z , Fang X et al.  Fault detection and operation optimization in district heating substations based on data mining techniques . Appl Energy 2017 ; 205 : 926 – 40 .

Li G , Hu Y , Chen H et al.  Data partitioning and association mining for identifying vrf energy consumption patterns under various part loads and refrigerant charge conditions . Appl Energy 2017 ; 185 : 846 – 61 .

Wang F , Li K , Duić N et al.  Association rule mining based quantitative analysis approach of household characteristics impacts on residential electricity consumption patterns . Energy Convers Manag 2018 ; 171 : 839 – 54 .

Zhao Y , Zhang C , Zhang Y et al.  A review of data mining technologies in building energy systems: load prediction, pattern identification, fault detection and diagnosis . Energy and Built Environment 2020 ; 1 : 149 – 64 .

Qiu S , Feng F , Li Z et al.  Data mining based framework to identify rule based operation strategies for buildings with power metering system . Build Simul 2019 ; 12 : 195 – 205 .

Rajabi A , Eskandari M , Ghadi MJ et al.  A comparative study of clustering techniques for electrical load pattern segmentation . Renew Sust Energ Rev 2020 ; 120 : 109628 .

Sun Y , Wu T , Zhao G et al.  Efficient rule engine for smart building systems . IEEE Trans Comput 2015 ; 64 : 1658 – 69 .

Pan Y . Handbook of Practical Building Energy Simulation . Beijing : China Architecture & Building Press , 2013 .

Fan C , Xiao F , Madsen H et al.  Temporal knowledge discovery in big BAS data for building energy management . Energy and Buildings 2015a ; 109 : 75 – 89 .

Fan C , Xiao F , Yan C . A framework for knowledge discovery in massive building automation data and its application in building diagnostics . Autom Constr 2015b ; 50 : 81 – 90 .

Yu Z , Fung BCM , Haghighat F et al.  A systematic procedure to study the influence of occupant behavior on building energy consumption . Energy and Buildings 2011a ; 43 : 1409 – 17 .

Yu ZJ , Haghighat F , Fung BCM et al.  A methodology for identifying and improving occupant behavior in residential buildings . Energy 2011b ; 36 : 6596 – 608 .

Email alerts

Citing articles via, affiliations.

  • Online ISSN 1748-1325
  • Print ISSN 1748-1317
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Digital Marketing
  • Apps & Website

Expand My Business

Data Mining Case Studies & Benefits

Data Mining Case Studies & Benefits

  • Key Takeaways

Data mining has improved the decision-making process for over 80% of companies. (Source: Gartner).

Statista reports that global spending on robotic process automation (RPA) is projected to reach $98 billion by 2024, indicating a significant investment in automation technologies.

According to Grand View Research, the global data mining market will reach $16,9 billion in 2027.

Ethical Data Mining preserves individual rights and fosters trust.

A successful implementation requires defining clear goals, choosing data wisely, and constant adaptation.

Data mining case studies help businesses explore data for smart decision-making. It’s about finding valuable insights from big datasets. This is crucial for businesses in all industries as data guides strategic planning. By spotting patterns in data, businesses gain intelligence to innovate and stay competitive. Real examples show how data mining improves marketing and healthcare. Data mining isn’t just about analyzing data; it’s about using it wisely for meaningful changes.

The Importance of Data Mining for Modern Business:

The Importance of Data Mining for Modern Business Understanding the Role in Decision Making

Data mining has taken on a central role in the modern world of business. Data is a major issue for businesses today. Making informed decisions with this data can be crucial to staying competitive. This article explores the many aspects of data mining and its impact on decisions.

  • Unraveling Data Landscape

Businesses generate a staggering amount of data, including customer interactions, market patterns, and internal operations. Decision-makers face an information overload without effective tools for sorting through all this data.

Data mining is a process which not only organizes, structures and extracts patterns and insights from this vast amount of data. It acts as a compass to guide decision makers through the complex landscape of data.

  • Empowering Strategic Decision Making

Data mining is a powerful tool for strategic decision making. Businesses can predict future trends and market behavior by analyzing historical data. This insight allows businesses to better align their strategies with predicted shifts.

Data mining can provide the strategic insights required for successful decision making, whether it is launching a product, optimizing supply chain, or adjusting pricing strategies.

  • Customer-Centric Determining

Understanding and meeting the needs of customers is paramount in an era where customer-centricity reigns. Data mining is crucial in determining customer preferences, behaviors, and feedback.

This information allows businesses to customize products and services in order to meet the expectations of customers, increase satisfaction and build lasting relationships. With customer-centric insights, decision-makers can make choices that resonate with their target audiences and foster loyalty and brand advocacy.

Data Mining: Applications across industries

Data mining is transforming the way companies operate and make business decisions. This article explores the various applications of data-mining, highlighting case studies that illuminate its impact in the healthcare, retail, and finance sectors.

  • Healthcare Case Studies:

Healthcare Case Studies Revolutionizing Patient Care

Data mining is a powerful tool in the healthcare industry. It can improve patient outcomes and treatment plans. Discover compelling case studies in which data mining played a crucial role in predicting patterns of disease, optimizing treatment and improving patient care. These examples, which range from early detection of health risks to personalized medicines, show the impact that data mining has had on the healthcare industry.

State of Technology 2024

Humanity's Quantum Leap Forward

Explore 'State of Technology 2024' for strategic insights into 7 emerging technologies reshaping 10 critical industries. Dive into sector-wide transformations and global tech dynamics, offering critical analysis for tech leaders and enthusiasts alike, on how to navigate the future's technology landscape.

  • Retail Success stories:

Retail is at the forefront of leveraging data mining to enhance customer experiences and streamline operations. Discover success stories of how data mining empowered businesses to better understand consumer behavior, optimize their inventory management and create personalized marketing strategies.

These case studies, which range from e-commerce giants and brick-and-mortar shops, show how data mining can boost sales, improve customer satisfaction, transform the retail landscape, etc.

  • Financial Sector Examples:

Data mining is a valuable tool in the finance industry, where precision and risk assessment are key. Explore case studies that demonstrate how data mining can be used for fraud detection and risk assessment. These examples demonstrate how financial institutions use data mining to make better decisions, protect against fraud, and customize services to their clients’ needs.

  • Data Mining and Education:

Data mining has been used in the education sector to enhance learning beyond healthcare, retail and finance. Learn how educational institutions use data mining to optimize learning outcomes, analyze student performance and personalize materials. These examples, ranging from adaptive learning platforms and predictive analytics to predictive modeling, demonstrate the potential for data mining to revolutionize how we approach education.

  • Manufacturing efficiency:

Manufacturing efficiency Streamlining production processes

Data mining is a powerful tool for streamlining manufacturing processes. Examine case studies that demonstrate how data mining can be used to improve supply chain management, predict maintenance requirements, and increase overall operational efficiency. These examples show how data-driven insights can lead to cost savings, increased productivity, and a competitive advantage in manufacturing.

Data mining is a key component in each of these applications. It unlocks insights, streamlines operations, and shapes the future of decisions. Data mining is transforming the landscapes of many industries, including healthcare, retail, education, finance, and manufacturing.

Data Mining Techniques

Data mining techniques help businesses gain an edge by extracting valuable insights and information from large datasets. This exploration will provide an overview of the most popular data mining methods, and back each one with insightful case studies.

  • Popular Data Mining Techniques

Clustering Analysis

The clustering technique involves grouping data points based on a set of criteria. This method is useful for detecting patterns in data sets and can be used to segment customers, detect anomalies, or recognize patterns. The case studies will show how clustering can be used to improve marketing strategies, streamline products, and increase overall operational efficiency.

Association Rule Mining

Association rule mining reveals relationships between variables within large datasets. Market basket analysis is a common application of association rule mining, which identifies patterns in co-occurring products in transactions. Real-world examples of how association rule mining is used in retail to improve product placements, increase sales, and enhance the customer experience.

Decision Tree Analysis

The decision tree is a visual representation of the process of making decisions. This technique is a powerful tool for classification tasks. It helps businesses make decisions using a set of criteria. Through case studies, you will learn how decision tree analyses have been used in the healthcare industry for disease diagnosis and fraud detection, as well as predictive maintenance in manufacturing.

Regression Analysis

Regression analysis is a way to explore the relationship between variables. This allows businesses to predict and understand how one variable affects another. Discover case studies that demonstrate how regression analysis is used to predict customer behavior, forecast sales trends, and optimize pricing strategies.

Benefits and ROI:

Businesses are increasingly realizing the benefits of data mining in the current dynamic environment. The benefits are numerous and tangible, ranging from improved decision-making to increased operational efficiency. We’ll explore these benefits, and how businesses can leverage data mining to achieve significant gains.

  • Enhancing Decision Making

Data mining provides businesses with actionable insight derived from massive datasets. Analyzing patterns and trends allows organizations to make more informed decisions. This reduces uncertainty and increases the chances of success. There are many case studies that show how data mining has transformed the decision-making process of businesses in various sectors.

  • Operational Efficiency

Data mining is essential to achieving efficiency, which is the cornerstone of any successful business. Organizations can improve their efficiency by optimizing processes, identifying bottlenecks, and streamlining operations. These real-world examples show how businesses have made remarkable improvements in their operations, leading to savings and resource optimization.

  • Personalized Customer Experiences

Data mining has the ability to customize experiences for customers. Businesses can increase customer satisfaction and loyalty by analyzing the behavior and preferences of their customers. Discover case studies that show how data mining has been used to create engaging and personalized customer journeys.

  • Competitive Advantage

Gaining a competitive advantage is essential in today’s highly competitive environment. Data mining gives businesses insights into the market, competitor strategies, and customer expectations. These insights can give organizations a competitive edge and help them achieve success. Look at case studies that show how companies have outperformed their competitors by using data mining.

Calculating ROI and Benefits

To justify investments, businesses must also quantify their return on investment. Calculating ROI for data mining initiatives requires a thorough analysis of the costs, benefits, and long-term impacts. Let’s examine the complexities of ROI within the context of data-mining.

  • Cost-Benefit Analysis

Prior to focusing on ROI, companies must perform a cost-benefit assessment of their data mining projects. It involves comparing the costs associated with implementing data-mining tools, training staff, and maintaining infrastructure to the benefits anticipated, such as higher revenue, cost savings and better decision-making. Case studies from real-world situations provide insight into cost-benefit analysis.

  • Quantifying Tangible and intangible benefits

Data mining initiatives can yield tangible and intangible benefits. Quantifying tangible benefits such as an increase in sales or a reduction in operational costs is easier. Intangible benefits such as improved brand reputation or customer satisfaction are also important, but they may require a nuanced measurement approach. Examine case studies that quantify both types.

  • Long-term Impact Assessment

ROI calculations should not be restricted to immediate gains. Businesses need to assess the impact their data mining projects will have in the future. Consider factors like sustainability, scalability, and ongoing benefits. Case studies that demonstrate the success of data-mining strategies over time can provide valuable insight into long-term impact assessment.

  • Key Performance Indicators for ROI

Businesses must establish KPIs that are aligned with their goals in order to measure ROI. KPIs can be used to evaluate the success of data-mining initiatives, whether it is tracking sales growth, customer satisfaction rates, or operational efficiency. Explore case studies to learn how to select and monitor KPIs strategically for ROI measurement.

Data Mining Ethics

Data mining is a field where ethical considerations are crucial to ensuring transparent and responsible practices. It is important to carefully navigate the ethical landscape as organizations use data to extract valuable insights. This section examines ethical issues in data mining and highlights cases that demonstrate ethical practices.

  • Understanding Ethical Considerations

Data mining ethics revolves around privacy, consent, and responsible information use. Businesses are faced with the question of how they use and collect data. Ethics also includes the biases in data and the fairness of algorithms.

  • Balance Innovation and Privacy

Finding the right balance between privacy and innovation is a major ethical issue in data mining. In order to gain an edge in the market through data insights and to innovate, organizations must walk a tightrope between innovation and privacy. Case studies will illuminate how companies have successfully balanced innovation and privacy.

  • Transparency and informed consent

Transparency in the processes is another important aspect of ethical data mining. This is to ensure that individuals are informed and consented before their data is used. This subtopic will explore the importance of transparency in data collection and processing, with case studies that highlight instances where organizations have established exemplary standards to obtain informed consent.

Exploring Data Mining Ethics is crucial as data usage evolves. Businesses must balance innovation, privacy, and transparency while gaining informed consent. Real-world cases show how ethical data mining protects privacy and builds trust.

Implementing Data Mining is complex yet rewarding. This guide helps set goals, choose data sources, and use algorithms effectively. Challenges like data security and resistance to change are common but manageable.

Considering ethics while implementing data mining shows responsibility and opens new opportunities. Organizations prioritizing ethical practices become industry leaders, mitigating risks and achieving positive impacts on business, society, and technology. Ethics and implementation synergize in data mining, unlocking its true potential.

  • Q. What ethical considerations are important in data mining?

Privacy and consent are important ethical considerations for data mining.

  • Q. How can companies avoid common pitfalls when implementing data mining?

By ensuring the security of data, addressing cultural opposition, and encouraging continuous learning and adaptation.

  • Q. Why is transparency important in data mining?

Transparency and consent to use collected data ethically are key elements of building trust.

  • Q. What are the main steps to implement data mining in businesses?

Define your objectives, select data sources, select algorithms and monitor continuously.

  • Q. How can successful organizations use data mining to gain a strategic advantage?

By taking informed decisions, improving operations and staying on top of the competition.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

favicon

Related Post

Strategic investments: choosing the right data center solutions for your business, customer data management: a guide for marketers, exploring data lake architecture: building a robust framework for data management, data analytics for hr: maximizing efficiency and performance, the role of data integration platforms in modern enterprises, understanding how big data analysis tools are shaping industries, table of contents.

Expand My Business is Asia's largest marketplace platform which helps you find various IT Services like Web and App Development, Digital Marketing Services and all others.

  • IT Staff Augmentation
  • Data & AI
  • E-commerce Development

Article Categories

  • Technology 642
  • Business 316
  • Digital Marketing 261
  • Social Media Marketing 129
  • E-Commerce 123
  • Website Development 102
  • Software 98

Copyright © 2024 Mantarav Private Limited. All Rights Reserved.

expand my business

  • Privacy Overview
  • Strictly Necessary Cookies

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

  • Open access
  • Published: 03 March 2022

Educational data mining: prediction of students' academic performance using machine learning algorithms

  • Mustafa Yağcı   ORCID: orcid.org/0000-0003-2911-3909 1  

Smart Learning Environments volume  9 , Article number:  11 ( 2022 ) Cite this article

52k Accesses

111 Citations

38 Altmetric

Metrics details

Educational data mining has become an effective tool for exploring the hidden relationships in educational data and predicting students' academic achievements. This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the random forests, nearest neighbour, support vector machines, logistic regression, Naïve Bayes, and k-nearest neighbour algorithms, which are among the machine learning algorithms, were calculated and compared to predict the final exam grades of the students. The dataset consisted of the academic achievement grades of 1854 students who took the Turkish Language-I course in a state University in Turkey during the fall semester of 2019–2020. The results show that the proposed model achieved a classification accuracy of 70–75%. The predictions were made using only three types of parameters; midterm exam grades, Department data and Faculty data. Such data-driven studies are very important in terms of establishing a learning analysis framework in higher education and contributing to the decision-making processes. Finally, this study presents a contribution to the early prediction of students at high risk of failure and determines the most effective machine learning methods.

Introduction

The application of data mining methods in the field of education has attracted great attention in recent years. Data Mining (DM) is the discovery of data. It is the field of discovering new and potentially useful information or meaningful results from big data (Witten et al., 2011 ). It also aims to obtain new trends and new patterns from large datasets by using different classification algorithms (Baker & Inventado, 2014 ).

Educational data mining (EDM) is the use of traditional DM methods to solve problems related to education (Baker & Yacef, 2009 ; cited in Fernandes et al., 2019 ). EDM is the use of DM methods on educational data such as student information, educational records, exam results, student participation in class, and the frequency of students' asking questions. In recent years, EDM has become an effective tool used to identify hidden patterns in educational data, predict academic achievement, and improve the learning/teaching environment.

Learning analytics has gained a new dimension through the use of EDM (Waheed et al., 2020 ). Learning analytics covers the various aspects of collecting student information together, better understanding the learning environment by examining and analysing it, and revealing the best student/teacher performance (Long & Siemens, 2011 ). Learning analytics is the compilation, measurement and reporting of data about students and their contexts in order to understand and optimize learning and the environments in which it takes place. It also deals with the institutions developing new strategies.

Another dimension of learning analytics is predicting student academic performance, uncovering patterns of system access and navigational actions, and determining students who are potentially at risk of failing (Waheed et al., 2020 ). Learning management systems (LMS), student information systems (SIS), intelligent teaching systems (ITS), MOOCs, and other web-based education systems leave digital data that can be examined to evaluate students' possible behavior. Using EDM method, these data can be employed to analyse the activities of successful students and those who are at risk of failure, to develop corrective strategies based on student academic performance, and therefore to assist educators in the development of pedagogical methods (Casquero et al., 2016 ; Fidalgo-Blanco et al., 2015 ).

The data collected on educational processes offer new opportunities to improve the learning experience and to optimize users' interaction with technological platforms (Shorfuzzaman et al., 2019 ). The processing of educational data yields improvements in many areas such as predicting student behaviour, analytical learning, and new approaches to education policies (Capuano & Toti, 2019 ; Viberg et al., 2018 ). This comprehensive collection of data will not only allow education authorities to make data-based policies, but also form the basis of software to be developed with artificial intelligence on the learning process.

EDM enables educators to predict situations such as dropping out of school or less interest in the course, analyse internal factors affecting their performance, and make statistical techniques to predict students' academic performance. A variety of DM methods are employed to predict student performance, identify slow learners, and dropouts (Hardman et al., 2013 ; Kaur et al., 2015 ). Early prediction is a new phenomenon that includes assessment methods to support students by proposing appropriate corrective strategies and policies in this field (Waheed et al., 2020 ).

Especially during the pandemic period, learning management systems, quickly put into practice, have become an indispensable part of higher education. While students use these systems, the log records produced have become ever more accessible. (Macfadyen & Dawson, 2010 ; Kotsiantis et al., 2013 ; Saqr et al., 2017 ). Universities now should improve the capacity of using these data to predict academic success and ensure student progress (Bernacki et al., 2020 ).

As a result, EDM provides the educators with new information by discovering hidden patterns in educational data. Using this model, some aspects of the education system can be evaluated and improved to ensure the quality of education.

In various studies on EDM, e-learning systems have been successfully analysed (Lara et al., 2014 ). Some studies have also classified educational data (Chakraborty et al., 2016 ), while some have tried to predict student performance (Fernandes et al., 2019 ).

Asif et al. ( 2017 ) focused on two aspects of the performance of undergraduate students using DM methods. The first aspect is to predict the academic achievements of students at the end of a four-year study program. The second one is to examine the development of students and combine them with predictive results. He divided the students into two parts as low achievement and high achievement groups. He have found that it is important for the educators to focus on a small number of courses indicating particularly good or poor performance in order to offer timely warnings, support underperforming students and offer advice and opportunities to high-performing students. Cruz-Jesus et al. ( 2020 ) predicted student academic performance with 16 demographics such as age, gender, class attendance, internet access, computer possession, and the number of courses taken. Random forest, logistic regression, k-nearest neighbours and support vector machines, which are among the machine learning methods, were able to predict students’ performance with accuracy ranging from 50 to 81%.

Fernandes et al. ( 2019 ) developed a model with the demographic characteristics of the students and the achievement grades obtained from the in-term activities. In that study, students' academic achievement was predicted with classification models based on Gradient Boosting Machine (GBM). The results showed that the best qualities for estimating achievement scores were the previous year's achievement scores and unattendance. The authors found that demographic characteristics such as neighbourhood, school and age information were also potential indicators of success or failure. In addition, he argued that this model could guide the development of new policies to prevent failure. Similarly, by using the student data requested during registration and environmental factors, Hoffait and Schyns ( 2017 ) determined the students with the potential to fail. He found that students with potential difficulties could be classified more precisely by using DM methods. Moreover, their approach makes it possible to rank the students by levels of risk. Rebai et al. ( 2020 ) proposed a machine learning-based model to identify the key factors affecting academic performance of schools and to determine the relationship between these factors. He concluded that the regression trees showed that the most important factors associated with higher performance were school size, competition, class size, parental pressure, and gender proportions. In addition, according to the random forest algorithm results, the school size and the percentage of girls had a powerful impact on the predictive accuracy of the model.

Ahmad and Shahzadi, ( 2018 ) proposed a machine learning-based model to find an answer to the question whether students were at risk regarding their academic performance. Using the students' learning skills, study habits, and academic interaction features, they made a prediction with a classification accuracy of 85%. The researchers concluded that the model they proposed could be used to determine academically unsuccessful student. Musso et al., ( 2020 ) proposed a machine learning model based on learning strategies, perception of social support, motivation, socio-demographics, health condition, and academic performance characteristics. With this model, he predicted the academic performance and dropouts. He concluded that the predictive variable with the highest effect on predicting GPA was learning strategies while the variable with the greatest effect on determining dropouts was background information.

Waheed et al., ( 2020 ) designed a model with artificial neural networks on students' records related to their navigation through the LMS. The results showed that demographics and student clickstream activities had a significant impact on student performance. Students who navigated through courses performed higher. Students' participation in the learning environment had nothing to do with their performance. However, he concluded that the deep learning model could be an important tool in the early prediction of student performance. Xu et al. ( 2019 ) determined the relationship between the internet usage behaviors of university students and their academic performance and he predicted students’ performance with machine learning methods. The model he proposed predicted students' academic performance at a high level of accuracy. The results suggested that Internet connection frequency features were positively correlated with academic performance, whereas Internet traffic volume features were negatively correlated with academic performance. In addition, he concluded that internet usage features had an important role on students' academic performance. Bernacki et al. ( 2020 ) tried to find out whether the log records in the learning management system alone would be sufficient to predict achievement. He concluded that the behaviour-based prediction model successfully predicted 75% of those who would need to repeat a course. He also stated that, with this model, students who might be unsuccessful in the subsequent semesters could be identified and supported. Burgos et al. ( 2018 ) predicted the achievement grades that the students might get in the subsequent semesters and designed a tool for students who were likely to fail. He found that the number of unsuccessful students decreased by 14% compared to previous years. A comparative analysis of studies predicting the academic achievement grades using machine learning methods is given in Table 1 .

A review of previous research that aimed to predict academic achievement indicates that researchers have applied a range of machine learning algorithms, including multiple, probit and logistic regression, neural networks, and C4.5 and J48 decision trees. However, random forests (Zabriskie et al., 2019 ), genetic programming (Xing et al., 2015 ), and Naive Bayes algorithms (Ornelas & Ordonez, 2017 ) were used in recent studies. The prediction accuracy of these models reaches very high levels.

Prediction accuracy of student academic performance requires an deep understanding of the factors and features that impact student results and the achievement of student (Alshanqiti & Namoun, 2020 ). For this purpose, Hellas et al. ( 2018 ) reviewed 357 articles on student performance detailing the impact of 29 features. These features were mainly related to psychomotor skills such as course and pre-course performance, student participation, student demographics such as gender, high school performance, and self-regulation. However, the dropout rates were mainly influenced by student motivation, habits, social and financial issues, lack of progress, and career transitions.

The literature review suggests that, it is a necessity to improve the quality of education by predicting the academic performance of the students and supporting those who are in the risk group. In the literature, the prediction of academic performance was made with many and various variables, various digital traces left by students on the internet (browsing, lesson time, percentage of participation) (Fernandes et al., 2019 ; Rubin et al., 2010 ; Waheed et al., 2020 ; Xu et al., 2019 ) and students demographic characteristics (gender, age, economic status, number of courses attended, internet access, etc.) (Bernacki et al., 2020 ; Rizvi et al., 2019 ; García-González & Skrita, 2019 ; Rebai et al., 2020 ; Cruz-Jesus et al., 2020 ; Aydemir, 2017 ), learning skills, study approaches, study habits (Ahmad & Shahzadi, 2018 ), learning strategies, social support perception, motivation, socio-demography, health form, academic performance characteristics (Costa-Mendes et al., 2020 ; Gök, 2017 ; Kılınç, 2015 ; Musso et al., 2020 ), homework, projects, quizzes (Kardaş & Güvenir, 2020 ), etc. In almost all models developed in such studies, prediction accuracy is ranging from 70 to 95%. Hovewer, collecting and processing such a variety of data both takes a lot of time and requires expert knowledge. Similarly, Hoffait and Schyns ( 2017 ) suggested that collecting so many data is difficult and socio-economic data are unnecessary. Moreover, these demographic or socio-economic data may not always give the right idea of preventing failure (Bernacki et al., 2020 ).

The study concerns predicting students’ academic achievement using grades only, no demographic characteristics and no socio-economic data. This study aimed to develop a new model based on machine learning algorithms to predict the final exam grades of undergraduate students taking their midterm exam grades, Faculty and Department of the students.

For this purpose, classification algorithms with the highest performance in predicting students’ academic achievement were determined by using machine learning classification algorithms. The reason for choosing the Turkish Language-I course was that it is a compulsory course that all students enrolled in the university must take. Using this model, students’ final exam grades were predicted. These models will enable the development of pedagogical interventions and new policies to improve students' academic performance. In this way, the number of potentially unsuccessful students can be reduced following the assessments made after each midterm.

This section describes the details of the dataset, pre-processing techniques, and machine learning algorithms employed in this study.

Educational institutions regularly store all data that are available about students in electronic medium. Data are stored in databases for processing. These data can be of many types and volumes, from students’ demographics to their academic achievements. In this study, the data were taken from the Student Information System (SIS), where all student records are stored at a State University in Turkey. In these records, the midterm exam grades, final exam grades, Faculty, and Department of 1854 students who have taken the Turkish Language-I course in the 2019–2020 fall semester were selected as the dataset. Table 2 shows the distribution of students according to the academic unit. Moreover, as a additional file 1 the dataset are presented.

Midterm and final exam grades are ranging from 0 to 100. In this system, the end-of-semester achievement grade is calculated by taking 40% of the midterm exam and 60% of the final exam. Students with achievement grade below 60 are unsuccessful and those above 60 are successful. The midterm exam is usually held in the middle of the academic semester and the final exam is held at the end of the semester. There are approximately 9 weeks (2.5 months) from the midterm exam to the final exam. In other words, there is a two and a half month period for corrective actions for students who are at risk of failing thanks to the final exam predictions made. In other words, the answer to the question of how effective the student's performance in the middle of the semester is on his performance at the end of the semester was investigated.

Data identification and collection

At this phase, it is determined from which source the data will be stored, which features of the data will be used, and whether the collected data is suitable for the purpose. Feature selection involves decreasing the number of variables used to predict a particular outcome. The goal; to facilitate the interpretability of the model, reduce complexity, increase the computational efficiency of algorithms, and avoid overfitting.

Establishing DM model and implementation of algorithm

RF, NN, LR, SVM, NB and kNN were employed to predict students' academic performance. The prediction accuracy was evaluated using tenfold cross validation. The DM process serves two main purposes. The first purpose is to make predictions by analyzing the data in the database (predictive model). The second one is to describe behaviors (descriptive model). In predictive models, a model is created by using data with known results. Then, using this model, the result values are predicted for datasets whose results are unknown. In descriptive models, the patterns in the existing data are defined to make decisions.

When the focus is on analysing the causes of success or failure, statistical methods such as logistic regression and time series can be employed (Ortiz & Dehon, 2008 ; Arias Ortiz & Dehon, 2013 ). However, when the focus is on forecasting, neural networks (Delen, 2010 ; Vandamme et al., 2007 ), support vector machines (Huang & Fang, 2013 ), decision trees (Delen, 2011 ; Nandeshwar et al., 2011 ) and random forests (Delen, 2010 ; Vandamme et al., 2007 ) is more efficient and give more accurate results. Statistical techniques are to create a model that can successfully predict output values based on available input data. On the other hand, machine learning methods automatically create a model that matches the input data with the expected target values when a supervised optimization problem is given.

The performance of the model was measured by confusion matrix indicators. It is understood from the literature that there is no single classifier that works best for prediction results. Therefore, it is necessary to investigate which classifiers are more studied for the analysed data (Asif et al., 2017 ).

Experiments and results

The entire experimental phase was performed with Orange machine learning software. Orange is a powerful and easy-to-use component-based DM programming tool for expert data scientists as well as for data science beginners. In Orange, data analysis is done by stacking widgets into workflows. Each widget includes some data retrieval, data pre-processing, visualization, modelling, or evaluation task. A workflow is a series of actions or actions that will be performed on the platform to perform a specific task. Comprehensive data analysis charts can be created by combining different components in a workflow. Figure  1 shows the workflow diagram designed.

figure 1

The workflow of the designed model

The dataset included midterm exam grades, final exam grades, Faculty, and Department of 1854 students taking the Turkish Language-I course in the 2019–2020 Fall Semester. The entire dataset is provided as Additional file 1 . Table 3 shows part of the dataset.

In the dataset, students' midterm exam grades, final exam grades, faculty, and department information were determined as features. Each measure contains data associated with a student. Midterm exam and final exam grade variables were explained under the heading "dataset". The faculty variable represents Faculties in Kırşehir Ahi Evran University and the department variable represents departments in faculties. In the development of the model, the midterm, the faculty, and the department information were determined as the independent variable and the final was determined as the dependent variable. Table 4 shows the variable model.

After the variable model was determined, the midterm exam grades and final exam grades were categorized according to the equal-width discretization model. Table 5 shows the criteria used in converting midterm exam grades and final exam grades into the categorical format.

In Table 6 , the values in the final column are the actual values. The values in the RF, SVM, LR, KNN, NB, and NN columns are the values predicted by the proposed model. For example, according to Table 5 , std1’s actual final grade was in the range 55 to 77. While the predicted value of the RF, SVM, LR, NB, and NN models were in the range of, the predicted value of the kNN model was greater than 77.

Evaluation of the model performance

The performance of model was evaluated with confusion matrix, classification accuracy (CA), precision, recall, f-score (F1), and area under roc curve (AUC) metrics.

Confusion matrix

The confusion matrix shows the current situation in the dataset and the number of correct/incorrect predictions of the model. Table 7 shows the confusion matrix. The performance of the model is calculated by the number of correctly classified instances and incorrectly classified instances. The rows show the real numbers of the samples in the test set, and the columns represent the estimation of the model.

In Table 6 , true positive (TP) and true negative (TN) show the number of correctly classified instances. False positive (FP) shows the number of instances predicted as 1 (positive) while it should be in the 0 (negative) class. False negative (FN) shows the number of instances predicted as 0 (negative) while it should be in class 1 (positive).

Table 8 shows the confusion matrix for the RF algorithm. In the confusion matrix of 4 × 4 dimensions, the main diagonal shows the percentage of correctly predicted instances, and the matrix elements other than the main diagonal shows the percentage of errors predicted.

Table 8 shows that 84.9% of those with the actual final grade greater than 77.5, 71.2% of those with range 55–77.5, 65.4% of those with range 32.5–55, and 60% of those with less than 32.5 were predicted correctly. Confusion matrixs of other algorithms are shown in Tables 9 , 10 , 11 , 12 , and 13 .

Classification accuracy:  CA is the ratio of the correct predictions (TP + TN) to the total number of instances (TP + TN + FP + FN).

Precision: Precision is the ratio of the number of positive instances that are correctly classified to the total number of instances that are predicted positive. Gets a value in the range [0.1].

Recall: Recall i s the ratio of the correctly classified number of positive instances to the number of all instances whose actual class is positive. The Recall is also called the true positive rate. Gets a value in the range [0.1].

F-Criterion (F1):  There is an opposite relationship between precision and recall. Therefore, the harmonic mean of both criteria is calculated for more accurate and sensitive results. This is called the F-criterion.

Receiver operating characteristics (ROC) curve

The AUC-ROC curve is used to evaluate the performance of a classification problem. AUC-ROC is a widely used metric to evaluate the performance of machine learning algorithms, especially in cases where there are unbalanced datasets, and explains how well the model is at predicting.

AUC: Area under the ROC curve

The larger the area covered, the better the machine learning algorithms at distinguishing given classes. AUC for the ideal value is 1. The AUC, Classification Accuracy (CA), F-Criterion (F1), precision, and recall values of the models are shown in Table 14 .

The AUC value of RF, NN, SVM, LR, NB, and kNN algorithms were 0.860, 0.863, 0.804, 0.826, 0.810, and 0.810 respectively. The classification accuracy of the RF, NN, SVM, LR, NB, and kNN algorithms were also 0.746, 0.746, 0.735, 0.717, 0.713, and 0,699 respectively. According to these findings, for example, the RF algorithm was able to achieve 74.6% accuracy. In other words, there was a very high-level correlation between the data predicted and the actual data. As a result, 74.6% of the samples were been classified correctly.

Discussion and conclusion

This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the Random Forests, nearest neighbour, support vector machines, Logistic Regression, Naïve Bayes, and k-nearest neighbour algorithms, which are among the machine learning algorithms, were calculated and compared to predict the final exam grades of the students. This study focused on two parameters. The first parameter was the prediction of academic performance based on previous achievement grades. The second one was the comparison of performance indicators of machine learning algorithms.

The results show that the proposed model achieved a classification accuracy of 70–75%. According to this result, it can be said that students' midterm exam grades are an important predictor to be used in predicting their final exam grades. RF, NN, SVM, LR, NB, and kNN are algorithms with a very high accuracy rate that can be used to predict students' final exam grades. Furthermore, the predictions were made using only three types of parameters; midterm exam grades, Department data and Faculty data. The results of this study were compared with the studies that predicted the academic achievement grades of the students with various demographic and socio-economic variables. Hoffait and Schyns ( 2017 ) proposed a model that uses the academic achievement of students in previous years. With this model, they predicted students' performance to be successful in the courses they will take in the new semester. They found that 12.2% of the students had a very high risk of failure, with a 90% confidence rate. Waheed et al. ( 2020 ) predicted the achievement of the students with demographic and geographic characteristics. He found that it has a significant effect on students' academic performance. He predicted the failure or success of the students by 85% accuracy. Xu et al. ( 2019 ) found that internet usage data can distinguish and predict students' academic performance. Costa-Mendes et al. ( 2020 ), Cruz-Jesus et al. ( 2020 ), Costa-Mendes et al. ( 2020 ) predicted the academic achievement of students in the light of income, age, employment, cultural level indicators, place of residence, and socio-economic information. Similarly, Babić ( 2017 ) predicted students’ performance with an accuracy of 65% to 100% with artificial neural networks, classification tree, and support vector machines methods.

Another result of this study was RF, NN and SVM algorithms have the highest classification accuracy, while kNN has the lowest classification accuracy. According to this result, it can be said that RF, NN and SVM algorithms perform with more accurate results in predicting the academic achievement grades of students with machine learning algorithms. The results were compared with the results of the research in which machine learning algorithms were employed to predict academic performance according to various variables. For example, Hoffait and Schyns ( 2017 ) compared the performances of LR, ANN and RF algorithms to identify students at high risk of academic failure on their various demographic characteristics. They ranked the algorithms from those with the highest accuracy to the ones with the lowest accuracy as LR, ANN, and RF. On the other hand, Waheed et al. ( 2020 ) found that the SVM algorithm performed higher than the LR algorithm. According to Xu et al. ( 2019 ), the algorithm with the highest performance is SVM, followed by the NN algorithm, and the decision tree is the algorithm with the lowest performance.

The proposed model predicted the final exam grades of students with 73% accuracy. According to this result, it can be said that academic achievement can be predicted with this model in the future. By predicting students' achievement grades in future, students can be allowed to review their working methods and improve their performance. The importance of the proposed method can be better understood, considering that there is approximately 2.5 months between the midterm exams and the final exams in higher education. Similarly, Bernacki et al. ( 2020 ) work on the early warning model. He proposed a model to predict the academic achievements of students using their behavior data in the learning management system before the first exam. His algorithm correctly identified 75% of students who failed to earn the grade of B or better needed to advance to the next course. Ahmad and Shahzadi ( 2018 ) predicted students at risk for academic performance with 85% accuracy evaluating their study habits, learning skills, and academic interaction features. Cruz-Jesus et al. ( 2020 ) predicted students' end-of-semester grades with 16 independent variables. He concluded that students could be given the opportunity of early intervention.

As a result, students' academic performances were predicted using different predictors, different algorithms and different approaches. The results confirm that machine learning algorithms can be used to predict students’ academic performance. More importantly, the prediction was made only with the parameters of midterm grade, faculty and department. Teaching staff can benefit from the results of this research in the early recognition of students who have below or above average academic motivation. Later, for example, as Babić ( 2017 ) points out, they can match students with below-average academic motivation by students with above-average academic motivation and encourage them to work in groups or project work. In this way, the students' motivation can be improved, and their active participation in learning can be ensured. In addition, such data-driven studies should assist higher education in establishing a learning analytics framework and contribute to decision-making processes.

Future research can be conducted by including other parameters as input variables and adding other machine learning algorithms to the modelling process. In addition, it is necessary to harness the effectiveness of DM methods to investigate students' learning behaviors, address their problems, optimize the educational environment, and enable data-driven decision making.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

  • Educational data mining

Random forests

Neural networks

Support vector machines

Logistic regression

Naïve Bayes

K-nearest neighbour

Decision trees

Artificial neural networks

Extremely randomized trees

Regression trees

Multilayer perceptron neural network

Feed-forward neural network

Adaptive resonance theory mapping

Learning management systems

Student information systems

Intelligent teaching systems

Classification accuracy

Area under roc curve

True positive

True negative

False positive

False negative

Receiver operating characteristics

Ahmad, Z., & Shahzadi, E. (2018). Prediction of students’ academic performance using artificial neural network. Bulletin of Education and Research, 40 (3), 157–164.

Google Scholar  

Alshanqiti, A., & Namoun, A. (2020). Predicting student performance and its influential factors using hybrid regression and multi-label classification. IEEE Access, 8 , 203827–203844. https://doi.org/10.1109/access.2020.3036572

Article   Google Scholar  

Arias Ortiz, E., & Dehon, C. (2013). Roads to success in the Belgian French Community’s higher education system: predictors of dropout and degree completion at the Université Libre de Bruxelles. Research in Higher Education, 54 (6), 693–723. https://doi.org/10.1007/s11162-013-9290-y

Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (2017). Analyzing undergraduate students’ performance using educational data mining. Computers and Education, 113 , 177–194. https://doi.org/10.1016/j.compedu.2017.05.007

Aydemir, B. (2017). Predicting academic success of vocational high school students using data mining methods graduate . [Unpublished master’s thesis]. Pamukkale University Institute of Science.

Babić, I. D. (2017). Machine learning methods in predicting the student academic motivation. Croatian Operational Research Review, 8 (2), 443–461. https://doi.org/10.17535/crorr.2017.0028

Baker, R. S., & Inventado, P. S. (2014). Educational data mining and learning analytics. Learning analytics (pp. 61–75). Springer.

Chapter   Google Scholar  

Baker, R. S., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1 (1), 3–17.

Bernacki, M. L., Chavez, M. M., & Uesbeck, P. M. (2020). Predicting achievement and providing support before STEM majors begin to fail. Computers & Education, 158 (August), 103999. https://doi.org/10.1016/j.compedu.2020.103999

Burgos, C., Campanario, M. L., De, D., Lara, J. A., Lizcano, D., & Martínez, M. A. (2018). Data mining for modeling students’ performance: A tutoring action plan to prevent academic dropout. Computers and Electrical Engineering, 66 (2018), 541–556. https://doi.org/10.1016/j.compeleceng.2017.03.005

Capuano, N., & Toti, D. (2019). Experimentation of a smart learning system for law based on knowledge discovery and cognitive computing. Computers in Human Behavior, 92 , 459–467. https://doi.org/10.1016/j.chb.2018.03.034

Casquero, O., Ovelar, R., Romo, J., Benito, M., & Alberdi, M. (2016). Students’ personal networks in virtual and personal learning environments: A case study in higher education using learning analytics approach. Interactive Learning Environments, 24 (1), 49–67. https://doi.org/10.1080/10494820.2013.817441

Chakraborty, B., Chakma, K., & Mukherjee, A. (2016). A density-based clustering algorithm and experiments on student dataset with noises using Rough set theory. In Proceedings of 2nd IEEE international conference on engineering and technology, ICETECH 2016 , March (pp. 431–436). https://doi.org/10.1109/ICETECH.2016.7569290

Costa-Mendes, R., Oliveira, T., Castelli, M., & Cruz-Jesus, F. (2020). A machine learning approximation of the 2015 Portuguese high school student grades: A hybrid approach. Education and Information Technologies, 26 , 1527–1547. https://doi.org/10.1007/s10639-020-10316-y

Cruz-Jesus, F., Castelli, M., Oliveira, T., Mendes, R., Nunes, C., Sa-Velho, M., & Rosa-Louro, A. (2020). Using artificial intelligence methods to assess academic achievement in public high schools of a European Union country. Heliyon . https://doi.org/10.1016/j.heliyon.2020.e04081

Delen, D. (2010). A comparative analysis of machine learning techniques for student retention management. Decision Support Systems, 49 (4), 498–506. https://doi.org/10.1016/j.dss.2010.06.003

Delen, D. (2011). Predicting student attrition with data mining methods. Journal of College Student Retention: Research, Theory and Practice, 13 (1), 17–35. https://doi.org/10.2190/CS.13.1.b

Fernandes, E., Holanda, M., Victorino, M., Borges, V., Carvalho, R., & Van Erven, G. (2019). Educational data mining : Predictive analysis of academic performance of public school students in the capital of Brazil. Journal of Business Research, 94 (February 2018), 335–343. https://doi.org/10.1016/j.jbusres.2018.02.012

Fidalgo-Blanco, Á., Sein-Echaluce, M. L., García-Peñalvo, F. J., & Conde, M. Á. (2015). Using Learning Analytics to improve teamwork assessment. Computers in Human Behavior, 47 , 149–156. https://doi.org/10.1016/j.chb.2014.11.050

García-González, J. D., & Skrita, A. (2019). Predicting academic performance based on students’ family environment: Evidence for Colombia using classification trees. Psychology, Society and Education, 11 (3), 299–311. https://doi.org/10.25115/psye.v11i3.2056

Gök, M. (2017). Predicting academic achievement with machine learning methods. Gazi University Journal of Science Part c: Design and Technology, 5 (3), 139–148.

Hardman, J., Paucar-Caceres, A., & Fielding, A. (2013). Predicting students’ progression in higher education by using the random forest algorithm. Systems Research and Behavioral Science, 30 (2), 194–203. https://doi.org/10.1002/sres.2130

Hellas, A., Ihantola, P., Petersen, A., Ajanovski, V.V., Gutica, M., Hynninen, T., Knutas, A., Leinonen, J., Messom, C., & Liao, S.N. (2018). Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education (pp. 175–199).

Hoffait, A., & Schyns, M. (2017). Early detection of university students with potential difficulties. Decision Support Systems, 101 (2017), 1–11. https://doi.org/10.1016/j.dss.2017.05.003

Huang, S., & Fang, N. (2013). Predicting student academic performance in an engineering dynamics course: A comparison of four types of predictive mathematical models. Computers and Education, 61 (1), 133–145. https://doi.org/10.1016/j.compedu.2012.08.015

Kardaş, K., & Güvenir, A. (2020). Analysis of the effects of Quizzes, homeworks and projects on final exam with different machine learning techniques. EMO Journal of Scientific, 10 (1), 22–29.

Kaur, P., Singh, M., & Josan, G. S. (2015). Classification and prediction based data mining algorithms to predict slow learners in education sector. Procedia Computer Science, 57 , 500–508. https://doi.org/10.1016/j.procs.2015.07.372

Kılınç, Ç. (2015). Examining the effects on university student success by data mining techniques. [Unpublished master’s thesis]. Eskişehir Osmangazi University Institute of Science.

Kotsiantis, S., Tselios, N., Filippidi, A., & Komis, V. (2013). Using learning analytics to identify successful learners in a blended learning course. International Journal of Technology Enhanced Learning, 5 (2), 133–150. https://doi.org/10.1504/IJTEL.2013.059088

Lara, J. A., Lizcano, D., Martínez, M. A., Pazos, J., & Riera, T. (2014). A system for knowledge discovery in e-learning environments within the European Higher Education Area—Application to student data from Open University of Madrid, UDIMA. Computers and Education, 72 , 23–36. https://doi.org/10.1016/j.compedu.2013.10.009

Long, P., & Siemens, G. (2011). Penetrating the fog: Analytics in learning and education. Educause Review, 46 (5), 31–40.

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & Education, 54 (2), 588–599. https://doi.org/10.1016/j.compedu.2009.09.008

Musso, M. F., Hernández, C. F. R., & Cascallar, E. C. (2020). Predicting key educational outcomes in academic trajectories: A machine-learning approach. Higher Education, 80 (5), 875–894. https://doi.org/10.1007/s10734-020-00520-7

Nandeshwar, A., Menzies, T., & Nelson, A. (2011). Learning patterns of university student retention. Expert Systems with Applications, 38 (12), 14984–14996. https://doi.org/10.1016/j.eswa.2011.05.048

Ornelas, F., & Ordonez, C. (2017). Predicting student success: A naïve bayesian application to community college data. Technology, Knowledge and Learning, 22 (3), 299–315. https://doi.org/10.1007/s10758-017-9334-z

Ortiz, E. A., & Dehon, C. (2008). What are the factors of success at University? A case study in Belgium. Cesifo Economic Studies, 54 (2), 121–148. https://doi.org/10.1093/cesifo/ifn012

Rebai, S., Ben Yahia, F., & Essid, H. (2020). A graphically based machine learning approach to predict secondary schools performance in Tunisia. Socio-Economic Planning Sciences, 70 (August 2018), 100724. https://doi.org/10.1016/j.seps.2019.06.009

Rizvi, S., Rienties, B., & Ahmed, S. (2019). The role of demographics in online learning; A decision tree based approach. Computers & Education, 137 (August 2018), 32–47. https://doi.org/10.1016/j.compedu.2019.04.001

Rubin, B., Fernandes, R., Avgerinou, M. D., & Moore, J. (2010). The effect of learning management systems on student and faculty outcomes. The Internet and Higher Education, 13 (1–2), 82–83. https://doi.org/10.1016/j.iheduc.2009.10.008

Saqr, M., Fors, U., & Tedre, M. (2017). How learning analytics can early predict under-achieving students in a blended medical education course. Medical Teacher, 39 (7), 757–767. https://doi.org/10.1080/0142159X.2017.1309376

Shorfuzzaman, M., Hossain, M. S., Nazir, A., Muhammad, G., & Alamri, A. (2019). Harnessing the power of big data analytics in the cloud to support learning analytics in mobile learning environment. Computers in Human Behavior, 92 (February 2017), 578–588. https://doi.org/10.1016/j.chb.2018.07.002

Vandamme, J.-P., Meskens, N., & Superby, J.-F. (2007). Predicting academic performance by data mining methods. Education Economics, 15 (4), 405–419. https://doi.org/10.1080/09645290701409939

Viberg, O., Hatakka, M., Bälter, O., & Mavroudi, A. (2018). The current landscape of learning analytics in higher education. Computers in Human Behavior, 89 (July), 98–110. https://doi.org/10.1016/j.chb.2018.07.027

Waheed, H., Hassan, S. U., Aljohani, N. R., Hardman, J., Alelyani, S., & Nawaz, R. (2020). Predicting academic performance of students from VLE big data using deep learning models. Computers in Human Behavior, 104 (October 2019), 106189. https://doi.org/10.1016/j.chb.2019.106189

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining practical machine learning tools and techniques (3rd ed.). Morgan Kaufmann.

Xing, W., Guo, R., Petakovic, E., & Goggins, S. (2015). Participation-based student final performance prediction model through interpretable Genetic Programming: Integrating learning analytics, educational data mining and theory. Computers in Human Behavior, 47 , 168–181.

Xu, X., Wang, J., Peng, H., & Wu, R. (2019). Prediction of academic performance associated with internet usage behaviors using machine learning algorithms. Computers in Human Behavior, 98 (January), 166–173. https://doi.org/10.1016/j.chb.2019.04.015

Zabriskie, C., Yang, J., DeVore, S., & Stewart, J. (2019). Using machine learning to predict physics course outcomes. Physical Review Physics Education Research, 15 (2), 020120. https://doi.org/10.1103/PhysRevPhysEducRes.15.020120

Download references

Acknowledgements

Not applicable.

Author information

Authors and affiliations.

Kırşehir Ahi Evran University, Faculty of Engineering and Architecture, 40100, Kırşehir, Turkey

Mustafa Yağcı

You can also search for this author in PubMed   Google Scholar

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mustafa Yağcı .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yağcı, M. Educational data mining: prediction of students' academic performance using machine learning algorithms. Smart Learn. Environ. 9 , 11 (2022). https://doi.org/10.1186/s40561-022-00192-z

Download citation

Received : 15 November 2021

Accepted : 15 February 2022

Published : 03 March 2022

DOI : https://doi.org/10.1186/s40561-022-00192-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Predicting achievement
  • Learning analytics
  • Early warning systems

case study data mining algorithms

Research on Intelligent Data Mining and Knowledge Discovery Method Based on Software Information System

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Application of Sentinel-1 InSAR to monitor tailings dams and predict geotechnical instability: practical considerations based on case study insights

  • Original Paper
  • Open access
  • Published: 29 April 2024
  • Volume 83 , article number  204 , ( 2024 )

Cite this article

You have full access to this open access article

case study data mining algorithms

  • Nahyan M. Rana   ORCID: orcid.org/0000-0002-7525-8431 1 , 2 ,
  • Keith B. Delaney 1 ,
  • Stephen G. Evans 1 ,
  • Evan Deane 3 ,
  • Andy Small 4 ,
  • Daniel A. M. Adria 5 ,
  • Scott McDougall 6 ,
  • Negar Ghahramani 7 &
  • W. Andy Take 8  

619 Accesses

Explore all metrics

Tailings storage facilities (TSFs) impound mining waste behind dams to ensure public safety, but failure incidents have prompted calls for more robust monitoring programs. Satellite-based interferometric synthetic aperture radar (InSAR) has grown in popularity due to its ability to remotely detect millimeter-scale displacements in most urban and some natural terrains. However, there remains a limited understanding of whether InSAR can be as accurate or representative as on-the-ground instruments, whether failures can be predicted in advance using InSAR, and what variables govern the quality and reliability of InSAR results. To address these gaps, we analyze open-source, medium-resolution Sentinel-1 data to undertake a ground-truth assessment at a test site and a forensic analysis of five failure cases. We use a commercial software with an automated Persistent Scatterer (PS) workflow (SARScape Analytics) for all case study sites except one and a proprietary algorithm (SqueeSAR) with a dual PS and Distributed Scatterer (DS) algorithm for the ground-truth site and one forensic case. The main goal is to deliver practical insights regarding the influence of algorithm/satellite selection, environmental conditions, site activity, coherence thresholds, satellite-dam geometry, and failure modes. We conclude that Sentinel-1 InSAR can serve as a hazard-screening tool to help guide where to undertake targeted investigations; however, most potential failure modes may not exhibit InSAR-detectable accelerations that could assist with time-of-failure prediction in real time. As such, long-term monitoring programs should ideally be integrated with a combination of remote sensing and field instrumentation to best support engineering practice and judgment.

Similar content being viewed by others

case study data mining algorithms

Improved phase gradient stacking for landslide detection

case study data mining algorithms

Optimizing seismic hazard inputs for co-seismic landslide susceptibility mapping: a probabilistic analysis

case study data mining algorithms

Field Insights and Analysis of the 2018 Mw 7.5 Palu, Indonesia Earthquake, Tsunami and Landslides

Avoid common mistakes on your manuscript.

Introduction

Tailings storage facilities (TSFs) impound fine-grained, wet, often geochemically hazardous mine waste behind constructed dams in perpetuity for societal and environmental protection (Vick 1983 ; Blight 2010 ). TSFs can store considerable volumes of flowable material that, if released accidentally, could produce far-reaching and long-lasting consequences, as evidenced by a number of TSF failure incidents in recent years (e.g., Morgenstern et al. 2015 , 2016 ; Robertson et al. 2019 ; Rana et al. 2021 ). Such events highlight the importance of implementing proactive monitoring systems at TSF sites to ensure safe performance.

A key objective in TSF monitoring is to observe for potential signs of instability by analyzing spatiotemporal rates of displacement—a variable that is of special concern in scenarios involving creep deformation, static liquefaction, or foundation deformations. In industry practice, the displacement rate has conventionally been monitored by field observations, in-situ instrumentation (e.g., monitoring prisms, inclinometers, and extensometers), ground-based InSAR, ground-based or airborne Light Detection and Ranging (LiDAR), and/or photogrammetry. Another available technique is satellite InSAR, which has been utilized as a complementary, and potentially cost-effective, monitoring tool in mining practice (Hu et al. 2017 ; Raspini et al. 2022 ). The central focus of this article is on the utility of satellite InSAR for monitoring tailings dams and predicting instability.

By conducting interferometric analysis of SAR satellite images, one can measure millimeter-scale displacements in the line-of-sight (LOS) direction of the satellite or in two dimensions (east–west horizontal and up-down vertical) if two satellite orbit tracks in opposite directions are overlapping spatially and temporally (e.g., Hu et al. 2017 ; Mazzanti et al. 2021 ). The Sentinel-1 satellite, commenced in mid-2015 by the European Space Agency (ESA), has become a popular SAR sensor in displacement monitoring studies due to the open-source data release and the revisit times of 6 or 12 days in most of the world.

Among the numerous InSAR deformation analysis techniques (Aswathi et al. 2022 ), Persistent Scatterer (PS) is able to produce highly precise, long-term displacement time-series mainly for human-built structures such as bridges, roads, buildings, and dams (Ferretti et al. 2001 ; Crosetto et al. 2016 ). Small Baseline Subset (SBAS) InSAR is an alternative time-series approach that was designed to improve the spatial distribution and density of “Distributed Scatterer” (DS) observation points in vegetated study areas, albeit at reduced spatial resolution (Berardino et al. 2002 ; Casu et al. 2006 ).

A technical drawback of InSAR is the complicated and lengthy workflow that necessitates the use of computers and software with high data processing capacity. This obstacle, along with the conceptual complexity of advanced InSAR, has contributed to a limited archive of case studies on tailings dams. Hu et al. ( 2017 ) monitored displacements at the Kennecott TSF in the USA by integrating ENVISAT, ALOS Palsar-1, and Sentinel-1A data. Mazzanti et al. ( 2021 ) used over 400 Sentinel-1 images in the ascending and descending orbit direction to study displacements at the Zelazny Most TSF in Poland.

To date, three recent TSF breach cases have been forensically analyzed using InSAR: 2018 Cadia, Australia (Carla et al. 2019a ; Jefferies et al. 2019 ; Thomas et al. 2019 ; Hudson et al. 2021 ; Bayaraa et al. 2022 ), 2019 Feijao, Brazil (Gama et al. 2020 ; Holden et al. 2020 ; Rotta et al. 2020 ; Grebby et al. 2021 ), and 2022 Jiaokou, China (Duan et al. 2023 ; Su et al. 2024 ). The common conclusion in these case studies was that satellite InSAR can be an effective monitoring tool for TSFs exhibiting slow, long-term deformations, but awareness of limitations is needed as it relates to the oblique geometry of 1-D LOS displacement measurements, the difficulty in predicting instantaneous failure mechanisms, phase unwrapping errors, and loss of coherence. Mirmazloumi et al. ( 2023 ) also re-examined the Cadia and Feijao cases using a PS algorithm to test an early warning system based on machine learning.

While the Cadia case demonstrated a precursor acceleration phase that could have assisted in time-to-failure prediction (Carla et al. 2019a ), such anomalous displacement patterns were not as readily apparent in the Feijao and Jiaokou cases. This leads to an incomplete understanding of whether tailings dam failures can indeed be predicted in advance using InSAR data alone or whether accurate InSAR-derived failure predictions can only be achieved only under certain criteria/conditions (e.g., processing algorithm, satellite-dam geometry, failure mechanism). To build on previous advancements, the case study inventory needs to be expanded in order to explore the capabilities and limitations of InSAR when monitoring tailings dams in diverse site conditions and with different potential failure modes.

Goal and scope

Using Sentinel-1 data, this study helps address this research gap in two ways. First, we present a ground-truth assessment at a test site where InSAR results are compared to in-situ monitoring prism data. Second, we examine the precursor displacements in 5 TSF failure cases (2017–2019) selected from published databases (Islam and Murakami 2021 ; Rana et al. 2021 , 2022 ) based on the following criteria: (i) their variability in reported failure mechanisms and site characteristics and (ii) the spatial–temporal coverage of Sentinel-1 data over the sites. Of the 5 failure events, 3 are new case studies whereas the 2 others (Cadia and Feijao) have already been analyzed using InSAR in preceding studies, which allows us to compare our findings versus published results.

In all of the cases, it was possible to retrieve only the 1-D LOS displacements—i.e., spatially and temporally overlapping ascending and descending orbit tracks, which allow insights into east–west and vertical displacements, were not available. To process the Sentinel-1 InSAR data, we used two software/algorithms: (i) for the ground-truth site and 4 of 5 forensic case studies, a commercial software (SARScape Analytics) that offers an automated workflow for PS analysis; and (ii) for the ground-truth site and only 1 forensic case study, the proprietary algorithm SqueeSAR which is integrated with an advanced PS + DS technique (Ferretti et al. 2011 ). The use of multiple processing techniques helped demonstrate how the quality of InSAR results may vary depending on the adopted algorithm and the site conditions.

The ultimate goal of this study was to provide practical insights and considerations for engineers and mine owners who may be considering Sentinel-1 InSAR as a long-term monitoring tool for their TSFs. This goal is pursued by the following approach:

We analyze the accuracy of Sentinel-1 InSAR on a site-scale using multiple software/algorithms.

We assess if unstable locations and accelerations in precursor displacements can be detected by the present approach. This allows us to:

Identify high-deformation hotspots to match with the observed breach location and reported breach mechanism, which is important for hazard assessment; 

Check if the failure was preceded by accelerating displacements, and if so, how many weeks in advance this trend was observed, which is important for risk management; and/or

Identify errors in the results due to the inherent limitations of the selected software, the limitations of Sentinel-1 data, or the LOS velocity threshold being exceeded. 

We comment on the influence of dam-satellite geometry, environmental conditions, and failure modes on the quality, value, and interpretation of LOS displacement results.

Background and approach

Sar data processing.

Methods to process SAR data for displacement monitoring range from open-source software such as SNAP/SNAPHU (Chen and Zebker 2002 ), HyP3/MintPy (Yunjun et al. 2019 ) and EZ-InSAR (Hrysiewicz et al. 2023 ), to commercially available software such as GAMMA (Werner et al. 2000 ; Wang et al. 2020 ), SARPROZ (Perissin et al. 2011 ; Bakon et al. 2014 ), and SARScape (Gama et al. 2020 ), to company-specific proprietary algorithms such as APSIS (Sowter et al. 2016 ; Grebby et al. 2021 ) and SqueeSAR (Ferretti et al. 2011 ; Carla et al. 2019a ; Bischoff et al. 2020 ). To our knowledge, there are no scientific studies to date that compare the effectiveness of different processing algorithms/software specifically for tailings dam monitoring applications.

For the present study, we used ENVI SARScape Analytics (v. 5.6), which was developed by SARMap and is commercially distributed by NV5 Geospatial (formerly L3Harris Geospatial), to process Sentinel-1 data for the ground-truth test site and 4 of 5 forensic case studies. The software offers a streamlined workflow for PS-InSAR analysis, whereby each processing step is automated and the analysis runtime is reduced substantially. The Analytics package is a condensed, limited version of the entire SARScape software suite, which has been previously used to analyze the 2019 Feijao TSF failure using both the PS and SBAS techniques (Gama et al. 2020 ). The steps that are automated in the PS processing chain include co-registration, interferogram creation, coherence generation, height estimation, baseline refinement, noise filtering, and phase unwrapping. For a technical background on the standard PS algorithm, we refer to Ferretti et al. ( 2001 ), Crosetto et al. ( 2016 ), and references therein.

We explored the feasibility of alternative software such as SNAP/SNAPHU and SARPROZ, but the balanced cost- and time-saving features of SARScape Analytics were deemed to be most convenient for the comprehensive scope of this study. However, there are inherent drawbacks in the automated approach of SARScape Analytics; it is not possible to produce or extract individual interferograms, to modify any parameters or steps in the workflow (except the coherence), and to view or modify the reference location.

We downloaded open-source Sentinel-1 Single Look Complex (SLC) Interferometric Wide (IW) scenes from the Alaska Satellite Facility (ASF) web platform. Other input variables into the SARScape Analytics software were the geoid type (EGM2008), the base global digital elevation model (DEM) which we selected to be 30-m resolution SRTM-3 v4, and the area of interest in KML format (must be between 4 and 25 km 2 ).

For the ground-truth site and one forensic case study (2018 Cieneguita, Mexico), the Sentinel-1 images were also processed using the SqueeSAR algorithm (Ferretti et al. 2011 ). The SqueeSAR processing was performed by TRE Altamira based on instructions on the study area and time-series duration provided by the lead author (N. Rana). SqueeSAR overcomes the limitations of alternative software packages by integrating both PS and DS points during analysis, thus enhancing the spatial density of point-cloud displacement data in most terrains. SqueeSAR has been used to study tailings dam failures (2018 Cadia, Australia), open-pit slope instabilities (e.g., Carla et al. 2019a ), urban deformation (e.g., Bischoff et al. 2017 , 2020 ), and natural landslides (e.g., Carla et al. 2019a ,b) and is best-suited to monitor displacement rates of < 1000 mm/year. The technical framework of SqueeSAR is described in Ferretti et al. ( 2011 ).

The comparison between SARScape Analytics and SqueeSAR in the ground-truth assessment is not intended to be a competition, but rather to demonstrate how the quality of InSAR results can differ depending on the adopted data processing technique, which is ultimately of practical value to engineers and mine owners.

For 3 case study sites, we filtered the PS data based on a minimum coherence of 0.70, whereas for the 2 sites where environmental conditions affected InSAR data availability, we reduced the minimum coherence to 0.57–0.65. Our thresholds are higher than the 0.45 value applied in the ISBAS analysis of Feijao by Grebby et al. ( 2021 ) and are comparable to the minimum temporal coherence of 0.60 adopted by Mazzanti et al. ( 2021 ) in their PS analysis of Zelazsny Most TSF.

Ground-truth assessment

As a complementary lead-up to the forensic case studies, we undertook a ground-truth assessment at a tailings dam situated in a cold-climate setting. Key identifier details of the mine and TSF are kept confidential to adhere to the non-disclosure agreement signed with the mine owner. We compared Sentinel-1B PS-InSAR LOS cumulative displacement results to in-situ data captured via two monitoring prisms (MP5 and MP3) over the same study period. The MPs were installed in late 2018 along the tailings dam. MP5 is at the crest while MP3 is on the downstream slope, and both are located at the SE corner of the dam. The average accuracy of the MP horizontal and vertical displacement measurements is approximately ± 10 mm. The errors are mainly due to setup error of the total station, and the magnitude of the errors varies depending on how far the prism is located from the setup location.

The MP data indicates that this section of the dam crest has exhibited some settlement deformation mainly toward the upstream (western) direction, whereas the downstream (east-facing) slope of the dam has shown relatively stable behavior with minimal cumulative movement. The main objective here was to check if Sentinel-1 InSAR, as processed via both SARScape Analytics and SqueeSAR, shows reasonable consistency with the MP data and can adequately represent a tailings dam experiencing displacement rates between 0 and 50 mm/year.

The Sentinel-1B track used for analysis had an ascending (i.e., roughly south-to-north) orbit geometry and a LOS incidence angle of 31°. The tailings dam trends north–south, whereby the downstream slope of the dam faces eastward away from the satellite but is still exposed at an acute angle to the satellite’s LOS. On SARScape Analytics, we processed 36 Sentinel-1B images from May 2019 to October 2021, excluding the winter months from the processing stack because snow/ice is known to reduce the coherence of InSAR observations (Carla et al. 2019b ; Kim et al. 2022 ). The SqueeSAR processing involved a total of 42 Sentinel-1B images over roughly the same study duration, also excluding the winter images. We confirmed the start and end dates of the winter months by checking for snow/ice cover on the TSF on high-resolution, high-frequency RapidEye and PlanetScope satellite imagery for the site.

To facilitate a fair time-series comparison, we followed these steps:

The original MP datasets included results collected during the winter months, whereas Sentinel-1 images corresponding to winter months were excluded from the InSAR processing stack. To avoid temporal inconsistency, we baselined the cumulative displacement results from the Sentinel-1 and MP datasets to the start of the Spring season in each study year—that is, our time-series comparisons encompass the Spring-Fall seasons of 2019–2021.

The MP data was originally reported as horizontal-easting and vertical. To avoid geometric inconsistency, we converted/projected the MP data to the Sentinel-1 LOS (31° eastward) using the following formula: LOS-projected displacement  =  vertical displacement * cos(31 ° )  −  horizontal-easting displacement * sin(31 ° ).

For the comparisons, we selected the InSAR data point closest to the corresponding MP. The comparisons were quantitatively assessed by calculating average differences in the displacement measurements between the different time-series datasets, as well as via the root mean square error (RMSE) calculated by the formula: \(\sqrt{{\frac{1}{n}\sum_{i=1}^{n}({d}_{A}-{d}_{B})}^{2}}\) , where d A and d B represent displacements from InSAR or MP data and n is number of comparison observations.

Forensic case studies

This section describes the approach to investigating the precursor LOS displacements in 5 tailings dams that experienced a breach in the period 2017–2019. The cases are listed in Table  1 . We selected these cases after screening databases of TSF failures (e.g., Islam and Murakami 2021 ; Rana et al. 2021 2022 ) based on two criteria. First, the cases encompass a variety of breach mechanisms, allowing us to check for distinct displacement patterns depending on the failure mechanism (e.g., internal erosion vs. liquefaction vs. subsidence). Second, the 5 cases occurred in 5 countries with diverse climatic, topographic, and land-use regimes. This allows us to check for environmental conditions that influence the quality and reliability of the InSAR results.

For some of the cases, the predisposal variables (i.e., the underlying causal variables that preconditioned the precarious stability of the TSF), trigger mechanisms (i.e., a natural or anthropogenic activity that caused the breach to occur at a particular location at a given time), and failure modes (i.e., the mechanism by which the breach and outflow occurred) are poorly described in existing literature. These underlying factors form the background story of each case; therefore, our knowledge of these factors has an influence on our judgment of the InSAR results. In Table  1 , we assign qualitative levels of knowledge uncertainty for each case:

“High uncertainty” means that the factor is virtually unknown due to lack of research.

“Medium uncertainty” means that there are news articles or satellite images that provide basic information on, or insights into, the factor.

“Low uncertainty” means that the factor has been well-documented in scientific material.

Details on the Sentinel-1 image processing stacks for each case are provided in Table  2 . All of the cases were processed using SARScape Analytics with the exception of Cieneguita which was processed using SqueeSAR due to the highly vegetated environment. Our main objective was to forensically check whether the breach location and breach timing could have been predicted in advance using Sentinel-1 InSAR. For each case study, we adopted the following consistent approach:

We presented the LOS velocity map illustrating the calculated mean displacement rates per annum across the site.

We identified hotspots of detected movements in the unstable section and, where applicable, in other sections of the dam exhibiting similar detected behavior.

We plotted the cumulative LOS displacement time-series for these hotspot locations.

We checked if accelerations in the time-series could be observed.

Where precursor identifiers on the breach location and/or timing were not found, we provided explanations on the basis of the failure mechanism (if this information was available), the limitations of the processing algorithm, and/or the limitations of medium-resolution Sentinel-1 data. Out of our 5 case studies, only Cadia and Feijao have been forensically studied using InSAR (Carla et al. 2019a ; Thomas et al. 2019 ; Gama et al. 2020 ; Rotta et al. 2020 ; Holden et al. 2020 ; Grebby et al. 2021 ; Hudson et al. 2021 ; Bayaraa et al. 2022 ), which allows us to compare our findings versus previously published results and comment on how different InSAR processing approaches influence the outcomes.

Figure  1 shows the LOS velocity maps (i.e., calculated mean displacement rate per annum) from SARScape Analytics and SqueeSAR, enabling a side-by-side visual comparison. Both sets of InSAR data are filtered to show only points with coherence of 0.70 or greater. Figure  1 also includes a cross-sectional diagram to elucidate the geometric relationship between the satellite, the dam, and the MPs. The absence of InSAR data points in the TSF impoundment is due to surface wetness, indicating the presence of tailings or free water.

figure 1

Sentinel-1 InSAR line-of-sight (LOS) velocity maps of the anonymized ground-truth TSF test site based on A SARScape Analytics and B SqueeSAR. Both sets of InSAR data are filtered to show only points with a coherence of ≥ 0.70. The maps are annotated with the locations of the selected monitoring prisms (MPs), the surface of the exposed tailings (dashed outline), the embankment (solid outline), the mine ground surface, ponded water sites, and the adjacent dam. Negative (red) values indicate detected movements away from the satellite whereas positive (blue) values indicate detected movements toward the satellite. C Cross-sectional schematic illustrating the geometric relationship between the satellite, the tailings dam, and the MPs

An important distinction is apparent between the two sets of InSAR data: the point density. Figure  2 shows that SqueeSAR produced 9876 data points over the TSF (the extent shown in Fig.  2 A, B) whereas SARScape Analytics produced 4368. InSAR data coverage along the embankment was sparse, particularly for SARScape Analytics. This is attributed to (i) the inherent differences in the PS vs. PS + DS algorithms, (ii) on-site activity that likely reduced the stability of InSAR pixels, and/or (iii) the removal of winter images which could have contributed to loss of coherence and, consequently, loss of PS data for SARScape Analytics. The standard deviation in the SqueeSAR data was 5.5 mm/year compared to 6.2 mm/year for SARScape Analytics. Figure  2 also shows contrasting statistical distributions for the InSAR data when sorted by coherence; for SqueeSAR, the number of data points increases with increasing coherence, which is opposite to that of SARScape Analytics. As such, the mean coherence of SqueeSAR data was greater, implying a higher measurement precision on average.

figure 2

of A, B line-of-sight (LOS) velocities (mm/year) and C, D coherence values between SARScape Analytics and SqueeSAR results for the ground-truth test site

Figure  3 presents the cumulative LOS displacement time-series for the two InSAR datasets and the two MP datasets on the dam crest and downstream slope for the Spring-Fall seasons of 2019–2021. The standard deviation of incremental displacements in the SARScape Analytics dataset was calculated to be 3.6 mm, marginally greater than the standard deviation of 3.0 mm for SqueeSAR. In comparison, as stated earlier, the accuracy for the MP data at this site was approximately ± 10 mm.

figure 3

Ground-truth results comparing the Sentinel-1 line-of-sight (LOS) InSAR cumulative displacement data (processed on SARScape Analytics and SqueeSAR) to in-situ cumulative displacement data from monitoring prisms MP5, on the crest, and MP3, on the downstream slope, of the tailings dam. The study period encompasses the Spring-Fall seasons of 2019–2021 (i.e., winter images were excluded from the processing stack), with all sets of data being baselined to the start of each Spring-Fall study period

Table 3 presents statistical comparisons between these datasets. For the dam crest, the average differences ranged from 2.8 to 3.7 mm when comparing SARScape Analytics versus SqueeSAR results, from 4.4 to 5.0 mm when comparing SARScape Analytics versus MP5 data, and from 2.4 to 5.0 mm when comparing SqueeSAR versus MP5. The RMSE values ranged from 3.6 to 6.0 mm (SARScape Analytics vs. SqueeSAR), 5.1 to 6.6 mm (SARScape Analytics vs. MP5), and 3.1 to 6.7 mm (SqueeSAR vs. MP5). For the downstream slope, the ranges of average differences were 1.5–6.0, 3.0–3.6, and 3.4–5.1 mm, respectively, and the ranges of RMSE values were 1.8–7.1, 3.7–4.1, and 4.4–6.1 mm, respectively.

Our results indicate that, at this particular tailings dam, Sentinel-1 InSAR was able to reasonably represent the displacement rates (all below 50 mm/year) on both the dam crest and downstream slope within a maximum RMSE of 7 mm. However, the loss or sparsity of InSAR data points along the dam crest due to on-site activity and the removal of winter images precluded us from performing additional comparisons to other MPs that are installed along the dam. This issue could be of concern to mine owners using InSAR to monitor tailings dams in cold-climate regimes or highly active TSFs undergoing tailings deposition and dam raises.

This section presents the InSAR results for each forensic case study. The presentation of the InSAR results is preceded by a background review of the reported failure description. For all of the cases except Tonglvshan, we constructed time-lapse videos using between 60 and 120 PlanetScope (3 m resolution) optical satellite images to show the evolution of the TSFs until their breach. The open-access URL links to these time-lapse videos are listed in Table 4 in the Appendix.

2017 Tonglvshan, China

The Tonglvshan copper-iron TSF is located near Daye City in Hubei Province, China. On 12 March 2017, the northern section of the tailings dam experienced a breach with an approximate geometry of about 50 m breadth × 250 m width × 12 m height (Fig.  4 ). The released volume was reportedly ~ 500,000 m 3 (Zhuang et al. 2022 ) with an inundation area of 300,000 m 2 (Ghahramani et al. 2020 ). The failure reportedly occurred due to the long-term effects of weathering and gravity in the roof granite in the mine goaf underneath the TSF (Zhuang et al. 2022 ). This caused the upper strata to fail, which led to the subsidence and brittle instability in the dam foundation. The incident led to 2 deaths and 6 injuries (Zhuang et al. 2022 ).

figure 4

A Pre-failure (13 February 2017) and B post-failure (13 April 2017) PlanetScope (3 m resolution) images of the 12 March 2017 Tonglvshan tailings dam breach in China

Figure  5 shows the results from our analysis of 49 Sentinel-1 images over a duration of 21 months. The satellite had an ascending orbit parallel to the length of the TSF with a LOS incidence angle of 33.8° toward E-NE. This implies that, at the northern section that experienced the breach, displacements in the downstream (N-NW) direction were along-track in relation to the satellite orbit (Fig.  5 B) and thus were poorly captured. However, LOS components of vertical, settlement- and subsidence-related deformation can be measured.

figure 5

Sentinel-1 PS-InSAR results, processed on SARScape Analytics with a minimum coherence of 0.57, for the 12 March 2017 Tonglvshan TSF breach in China. A Line-of-sight (LOS) velocity map, annotated with the northern breach section (1 data point) and western unfailed section (average of 5 data points) selected for time-series analysis. Negative (red) values indicate detected movements away from the satellite, positive (blue) values indicate detected movements toward the satellite, and green-yellow values indicate detected stable areas. B Cross-sectional schematics for the two sections illustrating the geometric relationship between the satellite, the tailings dam, and the PS points selected for time-series analysis (arrow indicates direction of InSAR-detected LOS movement).  C Cumulative LOS displacements over the time-series duration for the two sections

The LOS velocities in the infrastructure surrounding the TSF were generally close to 0 mm/year (i.e., stable). This contrasts with the deformation regime along the tailings dam, sections of which were experiencing anomalously high LOS velocities: only a single data point at the northern section that eventually breached and a few data points at the middle of the western wall that remained stable. As shown in the time-series (Fig.  5 C), the LOS deformation patterns were similar in both locations, suggesting a near-constant rate of movement away from the satellite (i.e., settlement and potential subsidence activity) with a total displacement of ~ 50 mm over the study duration, but there was no discernible evidence of precursor acceleration at any section along the tailings dam.

However, it is worth reiterating that the unstable northern section was geometrically unfavorable in relation to the satellite’s LOS compared to the western wall that was directly facing the LOS. Moreover, given the loss of coherence, a few isolated data points are not sufficient to study the instability of a tailings dam with confidence. These issues inevitably influence our interpretation of the results for this case.

2018 Cadia, Australia

The Cadia gold mine in New South Wales, Australia, operates two large TSFs: the Northern TSF (NTSF) and the Southern TSF (STSF), both of which are upstream-constructed, stepped side-hill impoundments (Jefferies et al. 2019 ). In March 2018, ~ 300 m long section of the SW section experienced failure in the NTSF (Fig.  6 ). We created a time-lapse video of the Cadia dam using 120 PlanetScope images between September 2016 and September 2018 (Table 4 ). The video shows that the failure process was two-staged: the initial breach occurred on 9 March 2018, and another secondary event occurred at the same location on 11 March 2018. Jefferies et al. ( 2019 ) identified the main failure cause to be a low-density, highly compressible Forest Reef Volcanics (FRV) layer in the foundation underneath the SW wall of the NTSF. This previously unidentified unit was strain-weakening—that is, the unit became brittle when subjected to high loads. Accelerated displacements over this foundation unit triggered static liquefaction in the loose, saturated tailings in the NTSF. The Cadia failure has been forensically examined by InSAR using proprietary processing software in several publications (Carla et al. 2019a ; Jefferies et al. 2019 ; Thomas et al. 2019 ; Hudson et al. 2021 ; Bayaraa et al. 2022 ). These studies identified accelerating deformation in the 2–3 months preceding the dam collapse.

figure 6

A Pre-failure (11 June 2016) and B post-failure (12 September 2018) Google Earth Worldview-2 (0.5 m resolution) images of the 9 March 2018 Cadia tailings dam breach in Australia

Figure  7 shows our PS-InSAR results after processing 45 Sentinel-1 images over a 17-month duration. The satellite had a descending orbit track with a LOS incidence angle of 35°. Although the unstable dam face was obliquely exposed to the satellite’s LOS, the LOS components of vertical settlements along the dam were well-captured. The LOS velocity patterns showed that the vicinity of the unstable section, especially along the dam crest, represented a hotspot of anomalously high movements (Fig.  7 A). These observations resemble those seen in previous publications on the Cadia event, whereby the time-series (Fig.  7 C) shows the commencement of the acceleration phase in January 2018. However, there appear to be errors associated with the final two data points leading up to the failure. We observed a similar time-series pattern for all of the other high-velocity data points along the breach section. This issue may be caused by phase unwrapping errors due to exceedance of the maximum LOS velocity threshold of 28 mm over a 12-day revisit time, as previously suggested by Bayaraa et al. ( 2022 ). However, these errors were not observed in Carla et al. ( 2019a ) who used SqueeSAR and in Jefferies et al. ( 2019 ) who used the SBAS algorithm within the complete SARScape software package. Jefferies et al. ( 2019 ) also stated that SBAS is more appropriate to capture strong, non-linear accelerations. As such, the issue encountered here appears to reflect a key technical limitation of PS-InSAR when monitoring relatively fast, accelerating deformations, with major implications for the ability to make time-of-failure predictions.

figure 7

Sentinel-1 PS-InSAR results, processed on SARScape Analytics, for the 9 March 2018 Cadia TSF breach in Australia. A Line-of-sight (LOS) velocity map, annotated with the breach location and the PS point selected for time-series analysis. Negative (red) values indicate detected movements away from the satellite, positive (blue) values indicate detected movements toward the satellite, and green-yellow values indicate detected stable areas. B Cross-sectional schematic illustrating the geometric relationship between the satellite, the tailings dam, and the PS points. The small red arrows indicate the direction of LOS movement, in this case away from the satellite. C Cumulative LOS displacement time-series for the data point at the center of the breach section that exhibited the highest detected LOS velocity (− 25 mm/year)

2018 Cieneguita, Mexico

The Cieneguita gold-silver tailings dam in Chihuahua, Mexico, was breached on 4 June 2018 (Fig.  8 ). The total released volume was ~ 440,000 m 3 , including ~ 250,000 m 3 of tailings and ~ 190,000 m 3 of embankment and construction materials (Rana et al. 2021 2022 ). According to historical satellite imagery on Google Earth, the tailings deposition commenced at this site in mid-2013. The TSF size was relatively small, covering an area of ~ 35,000 m 2 with a dam crest length of about 150 m around the date of failure. According to local reports, premonitory signs included extensive cracking along the downstream face of the dam about 4 months prior to failure, likely indicative of ongoing internal erosion and a weakened state of the embankment in response to rapid loading on the sloping impoundment (Rana et al. 2021 2022 ). At least three mine workers were killed by the collapse. The tailings flow caused at least three fatalities and achieved a runout distance of 15 km along Canitas Creek (Ghahramani et al. 2020 ). We constructed a time-lapse video of the Cieneguita TSF using 60 PlanetScope images spanning the period January 2017 to July 2018 (Table 4 ). The video shows that the TSF was undergoing rapid depositional and construction activity.

figure 8

A Pre-failure (3 June 2018) and B post-failure (5 June 2018) PlanetScope (3 m resolution) images of the 4 June 2018 Cieneguita tailings dam breach in Mexico

SARScape Analytics was found to be ineffective in producing reliable InSAR results due to the forested terrain at this site. As such, SqueeSAR was used to process 39 Sentinel-1 images encompassing the 15-month study period, as shown in Fig.  9 . The SqueeSAR results still showed a complete absence of data points along the central (breached) portion of the embankment, mainly due to the construction activity. We observed three high-velocity (< − 20 mm/year) data points at the left edge/corner of the dam slope and plotted the average cumulative displacement time-series in Fig.  9 C. The total displacement recorded here was ~ 40 mm, without evidence of precursor acceleration. In this time-series, we also detected a “cyclic” pattern of displacements from September 2017 onwards. This pattern could be a reflection of on-site activity rather than geotechnical processes in the dam itself. However, it remains difficult to confirm this given the lack of knowledge on site-specific conditions and activities over the study period.

figure 9

Sentinel-1 PS + DS InSAR results, processed on SqueeSAR with a minimum coherence threshold of 0.60, for the 4 June 2018 Cieneguita TSF failure in Mexico. A Line-of-sight (LOS) velocity map, annotated with the PS points selected for time-series analysis. Negative (red) values indicate detected movements away from the satellite, positive (blue) values indicate detected movements toward the satellite, and green-yellow values indicate detected stable areas. B Cross-sectional schematic illustrating the geometric relationship between the satellite, the tailings dam, and the PS points. The small red arrow from the InSAR data point indicates the direction of LOS movement, in this case away from the satellite. C Average cumulative LOS displacement time-series for the selected data points

2019 Feijao, Brazil

The Feijao TSF is located near Brumadinho, Brazil. The dam collapsed on 25 January 2019, releasing 9.7 M m 3 of tailings, equivalent to 75% of the total impounded volume (Fig.  10 ). The failure was predisposed by several factors (Robertson et al. 2019 ): (i) the application of the upstream raise method with a steep slope; (ii) the deposition of fine, weak tailings near the crest of the dam; (iii) a setback in construction, which caused the upper portions of the dam to overlie weaker, finer-grained tailings; (iv) the lack of effective horizontal drainage, groundwater seepage, and high rainfall that led to high internal water levels; (v) a loss of suction in the unsaturated portion of the tailings, leading to a sudden loss of strength; and (vi) high iron content in the tailings resulting in particle bonding via iron oxidation, causing brittle behavior in the tailings. These issues preconditioned the occurrence of static liquefaction on the date of failure, likely triggered by drilling activity on a metastable section of the dam (Arroyo and Gens 2021 ; Arenas et al. 2023 ). The resulting tailings flow resulted in 272 deaths and rendered long-lasting environmental and socio-economic effects in the region, prompting major updates to global industry standards in tailings management (Global Tailings Review 2020 ). We constructed a time-lapse video of the Feijao TSF using 90 PlanetScope satellite images between June 2017 and February 2019 (Table 4 ), which confirm that the TSF was in an inactive state during this period.

figure 10

A Pre-failure (early January 2019) and B post-failure (early February 2019) Google Earth Worldview-2 (0.5 m resolution) images of the 25 January 2019 Feijao tailings dam breach in Brazil

A number of previous studies have presented InSAR investigations of the Feijao event (Gama et al. 2020 ; Holden et al. 2020 ; Grebby et al. 2021 ; Mirmazloumi et al. 2023 ). The differences in data interpretation and conclusions between these studies are summarized as follows:

Gama et al. ( 2020 ) used the SBAS and PS algorithms in SARScape (the complete software, not the automated and limited Analytics package that is used in this study) to process 26 Sentinel-1 images and detected a mild acceleration phase in the weeks preceding the collapse. The authors concluded that their confidence in their inverse-velocity prediction results was low due to the wide error distributions that were not centered around the actual failure date.

Holden et al. ( 2020 ) from 3vGeomatics Inc. used proprietary software to analyze Sentinel-1 (> 3 years of images over two orbit tracks), TerraSAR-X (~ 2 years of images), and COSMO-SkyMed data (30 images) and concluded that the precursor acceleration was not statistically significant or anomalous enough to have been a reliable warning sign. They used the findings of Robertson et al. ( 2019 ), who noted “no apparent signs of distress prior to failure,” as a geotechnical justification for this conclusion.

Grebby et al. ( 2021 ) processed 45 Sentinel-1 images from two satellite orbit tracks (both descending) using the ISBAS algorithm in the Punnet (now APSIS) software (Terra Motion Limited). Based on their inverse-velocity analysis of 4–5 data points that showed a prediction interval of ~ 40 days around the failure date, the authors concluded that the collapse was foreseeable.

Mirmazloumi et al. ( 2023 ) used a PS algorithm implemented in the Geomatics Division of the Centre Tecnològic de Telecomunicacions de Catalunya (CTTC) in Spain (Devanthery et al. 2014 ). They processed 68 Sentinel-1 images with the goal of testing a machine learning-based early warning system. The time-of-failure forecast capability for Feijao was found to have some promise albeit with the requirement of expert interpretation due to the low PS point density, given the forested landscape.

Figure  11 shows our PS results from the processing of 50 Sentinel-1 images over two orbit tracks (155 and 53, both descending) over a 19-month duration. Track 155 has a relatively high LOS incidence angle (45°) compared to Track 53 (32.5°) (Fig.  11 D). This implies that the PS results of Track 53 are more sensitive to vertical deformations, whereas Track 155 results are more sensitive to sub-horizontal displacements. The dam face was exposed to the satellite’s LOS at an oblique angle (Fig.  11 D); this makes it a non-ideal geometry to estimate the rate of movement in the downstream direction, which is almost parallel to the orbit track. It is worth noting that, by implementing a high coherence threshold (0.70) via the PS analysis on SARScape Analytics, there were no data points detected along the dam toe. A similar issue was encountered by Mirmazloumi et al. ( 2023 ), though their selected coherence threshold was not reported.

figure 11

Sentinel-1 PS-InSAR results, processed on SARScape Analytics with a minimum coherence threshold of 0.70, for 2 satellite orbit tracks over the site of the 25 January 2019 Feijao TSF failure in Brazil. A , B Line-of-sight (LOS) velocity map for Track 53 and Track 155, annotated with the data points selected for time-series analysis. Negative (red) values indicate detected movements away from the satellite, positive (blue) values indicate detected movements toward the satellite, and green-yellow values indicate detected stable areas. C Legend of the LOS velocity maps. D Cross-sectional schematic illustrating the geometric relationship between the satellites, the dam, and the PS points selected for time-series analysis. The small red arrows indicate the direction of LOS displacement, in this case away from the satellite. E Cumulative LOS displacement time-series for the selected data points from Track 53 and Track 155

The time-series results from both orbit tracks corresponding to the dam crest, shown in Fig.  11 E, show that the LOS cumulative displacements were in the order of 40 mm (away from the sensor) over the 19-month duration. This compares to averages of 27 mm (SBAS) and 35 mm (PS) over a 10-month duration reported by Gama et al. ( 2020 ), an average of 20 mm (ISBAS) over 17 months reported by Grebby et al. ( 2021 ), and averages of 40 mm (Track 53) and 60 mm (Track 155) over 27 months reported by Holden et al. ( 2020 ). None of our InSAR data points present any visually discernible evidence of precursor accelerations. Our conclusion is, therefore, consistent with that of Holden et al. ( 2020 )—i.e., although the dam was experiencing deformations, the failure date could not have been predicted in advance using InSAR data alone.

2019 Hindalco, India

Hindalco Industries operates a bauxite residue TSF near the village of Muri in Jharkhand, India. The TSF covers a surface area of ~ 300,000 m 2 and was dammed by ~ 5 m high gabion retaining walls with a perimeter of over 2 km. Historical satellite images on Google Earth show that a water pond covered the SW portion of the impoundment for several years until sometime in the period 2011–2014, when tailings were deposited into the pond. There were also two water storage ponds as distinct compartments in the TSF.

On 9 April 2019, about 600 m of the SW section of the gabion wall was breached (Fig.  12 ). The failed materials comprised the entire extent of the former pond area, as well as one of the water storage ponds. The flowslide was bounded by railway tracks toward the west and travelled for a few hundred meters southward along the margin of the tracks toward the village. The trigger mechanism of the event is unclear, but local reports have pointed to the poorly constructed gabion wall as a failure cause and to a potential undrained failure mechanism based on field observations, eyewitness accounts, and analysis of a publicly available video of the breach area (Kumar 2019 ; Rana et al. 2021 ). We created a time-lapse video of the Hindalco TSF using 120 PlanetScope images captured over the period September 2017 to September 2019 (Table 4 ).

figure 12

A Pre-failure (9 April 2019) and B post-failure (9 May 2019) satellite images of the 9 April 2019 Hindalco tailings dam breach in India. Image A is PlanetScope (3 m resolution) and image B is Google Earth Worldview-2 (0.5 m resolution)

Figure  13 shows the PS-InSAR results for the Hindalco case. We processed 52 Sentinel-1 images spanning 18.5 months. The satellite was on a descending orbit track with a LOS angle of 38.2°. Along the section of the gabion wall that breached (i.e., on the west and south side), the crest was exposed to the satellite’s LOS whereas the wall slopes were partly or completely hidden. For the generation of cumulative displacement time-series, we selected the PS points that exhibited the highest LOS velocities. In both the western and southern failed sections, these data points indicated a LOS cumulative displacement of ~ 40 mm over the study duration, without any notable indication of precursor accelerations.

figure 13

Sentinel-1 PS-InSAR results, processed on SARScape Analytics with a minimum coherence of 0.65, for the 9 April 2019 Hindalco TSF failure in India. A Line-of-sight (LOS) velocity map, annotated with the data points selected for time-series analysis. Negative (red) values indicate detected movements away from the satellite, positive (blue) values indicate detected movements toward the satellite, and green-yellow values indicate detected stable areas. B Cross-sectional schematics illustrating the geometric relationship between the satellite, the dam, and the PS points selected for time-series analysis. The small red arrows indicate the direction of detected LOS movement, in this case away from the satellite. C Cumulative LOS displacement time-series for the selected data points

Our analyses show that the quality or value of Sentinel-1 InSAR results for a tailings dam may be influenced by several variables and considerations with practical implications for monitoring accuracy and failure predictions. At the same time, we acknowledge that the methodology underpinning this study consisted of some limitations and user judgment, and future research will require overcoming these limitations to build on our comprehensive work. All of these discussion points are presented in the following sub-sections.

Practical considerations

Environmental conditions.

It is well-established that C-band InSAR data (e.g., Sentinel-1) and the PS technique are not well-suited to monitor areas with dense vegetation. This is due to the average wavelength of C-band data (~ 6 cm) that prevents signal penetration through wooded or forested sites, as well as the temporal de-correlation that often characterizes such sites (e.g., Crosetto et al. 2010 ). Therefore, this approach would be ineffective for monitoring most of the thousands of tailings dams in sub-tropical regions such as Brazil, China, India, and Mexico. One way to overcome this limitation is to use L-band data (see “ Selection of satellite data ”) and the SBAS algorithm to monitor TSFs in such regions.

Snow/ice cover also affects the quality of InSAR data, which necessitated the removal of winter images for our ground-truth test site. At the same time, removing a significant section of the processing stack leads to a long temporal baseline, which can cause phase unwrapping problems and underestimate real deformations (Pawluszek-Filipiak et al. 2023 ). Both conflicting issues can impact the monitoring performance of satellite InSAR for tailings dams in cold-climate regions (e.g., Canada, Nordic countries, Russia). A viable approach to bypass this limitation is to install artificial corner reflectors on tailings dams, which help concentrate InSAR measurements on select sections that require monitoring (Pawluszek-Filipiak et al. 2023 ).

Dam orientation in relation to satellite line-of-sight (LOS)

The satellite orbit direction and the LOS angle have an important influence on the detected magnitude of InSAR displacement results and the subsequent data interpretation. This effect is particularly applicable to TSFs that consist of multiple dams of variable orientations. However, LOS components of vertical deformation on the dam crest can still be well-detected irrespective of the satellite orbit direction, as observed in the Cadia case. The tracking of vertical versus horizontal deformation (when using only a single satellite rather than multiple overlapping satellites) is sensitive to the satellite’s LOS angle, whereby a smaller angle corresponds to a stronger sensitivity to vertical movements.

Where possible, the retrieval of 2-D InSAR data (vertical displacements and horizontal east–west displacements) is most ideal. However, a major existing limitation is that, due to the polar orbit of SAR satellites, sub-horizontal displacements in the north–south direction cannot be retrieved.

Selection of satellite data

Given that Sentinel-1 is currently the only open-source SAR imagery with near-global coverage, it remains the most popular option for InSAR researchers over alternative data sources such as TerraSAR-X, SAOCOM-1, ALOS PALSAR-1/2, and COSMO-SkyMed. However, our research has shown that the limitations of Sentinel-1 may have implications for effective long-term monitoring of tailings dams.

For instance, when monitoring a smaller-sized TSF, the resolution of Sentinel-1 data (20 × 5 m) may be too coarse, as a single pixel may cover a significant part of the tailings dam being studied. A potential solution to this is TerraSAR-X for which the spatial resolution can be 1–3 m depending on the imaging mode. To our knowledge, only Holden et al. ( 2020 ) have conducted a comparison between the two satellites for a TSF failure (Feijao), whereas Gama et al. ( 2022 ) and a few studies in other sectors (e.g., Bischoff et al. 2017 ; Colombo 2021 ; Wang et al. 2021 ) have commented on the higher point density and lower standard deviation offered by TerraSAR-X. As such, Sentinel-1 appears to be more appropriate for monitoring larger-scale hazards.

The positional accuracy of observation points can also vary depending on the satellite. Although the precision of displacement measurements is millimetric, the position of observation points is known with a meter-scale accuracy. According to general insights from SqueeSAR case studies, the approximate point-elevation accuracy is ± 1.5 m for TerraSAR-X compared to ± 8 m for Sentinel-1. Furthermore, the approximate north–south and east–west point-location accuracy is ± 1 m and ± 3 m, respectively, for TerraSAR-X, compared to ± 8 m and ± 12 m for Sentinel-1. These differences are particularly important considerations when monitoring smaller-sized TSFs.

To overcome the limitations of C-band InSAR (e.g., Sentinel-1), the use of L-band data (e.g., ALOS PALSAR-1/2 and SAOCOM-1) could be more effective when monitoring TSFs in forested/wooded terrains due to the higher signal wavelengths of ~ 24 cm. An example of the application of L-band data for TSF monitoring is presented in Hu et al. ( 2017 ). The L-band and S-band (~ 12 cm wavelength) satellite NISAR is planned to be launched in 2024 with near-global coverage and a revisit interval of 12 days. Like Sentinel-1, the NISAR data will be made freely available. This will expand the scope and capabilities of InSAR monitoring of TSFs in diverse environmental settings and will enable case-study applications of multi-band InSAR data.

Selection of processing software/algorithm

Each InSAR data processing software is founded on algorithms that filter and convert raw radar satellite data into point-cloud displacement data. The strengths and limitations of these algorithms differ depending on each software. To our knowledge, the present study is the first to directly compare different processing algorithms for a TSF site. The commercial SARScape Analytics package allows automated data processing and enables faster runtimes, thus making it convenient for multi-site, regional-scale assessments or comprehensive case study investigations. However, the automated approach also prevents the user from checking interferogram quality, modifying filtering techniques, or assigning/locating the reference point—a critical parameter for InSAR processing. These limitations generally do not exist in advanced commercial software (e.g., the complete SARScape package) and proprietary algorithms (e.g., SqueeSAR).

The ground-truth application showed a major difference in the performances of SARScape Analytics in comparison to SqueeSAR. In SqueeSAR, the number of data points increases with greater coherence, which contrasts with the statistical distribution for SARScape Analytics. This may reflect the different data filtering techniques in both algorithms, and it resulted in a much greater point density for SqueeSAR over the TSF. Moreover, the Tonglvshan case highlighted the issue of loss of coherence in SARScape Analytics that led to only a single data point along the breach section of the dam.

It appears that errors were manifested in the final two data points of the time-series for Cadia. This issue was not encountered in Carla et al. ( 2019a ), who used SqueeSAR, nor in Jefferies et al. ( 2019 ), who implemented SBAS in the complete SARScape package. However, in Bayaraa et al ( 2022 ), the InSAR results that were processed using TerraMotion’s ISBAS algorithm were characterized by notable variability and with significant deviations from the finite-element modeled deformations. The authors attributed this to the exceedance of the maximum measurable deformation of Sentinel-1 InSAR (28 mm over a 12-day revisit time) in the tertiary deformation phase, and it is likely that these issues impacted our results as well. Jefferies et al. ( 2019 ) also stated that SBAS is more appropriate for capturing non-linear accelerating movements compared to PS.

Lastly, a key judgment that influenced our InSAR data processing and the time-series analysis was our selection of the minimum coherence threshold: 0.70 for 3 sites, 0.65 for 1 site, and 0.57 for 1 site. The selections depended on the quality of InSAR data over the site and the need to filter out noise and obtain reliable time-series data. As stated in “ Background and approach ”, our coherence thresholds were greater than the 0.45 value applied in the ISBAS analysis by Grebby et al. ( 2021 ) and comparable to the 0.60 value selected in the PS analysis by . The average coherence of InSAR data ultimately depends on the environmental conditions and the choice of processing algorithm (e.g., PS or SBAS or PS + DS)—an example of which is demonstrated in the ground-truth case. It is worth noting that technical guidance on appropriate coherence thresholds is rather limited, particularly for InSAR applications to mine areas.

Implications for monitoring

In the ground-truth assessment, we observed that both processing software/algorithms were able to represent the deformation regime of 0–50 mm/year on a site-scale, both on the tailings dam crest and the downstream slope. The RMSE between the InSAR and MP data was calculated to be up to 7 mm. The SARScape Analytics results had a slightly higher standard deviation (3.6 mm) compared to SqueeSAR (3.0 mm). However, the loss of data points along the embankment due to on-site activity prevented additional time-series comparisons.

This study highlights the complementary role that satellite InSAR can play in long-term monitoring programs at TSF sites. InSAR can be a valuable “hazard-screening” tool for active mines containing multiple TSFs or a large TSF, for monitoring inactive or closed TSFs, or for monitoring legacy/abandoned TSFs where installing and maintaining in-situ instrumentation can pose practical challenges. However, some precautionary notes are as follows:

1-D InSAR results often do not represent the maximum rate of movement that the dam is actually experiencing, nor does the LOS represent the true 3-D direction toward which the maximum rate of movement is occurring. This is a situation where the availability of satellite data of overlapping orbits can be important in retrieving 2-D displacements (vertical and east–west horizontal), which was not possible for any of our case studies.

As stated earlier, sub-horizontal movements in the north–south direction tend to be poorly captured and potentially underestimated, due to the polar orbits of SAR satellites.

On-site mining activities (e.g., construction, dam raise, tailings deposition, drilling) can lead to loss of InSAR data in an active TSF and may produce InSAR data patterns that can be potentially misinterpreted without sufficient site-specific knowledge.

Routinely conducting ground-truth assessments with the support of in-situ data (e.g., geodetic, survey, instrumentation) provides value by verifying that the InSAR results are representative on a site-scale.

Prediction of instability (location and timing)

When attempting to predict TSF instability using InSAR, there are two components that require attention: breach location and breach timing. Our study suggests that the breach location is generally easier to predict than the failure date. When using proprietary algorithms such as SqueeSAR, it appears that time-of-failure prediction capabilities are enhanced, given that the issues encountered with Cadia when using SARScape Analytics in this study were not observed in Carla et al. ( 2019a ).

In the cases of Tonglvshan and Hindalco, there were other hotspots of detected movements in addition to the breach location. Given that these case studies were founded on relatively poor background knowledge, explaining why the breach occurred where it did was challenging using InSAR data alone.

Feijao is an interesting case where several studies (including the present) have conducted forensic InSAR investigations using different processing algorithms and have obtained reasonably similar rates of precursor deformation, yet have arrived at different conclusions on whether the failure timing was foreseeable. This is because, in certain sections of the dam, the precursor deformation patterns indicated minor accelerations which were subject to user-specific judgment, unlike the Cadia case that exhibited anomalous accelerations in the 3 months preceding the breach.

A key lesson to draw here is that, from a geotechnical perspective, not all failure modes can be expected to exhibit obvious, InSAR-detectable signs of precursor distress for weeks prior to dam collapse. While foundation instabilities may involve sequential phases of creep movement and acceleration under high loading conditions, internal erosion (piping) and seepage is a process that cannot be reliably detected via InSAR. Previous studies have also reported that some TSF failure mechanisms are onset without advance warning, either due to brittle collapse of the tailings structure (e.g., Feijao; Robertson et al. 2019 ) or sudden anthropogenic disturbances or localized triggers (Rana et al. 2021 ).

This is an important lesson for an engineer having to make decisions in real time based on InSAR data alone, without foreknowledge of the future failure. The main issue is that the satellite revisit interval remains too infrequent (6–12 days) for InSAR to be able to help identify the triggering mechanism or to capture the deformation behavior in the hours preceding a breach. This underscores the value of keeping continuous, accessible records of in-situ data and highlights why InSAR is a useful hazard-screening technology that can complement, but not substitute, on-the-ground observations. It is also worth acknowledging that the benefit of hindsight is an important factor in how the pre-failure InSAR data has been perceived in some forensic case studies, including the ones presented here.

Study limitations

This study significantly improves InSAR case history knowledge for tailings dams, thus addressing a critical research gap for practitioners. However, the insights presented herein are conditioned by certain limitations that underpin the adopted approach. We briefly acknowledge the most important limitations below, and we refer to “ Practical considerations ” and “ Implications for monitoring ” where the basis and implications of these limitations were discussed in greater detail:

Selection of the SARScape Analytics processing software . This included issues related to (i) the inability to assign or locate the reference point, (ii) errors when capturing tertiary-phase, rapid accelerations for Cadia, which were not encountered in previous studies, and (iii) the sole use of the PS technique for InSAR processing without comparison to the SBAS algorithm, which is more appropriate for monitoring vegetated areas and, according to Jefferies et al. ( 2019 ), for capturing accelerating movements.

Selection of the minimum coherence threshold. There is limited technical guidance on appropriate coherence thresholds for InSAR data processing over mine areas. We adopted best judgment based on previous studies (e.g., Grebby et al. 2021 ; Mazzanti et al. 2021 ) and applied a relatively high, strict coherence threshold to filter out noise and visualize and analyze reliable time-series data.

No comparisons between Sentinel-1 and other forms of satellite data . Such comparisons would have produced key insights into how different spatial resolutions and different signal wavelength bands influence InSAR data quality and accuracy for tailings dams in diverse site conditions.

Concluding remarks

This study explored the capabilities and limitations of satellite InSAR to monitor the geotechnical stability of tailings dams. This research is timely considering the increased reliance on remote sensing for geotechnical monitoring in the tailings management industry and the need for more case study applications to enhance technical knowledge of InSAR. The goal of this study was to generate practical insights and considerations chiefly from an engineer’s perspective. We used open-source, medium-resolution Sentinel-1 data to undertake a ground-truth assessment at a test site equipped with monitoring prism data and to conduct a forensic analysis of 5 failure cases. The methodology involved the use of a commercial software with an automated PS workflow (SARScape Analytics) for the ground-truth site and 4 of 5 failure cases and an advanced proprietary algorithm (SqueeSAR) implemented with a dual PS + DS technique for the ground-truth site and 1 failure case.

Based on the ground-truth site which has exhibited displacement rates of 0–50 mm/year, we find that Sentinel-1 InSAR can provide reasonable accuracy on a site-scale (for both the dam crest and downstream slope) with a maximum RMSE of 7 mm. In comparison to SARScape Analytics, SqueeSAR was shown to generate a greater point density and a higher average coherence.

Based on all of our forensic case studies, we conclude that Sentinel-1 InSAR can serve as a valuable hazard-screening tool in active mines with large TSFs or multiple TSFs, in mines with inactive or closed TSFs, and in legacy mines with abandoned TSFs, as it may help guide where to undertake targeted investigations. However, the benefit of hindsight is an important factor in how pre-failure InSAR data has been perceived in forensic case studies, including the ones presented herein. From a geotechnical hazard perspective, most potential failure modes associated with tailings dams may not exhibit InSAR-detectable accelerations in precursor deformation trends that could assist with real-time, time-of-failure prediction. Furthermore, the revisit interval of SAR satellites prevents detection of instantaneous failure mechanisms.

As such, long-term monitoring programs for tailings dams should ideally be integrated with a combination of remote sensing and field instrumentation to best support engineering practice and judgment. This study contributes to this effort by providing considerations on how InSAR data quality over tailings dam sites may be influenced by algorithm/satellite selection, environmental conditions, site activity, coherence thresholds, and satellite-dam geometry. Future research to build on this work could involve additional case-study comparisons between different forms of satellite data and between different processing algorithms/software.

Arenas A, Reid D, Fanni R, Smith K, Fourie A (2023) Numerical assessment of drilling-induced static liquefaction triggering of Feijão Dam I. In: Proceedings of the 10 th Numerical Methods in Geotechnical Engineering, June 26–28, London, United Kingdom

Arroyo M, Gens A (2021) Computational analyses of dam i failure at the corrego de feijao mine in brumadinho (Final Report). Investigation commissioned by federal public prosecutor's office and vale S.A. https://www.cimne.com/vnews/m2381/11447/cimne-delivers-the-final-technical-report-on-the-brumadinho-disaster-to-the-brazilian-prosecutors-office . Accessed 6 Oct 2021

Aswathi J, Binojkumar RB, Oommen T, Bouali EH, Sajinkumar KS (2022) InSAR as a tool for monitoring hydropower projects: a review. Energy Geosci 3(2):160–171. https://doi.org/10.1016/j.engeos.2021.12.007

Article   Google Scholar  

Bakon M, Perissin D, Lazecky M, Papco J (2014) Infrastructure non-linear deformation monitoring via satellite radar interferometry. Procedia Tech 16:294–300. https://doi.org/10.1016/j.protcy.2014.10.095

Bayaraa M, Sheil B, Rossi C (2022) InSAR and numerical modelling for tailings dam monitoring – the Cadia failure case study. Géotechnique 1–19. https://doi.org/10.1680/jgeot.21.00399

Berardino P, Fornaro G, Lanari R, Sansosti E (2002) A new algorithm for surface deformation monitoring based on small baseline differential SAR interferograms. IEEE Trans Geosci Remote Sens 40(11):2375–2383. https://doi.org/10.1109/TGRS.2002.803792

Bischoff CA, Ferretti A, Novali F, Uttini A, Giannico C, Meloni F (2020) Nationwide deformation monitoring with SqueeSAR® using Sentinel-1 data. Proc Int Assoc Hydrol Sci 382:31–37. https://doi.org/10.5194/piahs-382-31-2020

Bischoff CA, Basilico M, Ferretti A, Molinaro D, Giannico C, Ghail RC, Mason PJ (2017) A comparison between TerraSAR-X and Sentinel-1 PSInSAR data for infrastructure monitoring in London, UK. GRSG 28th International Annual Conference “Applied Geological Remote Sensing”

Blight GE (2010) Geotechnical engineering for mine waste storage facilities. CRC Press, London

Google Scholar  

Carlà T, Intrieri E, Raspini F, Bardi F, Farina P, Ferretti A, Colombo D, Novali F, Casagli N (2019a) Perspectives on the prediction of catastrophic slope failures from satellite InSAR. Sci Rep 9(1):1–9. https://doi.org/10.1038/s41598-019-50792-y

Article   CAS   Google Scholar  

Carlà T, Tofani V, Lombardi L, Raspini F, Bianchini S, Bertolo D, Thuegaz P, Casagli N (2019b) Combination of GNSS, satellite InSAR, and GBInSAR remote sensing monitoring to improve the understanding of a large landslide in high alpine environment. Geomorph 335:62–75. https://doi.org/10.1016/j.geomorph.2019.03.014

Casu F, Manzo M, Lanari R (2006) A quantitative assessment of the SBAS algorithm performance for surface deformation retrieval from DInSAR data. Remote Sens Environ 102(3–4):195–210. https://doi.org/10.1016/j.rse.2006.01.023

Chen, C W, Zebker, H A (2002) Phase unwrapping for large SAR interferograms: Statistical segmentation and generalized network models. IEEE Trans Geosci Remote Sens 40(8):1709–1719. https://doi.org/10.1109/TGRS.2002.802453

Colombo, D (2021) Why InSAR monitoring in mining should be “high resolution”. https://www.linkedin.com/pulse/why-insar-monitoring-mining-should-high-resolution-davide-colombo/. Accessed 10 Nov 2022

Crosetto M, Monserrat O, Iglesias R, Crippa B (2010) Persistent scatterer interferometry: potential limits and initial C- and X-band comparison. Photogramm Eng Remote Sens 76:1061–1069

Crosetto M, Monserrat O, Cuevas-González M, Devanthéry N, Crippa B (2016) Persistent scatterer interferometry: a review. ISPRS J Photogramm Remote Sens 115:78–89. https://doi.org/10.1016/j.isprsjprs.2015.10.011

Devanthéry N, Crosetto M, Monserrat O, Cuevas-González M, Crippa B (2014) An approach to persistent scatterer interferometry. Remote Sens 6(7):6662–6679. https://doi.org/10.3390/rs6076662

Duan H, Li Y, Jiang H, Li Q, Jiang W, Tian Y, Zhang J (2023) Retrospective monitoring of slope failure event of tailings dam using InSAR time-series observations. Nat Haz 117(3):2375–2391. https://doi.org/10.1007/s11069-023-05946-7

Ferretti A, Prati C, Rocca F (2001) Permanent scatterers in SAR interferometry. IEEE Trans Geosci Remote Sens 39(1):8–20. https://doi.org/10.1109/36.898661

Ferretti A, Fumagalli A, Novali F, Prati C, Rocca F, Rucci A (2011) A new algorithm for processing interferometric data-stacks: SqueeSAR. IEEE Trans Geosci Remote Sens 49(9):3460–3470. https://doi.org/10.1109/TGRS.2011.2124465

Gama, F F, Cantone, A, Mura, J C (2022) Monitoring horizontal and vertical components of SAMARCO mine dikes deformations by DInSAR-SBAS using TerraSAR-X and sentinel-1 data. Mining 2(4):725–745.  https://doi.org/10.3390/mining2040040

Gama F, Mura JC, Paradella W, de Oliveira CG (2020) Deformations prior to the brumadinho dam collapse revealed by Sentinel-1 InSAR data using SBAS and PSI techniques. Remote Sens 12(21):3664. https://doi.org/10.3390/rs12213664

Ghahramani N, Mitchell A, Rana NM, McDougall S, Evans SG, Take A (2020) Tailings-flow runout analysis: examining the applicability of a semi-physical area–volume relationship using a novel database. Nat Hazards Earth Syst Sci. https://doi.org/10.5194/nhess-2020-199

Global Tailings Review (2020) Global industry standard on tailings management. https://globaltailingsreview.org/. Accessed 1 Sept 2020

Grebby S, Sowter A, Gluyas J, Toll D, Gee D, Athab A, Girindran R (2021) Advanced analysis of satellite data reveals ground deformation precursors to the Brumadinho Tailings Dam collapse. Commun Earth Environ 2(1):1–9. https://doi.org/10.1038/s43247-020-00079-2

Holden D, Donegan S, Pon A (2020) Brumadinho dam InSAR study: analysis of TerraSAR-X, COSMO-SkyMed and Sentinel-1 images preceding the collapse. In: Dight P (ed) International symposium on slope stability in open pit mining and civil engineering, Proceedings, Australian Centre for Geomechanics, pp 293–306

Hrysiewicz A, Wang X, Holohan EP (2023) EZ-InSAR: an easy-to-use open-source toolbox for mapping ground surface deformation using satellite interferometric synthetic aperture radar. Earth Sci Inform 16:1929–1945. https://doi.org/10.1007/s12145-023-00973-1

Hu X, Oommen T, Lu Z, Wang T, Kim JW (2017) Consolidation settlement of Salt Lake County tailings impoundment revealed by time-series InSAR observations from multiple radar satellites. Remote Sens Environ 202:199–209. https://doi.org/10.1016/j.rse.2017.05.023

Hudson R, Sato S, Morin R, McParland, MA (2021) Comparison of sentinel-1 and Radarsat-2 data for monitoring of tailings storage facilities. In: 13th European Conference on Synthetic Aperture Radar, online, pp 1-6

Islam K, Murakami S (2021) Global-scale impact analysis of mine tailings dam failures: 1915–2020. Glob Environ Change 70:102361. https://doi.org/10.1016/j.gloenvcha.2021.102361

Jefferies M, Morgenstern NR, Van Zyl DV, Wates J (2019) Report on NTSF embankment failure. Investigation report commissioned by Cadia Valley Operations for Ashurst Australia. https://www.newcrest.com/sites/default/files/2019-10/190417_Report%20on%20NTSF%20Embankment%20Failure%20at%20Cadia%20for%20Ashurst.pdf . Accessed 17 Apr 2019

Kim J, Coe JA, Lu Z, Avdievitch NN, Hults CP (2022) Spaceborne InSAR mapping of landslides and subsidence in rapidly deglaciating terrain, Glacier Bay National Park and Preserve and vicinity, Alaska and British Columbia. Remote Sens Environ 281:113231. https://doi.org/10.1016/j.rse.2022.113231

Kumar RM (2019) Muri hindalco red mud blasting(2).  https://www.youtube.com/watch?v=8K63D70b4CU&ab_channel=R.Mkumar . Accessed 10 Apr 2019

Mazzanti, P, Antonielli, B, Sciortino, A, Scancella, S, Bozzano, F (2021) Tracking deformation processes at the legnica glogow copper district (Poland) by Satellite InSAR - II: Żelazny Most Tailings Dam. Land 10(6):654.  https://doi.org/10.3390/land10060654

Mirmazloumi SM, Wassie Y, Nava L, Cuevas-González M, Crosetto M, Monserrat O (2023) InSAR time series and LSTM model to support early warning detection tools of ground instabilities: mining site case studies. Bull Eng Geol Environ 82(10):374. https://doi.org/10.1007/s10064-023-03388-w

Morgenstern NR, Vick SG, Van Zyl D (2015) Report on mount polley tailings storage facility breach. In: Report of independent expert engineering investigation and review panel for the Government of british columbia and the williams lake and soda creek indian bands (Canada)

Morgenstern NR, Vick SG, Viotti CB, Watts BD (2016) Report on the immediate causes of the failure of the fundao dam. https://www.resolutionmineeis.us/sites/default/files/references/fundao-2016.pdf . Accessed 1 Dec 2019

Pawluszek-Filipiak K, Wielgocka N, Tondaś D, Borkowski A (2023) Monitoring nonlinear and fast deformation caused by underground mining exploitation using multi-temporal Sentinel-1 radar interferometry and corner reflectors: application, validation and processing obstacles. Int J Digital Earth 16(1):251–271

Perissin D, Wang Z, Wang T (2011) The SARPROZ InSAR tool for urban subsidence/manmade structure stability monitoring in China. In: 34th International Symposium on Remote Sensing of Environment, Proceedings, Sydney

Rana NM, Ghahramani N, Evans SG, McDougall S, Small A, Take WA (2021) Catastrophic mass flows resulting from tailings impoundment failures. Eng Geol 292:106262. https://doi.org/10.1016/j.enggeo.2021.106262

Rana, NM, Ghahramani N, Evans SG, Small A, Skermer N, McDougall S, Take WA (2022) Global magnitude-frequency statistics of the failures and impacts of large water-retention dams and mine tailings impoundments. Earth-Sci Rev 232:104144. https://doi.org/10.1016/j.earscirev.2022.104144

Raspini F, Caleca F, Del Soldato M, Festa D, Confuorto P, Bianchini S (2022) Review of satellite radar interferometry for subsidence analysis. Earth-Sci Rev 235:104239. https://doi.org/10.1016/j.earscirev.2022.104239

Robertson PK, de Melo L, Williams DJ, Wilson GW (2019) Report of the expert panel on the technical causes of the failure of Feijao Dam I. Investigation report commissioned by Vale S.A. http://www.b1technicalinvestigation.com/ . Accessed 12 Dec 2019

Rotta LHS, Alcantara E, Park E, Negri RG, Lin YN, Bernardo N, Mendes TSG, Filho CRS (2020) The 2019 Brumadinho tailings dam collapse: possible cause and impacts of the worst human and environmental disaster in Brazil. Int J Appl Earth Obs Geoinf 90:102119. https://doi.org/10.1016/j.jag.2020.102119

Sowter A, Amat MBC, Cigna F, Marsh S, Athab A, Alshammari L (2016) Mexico City land subsidence in 2014–2015 with Sentinel-1 IW TOPS: results using the intermittent SBAS (ISBAS) technique. Int J Appl Earth Obs Geoinf 52:230–242. https://doi.org/10.1016/j.jag.2016.06.015

Su C, Mergili M, Rana NM, Zhang S, Dai C, Wang B, Han Y (2024) Failure analysis and flow dynamic modeling using a new slow-flow functionality: the 2022 Jiaokou (China) tailings dam breach. Landslides 21:379–391. https://doi.org/10.1007/s10346-023-02146-z

Thomas A, Edwards SJ, Engels J, McCormack H, Hopkins V, Holley R (2019) Earth observation data and satellite InSAR for the remote monitoring of tailings storage facilities: a case study of Cadia Mine, Australia. In: Paterson A, Fourie A, Reid D (eds) 22nd International Conference on Paste, Thickened and Filtered Tailings, Proceedings, Australia

Vick SG (1983) Planning, design, and analysis of tailings dams. John Wiley & Sons, New York

Wang L, Deng K, Zheng M (2020) Research on ground deformation monitoring method in mining areas using the probability integral model fusion D-InSAR, sub-band InSAR and offset-tracking. Int J Appl Earth Obs Geoinf 85:101981. https://doi.org/10.1016/j.jag.2019.101981

Wang Y, Bai Z, Zhang Y, Qin Y, Lin Y, Li Y, Shen W (2021) Using TerraSAR X-band and Sentinel-1 C-band SAR interferometry for deformation along Beijing-Tianjin intercity railway analysis. IEEE J Select Topics Appl Earth Obs Remote Sens 14:4832–4841. https://doi.org/10.1109/JSTARS.2021.3076244

Werner C, Wegmüller U, Strozzi T, Wiesmann A (2000) Gamma SAR and interferometric processing software. In: ERS-Envisat Symposium, Proceedings, Gothenburg

Yunjun Z, Fattahi H, Amelung F (2019) Small baseline InSAR time series analysis: unwrapping error correction and noise reduction. Comput Geosci 133:104331. https://doi.org/10.1016/j.cageo.2019.104331

Zhuang Y, Jin K, Cheng Q, Xing A, Luo H (2022) Experimental and numerical investigations of a catastrophic tailings dam break in Daye, Hubei China. Bull Eng Geol Environ 81(1):1–16. https://doi.org/10.1007/s10064-021-02491-0

Download references

Acknowledgements

This research was carried out within the CanBreach project which comprises five industrial partners (Imperial Oil, Suncor Energy, BGC Engineering, Klohn Crippen Berger, and Golder Associates) and three Canadian research institutions (University of Waterloo, The University of British Columbia, and Queen’s University). The authors acknowledge the guidance and feedback provided by Jeanine Engelbrecht (BGC Engineering), Scott Martens (Teck Resources), Gideon Steyl (ATC Williams), and Joanna Chen (WSP Golder) during the preparation of this study, and thank the anonymous peer-reviewers for their comments which improved the quality of this manuscript. The authors also acknowledge the support of NV5 Geospatial (formerly L3Harris Geospatial), Sarmap, and TRE Altamira for the InSAR processing software/algorithms used in this study.

The first phase of the CanBreach project (2019–2023) was joint-funded by the listed industrial partners and a Collaborative Research and Development (CRD) grant (CRDPJ 533226–18) issued by the Natural Sciences and Engineering Research Council (NSERC) of Canada. The lead author (N. Rana) was supported by the Science Domestic Scholarship and the President’s Scholarship awarded by the University of Waterloo, the Queen Elizabeth II Graduate Scholarship in Science and Technology awarded by the Ontario Provincial Government in Canada, and the Gary Salmon Memorial Scholarship awarded by the Canadian Dam Association.

Author information

Authors and affiliations.

Department of Earth and Environmental Sciences, University of Waterloo, Waterloo, ON, N2L3G1, Canada

Nahyan M. Rana, Keith B. Delaney & Stephen G. Evans

Klohn Crippen Berger, Toronto, ON, M5H 1T1, Canada

Nahyan M. Rana

BGC Engineering, Edmonton, AB, T5J 4A1, Canada

Klohn Crippen Berger, Fredericton, NB, E3B 2L2, Canada

Knight Piésold, Vancouver, BC, V6C 2T8, Canada

Daniel A. M. Adria

Department of Earth, Ocean and Atmospheric Sciences, The University of British Columbia, Vancouver, BC, V6T 1Z4, Canada

Scott McDougall

WSP, Lakewood, CO, 80228, USA

Negar Ghahramani

Department of Civil Engineering, Queen’s University, Kingston, ON, K7L 3N6, Canada

W. Andy Take

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nahyan M. Rana .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Rana, N.M., Delaney, K.B., Evans, S.G. et al. Application of Sentinel-1 InSAR to monitor tailings dams and predict geotechnical instability: practical considerations based on case study insights. Bull Eng Geol Environ 83 , 204 (2024). https://doi.org/10.1007/s10064-024-03680-3

Download citation

Received : 18 October 2023

Accepted : 07 April 2024

Published : 29 April 2024

DOI : https://doi.org/10.1007/s10064-024-03680-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Tailings storage facility
  • Mining hazards
  • Remote sensing
  • Failure prediction
  • Risk management
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Top 10 Data Mining Algorithms Explained I DevTeam.Space

    case study data mining algorithms

  2. Data Mining Algorithms

    case study data mining algorithms

  3. Data Mining Process, Algorithms, and Software

    case study data mining algorithms

  4. Top 10 Use Cases of AI in Data Mining Algorithms Explained

    case study data mining algorithms

  5. Data Mining Algorithms

    case study data mining algorithms

  6. Data Modeling in Data Science for Beginners

    case study data mining algorithms

VIDEO

  1. Major Issues in Data Mining || Data Mining challenges

  2. Data Mining Algorithms

  3. Lecture 16: Data Mining CSE 2020 Fall

  4. Density Based Clustering

  5. Difference between Data Analytics and Data Science . #shorts #short

  6. Lecture 15: Data Mining CSE 2020 Fall

COMMENTS

  1. Data mining tools -a case study for network intrusion detection

    The data mining tools use historical information to build a model to predict customer's behavior e.g., which customers are likely to respond to a new product. Another example is intrusion detection in local systems or networks by analyzing the activity of system and network and processes them by the data mining algorithm in data mining tools.

  2. Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies

    Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distribution are two faces of the same coin. We discuss the use of modern techniques involving the Vapnik-Chervonenkis (VC) dimension to study the trade-off between sample size and accuracy of data mining results that can be obtained from a sample.

  3. Case studies

    The case studies uses data mining algorithm implementations from CRAN packages. It is devoted to the classification of individuals described by socioeconomic Census attributes into income categories. The primary objective of the case study is to predict the number of violent crimes (per population) in US communities based on attributes ...

  4. TOP-10 DATA MINING CASE STUDIES

    Abstract. We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining case studies in order to provide a snapshot of where and how data mining techniques have made significant real-world impact. The tasks covered by 10 case studies range from the detection of anomalies such as cancer, fraud, and system failures to ...

  5. Educational data mining using cluster analysis and decision tree

    Data mining. Data mining algorithms, such as classification and clustering, are applied to predict student success. 4. ... Križanić S, Tomičić-Pupek K. Process parameters discovery based on application of k-means algorithm—a real case experimental study. In: Central European Conference on Information and Intelligent Systems ...

  6. An Efficient Healthcare Data Mining Approach Using Apriori Algorithm: A

    There are several data mining algorithms that have been employed for data preprocessing, classification, and clustering. A comparison of the pros and cons of these techniques in the practical applications is also elaborated for their use in industry. ... "An Efficient Healthcare Data Mining Approach Using Apriori Algorithm: A Case Study of Eye ...

  7. Application of data mining algorithms for improving stress prediction

    Evaluates more widely implemented data mining algorithms on the same dataset in pursuing the objective to investigate the overall suitability of such adoption in estimating the stress levels of automobile drivers. ... and RF algorithms in predicting the stress levels of automobile drivers in Jordan, as a case study. Physiological data was ...

  8. PDF MobileMiner: A Real World Case Study of Data Mining in Mobile

    problems. However, very few data mining researchers have a chance to see a working data mining system on real mo-bile communication data. In this demo, we showcase our new system MobileMiner on a real mobile communication data set, which presents a case study of business solutions using state-of-the-art data mining techniques. MobileM-

  9. A Data Mining Approach for Inventory Forecasting: A Case Study of a

    In some prior studies, these algorithms have been widely adopted as classical methods for forecasting data with nonlinear behaviors . In one study conducted by Zadeh , data mining was applied to determine the quantity of drug to be kept in store. First, explorative network analysis was done to discover clique set and group members of drugs with ...

  10. 5 Data Mining Use Cases

    Read the PBS, LunaMetrics, and Google Analytics case study. 5. The Pegasus Group. Cyber attackers compromised and targeted the data mining system (DMS) of a major network client of The Pegasus Group and launched a distributed denial-of-service (DDoS) attack against 1,500 services. Under extreme time pressure, The Pegasus Group needed to find a ...

  11. energy efficiency solution based on time series data mining algorithm—a

    Abstract. This study aims to conduct data mining research on the time series energy consumption dataset of a small hotel. Earlier studies on data mining have demonstrated that cluster and association analysis had been commonly used methods today, while this has not yet been investigated under time series dimension.

  12. (PDF) Data Mining Algorithms: An Overview

    mining and the algorithms which are commonly used in data mining. 3. DATA MINING ALGORITHMS. A data mining algorithm is a set of heuristics and calculations that creates a data mining model from ...

  13. The Data Mining Approach: A Case Study

    Abstract: Data mining is a business-effective technology to provide customer experience enhancement and alleviate the process of decision-making along the digital transformation journey. The main goal of this research paper is to provide a case study - an analysis on the implementation of data mining techniques, in particular clustering techniques, and a theoretical analysis and research of ...

  14. Financial fraud detection applying data mining techniques: A

    The study shows that 34 data mining techniques were used to identify fraud throughout various financial applications. ... Agrawal et al. [48] suggested a model identify credit card fraud based on a case study by a combination of HMM, Behavior-based and Genetic Algorithm (GA). The proposed model consists of three steps, firstly, the authors used ...

  15. Using Data Mining in Educational Administration: A Case Study on ...

    Pupil absenteeism remains a significant problem for schools across the globe with negative impacts on overall pupil performance being well-documented. Whilst all schools continue to emphasize good attendance, some schools still find it difficult to reach the required average attendance, which in the UK is 96%. A novel approach is proposed to help schools improve attendance that leverages the ...

  16. Data Mining in Healthcare: Applying Strategic Intelligence Techniques

    In recent years, data-mining algorithms have stood out for their usefulness in detecting and screening patients with potential adverse drug reactions and ... Yu K. Data mining process for predicting diabetes mellitus based model about other chronic diseases: A case study of the northwestern part of Nigeria. Healthc. Technol. Lett. 2019; 6 ...

  17. Data Mining Case Studies & Benefits

    A successful implementation requires defining clear goals, choosing data wisely, and constant adaptation. Data mining case studies help businesses explore data for smart decision-making. It's about finding valuable insights from big datasets. This is crucial for businesses in all industries as data guides strategic planning.

  18. Educational data mining: prediction of students' academic performance

    Educational data mining has become an effective tool for exploring the hidden relationships in educational data and predicting students' academic achievements. This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the random forests, nearest ...

  19. Impact of Data Mining Techniques in Predictive Modeling: A Case Study

    The results obtained from data mining are essentially used for making analysis and predictions. There are different techniques have been used in data mining, such as Association, Clustering, Classification, Prediction, Outlier Detection, and Regression. Prediction is the most significant data mining technique that employs a set of pre ...

  20. PDF The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study

    Ten most popular data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) are presented in [5]. They are listed as: C4.5, K-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top ten algorithms are among the most. a influential company's success [2].

  21. A CASE STUDY ON DATA MINING APPLICATIONS ON BANKING SECTOR

    International Journal of Computer Sciences and Engin eering Open Access. Research Paper Vol-6, Special Issue-8, Oct 2018 E-ISSN: 2347 -2693. A CASE STUDY ON DATA MINING APPLICATIONS ON BANKING ...

  22. Data Mining with Clustering Algorithms to Reduce Packaging Costs: A

    In this study, a data mining model with three clustering algorithms was developed to modularize a packaging system by reducing the variety of packaging sizes. ... The results show that the packaging system modularized by the agglomerative hierarchical clustering algorithm is more cost-effective in this case compared with the ones modularized by ...

  23. Research on Intelligent Data Mining and Knowledge Discovery Method

    The data mining technology introduced in this paper is a typical and relatively novel technology, many of which have been successfully applied to data mining applications. K-nearest neighbor algorithm has an accuracy rate of 90% and a recall rate of 95%. The accuracy of SVM is 93%, and the recall rate is 94%. This paper provides a research ...

  24. PDF repositorium.sdum.uminho.pt

    repositorium.sdum.uminho.pt

  25. ch 4 midterm Flashcards

    Study with Quizlet and memorize flashcards containing terms like In data mining, classification models help in prediction., The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and ________ hidden deep in large and complex medical databases., List five reasons for the growing popularity of data mining in the business world. and more.

  26. Estimation of missing weather variables using different data mining

    The availability of continuous weather data is essential in many applications such as the study of hydrology, glaciology, and modelling of extreme catastrophic events such as landslides, heavy precipitation, cloud burst and snow avalanches. Weather data are collected either manually or automatically, and due to variety of reasons, it becomes difficult to maintain continuous records of these data.

  27. Application of Sentinel-1 InSAR to monitor tailings dams and ...

    To process the Sentinel-1 InSAR data, we used two software/algorithms: (i) for the ground-truth site and 4 of 5 forensic case studies, a commercial software (SARScape Analytics) that offers an automated workflow for PS analysis; and (ii) for the ground-truth site and only 1 forensic case study, the proprietary algorithm SqueeSAR which is ...