Don Knuth's home page

Don knuth's other books.

analysis of algorithm research paper

CSLI Publications RSS feed

  • Selected Papers on Analysis of Algorithms
  • 1. Mathematical Analysis of Algorithms 1
  • 2. The Dangers of Computer Science Theory 19
  • 3. The Analysis of Algorithms 27
  • 4. Big Omicron and Big Omega and Big Theta 35
  • 5. Optimal Measurement Points for Program Frequency Counts 43
  • 6. Estimating the Efficiency of Backtrack Programs 55
  • 7. Ordered Hash Tables 77
  • 8. Activity in an Interleaved Memory 101
  • 9. An Analysis of Alpha-Beta Pruning 105
  • 10. Notes on Generalized Dedekind Sums 149
  • 11. The Distribution of Continued Fraction Approximations 181
  • 12. Evaluation of Porter's Constant 189
  • 13. The Subtractive Algorithm for Greatest Common Divisors 195
  • 14. Length of Strings for a Merge Sort 205
  • 15. The Average Height of Planted Plane Trees 215
  • 16. The Toilet Paper Problem 225
  • 17. An Analysis of Optimum Caching 235
  • 18. A Trivial Algorithm Whose Analysis Isn't 257
  • 19. Deletions That Preserve Randomness 283
  • 20. Analysis of a Simple Factorization Algorithm 303
  • 21. The Expected Linearity of a Simple Equivalence Algorithm 341
  • 22. Textbook Examples of Recursion 391
  • 23. An Exact Analysis of Stable Allocation 415
  • 24. Stable Husbands 429
  • 25. Shellsort With Three Increments 447
  • 26. The Average Time for Carry Propagation 467
  • 27. Linear Probing and Graphs 473
  • 28. A Terminological Proposal 485
  • 29. Postscript About NP-hard Problems 493
  • 30. An Experiment in Optimal Sorting 495
  • 31. Duality in Addition Chains 501
  • 32. Complexity Results for Bandwidth Minimization 505
  • 33. The Problem of Compatible Representatives 535
  • 34. The Complexity of Nonuniform Random Number Generation 545

Prof. Knuth's page on this book including table of contents and errata

Books by Donald E. Knuth at CSLI Publications:

  • Knuth par Knuth (in French)
  • Fantasia Apocalyptica Illustrated
  • Companion to the Papers of Donald Knuth
  • Selected Papers on Fun and Games
  • Selected Papers on Design of Algorithms
  • Selected Papers on Discrete Mathematics
  • Selected Papers on Computer Languages
  • Digital Typography
  • Selected Papers on Computer Science
  • Literate Programming
  • Things a Computer Scientist Rarely Talks About
  • Algorithmes (in French)
  • Éléments pour une histoire de l'informatique (in French)

Professor Knuth's Computer Musings

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

algorithms-logo

Journal Menu

  • Algorithms Home
  • Aims & Scope

Editorial Board

  • Reviewer Board
  • Topical Advisory Panel
  • Instructions for Authors

Special Issues

  • Sections & Collections
  • Article Processing Charge
  • Indexing & Archiving
  • Editor’s Choice Articles
  • Most Cited & Viewed
  • Journal Statistics
  • Journal History
  • Journal Awards
  • Society Collaborations
  • Conferences
  • Editorial Office

Journal Browser

  • arrow_forward_ios Forthcoming issue arrow_forward_ios Current issue
  • Vol. 17 (2024)
  • Vol. 16 (2023)
  • Vol. 15 (2022)
  • Vol. 14 (2021)
  • Vol. 13 (2020)
  • Vol. 12 (2019)
  • Vol. 11 (2018)
  • Vol. 10 (2017)
  • Vol. 9 (2016)
  • Vol. 8 (2015)
  • Vol. 7 (2014)
  • Vol. 6 (2013)
  • Vol. 5 (2012)
  • Vol. 4 (2011)
  • Vol. 3 (2010)
  • Vol. 2 (2009)
  • Vol. 1 (2008)

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

Analysis of Algorithms and Complexity Theory

A section of Algorithms (ISSN 1999-4893).

Section Information

“Analysis of Algorithms and Complexity Theory” is a Section of the MDPI open-access journal Algorithms . It focuses on the design, formal analysis, and experimental evaluation of algorithms and algorithmic techniques for efficiently solving fundamental computational problems of a theoretical or practical nature. Research related to complexity aspects such as time–space tradeoffs, information-theoretic entropy, and lower bounds in various models of computation is also covered in this Section of the journal. We hereby invite original research articles devoted to any of the above (or related) areas and hope to receive many high-quality submissions. Dr. Jesper Jansson Section Editor-in-Chief

  • Theory of algorithms 
  • Sorting and searching algorithms 
  • Automata theory and formal languages 
  • Parameterized complexity 
  • Computational geometry 
  • Quantum algorithms

Following special issues within this section are currently open for submissions:

  • Numerical Optimization and Algorithms: 2nd Edition (Deadline: 15 September 2024 )
  • Meta-Heuristics and Machine Learning in Modelling, Developing and Optimising Complex Systems (Deadline: 20 September 2024 )
  • Geometric Algorithms and Applications (Deadline: 30 September 2024 )
  • Algorithms for Complex Problems (Deadline: 31 October 2024 )
  • Surveys in Algorithm Analysis and Complexity Theory, Part II (Deadline: 28 February 2025 )

Topical Collection

Following topical collection within this section is currently open for submissions:

  • Feature Paper in Algorithms and Complexity Theory

Papers Published

Further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Advertisement

Advertisement

Machine Learning: Algorithms, Real-World Applications and Research Directions

  • Review Article
  • Published: 22 March 2021
  • Volume 2 , article number  160 , ( 2021 )

Cite this article

analysis of algorithm research paper

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

573k Accesses

1747 Citations

35 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Similar content being viewed by others

analysis of algorithm research paper

Machine Learning Approaches for Smart City Applications: Emergence, Challenges and Opportunities

analysis of algorithm research paper

Insights into the Advancements of Artificial Intelligence and Machine Learning, the Present State of Art, and Future Prospects: Seven Decades of Digital Revolution

analysis of algorithm research paper

Editorial: Machine Learning, Advances in Computing, Renewable Energy and Communication (MARC)

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

figure 1

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

figure 2

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

figure 4

An example of a decision tree structure

figure 5

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

figure 6

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

figure 7

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

figure 8

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

figure 9

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

figure 10

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

figure 11

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH   Google Scholar  

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article   MathSciNet   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article   MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar  

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article   MathSciNet   MATH   Google Scholar  

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book   Google Scholar  

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Deep learning
  • Artificial intelligence
  • Data science
  • Data-driven decision-making
  • Predictive analytics
  • Intelligent applications
  • Find a journal
  • Publish with us
  • Track your research

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • DSA Tutorial
  • Data Structures
  • Linked List
  • Dynamic Programming
  • Binary Tree
  • Binary Search Tree
  • Divide & Conquer
  • Mathematical
  • Backtracking
  • Branch and Bound
  • Pattern Searching

Design and Analysis of Algorithms

Design and Analysis of Algorithms is a fundamental aspect of computer science that involves creating efficient solutions to computational problems and evaluating their performance. DSA focuses on designing algorithms that effectively address specific challenges and analyzing their efficiency in terms of time and space complexity .

analysis of algorithm research paper

Complete Guide On Complexity Analysis

Table of Content

  • What is meant by Algorithm Analysis?
  • Why Analysis of Algorithms is important?
  • Types of Algorithm Analysis
  • Basics on Analysis of Algorithms
  • Asymptotic Notations
  • Some Advance topics
  • Complexity Proofs

Basics on Analysis of Algorithms:

  • What is algorithm and why analysis of it is important?
  • Asymptotic Notation and Analysis (Based on input size) in Complexity Analysis of Algorithms
  • Worst, Average and Best Case Analysis of Algorithms
  • Types of Asymptotic Notations in Complexity Analysis of Algorithms
  • How to Analyse Loops for Complexity Analysis of Algorithms
  • How to analyse Complexity of Recurrence Relation
  • Introduction to Amortized Analysis

Asymptotic Notations:

  • Analysis of Algorithms | Big-O analysis
  • Difference between Big Oh, Big Omega and Big Theta
  • Examples of Big-O analysis
  • Difference between big O notations and tilde
  • Analysis of Algorithms | Big – Ω (Big- Omega) Notation
  • Analysis of Algorithms | Big – Θ (Big Theta) Notation

Some Advance topics:

  • Types of Complexity Classes | P, NP, CoNP, NP hard and NP complete
  • Can Run Time Complexity of a comparison-based sorting algorithm be less than N logN?
  • Why does accessing an Array element take O(1) time?
  • What is the time efficiency of the push(), pop(), isEmpty() and peek() operations of Stacks?

Complexity Proofs:

  • Proof that Clique Decision problem is NP-Complete
  • Proof that Independent Set in Graph theory is NP Complete
  • Prove that a problem consisting of Clique and Independent Set is NP Complete
  • Prove that Dense Subgraph is NP Complete by Generalisation
  • Prove that Sparse Graph is NP-Complete

Please Login to comment...

Similar reads.

  • Algorithms-Analysis of Algorithms
  • How to Get a Free SSL Certificate
  • Best SSL Certificates Provider in India
  • Elon Musk's xAI releases Grok-2 AI assistant
  • What is OpenAI SearchGPT? How it works and How to Get it?
  • Content Improvement League 2024: From Good To A Great Article

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

analysis of algorithm research paper

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Analysis of Algorithms

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Last »
  • Combinatorial Algorithms Follow Following
  • Computer Technology and Programming Follow Following
  • Ziraat Bankası Follow Following
  • Schedling Theory Follow Following
  • Combiatorial Optimization Follow Following
  • Data Structures and Algorithms Follow Following
  • Mathematical Sciences Follow Following
  • Production Management Follow Following
  • Mobile Technology Follow Following
  • Engineering Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Journals
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Captcha Page

We apologize for the inconvenience...

To ensure we keep this website safe, please can you confirm you are a human by ticking the box below.

If you are unable to complete the above request please contact us using the below link, providing a screenshot of your experience.

https://ioppublishing.org/contacts/

Algorithms & optimization

We perform fundamental research in algorithms, markets, optimization, and graph analysis, and use it to deliver solutions to challenges across Google’s business.

Google offices

About the team

Our team comprises multiple overlapping research groups working on graph mining, large-scale optimization, and market algorithms. We collaborate closely with teams across Google, benefiting Ads, Search, YouTube, Play, Infrastructure, Geo, Social, Image Search, Cloud and more. Along with these collaborations, we perform research related to algorithmic foundations of machine learning, distributed optimization, economics, data mining, and data-driven optimization. Our researchers are involved in both long-term research efforts as well as immediate applications of our technology.

Examples of recent research interests include online ad allocation problems, distributed algorithms for large-scale graph mining, mechanism design for advertising exchanges, and robust and dynamic pricing for ad auctions.

Team focus summaries

Large-scale optimization.

Our mission is to develop large-scale, distributed, and data-driven optimization techniques and use them to improve the efficiency and robustness of infrastructure and machine learning systems at Google. We achieve such goals as increasing throughput and decreasing latency in distributed systems, or improving feature selection and parameter tuning in machine learning. To do this, we apply techniques from areas such as combinatorial optimization, online algorithms, and control theory. Our research is used in critical infrastructure that supports products such as Search and Cloud.

Understanding places

Our mission is to discover all the world’s places and to understand people’s interactions with those places. We accomplish this by using ML to develop deep understanding of user trajectories and actions in the physical world, and we apply that understanding to solve the recurrent hard problems in geolocation data analysis. This research has enabled many of the novel features that appear in Google geo applications such as Maps.

Structured information extraction

Our mission is to extract salient information from templated documents and web pages and then use that information to assist users. We focus our efforts on extracting data such as flight information from email, event data form the web, and product information from the web. This enables many features in products such as Google Now, Search, and Shopping.

Search and information retrieval

Our mission is to conduct research to enable new or more effective search capabilities. This includes developing deeper understanding of correlations between documents and queries; modeling user attention and product satisfaction; developing Q&A models, particularly for the “next billion Internet users”; and, developing effective personal search models even when Google engineers cannot inspect private user input data.

Medical knowledge and learning

Our mission is offer a premier source of high-quality medical information along your entire online health journey. We provide relevant, targeted medical information to users by applying advanced ML on Google Search data. Examples of technologies created by this team include Symptom Search, Allergy Prediction, and other epidemiological applications.

Featured publications

Highlighted work.

algorithms

Some of our locations

Cambridge

Some of our people

Gagan Aggarwal

Gagan Aggarwal

  • Algorithms and Theory
  • Data Mining and Modeling
  • Economics and Electronic Commerce

David Applegate

David Applegate

Aaron Archer

Aaron Archer

  • Distributed Systems and Parallel Computing

Ashwinkumar Badanidiyuru Varadaraja

Ashwinkumar Badanidiyuru Varadaraja

  • Machine Intelligence

Mohammadhossein Bateni

Mohammadhossein Bateni

Michael Bendersky

Michael Bendersky

  • Information Retrieval and the Web

Kshipra Bhawalkar

Kshipra Bhawalkar

Edith Cohen

Edith Cohen

  • Machine Learning

Alessandro Epasto

Alessandro Epasto

Alejandra Estanislao

Alejandra Estanislao

Andrei Z. Broder

Andrei Z. Broder

Jon Feldman

Jon Feldman

Nadav Golbandi

Nadav Golbandi

Jeongwoo Ko

Jeongwoo Ko

  • Natural Language Processing

Marc Najork

Marc Najork

Nitish Korula

Nitish Korula

Kostas Kollias

Kostas Kollias

Silvio Lattanzi

Silvio Lattanzi

Cheng Li

Mohammad Mahdian

Alex Fabrikant

Alex Fabrikant

Rich Washington

Rich Washington

Qi Zhao

Andrew Tomkins

  • Human-Computer Interaction and Visualization

Vidhya Navalpakkam

Vidhya Navalpakkam

  • Machine Perception

Bhargav Kanagal

Bhargav Kanagal

Aranyak Mehta

Aranyak Mehta

Guillaume Chatelet

Guillaume Chatelet

  • Hardware and Architecture
  • Software Engineering
  • Software Systems

Sandeep Tata

Sandeep Tata

  • Data Management

Balasubramanian Sivan

Balasubramanian Sivan

Vahab S. Mirrokni

Vahab S. Mirrokni

Yuan Wang

Xuanhui Wang

Renato Paes Leme

Renato Paes Leme

Bryan Perozzi

Bryan Perozzi

Morteza Zadimoghaddam

Morteza Zadimoghaddam

Fabien Viger

Fabien Viger

Tamas Sarlos

Tamas Sarlos

James B. Wendt

James B. Wendt

We're always looking for more talented, passionate people.

Careers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 22 August 2024

Distribution network line loss analysis method based on improved clustering algorithm and isolated forest algorithm

  • Jian Li 1 ,
  • Shuoyu Li 2 ,
  • Wen Zhao 1 ,
  • Jiajie Li 1 ,
  • Ke Zhang 1 &
  • Zetao Jiang 1  

Scientific Reports volume  14 , Article number:  19554 ( 2024 ) Cite this article

59 Accesses

Metrics details

  • Computational science
  • Computer science

The long-term loss of distribution network in the process of distribution network development is caused by the backward management mode of distribution network. The traditional analysis and calculation methods of distribution network loss can not adapt to the current development environment of distribution network. To improve the accuracy of filling missing values in power load data, particle swarm optimization algorithm is proposed to optimize the clustering center of the clustering algorithm. Furthermore, the original isolated forest anomaly recognition algorithm can be used to detect outliers in the load data, and the coefficient of variation of the load data is used to improve the recognition accuracy of the algorithm. Finally, this paper introduces a breadth-first-based method for calculating line loss in the context of big data. An example is provided using the distribution network system of Yuxi City in Yunnan Province, and a simulation experiment is carried out. And the findings revealed that the error of the enhanced fuzzy C-mean clustering algorithm was on average − 6.35, with a standard deviation of 4.015 in the situation of partially missing data. The area under the characteristic curve of the improved isolated forest algorithm subjects in the case of the abnormal sample fuzzy situation was 0.8586, with the smallest decrease, based on the coefficient of variation, and through the refinement of the analysis, it was discovered that the feeder line loss rate is 7.62%. It is confirmed that the suggested technique can carry out distribution network line loss analysis fast and accurately and can serve as a guide for managing distribution network line loss.

Similar content being viewed by others

analysis of algorithm research paper

Faults locating of power distribution systems based on successive PSO-GA algorithm

analysis of algorithm research paper

Reconfiguration of low-voltage distributed power sources within electric power's distribution network based on improved particle swarm-fish swarm fusibility algorithm

analysis of algorithm research paper

Transformer fault diagnosis method based on TLR-ADASYN balanced dataset

Introduction.

The Line Loss (LL) ratio is a crucial index for measuring the operating efficiency and economy of a power system. It represents the proportion of electrical energy lost due to the presence of components such as resistors and inductors during power transmission. The level of LL rate directly affects the safe and stable operation and economic benefit of the power grid. A current distribution network's LINE LOSS ANALYSIS (LLA) primarily relies on the expertise and professional judgement of specialists, which has a limited impact on improving the LL management level of the network 1 , 2 . Currently, there are two general routes for research on loss reduction in medium voltage distribution networks (MVDN) both domestically and internationally. The first is the study of power equipment, which aims to lower LL by producing more energy-saving equipment for cooperation. The study of LL after new energy is allowed access to the distribution network, the benchmarking value for LL, LL management, and LL causes are the key topics of the second area of research on theoretical LL 3 . In response to the country's calls for energy efficiency and emission reduction, the use of new energy technologies has gradually increased. In the case of electric power, the presence of numerous distributed photovoltaic power plants and distributed hydroelectric power plants has changed the direction of the original tidal currents, posing a new challenge for the distribution network. Although the efficiency and accuracy of LLA have increased significantly over the past few years due to advancements in power system technology and management, the distribution network LLA still faces some challenges as a result of the complexity of the distribution network's structure, load imbalance, and other factors 4 , 5 . Therefore, it is crucial to research a technique that can carry out LLA for distribution networks rapidly and accurately. This research innovatively proposes a data cleaning model based on the combination of ensemble learning and optimal clustering, and improves the shortcomings of ensemble learning and optimal clustering to enhance the accuracy and practicability of the data cleaning model. On this basis, based on a large number of existing data, a multi-level distribution network LL calculation model is constructed to obtain the fine LL under different data scenarios. Finally, according to the loss characteristic index and loss rate, the loss causes are determined, and the loss causes are identified for different types of feeders, so as to obtain the main reasons for the high loss rate.

The innovations of this research are as follows: (1) A fuzzy C-means (FCM) clustering algorithm based on random distributed delayed Particle Swarm Optimization (RODDPSO) algorithm is proposed, and the clustering center of FCM clustering algorithm is optimized to improve the accuracy of final data filling. (2) On the basis of the original isolated forest anomaly recognition algorithm, by calculating the coefficient of variation of load data, the abnormal subspace is screened to reduce the dimension, and the randomness of the algorithm is reduced by fixing the selection of cutting points, so as to improve the recognition accuracy of abnormal data. (3) The LL calculation model of the backward substitution method is established, which reflects the characteristics of data-driven, and provides real and reliable data support for the research of distribution network.

The contributions of this research are as follows: (1) To solve the problem of missing charge data in the distribution network, PSO algorithm is used to optimize the clustering center of the clustering algorithm, and the randomness and variable inertia weight of particles in the particle swarm optimization algorithm are added to avoid the PSO algorithm falling into the local optimal, and the accuracy of the final data filling is improved. (2) An improved isolated forest algorithm based on the coefficient of variation is proposed to solve the problem of low accuracy in identifying abnormal load data of distribution network caused by high-dimensional data and algorithm instability. (3) Taking full account of the advantages of multi-source data in data filling, the LL calculation model established in this paper is more accurate and more abundant than the traditional method.

The article develops the study through four parts, the first section provides a summary of current LLA research as well as isolated forest algorithm (IFA) and FCM clustering algorithms for distribution networks. The second part is the study of LLA modelling for MVDNs, the third part is the performance validation of the system designed for the study, and the fourth part is the conclusion.

The abbreviation and full name of this research design are shown in Table 1 .

Related works

In an effort to reduce the LL rate, many academics have studied LL, one of the primary indicators of power supply firms. W. Hu established an LL assessment system. Firstly, the collected data were subjected to image processing, and then a reasonable LL interval calculation model was established based on convolutional neural network, based on which a loss reduction strategy was formed. After verification, the system can save electricity and improve economic efficiency 6 . Zhang proposed a LL prediction method based on a multidimensional information matrix and a multidimensional attention mechanism for the problem of high energy loss in low-voltage distribution networks. First, the distribution network characteristics and seasonal trend parameters are selected, and then the historical LL data is decomposed by the optimized variational mode decomposition method, and the model relationship between line loss index deviation and line loss deviation is constructed. Finally, the obtained data is input into the LSTNet network with dimensional attention mechanism. The results show that this method has weak hysteresis effect and high prediction accuracy 7 . Tang proposed a short-term LL prediction algorithm based on K-means-LightGBM. Firstly, a data quality evaluation system was established using the Hadoop platform, the feature dimensions with high correlation were normalized, the samples were classified by K-means clustering algorithm, and the model relationship between line loss index deviation and line loss deviation was constructed. Finally, it is verified that the algorithm has higher accuracy and is superior to the traditional algorithm 8 . In order to accurately diagnose abnormal LL, Liu proposed a hybrid clustering and long and short-term memory based scheme for abnormal LL detection in distribution networks. In this method, samples are classified by mixed clustering method, abnormal feeders are detected quickly, and abnormal feeders are predicted and substations under the jurisdiction of abnormal feeders are detected by long and short term memory method. It has been verified that the method can detect LL quickly and effectively 9 . In order to improve the LL calculation methods and management tools for distribution networks, Zhang's team established a simulation and analysis model for distribution networks based on the IEEE 34-node system after considering the impact of distributed PV access. The results indicated that when the access capacity of distributed power was too large, the LL of the system would increase 10 .

It becomes quite challenging for people to extract accurate and truly relevant information from the power data due to interference in the data collection, transmission, and storage processes. C. C. Yi et al. aimed to address the issue of local optimization and error identification in traditional FCM clustering methods. They utilized the t-SNE method for reduction and initial clustering center selection, resulting in an improved FCM algorithm that significantly increased clustering accuracy 11 . Ke et al. proposed a high-precision intelligent prediction method based on back-propagation neural network and FCM clustering algorithm to predict the adsorption efficiencies of heavy metals with different biochar properties, and the phases classified the metal adsorption data by FCM algorithm 12 . The Minkowski distance and Chebyshev distance were combined as a measure of similarity in the FCM's clustering process, and then the principal component analysis was used to carry out the dimensionality reduction. S. Surono's team did this in order to address the issue that the FCM algorithm is easy to fall into local optimal solutions. The results showed that the method improved the accuracy of the clustering and optimized the objective function of the FCM 13 . In order to improve the accuracy of abnormal driving behavior monitoring, Wang et al. designed A driver abnormal behavior warning method based on the isolated forest algorithm. Through the analysis of abnormal driving behavior, XGBoost algorithm was used to extract the characteristics of abnormal driving behavior, and a detection model of abnormal driving behavior was established by constructing an isolated forest of abnormal driving behavior. The results show that the method can detect abnormal driving behavior with 98.6% accuracy 14 . A parameter distribution model for feature decomposition and error compensation correction of the specimen seating attitude was developed by N. Pan et al. using a multi-modal elastic-driven adaptive control method. The methodology increased the clustering's accuracy and optimized the FCM's objective function, according to the results. The anomaly detection signals were processed by the IFA, then the trajectory curve profile was extracted using a multi-scale alignment framework, and finally a parameter-sharing concatenated ternary deep learning model for feature tracking and data enhancement strategies was established 15 .

In summary, the existing methods are basically to classify the samples, and then use the neural network to construct the model relationship between the line loss index deviation and the line loss deviation. However, in line loss calculation, the deviation coefficient of line loss index is used to calculate the line loss rate, and the data can be obtained is often less, so it is not suitable to use the power flow calculation algorithm which requires high data quantity. Therefore, this paper proposes a solution to solve the missing data and abnormal data, and alleviates the adverse impact of data quality problems on the distribution network analysis and calculation to the greatest extent. Then, based on the breadth-first backward generation accurate line loss analysis and calculation method, by improving the clustering algorithm and IFA, it is expected to quickly and effectively deduce the LL rate in the distribution network and the reasons for LL.

LLA modeling study of MVDN

The study establishes missing value filling based on improved clustering algorithm and outlier identification based on improved isolation forest algorithm to identify outliers and fill in missing values to reduce the impact of dirty data on subsequent LLA calculation. Then, the cleaned data are used to calculate the LL rate as a theoretical basis for distribution network planning. Finally, a feeder LL depletion identification model based on feeder classification is designed to identify the whole feeder LL depletion causes and eliminate irrelevant factors.

Missing data filling based on improved FCM algorithm

The data acquisition system of distribution network is a complex system composed of various sensors, transformers and software. The data acquisition system of distribution network includes dispatching system, production management system, measurement automation system, distribution geographic information system and marketing system. These data systems process data from a large number of different sources, and a large amount of dirty data is inevitably generated in the process of data transmission from tool to tool and from system to system 16 , 17 .

To remove the impurities from the collected data, the data cleansing on the raw data need to be performed, the principle of which is shown in Fig.  1 . By using statistical learning, machine learning, deep learning and other methods, with pre-set cleaning rules and strategies, the massive junk data is transformed into data that meets the needs and is of high quality. The degree of data cleaning depends on the adaptive ability of the cleaning methods, rules and strategies.

figure 1

Basic framework of data preprocessing.

In the distribution system, the dirty data mainly comes from the dispatching system and the metering automation system, and the dirty data are mainly generated in the following three links: collection, transmission and storage. In the process of data collection, the main cause of dirty data is equipment failure. In the process of data transmission, the unreliable connection between the data acquisition device and the data transmission device, the aging of the transmission device, the instability of the transmission signal and the signal interference are also the main causes of dirty data. During the data storage process, the data storage module needs to convert the data after receiving the data signal from the sensor, and the data is easy to be abnormal during the conversion process. Data missing, data outlier (abnormal), data naming inconsistency, data illegality and data duplication are the most common types of dirty data in the distribution system. After processing the original data of the distribution network system with the research method, the complete, legal and good data with the same name are obtained.

Missing data, as a kind of junk data, has a large impact on the original data. The clustering algorithm is one of the most widely used methods for filling in the missing data based on similarities between the data, but it has some drawbacks, including that the number of clusters to use depends on experience, the cluster centre is prone to falling into local extremes during iteration, and the accuracy of the clustering for high-dimensional data will suffer, which will have an impact. RODDPSO algorithm replaces the traditional FCM clustering's centre self-renewal process with the clustering center's particle swarm optimization process. In order to acquire a more precise clustering centre for historical data, it is necessary to address the issue that the typical FCM clustering approach tends to slip into the local extreme value during the iterative process.

The selection of k-means algorithm K is difficult to grasp, and it is difficult to converge for non-convex data sets. If the types of data are not balanced, such as the amount of data is seriously unbalanced or the variance of the categories is different, the clustering effect is not good. And it adopts the iterative method, can only get the local optimal solution 18 . FCM takes into account the degree of membership of data points to clusters, and has better global optimization performance. Compared with other clustering algorithms such as K-means, FCM is insensitive to the initial center point and has faster convergence speed, which is suitable for large-scale data sets 19 . For the sample dataset containing \(a \times b\) , where \(a\) is the number of samples and \(b\) is the sample dimension, the objective function is minimized by continuously updating the clustering centre and the degree of affiliation of the FCM algorithm. Equation ( 1 ) contains the FCM algorithm's mathematical expression.

In Eq. ( 1 ), \(U\) is the membership matrix, \(V\) denotes the matrix consisting of \(c\) cluster centre vectors of dimension \(b\) , \(m\) denotes the fuzzy factor, which generally takes the value of 2, \(u_{ij}\) denotes the element of the affiliation matrix that indicates the degree of affiliation of the \(i\) th sample belonging to the \(j\) th subclass; \(x_{i}\) denotes the data in the \(i\) th sample; \(p_{j}\) denotes the cluster centre of the \(j\) th subclass; and \(\left\| {x_{i} - p_{j} } \right\|^{2}\) denotes the Euclidean distance between two vectors. Then the affiliation matrix and clustering centre matrix are updated according to Eq. ( 2 ) until the termination condition is satisfied.

Since the Particle Swarm Optimization (PSO) algorithm is also prone to local optima, the study proposes linearly varying inertia weights, whose computational metric expression is shown in Eq. ( 3 ).

In Eq. ( 3 ), \(\omega\) denotes the inertia weight of PSO algorithm, \(\omega_{\max }\) and \(\omega_{\min }\) denote the maximum and minimum values of inertia weight, respectively, \(t\) denotes the current number of iterations, and \(t_{\max }\) denotes the maximum number of iterations. The PSO algorithm needs to set a larger weight in order to speed up convergence at the beginning of iterations, and needs to set a smaller weight at the later stage of iterations in order to prevent the algorithm from skipping the optimal value.

In order to improve the accuracy of the LL rate calculation, the existing grid companies generally set the load data collection once every 15 min, so the daily load profile is composed of 96 points, while in the PSO algorithm, the dimensions of the particles and the cluster centers are both 96. The RODDPSO algorithm differs from the conventional PSO algorithm in that it optimizes the cluster centre matrix. As a result, the particle in the RODDPSO algorithm is a three-dimensional array of size \(c \times 1 \times d\) .

In Eq. ( 4 ), one particle holds the \(c\) clustering centre, and during the RODDPSO iteration, the affiliation degree is changed in real time according to the position data of the best clustering centre. By resolving the fitness value of the two particles as the FCM algorithm's objective function until the end of the RODDPSO iteration, a more precise clustering centre position and affiliation degree is obtained. The formula for updating particle velocity and position can be obtained through particle swarm search behavior. To obtain Eq. ( 4 ), a random distributed delay term is introduced in the velocity update process.

In formula ( 4 ), \(t\) represents the current number of iterations; \(c_{1}\) and \(c_{2}\) are individual learning factors and social learning factors, respectively. \(c_{3}\) and \(c_{4}\) are learning factors of distributed delay terms, whose values are \(c_{1}\) and \(c_{2}\) , respectively. \(N\) indicates the upper limit of the distributed delay item. \(\alpha \left( \tau \right)\) represents a vector \(N\) where each element is selected from 0 to 1; \(r_{i} \left( {i = 1,2,3,4} \right)\) is a random number uniformly distributed in [0.1]; \(m_{1} \left( \xi \right)\) and \(m_{g} \left( \xi \right)\) represent distributed delay term intensity factors determined by evolutionary state \(\xi\) .

The degree of affiliation is the degree of similarity between the sample data and the clustering centre, so it can be used to compensate for missing data in the daily load profile, and the accuracy of the affiliation will affect the filling effect. In practical applications, it has been found that there are two common characteristics of missing data: one is randomness and the other is long time series. The missing of these two types of data will affect the accuracy of the affiliation to some extent, while in the random case, the affiliation of this type of data is negligible. However, if it is missing for a long period of time, or even for a whole day, then there is a lack of reliable data for subordination calculations. In this condition, the grid should be pre-filled based on the power data in the feeder and the regularity of the daily load variation of the transformer. Equation ( 5 ) expresses the mathematical relationship between the degree of affiliation for each cluster center and the missing data. It is important to consider the effect of multiple degrees of affiliation on the missing data from an overall perspective.

In Eq. ( 5 ), \(x_{ij}\) denotes the \(j\) th dimension data in the \(i\) th sample, \(u_{ik}\) denotes the affiliation of the \(i\) th sample belonging to the \(k\) th clustering centre, and \(p_{kj}\) denotes the \(j\) th dimension data in the \(k\) th clustering centre.

Figure  2 depicts the overall flow of the updated FCM algorithm, which fills in the missing data to produce a complete daily load curve. First, the power supply company's data platform is used to extract the daily historical load data, the format is processed, and the degree of missing data is determined. If the missing value is small, the number of categories can be determined directly; however, if the missing value is large, pre-populating the data with electricity data can be used. After classifying the historical load data and initializing the RODDPSO algorithm's parameters, the clusters was calculated. On this basis then the position equation and velocity equation of the particles are updated. Once the iteration termination condition is determined to be satisfied or not, the result is output. If the condition is not satisfied, the steps are repeated until it is.

figure 2

Flow chart of improved FCM algorithm.

Improved IFA-based anomaly data identification

In addition to the frequent missing data phenomenon in the raw data of MVDN, data anomalies also occur frequently. When it comes to missing data, it can usually be identified with the naked eye. However, identifying abnormal data manually can be challenging. Furthermore, anomalous data can introduce bias in engineers' data interpretation and calculations, which can negatively impact the effectiveness and financial gains of power grid firms. IFA is a popular outlier detection algorithm that isolates outliers from conventional observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlier. IFA is particularly well suited for working with large data sets. It has linear time complexity and is computationally more efficient due to the use of subsampling 20 . The existing data anomaly detection methods are mainly based on the description of normal samples, giving the region of a normal sample in the feature space, and the samples not in this region are regarded as abnormal 21 . The main disadvantage of these methods is that the anomaly detector only optimizes the description of the normal sample, not the description of the abnormal sample, which can cause a large number of false positives, or only detect a small number of anomalies.

The occurrence of abnormal data destroys the normal periodicity and continuity of power load to a certain extent. In order to accurately understand the distribution of abnormal data and the change amplitude of abnormal data, it is necessary to describe the distribution characteristics of data 22 . According to the characteristics and actual conditions of power load data, neither the central tendency measurement nor the distribution shape measurement can accurately describe the distribution characteristics of abnormal data. The range, standard deviation and coefficient of variation in the discrete trend measure can reflect the distribution of abnormal data, but when comparing different periods or different populations of the same population, the range and standard deviation are lack of comparability, while the coefficient of variation eliminates the above defects and has a wider application range 23 . Therefore, this paper selects the coefficient of variation as the criterion for screening the abnormal subspace.

Massive high-dimensional data sets are used in the study as the research object. The discrete degree measure function of the coefficient of variation is combined, and an improved isolated forest (CV-iForest) algorithm based on the coefficient of variation is proposed as a solution to the issues of low reliability of the high-dimensional data sets and high randomness of the iForest algorithm. In order to clearly express the relationship between various parts of the CV-iForest anomaly detection model, the general framework of CV-iForest is given in Fig.  3 . The anomaly detection model consists of four layers: input layer, data pruning layer, data mining layer, and anomaly detection layer, which are respectively responsible for downscaling and pruning the data, extracting the anomaly candidate set, calculating the coefficient of variation, and outputting the anomaly detection results.

figure 3

Improved iForest model mechanism diagram.

The core idea of isolated forest is to cut the abnormal data continuously, because the density of abnormal data is much smaller than the normal data clusters, so the abnormal data can be "isolated" with fewer cuts. In a binary tree structure, abnormal data is indicated by cuts closer to the root node, while normal data remains in deeper positions. In order to better demonstrate the principle that samples assign outlier score values by the average path length from all trees to the root node, the research draws a diagram of outlier score distribution, as shown in Fig.  4 .

figure 4

Schematic diagram of anomaly score.

In the iForest algorithm, there are two key training parameters: the sub-sample size \(s\) and the number of isolated trees \(n\) . Its learning process focuses on constructing orphaned trees and forming isolated forests by using existing data. Firstly, \(s\) subsamples are randomly selected from the dataset \(D\) of size \(k\) dimension \(m\) to form the training samples \(D_{i} = \left\{ {d_{1} ,d_{2} ,d_{3} , \cdots ,d_{n} } \right\}\left( {s \le k} \right)\) . Then the separation dimension is randomly selected from the training sample \(D_{i}\) and a cut point \(C\) is selected on the great and small interval of this separation dimension, all data samples greater than or equal to \(C\) are classified into the right branch of the isolation tree, and the remaining portion is classified into the left branch, and the step is repeated until the samples cannot be cut any more or reach the limited height of the tree. Finally, repeat the above steps all the time to construct multiple isolation trees to form an isolation forest.

After training, since outliers are generally isolated in the first few rounds of isolation and their average path lengths are relatively short, the outlier's anomaly score is calculated based on the average path length of the sample to determine whether the sample is anomalous or not.

Following that, based on the average path length \(c\left( s \right)\) of sample \(x\) , the anomaly score of sample \(x\) may be derived; its computation expression is presented in Eq. ( 6 ).

In Eq. ( 6 ), \(E\left( {pathL\left( x \right)} \right)\) is the average value of the path length of sample \(x\) in the forest. Since the selection of dimensions and segmentation points in the training phase of the iForest algorithm is random, which leads to its poor global stability, and for high-dimensional loaded data, some of the dimensional information is still unused after its modeling, which leads to its low reliability. Therefore, the study improves the iForest algorithm based on data reduction and novel isolation strategies.

In the process of data dimensionality reduction, the coefficient of variation is dimensionless, and the coefficient of variation of each dimension is calculated in order to eliminate variables with high dispersion, that is to say, to eliminate the anomalous subspace. This is done by calculating the coefficient of variation \(C_{V}\) for each dimension in the dataset and filtering out the anomalous subspace \(W = \left\{ {w_{1} ,w_{2} , \cdots ,w_{i} } \right\}\) based on \(C_{V\;\min } \le C_{V} \le C_{V\;\max }\) (where \(i = a \times m\) , \(a\) are the critical coefficients with values in the range of [0, 1], and \(m\) are the dimensions of dataset \(D\) . Where the coefficient of variation \(C_{V}\) is calculated in Eq. ( 7 ).

In Eq. ( 7 ), \(\sigma\) denotes the standard deviation of the sample, \(\mu\) denotes the number of data in the sample, \(n\) denotes the mean value of the sample, and \(x_{i}\) denotes the \(i\) th value in the sample. The anomalous subspace resulting from the data dimensionality reduction is used as a new training set, and the isolation strategy is followed to improve the isolation effect of a single isolated tree during the isolation tree construction process. The study proposes two isolation strategies. One strategy involves randomly selecting the isolation dimension from the anomalous subspace. The samples can be placed into the left and right branches of the isolation tree by selecting an isolation dimension and using the midpoint of the largest interval between neighboring data on that dimension as the isolation point.

Computational and analytical model construction for MVDNLL

MVDN differs from high voltage transmission grids in its ability to ignore the conductance of conductors and transformers to ground 24 . As a result, simpler approaches are typically used to calculate LL, such as the power method, the equivalent resistance method, the maximum current method, and the root mean square current method. However, these LL calculation methods can lead to a single LL calculation result and low calculation accuracy due to limitations such as missing measurement data 25 . Therefore, the study introduces a breadth-first based forward back generation LL calculation method in the context of big data. The method is based on the actual feeder topology and the load data of each transformer, transformers are equated with impedance, and the forward back generation method is used to calculate the losses of lines and transformers 26 . Given the complexity and numerous branches of the current MVDN feeder line topology, the forward-back generation method is less efficient. To address this, we conducted topology identification using the original algorithm and stratified it to enable layered calculation of LL. In the feeder topology, the connection relationship between the nodes is represented by the node association matrix; based on this, the root node of the feeder is the first layer node of the node hierarchy matrix; then the nodes that are connected to the root node but have not been written into the node hierarchy matrix are found, and then they are written in the second layer node in the node hierarchy matrix, and so on, until all the nodes are written in the node hierarchy matrix 27 .

Taking the IEEE31 node system as an example, whose node hierarchy is shown schematically in Fig.  5 . The branch currents are calculated by back generation layer by layer, and the node injection currents and branch currents are calculated as shown in Eq. ( 8 ).

figure 5

IEEE 31 node layer diagram.

In Eq. ( 8 ), \(I_{j}^{k}\) denotes the injected current of the end node \(j\) of the branch \(l\) , \(P_{Lood}\) denotes the active power of the branch, \(Q_{Lood}\) denotes the reactive power of the branch, \(V_{j}^{k - 1}\) denotes the voltage of the last node of the end node \(j\) of the branch \(l\) . \(I_{l}^{k}\) denotes the branch current of the branch, \(I_{m}^{k}\) denotes the current of \(m\) , the lower branch of branch \(l\) , and \(M\) denotes the set of all the lower branches that are directly connected to node \(j\) . Then, starting from the first node to the last node layer by layer, the voltage of node \(j\) is calculated as shown in Eq. ( 9 ).

In Eq. ( 9 ), \(V_{j}^{k}\) denotes the voltage of the end node \(j\) of the branch \(l\) , \(V_{i}^{k}\) denotes the voltage of the beginning node \(i\) of the branch \(l\) , and \(Z_{l}\) denotes the impedance value of the end node \(j\) of the branch \(l\) . Repeat the above calculation steps continuously until the convergence condition \(\max \left| {V^{k} - V^{k - 1} } \right| \le \varepsilon\) is satisfied.

The computed voltage and current at each branch can be used to determine the line's overall loss. The specific calculation is shown in Eq. ( 10 ).

In Eq. ( 10 ), \(L\) denotes the number of branches of the feeder, \(I_{i}^{2}\) denotes the current amplitude flowing through the branches, and \(R_{i}\) denotes the resistance of the branches. To simplify the calculation, the study has treated the transformer as an element with only resistance and reactance, and calculated its equivalent impedance using Eq. ( 11 ).

In Eq. ( 11 ), \(P_{k}\) indicates transformer short-circuit loss in kW. \(V_{N}^{{}}\) is the rated voltage of the transformer in kV. \(S_{N}^{{}}\) indicates the rated capacity of the transformer in kVA. And \(V_{k} \%\) indicates the impedance voltage percentage. Transformer loss includes variable loss and fixed loss brought by equivalent resistance, so the transformer loss calculation equation can be expressed as Eq. ( 12 ).

In Eq. ( 12 ), \(P_{0}\) denotes the fixed losses of the transformer and \(P_{R}\) denotes the losses of the transformer equivalent resistance. The feeder LL rate is the proportion of LL and transformer losses in the feeder to the power supply, where the power supply is the sum of feeder LL consumption and power sales. Its calculation equation is shown in Eq. ( 13 ).

In Eq. ( 13 ), \(\lambda\) denotes the feeder LL rate, \(T\) denotes is the duration of feeder power sales, and \(W_{es}\) denotes the total power sales of the feeder. To identify the cause of LL of feeders, the research first collects and preprocesses the network loss index parameters. And then based on the standardised data, logistic regression analysis is carried out on the whole grid to identify the important factors affecting the grid loss and exclude irrelevant factors. On this basis, the logistic regression method is used to identify the causes of feeder LL consumption in different regions from the perspective of power supply zoning. By categorizing the new feeders and predicting the LL high and low using the logistic regression model of the category to which they belong, the causes of losses were identified.

LLA model performance validation

The study has calculated the LL of the power supply line in Yuxi city in Yunnan Province as an example, and the feeder has been analyzed in detail. The experimental data is based on Electricity Load Diagrams data set in UCI database, which contains active load data of 370 users from 2017 to 2021. The data collection interval is once every 15 min, so the data dimension is 96. All users' load data from 2011 was selected as the experimental data to be used by the algorithm. Out of the 365 samples from 2017, the study randomly generated 37 abnormal samples (10% of the total). The abnormal samples are divided into two categories to detect the algorithm's effect: obvious outliers and fuzzy outliers.

Effectiveness of Missing Data Filling

To study the filling accuracy of the data filling algorithm in two cases, partially missing and completely missing, where the partially missing section is set in the first half of the daily load profile with a defect rate of 12.5%, which means that the load information is missing continuously for three hours, Table 2 shows the basic information of this section of the feeder.

The study employs a transformer's daily active load in this feeder as an example. It clusters and analyses the active load in two states using the FCM clustering method, and it derives the results in Fig.  6 based on the ideal number of clusters. The horizontal axis is the 96 load data collection points in one day, and the vertical axis is the instantaneous active power of certain points at a certain moment. The cluster analysis reveals that while the two types of load profiles share similar trends, the second type exhibits significantly larger and more volatile peaks and troughs than the first. In order to verify the accuracy of the data filling algorithm when partial and all data are missing, the study takes the active load of Yuxi City in Yunnan Province in July 2020 as an example data. In the original data, part of the missing data is inserted in the front section of the daily load curve, and the missing rate is 12.5%, that is, the load data is missing for 3 h continuously. To assess the feasibility of data filling methods, commonly used methods include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Standard Deviation (SD). The quality of the selected method can be determined by comparing the RMSE, MAE, and SD of the missing values with the actual values. In general, a smaller RMSE and MAE indicate higher precision in filling. A smaller SD indicates smoother filling values, and no filling value should have a particularly large difference.

figure 6

Clustering effect of historical active load curve.

In order to test the filling effect, the improved FCM (IFCM) algorithm was compared with the Back Propagation neural network from literature 28 , the conventional FCM procedure, and the cluster mean filling algorithm, each used to fill in two missing data examples. Its absolute error curve is given in Fig.  7 along with these comparisons. The revised FCM method, which has a higher accuracy compared to other algorithms, has an average error of -6.35kW for partial missing data and a maximum error of − 10.63kW for whole missing data. The comparison of the basic FCM algorithm indicates that the cluster center optimized by RODDPSO more accurately reflects the overall characteristics of the data. The filling accuracy of BPNN is the lowest due to the selected feeders being located in rural areas of class D power supply zones. These areas do not have any noticeable regularity in load changes and are susceptible to the influence of environmental factors. Building the neural network will be difficult without sufficient historical data. If there is a lack of historical data, it is difficult to complete the establishment of the neural network.

figure 7

Active power absolute error curve.

Three classical data filling methods, linear interpolation, mean filling and mode filling, were selected in this study. RMSE, MAE and SD were used to evaluate the performance of FCM algorithm, and 30 experimental results were statistically analyzed. The evaluation indexes obtained were shown in Fig.  8 . Different letters in the figure indicate significant difference between the same index ( p  < 0.05). In Fig.  8 , the SDs of the improved FCM algorithm for the two data missing cases are 4.015kW and 10.156kW respectively, which is significantly different from the other three methods ( p  < 0.05), indicating smoother filling value. The MAE and RMSE were 3.154kW and 2.416kW, respectively, when partial data were missing. The MAE and RMSE with total data missing were 5.635kW and 3.529kW, respectively, the smallest in the two missing cases. From partial missing to complete missing, the error evaluation index of the improved FCM algorithm changes the least, which indicates that the robustness of the proposed algorithm is better than that of other algorithms.

figure 8

Error evaluation index.

Effects of abnormal data identification

The aberrant samples are split into two categories of clear and fuzzy anomalies to test the algorithm's efficacy. The HBOS algorithm, the LOF algorithm, the K-means algorithm, and the iForest algorithm—all of which are implemented in Python—are all compared with the CV-iForest algorithm suggested in the study to confirm the efficacy and superiority of the algorithm in this work. The study calculates the coefficient of variation for each dimension, ranks the coefficients of variation, and filters the top \(a \times m\) dimensions as anomalous subspaces. The graphic shows that, in comparison to the data in the 23rd dimension, the data in the 78th dimension has more spikes, more pronounced spikes, and a larger coefficient of variation. Figure  9 shows the data in the first two dimensions with the largest coefficient of variation in the case of blurring of the anomalous samples. Compared to the 23rd dimensional data, the 5th dimensional data has more spikes and the degree of spikes is more pronounced and the coefficient of variation is larger.

figure 9

Abnormal samples are obvious.

Figure  10 displays the evaluation findings for the experimental choice of the subject characteristics Area Under Curve (AUC) value as the evaluation index of the running effect in order to assess the algorithm's detection accuracy. In the table, for the abnormal samples with obvious outliers, the AUC value of CV-iForest algorithm is 0.9971, which is the highest among several algorithms, but the running speed is slower, which is still within the acceptable range. LOF algorithm has the fastest running speed, which reaches 0.7189 s, and the AUC value is second only to that of the CV-iForest algorithm, but it is not applicable to load data that does not have the attribute of density. K-means algorithm runs the slowest, which is not applicable to load data without density. The K-means algorithm is the slowest, indicating that it is not suitable for high dimensional load datasets with large amount of data, and the HOBS algorithm has the lowest AUC value of 0.6629, which indicates that the load data do not comply with the assumptions of data distribution of this statistical method. In cases of abnormal sample blurring, the iForest algorithm maintains the highest AUC value of 0.8586 with the smallest decrease, while the LOF algorithm demonstrates the second-best adaptive ability in abnormal scenarios, following only the CV-iForest algorithm. The K-means algorithm and iForest algorithm show the most significant decrease in AUC value, while the HBOS algorithm has the lowest AUC value in both scenarios and also experiences a noticeable decrease.

figure 10

Algorithm operation effect.

The study evaluates the stability of the model under the Load Diagrams dataset by varying the dataset outliers rate and missing samples for comparison experiments. In Fig.  11 a, the model's performance is fluctuating as the outliers rate continues to increase, and the model's performance is the best when the outliers rate is between 0 and 10%. In general, when the proportion of non-normal samples increases, the distribution among the sample categories tends to be consistent, which results in a classification model with high accuracy. However, the iForest model takes advantage of the sparsity of the outliers, which is more likely to aggregate when the outliers are greater than 10%, leading to an increase in the number of outlier decompositions, which in turn affects the effectiveness of the model. As shown in Fig.  11 b, the performance of the model increases significantly from 5 to 20% base classifiers, suggesting a progressive rise in the variability of the base classifiers, increasing the stability of the final model. The model performance did not always improve when missing samples increased from 20 to 40%, which may be related to the creation of weak classifiers with accuracy lower than 0.5.

figure 11

Analysis of experimental results of model stability.

LL Calculation and Cause Identification

Taking a power supply line in a prefecture-level city in Yunnan Province as an example, the line loss is calculated and compared with the model proposed by Liang et al. 29 , and Table 3 shows the refined analysis results of this feeder. LL is the loss caused by problems such as line material. LL ratio is the proportion of LL to the total LL. Public distribution loss is the loss caused by public distribution. Public distribution loss ratio is the proportion of public distribution loss to the total LL. From the table, it can be seen that the LL rate of this feeder is 7.62%, which exceeds the LL rate standard of Class D power supply sub-district, and the LL rate is unqualified. The LL reaches 58,159.89kW, the loss of public distribution substation reaches 134.896kW, and the LL accounts for 99.77%, which indicates that the loss of this feeder is basically borne by the line, and the loss of the dedicated distribution substation is borne by the users themselves. Due to the large access load, most of the distribution transformers have low voltage phenomenon and 5 branch circuits have heavy overload phenomenon. Liang et al. 's method belongs to the power flow algorithm, and its main idea is to establish the power flow equation of the station area distribution network. For different circuits, different calculation models should be established according to the parameters, which have low generality and great dependence on the quality of parameters. Compared with the model proposed in this study, the overall performance is still lower than that of the model proposed in this study, although there is little difference in each index.

The study has analyzed five branches of this feeder where heavy overload phenomenon exists and the results obtained are shown in Table 4 . The table shows that two branches of the feeder have severe overloading phenomenon, which is caused by the failure of the branch type to match the load current, and according to the wire cross-section requirements in the MVDN Planning Technical Guidelines, these two branches should be replaced with 240mm 2 diameter wires.

The study suggests a data cleaning model based on the improved FCM clustering algorithm and IFA in accordance with the two common data quality problems occurring in the power load data in order to enhance the grid's ability to use energy and to encourage a change in the management style of the power supply company. A model for calculating and analyzing LL in the context of electric power big data is proposed. The traditional distribution network LL calculation method is not applicable and has flaws such as unclear identification of LL causes in distribution network feeders. The results show that the average error of the improved FCM algorithm is the smallest among several algorithms. The average error and standard deviation of the improved FCM algorithm are -6.35kW and 4.015kW when part of data is missing, and the average error and standard deviation of the improved FCM algorithm are -10.63kW and 10.156kW when all data is missing. The MAE and RMSE were 3.154kW and 2.416kW, respectively, when partial data were missing. When the data is incomplete, it is 5.635kW and 3.529kW respectively. From partial missing to total missing, the error evaluation index of the improved FCM algorithm changes the least, indicating that compared with the classical data filling method, the proposed algorithm is relatively robust. In the calculation of the original line loss data of a power supply line in a prefecture-level city in Yunnan province, compared with Liang et al. 's method, RODDPSO algorithm and cv-forest algorithm are used in this study to deal with abnormal or missing data, and the nonlinear relationship between input and output can be found only by learning a small number of data samples. There is no need to establish upper and lower power constraints on nodes with abnormal or missing data, and the line loss calculation results can be obtained quickly and accurately by artificial intelligence algorithm. The AUC value of cv-forest algorithm is 0.9971 for abnormal samples with clear outliers, and 0.8586 for fuzzy anomaly samples, in which the accuracy of fuzzy anomaly samples decreases the least. Further research shows that the loss rate of the feeder is 7.62%, and two branches are seriously overloaded. The power supply radius and bare conductor resistance are the key factors leading to the high loss rate. In this study, RODDPSO algorithm and cv-forest algorithm were used to process and calculate the line loss data. Although the results are good, the computing resources required are large. Therefore, different neural networks can be considered for lightweight processing to improve the efficiency and accuracy of the algorithm.

The research is supported by: Project source: key science and technology projects of China Southern Power Grid Corporation. Project Title: Research and application of LL analysis and diagnosis and loss reduction and carbon reduction technology for new power systems. [Project No. 035900KK52220006 (GDKJXM20220254); When the paper formed by the research and development results of this project is published, it must be indicated that "China Southern Power Grid Corporation Science and Technology Project Funding [Project No. 035900KK52220006 (GDKJXM20220254)].

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Danjuma, M. U., Yusuf, B. & Yusuf, I. Reliability, availability, maintainability, and dependability analysis of cold standby series-parallel system. JCCE. 1 (4), 193–200 (2022).

Article   Google Scholar  

Saeed, M., Ahmad, M. R. & Rahman, A. U. Refined pythagorean fuzzy sets: Properties set-theoretic operations and axiomatic results. JCCE. 2 (1), 10–16 (2022).

Choudhuri, S., Adeniye, S. & Sen, A. Distribution alignment using complement entropy objective and adaptive consensus-based label refinement for partial domain adaptation. AIA. 1 (1), 43–51 (2022).

Oslund, S., Washington, C. & So, A. Multiview robust adversarial stickers for arbitrary objects in the physical world. JCCE. 1 (4), 152–158 (2022).

Wang, X., Cheng, M. & Eaton, J. Fake node attacks on graph convolutional networks. JCCE. 1 (4), 165–173 (2022).

Hu, W., Guo, Q., Wang, W., Wang, W. H. & Song, S. H. “Loss reduction strategy and evaluation system based on reasonable line loss interval of transformer area. Appl. Energ. 306 (15), 123–133. https://doi.org/10.1016/j.apenergy.2021.118123 (2022).

Zhang, Z. Y., Yang, Y., Zhao, H. & Xiao, R. Prediction method of line loss rate in low-voltage distribution network based on multi-dimensional information matrix and dimensional attention mechanism-long-and short-term time-series network. IET Gener Transm DIS 16 (20), 4187–4203. https://doi.org/10.1049/gtd2.12590.Aug (2022).

Tang, Z. et al. Research on short-term low-voltage distribution network line loss prediction based on Kmeans-LightGBM. J. Circuit Syst. Comp. 31 (13), 135–146. https://doi.org/10.1142/S0218126622502280 (2022).

Liu, K. Y., Jia, D. L., Kang, Z. J. & Luo, L. Anomaly detection method of distribution network line loss based on hybrid clustering and LSTM. J. Electr. Eng. Technol. 17 (2), 1131–1141. https://doi.org/10.1007/s42835-021-00958-4 (2022).

Zhang, L. et al. Distribution network line loss calculation method considering distributed photovoltaic acces. J. Phys. Conf. Ser. 2488 (1), 63–72. https://doi.org/10.1088/1742-6596/2488/1/012057 (2023).

Yi, C. C., Tuo, S., Tu, S. & Zhang, W. T. Improved fuzzy C-means clustering algorithm based on t-SNE for terahertz spectral recognition. Infrared Phys. Technol. 117 (9), 214–225. https://doi.org/10.1016/j.infrared.2021.103856 (2021).

Article   CAS   Google Scholar  

Ke, B., Nguyen, H., Bui, X., Bui, H. & Nguyen-Thoi, T. “Prediction of the sorption efficiency of heavy metal onto biochar using a robust combination of fuzzy C-means clustering and back-propagation neural network. J. Environ. Manage. 293 (9), 214–225. https://doi.org/10.1016/j.jenvman.2021.112808 (2021).

Surono, S. & Putri, R. D. A. Optimization of Fuzzy C-means clustering algorithm with combination of Minkowski and Chebyshev distance using principal component analysis. Int. J. Fuzzy Syst. 23 (1), 139–144 (2021).

Wang, A. J. & Zhang, F. A driver abnormal behavior warning method based on isolated forest algorithm. ATS 3 (12), 55–66 (2023).

ADS   Google Scholar  

Pan, N., Jiang, X., Pan, D. & Liu, Y. Study of the bullet rifling linear traces matching technology based on deep learning. J. Intell. Fuzzy Syst. 40 (4), 16–22. https://doi.org/10.3233/JIFS-189617 (2021).

Long, X. M., Chen, Y. J. & Zhou, J. Development of AR experiment on electric-thermal effect by open framework with simulation-based asset and user-defined input. Artif. Intell. Appl. 1 (1), 52–57 (2022).

Google Scholar  

Yastrebov, A., Kubus, L. & Poczeta, K. Multiobjective evolutionary algorithm IDEA and k-means clustering for modeling multidimenional medical data based on fuzzy cognitive maps. Nat. Comput. 22 (3), 601–611 (2023).

Article   MathSciNet   Google Scholar  

Shi, H., Wang, P., Yang, X. & Yu, H. An improved mean imputation clustering algorithm for incomplete data. Neural Process Lett. 54 (5), 3537–3550 (2022).

Yang, Q. F. et al. HCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density. Inf. Syst. 114 (5), 1–14 (2023).

Sebastian, B., Philipp-Jan, H. & Katharina, M. Randomized outlier detection with trees. JDSA 13 (2), 91–104. https://doi.org/10.1007/s41060-020-00238-w (2022).

Shao, N. & Chen, Y. Abnormal data detection and identification method of distribution internet of things monitoring terminal based on spatiotemporal correlation. Energies 15 (6), 2151–2164. https://doi.org/10.3390/en15062151 (2022).

Liang, J. F., Li, W., Zhao, Y. P., Zhou, Y. & Zou, Q. W. A risk identification method for abnormal key data in the whole process of production project. Int. J. Data Min. Bioin. 24 (3), 1–3. https://doi.org/10.1504/IJDMB.2022.130345 (2022).

Wang, Y., Zhang, X. Y. & Liu, H. F. Intelligent identification of the line-transformer relationship in distribution networks based on GAN processing unbalanced data. Sustainability 14 (14), 624–647. https://doi.org/10.3390/su14148611 (2022).

Fu, J. et al. A novel optimization strategy for line loss reduction in distribution networks with large penetration of distributed generation. Int. J. Elec. Power 150 (8), 1091121–10911216 (2023).

Liu, X. Automatic routing of medium voltage distribution network based on load complementary characteristics and power supply unit division. Int. J. Elec. Power. 133 (2), 106467.1-106467.13. https://doi.org/10.1016/j.ijepes.2020.106467 (2021).

Liu, K. et al. Energy loss calculation of low voltage distribution area based on variational mode decomposition and least squares support vector machine. MPE 2021 (33), 8530389.1-8530389.11. https://doi.org/10.1155/2021/8530389 (2021).

Dashtdar, M. et al. Improving voltage profile and reducing power losses based on reconfiguration and optimal placement of UPQC in the network by considering system reliability indices. Int. T Electr. Energy 31 (11), e13120.1-e13120.29. https://doi.org/10.1002/2050-7038.13120 (2021).

Min, Y. C., Chai, H. K., Huang, Y. F., Wei, D. C. & Jia, Y. P. Artificial intelligence generated synthetic datasets as the remedy for data scarcity in water quality index estimation. Water Resour. Manag. 37 (15), 6183–6198. https://doi.org/10.1007/s11269-023-03650-6 (2023).

Liang, C. et al. Line loss interval algorithm for distribution network with DG based on linear optimization under abnormal or missing measurement data. Energies 15 (11), 4158. https://doi.org/10.3390/en15114158 (2022).

Download references

Author information

Authors and affiliations.

Metrology Center, Guangdong Power Grid Co.,Ltd., Guangzhou, 511545, China

Jian Li, Wen Zhao, Jiajie Li, Ke Zhang & Zetao Jiang

Power Supply Service, Dongguan Power Supply Bureau, Dongguan, 523576, China

You can also search for this author in PubMed   Google Scholar

Contributions

J. L. and S. L. collected the samples. W. Z. and J. L. analysed the data. K. Z. and Z. J. conducted the experiments and analysed the results. All authors discussed the results and wrote the manuscript.

Corresponding author

Correspondence to Jian Li .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, J., Li, S., Zhao, W. et al. Distribution network line loss analysis method based on improved clustering algorithm and isolated forest algorithm. Sci Rep 14 , 19554 (2024). https://doi.org/10.1038/s41598-024-68366-y

Download citation

Received : 25 December 2023

Accepted : 23 July 2024

Published : 22 August 2024

DOI : https://doi.org/10.1038/s41598-024-68366-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Fuzzy C-Means
  • Isolated forest algorithm
  • Medium voltage distribution networks
  • Line loss analysis
  • Data processing

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

analysis of algorithm research paper

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Data Structures and Algorithms

Title: a tighter complexity analysis of sparsegpt.

Abstract: In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh ICML 2023] from $O(d^{3})$ to $O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ for any $a \in [0, 1]$, where $\omega$ is the exponent of matrix multiplication. In particular, for the current $\omega \approx 2.371$ [Alman, Duan, Williams, Xu, Xu, Zhou 2024], our running times boil down to $O(d^{2.53})$. This running time is due to the analysis of the lazy update behavior in iterative maintenance problems, such as [Deng, Song, Weinstein 2022, Brand, Song, Zhou ICML 2024].
Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: [cs.DS]
  (or [cs.DS] for this version)
  Focus to learn more arXiv-issued DOI via DataCite (pending registration)

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • White Collar Crime
  • Criminal Law
  • Fraud Detection

Optimizing Hyperparameters for Fraud Detection: A Comparative Analysis of Machine Learning Algorithms

  • August 2024
  • In book: Artificial Intelligence, Big Data, IOT and Block Chain in Healthcare: From Concepts to Applications (pp.218-228)
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Yousef Farhaoui at Université Moulay Ismail, Faculty of sciences and Technics, Morocco

  • Université Moulay Ismail, Faculty of sciences and Technics, Morocco

Rejuwan Shamim at Maharishi University of Information Technology

  • Maharishi University of Information Technology

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

No full-text available

Request Full-text Paper PDF

To read the full-text of this research, you can request a copy directly from the authors.

Serafeim A. Triantafyllou

  • Mingzhi Zhu

Dawei Cheng

  • Yefeng Zheng

Domor I Mienye

  • Seetharam Khetavath

N C Sendhilkumar

  • Navalpur Chinnappan Sendhilkumar

Rejuwan Shamim

  • Dr. Vinay Pandey
  • Comput Econ
  • Alhanouf Abdulrahman Saleh Alsuwailem

Abdul Saudagar

  • Joyce Busola Ayoola
  • Devine F. Chollom
  • Najia Khouibiri
  • Joseph Bamidele Awotunde

Agbotiname Lucky Imoize

  • Sakinat Oluwabukonla Folorunso
  • Iyanu Pelumi Adigun

Mohamed Khalifa Boutahir

  • Sina Ahmadi

Ahmad El allaoui

  • AD HOC NETW

Shanthi Saravanan

  • Vallem Ranadheer Reddy
  • Anuradha Mohanta
  • Suvasini Panigrahi
  • Ana Jessica
  • Febi Vincent Raj
  • Janani Sankaran
  • Qazaleh Sadat Mirhashemi
  • Negar Nasiri

Mohammad Reza Keyvanpour

  • Ramakrishnan Raman

Vaseem Akram Shaik

  • Yousef Farhaoui
  • J Leder-Luis
  • G Palaiokrassas
  • S Scherrers
  • L Tassiulas
  • Chhabra Roy
  • N Prabhakaran
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

IMAGES

  1. (PDF) Design and Analysis of Algorithms

    analysis of algorithm research paper

  2. Design and Analysis of Algorithms 2011-2012 M.Sc Computer Science

    analysis of algorithm research paper

  3. Algorithm research paper

    analysis of algorithm research paper

  4. Design and analysis of algorithm past question paper

    analysis of algorithm research paper

  5. Research and Practice on Algorithm Analysis and Design Course Teaching

    analysis of algorithm research paper

  6. Algorithm

    analysis of algorithm research paper

COMMENTS

  1. Knuth: Selected Papers on Analysis of Algorithms

    Selected Papers on Analysis of Algorithms. by Donald E. Knuth (Stanford, California: Center for the Study of Language and Information, 2000), xvi+621 pp. (CSLI Lecture Notes, no. 102.) ISBN 1-57586-212-3 Printings made after 2006 have xvi+622 pp., because the index has gotten longer. This is the fourth in a series of eight volumes that contain ...

  2. (PDF) Design and Analysis of Algorithms

    This book "Design and Analysis of Algorithms", covering various algorithm and analyzing the real word problems. It delivers various types of algorithm and its problem solving techniques. It ...

  3. Comprehensive Study of Algorithms for the Analysis of Algorithms

    At the heart of modern computing, lies algorithms, which enable the processing of vast datasets and complex analytical tasks. Having a comprehensive understanding of these concepts is paramount in computer science, finding application in software development, system optimization, and computational theory. We discuss 5 key techniques of analysis in this paper. The 5 key analysis techniques ...

  4. (PDF) Analysis and design of algorithms. A critical comparison of

    The paper presents an analytical exposition, a critical context, and an integrative conclusion on the six major text books on Algorithms design and analysis. Algorithms form the heart of Computer ...

  5. PDF Design and Analysis of Algorithms

    978-1-108-49682-7 — Design and Analysis of Algorithms Sandeep Sen , Amit Kumar Frontmatter ... research in algorithms. Real life applications and numerical problems are spread throughout ... ISBN 9781108721998 (paperback : alk. paper) Subjects: LCSH: Algorithms. Classication: LCC QA9.58 .S454 2019 j DDC 005.1 dc23

  6. Selected Papers on Analysis of Algorithms

    Analysis of Algorithms, which has grown to be a thriving international discipline, is the unifying theme underlying Knuth's well known books The Art of Computer Programming. More than 30 of the fundamental papers that helped to shape this field are reprinted and updated in the present collection, together with historical material that has not ...

  7. Design and analysis of algorithms reconsidered

    Abstract. The paper elucidates two views (models) of algorithmic problem solving. The first one is static; it is based on the identification of several principal dimensions of algo- rithmic ...

  8. Analysis of Algorithms and Complexity Theory

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... "Analysis of Algorithms and ...

  9. Journal of Algorithms & Computational Technology: Sage Journals

    SUBMIT PAPER. Journal of Algorithms & Computational Technology (JACT) is a peer-reviewed open access journal which focusses on the employment of mathematical and numerical methods and computational technology in the development of engineering solutions … | View full journal description. This journal is a member of the Committee on Publication ...

  10. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  11. Automated Big-O analysis of algorithms

    Algorithm analysis is an important part of algorithm design. Traditionally, analysis of programming code or algorithms is theoretical and mathematical. This makes it a time consuming and manual process. This limits the scope and scale of undertaking such a task. There has always been an ever-growing need to automate this analysis. With mobile development taking the center stage, we need to ...

  12. A Comparative Analysis of Machine Learning Algorithms for

    Fig. 3: Precision (in percentage) of the algorithms applied on each dataset 430 Vraj Sheth et al. / Procedia Computer Science 215 (2022) 422â€"431 Vraj Sheth Et al./ Procedia Computer Science 00 (2019) 000â€"000 9 6. Conclusion The prediction of classes is handled by a classification algorithm in this paper.

  13. PDF 1. Analysis of Algorithms

    Analysis of Algorithms (Knuth, 1960s) To analyze an algorithm: Develop a good implementation. Identify unknown quantities representing the basic operations. Determine the cost of each basic operation. Develop a realistic model for the input. Analyze the frequency of execution of the unknown quantities. Calculate the total running time: ( ) ( )

  14. Analysis of Searching Algorithms in Solving Modern Engineering Problems

    Many current engineering problems have been solved using artificial intelligence search algorithms. To conduct this research, we selected certain key algorithms that have served as the foundation for many other algorithms present today. This article exhibits and discusses the practical applications of A*, Breadth-First Search, Greedy, and Depth-First Search algorithms. We looked at several ...

  15. Design and Analysis of Algorithms

    Design and Analysis of Algorithms. Design and Analysis of Algorithms is a fundamental aspect of computer science that involves creating efficient solutions to computational problems and evaluating their performance. DSA focuses on designing algorithms that effectively address specific challenges and analyzing their efficiency in terms of time ...

  16. Analysis and Research of Sorting Algorithm in Data ...

    Abstract. In the process of learning the data structure, it is very necessary to master the sorting algorithm, and in the program design, the sorting algorithm is applied frequently. Based on the importance of sorting algorithms, this paper will carefully compare the characteristics of different algorithms, starting with the work efficiency ...

  17. Analysis of Algorithms Research Papers

    In this paper, we prove polynomial running time bounds for an Ant Colony Optimization (ACO) algorithm for the single-destination shortest path problem on directed acyclic graphs. More specifically, we show that the expected number of iterations required for an ACO-based algorithm with n ants is O ( 1 ρ n2m logn) for graphs with n nodes and m ...

  18. Analysis of Dijkstra's Algorithm and A* Algorithm in Shortest Path

    These two algorithms are often used in routing or road networks. This paper's objective is to compare those two algorithms in solving this shortest path problem. In this research, Dijkstra and A* almost have the same performance when using it to solve town or regional scale maps, but A* is better when using it to solve a large scale map.

  19. (PDF) Comparative Analysis of Search Algorithms

    This paper aims to compare A*, Dijkstra, Bellmann-Ford, Floyd-Warshall, and best first search algorithms to solve a particular variant of the pathfinding problem based on the so-called paparazzi ...

  20. Algorithms & Optimization

    KDD 2015 Best Research Paper Award: Algorithms for Public-Private Social Networks ... Our mission is to build the most scalable library for graph algorithms and analysis and apply it to a multitude of Google products. We formalize data mining and machine learning challenges as graph problems and perform fundamental research in those fields ...

  21. Distribution network line loss analysis method based on ...

    The contributions of this research are as follows: (1) To solve the problem of missing charge data in the distribution network, PSO algorithm is used to optimize the clustering center of the ...

  22. [2408.12151] A Tighter Complexity Analysis of SparseGPT

    Computer Science > Data Structures and Algorithms. arXiv:2408.12151 (cs) [Submitted on 22 Aug 2024] Title: A Tighter Complexity Analysis of SparseGPT. Authors: Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song. View a PDF of the paper titled A Tighter Complexity Analysis of SparseGPT, by Xiaoyu Li and 3 other authors.

  23. Promise into practice: Application of computer vision in empirical

    Social scientists increasingly use video data, but large-scale analysis of its content is often constrained by scarce manual coding resources. Upscaling may be possible with the application of automated coding procedures, which are being developed in the field of computer vision. Here, we introduce computer vision to social scientists, review the state-of-the-art in relevant subfields, and ...

  24. (PDF) Analysis of Dijkstra's Algorithm and A* Algorithm in Shortest

    Dijkstra's algorithm is one form of the greedy. algorithm. This algorithm includes a graph search algorithm used to solve the shortest path problem. with a single source on a graph that does not ...

  25. DC near‐area voltage stability constrained renewable energy integration

    Herein, first, a voltage stability-constrained minimum startup index and algorithm for conventional thermal power plants are proposed. Then, based on time series production simulation, a renewable energy integration capacity analysis algorithm is designed considering voltage stability and peak shaving constraints.

  26. Analysis of cell balancing of Li-ion batteries with dissipative and non

    Research paper. Analysis of cell balancing of Li-ion batteries with dissipative and non-dissipative systems for electric vehicle applications ... display, and other systems (e.g., CAN, UART, SPI). The algorithm and techniques are Kalman filter estimate SOC and SOH and model predictive control optimize charging and discharging. Cell balancing ...

  27. (PDF) Comparative Analysis of Machine Learning Algorithms on Different

    This research aims at comparing different algorithms used in machine learning. Machine Learning can be both experience and explanation-based learning. In this study most popular algorithms were ...

  28. Accuracy of Speech Sound Analysis: Comparison of an Automatic

    Purpose:Automatic speech analysis (ASA) and automatic speech recognition systems are increasingly being used in the treatment of speech sound disorders (SSDs). ... The current research analyzes the feedback accuracy of a novel ASA algorithm (Amplio Learning Technologies), in comparison to clinician judgments. Method: A total of 3,584 consonant ...

  29. Study and Analysis of Decision Tree Based Classification Algorithms

    analysis on j48 algorithm for data mining. ... In this demo paper we present Docear's research paper recommender system. Docear is an academic literature suite to search, organize, and create ...

  30. Optimizing Hyperparameters for Fraud Detection: A Comparative Analysis

    This research investigates how the problem of money laundering (ML) can be detected in Saudi Arabia with supervised machine learning, specifically at two levels: the establishment-level means that ...