🔥 Burn Fat Fast. Discover How! 💪

What is anomaly detection and how does it work Anomaly detecti | Big Data Science

What is anomaly detection and how does it work
Anomaly detection is a mathematical search for deviations in controlled and uncontrolled numerical data, depending on how much a particular value differs from others or from the standard deviation in a given sample. There are many different methods for detecting anomalies, called outlier detection algorithms, each with different criteria for detecting them and therefore used in different scenarios. The most common methods used to detect anomalies are:
General density-based methods: K-Nearest Neighbor (KNN), Local Outlier Factor (LOF), Isolation Forests, and other algorithms that can be applied to regression or classification scenarios. Each of these generates the expected behavior by following the line of the highest density of data points. Points that fall by a statistically significant amount outside these dense zones are flagged as anomaly. Most of these methods are based on distance between points, so it is important to normalize the units and scale in the dataset to get accurate results. For example, in KNN, data points are weighted by 1 / k, where k is the distance to the nearest neighbor. Therefore, the points that are closer to each other have a lot of weight, and affect what is the standard, there are more distant points. The algorithm marks points with a low 1 / k value as outliers. This is suitable for normalized data without labels, when there is no desire and ability to use algorithms with more complex calculations.
One-class support vector machine is a supervised learning algorithm that creates a robust prediction model. Often used for classification. There is a training set of examples, each labeled as part of one of two categories. The system creates criteria for sorting new examples for each category, matches the examples with points in space in order to distinguish both categories as much as possible. The system will flag an outlier if it goes beyond any category. In the absence of labeled data, you can use unsupervised learning, which looks for clustering among the examples to define categories. This is suitable for working with 2 categories of data, when you need to find which data points lie outside each of them.
Algorithm for clustering K-means, combining KNN-approaches based on the proximity of each data point to other nearby points and SVM, since it focuses on classification into various categories. Here, each data point is categorized based on its characteristics. The category has a center point that serves as the prototype for all other data points in the cluster. They are all compared to these prototypes to determine their k-mean, which acts as a measure of the difference between the prototype and the current data point. Data points with higher k-means are closer to the prototype, forming a cluster. K-Means Clustering can detect anomalies by marking points that do not fit any of the established categories. This is suitable for scenarios where there is untagged data from many different types that need to be organized similar to the prototypes learned.
There are other more sophisticated algorithms for unsupervised anomaly detection and multidimensional datasets. For example, Gaussian as an alternative version of the K-Means algorithm with Gaussian distribution instead of standard deviation. And Bayesian uses Bayesian probability to detect anomalies. Also, to detect anomalies, autoencoders can be used - neural networks that create coded rules for the expected output depending on the input value. Anything beyond these repetitive values is considered an anomaly and is well suited for dimensional detection tasks.