Anomaly Detection#

An anomaly is something that differs from a norm: a deviation, an exception. In software engineering, by anomaly we understand a rare occurrence or event that doesn’t fit into the pattern, and, therefore, seems suspicious.

Types#

  • Global outliers: When a data point assumes a value that is far outside all the other data point value ranges in the dataset, it can be considered a global anomaly. In other words, it’s a rare event.

  • Contextual outliers: When an outlier is called contextual it means that its value doesn’t correspond with what we expect to observe for a similar data point in the same context. Contexts are usually temporal, and the same situation observed at different times can be not an outlier.

  • Collective outliers: Collective outliers are represented by a subset of data points that deviate from the normal behavior.

Methods of detecting Anamolies#

There are three categories of outlier detection, namely, supervised, semi-supervised, and unsupervised:

  • Supervised: Requires fully labeled training and testing datasets. An ordinary classifier is trained first and applied afterward. Here, the quality of the training dataset is very important and it is a lot of manual work involved since somebody needs to collect and label examples. Due to this mostly unsupervised methods are used in Anamoly detection.

  • Semi-supervised: Uses training and test datasets, whereas training data only consists of normal data without any outliers. A model of the normal class is learned and outliers can be detected afterward by deviating from that model.

  • Unsupervised: Does not require any labels; there is no distinction between a training and a test dataset Data is scored solely based on intrinsic properties of the dataset.

And three fundamental approaches to detect anomalies are based on:

  • By Density: Normal points occur in dense regions, while anomalies occur in sparse regions

  • By Distance: Normal point is close to its neighbors and anomaly is far from its neighbors

  • By Isolation: The term isolation means ‘separating an instance from the rest of the instances’. Since anomalies are ‘few and different’ and therefore they are more susceptible to isolation.

Isolation Forest#

Isolation Forest is an unsupervised anomaly detection algorithm that uses a random forest algorithm (decision trees) under the hood to detect outliers in the dataset. The algorithm tries to split or divide the data points such that each observation gets isolated from the others. Usually, the anomalies lie away from the cluster of data points, so it’s easier to isolate the anomalies compare to the regular data points.

../../_images/image181.PNG

Partitioning of Anomaly and Regular data point Source#

Local Outlier Factor#

It takes the density of data points into consideration to decide whether a point is an anomaly or not. The local outlier factor computes an anomaly score called anomaly score that measures how isolated the point is with respect to the surrounding neighborhood. It takes into account the local as well as the global density to compute the anomaly score.

../../_images/image191.PNG

Local Outlier Factor Formulation Source#

Robust Covariance#

For gaussian independent features, simple statistical techniques can be employed to detect anomalies in the dataset. For a gaussian/normal distribution, the data points lying away from 3rd deviation can be considered as anomalies.

For a dataset having all the feature gaussian in nature, then the statistical approach can be generalized by defining an elliptical hypersphere that covers most of the regular data points, and the data points that lie away from the hypersphere can be considered as anomalies.

One Class SVM#

A regular SVM algorithm tries to find a hyperplane that best separates the two classes of data points. For one-class SVM where we have one class of data points, and the task is to predict a hypersphere that separates the cluster of data points from the anomalies.

One Class SVM (SGD)#

One-class SVM with SGD solves the linear One-Class SVM using Stochastic Gradient Descent. The implementation is meant to be used with a kernel approximation technique to obtain results similar to sklearn.svm.OneClassSVM which uses a Gaussian kernel by default.

Cluster-based Local Outlier Factor (CBLOF)#

The CBLOF calculates the outlier score based on cluster-based local outlier factor. An anomaly score is computed by the distance of each instance to its cluster center multiplied by the instances belonging to its cluster.

Histogram-based Outlier Detection (HBOS)#

HBOS assumes the feature independence and calculates the degree of anomalies by building histograms. In multivariate anomaly detection, a histogram for each single feature can be computed, scored individually and combined at the end.

KNN#

It is one of the simplest methods in anomaly detection. For a data point, its distance to its kth nearest neighbor could be viewed as the outlier score.

For Univariate Analysis Isolation Forest can be used to detect outliers that returns the anomaly score of each sample. Isolation Forest is a tree-based model. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature. In multivariate anomaly detection, outlier is a combined unusual score on at least two variables. For Multivariate Analysis multiple options are avaliable other than isolation forest:

Questions#