algo package

Submodules

algo.algorithm_utils module

class algo.algorithm_utils.PyTorchUtils(seed, gpu)

Bases: object

Abstract class for PyTorch based deep learning detection algorithms.

property device
to_device(model)
to_var(t, **kwargs)
class algo.algorithm_utils.TensorflowUtils(seed, gpu)

Bases: object

Abstract class for Tensorflow based deep learning detection algorithms.

property device
class algo.algorithm_utils.deepBase(module_name, name, seed, details=False)

Bases: object

Abstract class for deep learning based detection algorithms.

abstract fit(X)

Train the algorithm on the given dataset

abstract predict(X)

:return anomaly score

algo.autoencoder module

class algo.autoencoder.AUTOENCODER(name: str = 'AutoEncoder', num_epochs: int = 10, batch_size: int = 20, lr: float = 0.001, hidden_size: int = 5, sequence_length: int = 30, train_gaussian_percentage: float = 0.25, seed: int = None, gpu: int = None, details=True, contamination=0.05)

Bases: pyodds.algo.base.Base, pyodds.algo.algorithm_utils.deepBase, pyodds.algo.algorithm_utils.PyTorchUtils

Auto Encoder (AE) is a type of neural networks for learning useful data representations unsupervisedly. It could be used to detect outlying objects in the data by calculating the reconstruction errors.

Parameters
  • name (str, optional (default='AutoEncoder')) – The name of the algorithm

  • num_epochs (int, optional (default=10)) – The number of epochs

  • batch_size (int, optional (default=20)) – The number of batch size

  • lr (float, optional (default=1e-3)) – The speed of learning rate

  • hidden_size (int, optional (default=5)) – The number of hidden layer

  • sequence_length (int, optional (default=30)) – The length of sequence

  • train_gaussian_percentage (float, optional (default=0.25)) – The percentage for gaussian training

  • seed (int, optional (default=None)) – The random seed

  • contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

decision_function(X: pandas.core.frame.DataFrame) → numpy.array

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. :param X: The training input samples. Sparse matrices are accepted only

if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X: pandas.core.frame.DataFrame)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the decision_function(X), and the threshold contamination. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

class algo.autoencoder.AutoEncoderModule(n_features: int, sequence_length: int, hidden_size: int, seed: int, gpu: int)

Bases: torch.nn.modules.module.Module, pyodds.algo.algorithm_utils.PyTorchUtils

forward(ts_batch, return_latent: bool = False)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

algo.base module

class algo.base.Base

Bases: object

Abstract class for all outlier detection algorithms.

decision_function(X)

Predict raw anomaly scores of X using the fitted detector. The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores. :param X: The input samples. Sparse matrices are accepted only

if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. :param X: The input samples. :type X: numpy array of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.cblof module

class algo.cblof.CBLOF(n_clusters=8, contamination=0.1, clustering_estimator=None, alpha=0.9, beta=5, use_weights=False, random_state=None, n_jobs=1)

Bases: pyodds.algo.base.Base

The CBLOF operator calculates the outlier score based on cluster-based local outlier factor. CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster. Use weighting for outlier factor based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it is disabled by default.Outliers scores are solely computed based on their distance to the closest large cluster center.

Parameters

n_clusters (int, optional (default=8)) – The number of clusters to form as well as the number of centroids to generate.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

clustering_estimatorEstimator, optional (default=None)

The base clustering algorithm for performing data clustering. A valid clustering algorithm should be passed in. The estimator should have standard sklearn APIs, fit() and predict(). The estimator should have attributes labels_ and cluster_centers_. If cluster_centers_ is not in the attributes once the model is fit, it is calculated as the mean of the samples in a cluster. If not set, CBLOF uses KMeans for scalability. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

alphafloat in (0.5, 1), optional (default=0.9)

Coefficient for deciding small and large clusters. The ratio of the number of samples in large clusters to the number of samples in small clusters.

betaint or float in (1,), optional (default=5).

Coefficient for deciding small and large clusters. For a list sorted clusters by size |C1|, |C2|, …, |Cn|, beta = |Ck|/|Ck-1|

use_weightsbool, optional (default=False)

If set to True, the size of clusters are used as weights in outlier score calculation.

check_estimatorbool, optional (default=False)

If set to True, check whether the base estimator is consistent with sklearn standard.

decision_function(X)

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. :param X: The training input samples. Sparse matrices are accepted only

if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.cblof.pairwise_distances_no_broadcast(X, Y)

Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast. For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4). :param X: First input samples :type X: array of shape (n_samples, n_features) :param Y: Second input samples :type Y: array of shape (n_samples, n_features)

Returns

distance – Row-wise euclidean distance of X and Y

Return type

array of shape (n_samples,)

algo.dagmm module

Adapted from Daniel Stanley Tan (https://github.com/danieltan07/dagmm)

class algo.dagmm.DAGMM(num_epochs=10, lambda_energy=0.1, lambda_cov_diag=0.005, lr=0.001, batch_size=50, gmm_k=3, normal_percentile=80, sequence_length=30, autoencoder_type=<class 'pyodds.algo.autoencoder.AutoEncoderModule'>, autoencoder_args=None, hidden_size: int = 5, seed: int = None, gpu: int = None, details=True, contamination=0.05)

Bases: pyodds.algo.base.Base, pyodds.algo.algorithm_utils.deepBase, pyodds.algo.algorithm_utils.PyTorchUtils

Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection, Zong et al, 2018. Unsupervised anomaly detection on multi- or high-dimensional data is of great importance in both fundamental machine learning research and industrial applications, for which density estimation lies at the core. Although previous approaches based on dimensionality reduction followed by density estimation have made fruitful progress, they mainly suffer from decoupled model learning with inconsistent optimization goals and incapability of preserving essential information in the low-dimensional space. In this paper, we present a Deep Autoencoding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. Our model utilizes a deep autoencoder to generate a low-dimensional representation and reconstruction error for each input data point, which is further fed into a Gaussian Mixture Model (GMM). Instead of using decoupled two-stage training and the standard Expectation-Maximization (EM) algorithm, DAGMM jointly optimizes the parameters of the deep autoencoder and the mixture model simultaneously in an end-to-end fashion, leveraging a separate estimation network to facilitate the parameter learning of the mixture model. The joint optimization, which well balances autoencoding reconstruction, density estimation of latent representation, and regularization, helps the autoencoder escape from less attractive local optima and further reduce reconstruction errors, avoiding the need of pre-training.

Parameters
  • num_epochs (int, optional (default=10)) – The number of epochs

  • lambda_energy (float, optional (default=0.1)) – The parameter to balance the energy in loss function

  • lambda_cov_diag (float, optional (default=0.05)) – The parameter to balance the covariance in loss function

  • lr (float, optional (default=1e-3)) – The speed of learning rate

  • batch_size (int, optional (default=50)) – The number of samples in one batch

  • gmm_k (int, optional (default=3)) – The number of clusters in the Gaussian Mixture model

  • sequence_length (int, optional (default=30)) – The length of sequence

  • hidden_size (int, optional (default=5)) – The size of hidden layer

  • seed (int, optional (default=None)) – The random seed

  • contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

class AutoEncoder

Bases: object

LSTM

alias of pyodds.algo.lstmencdec.LSTMEDModule

NN

alias of pyodds.algo.autoencoder.AutoEncoderModule

dagmm_step(input_data)
decision_function(X: pandas.core.frame.DataFrame)

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. Using the learned mixture probability, mean and covariance for each component k, compute the energy on the given data.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X: pandas.core.frame.DataFrame)

Learn the mixture probability, mean and covariance for each component k. Store the computed energy based on the training data and the aforementioned parameters. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

reset_grad()
class algo.dagmm.DAGMMModule(autoencoder, n_gmm, latent_dim, seed: int, gpu: int)

Bases: torch.nn.modules.module.Module, pyodds.algo.algorithm_utils.PyTorchUtils

Residual Block.

compute_energy(z, phi=None, mu=None, cov=None, size_average=True)
compute_gmm_params(z, gamma)
forward(x)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

loss_function(x, x_hat, z, gamma, lambda_energy, lambda_cov_diag)
relative_euclidean_distance(a, b, dim=1)

algo.hbos module

class algo.hbos.HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)

Bases: pyodds.algo.base.Base

Histogram- based outlier detection (HBOS) is an efficient unsupervised method. It assumes the feature independence and calculates the degree of outlyingness by building histograms. See :cite:`goldstein2012histogram` for details.

Parameters
  • n_bins (int, optional (default=10)) – The number of bins.

  • alpha (float in (0, 1), optional (default=0.1)) – The regularizer for preventing overflow.

  • tol (float in (0, 1), optional (default=0.1)) – The parameter to decide the flexibility while dealing the samples falling outside the bins.

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

bin_edges_

The edges of the bins.

Type

numpy array of shape (n_bins + 1, n_features )

hist_

The density of each histogram.

Type

numpy array of shape (n_bins, n_features)

decision_scores_

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type

numpy array of shape (n_samples,)

threshold_

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type

float

labels_

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type

int, either 0 or 1

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector

Parameters

X (dataframe of shape (n_samples, n_features)) – The input samples.

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.hbos.invert_order(scores, method='multiplication')

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different.

Parameters
  • scores (list, array or numpy array with shape (n_samples,)) – The list of values to be inverted

  • method (str, optional (default='multiplication')) – Methods used for order inversion. Valid methods are: - ‘multiplication’: multiply by -1 - ‘subtraction’: max(scores) - scores

Returns

inverted_scores – The inverted list

Return type

numpy array of shape (n_samples,)

Examples

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([ 0.6,  0.4,  0.2,  0. ,  0.5,  0.6])

algo.iforest module

class algo.iforest.IFOREST(n_estimators=100, max_samples='auto', contamination='legacy', max_features=1.0, bootstrap=False, n_jobs=None, behaviour='old', random_state=None, verbose=0, warm_start=False)

Bases: sklearn.ensemble.iforest.IsolationForest, pyodds.algo.base.Base

Isolation Forest Algorithm Return the anomaly score of each sample using the IsolationForest algorithm The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Parameters
  • n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.

  • max_samples (int or float, optional (default="auto")) –

    The number of samples to draw from X to train each base estimator.
    • If int, then draw max_samples samples.

    • If float, then draw max_samples * X.shape[0] samples.

    • If “auto”, then max_samples=min(256, n_samples).

    If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).

  • contamination (float in (0., 0.5), optional (default=0.1)) –

    The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. If ‘auto’, the decision function threshold is determined as in the original paper. .. versionchanged:: 0.20

    The default value of contamination will change from 0.1 in 0.20 to 'auto' in 0.22.

  • max_features (int or float, optional (default=1.0)) –

    The number of features to draw from X to train each base estimator.
    • If int, then draw max_features features.

    • If float, then draw max_features * X.shape[1] features.

  • bootstrap (boolean, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.

  • n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • behaviour (str, default='old') –

    Behaviour of the decision_function which can be either ‘old’ or ‘new’. Passing behaviour='new' makes the decision_function change to match other anomaly detection algorithm API which will be the default behaviour in the future. As explained in details in the offset_ attribute documentation, the decision_function becomes dependent on the contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers. .. versionadded:: 0.20

    behaviour is added in 0.20 for back-compatibility purpose.

    Deprecated since version 0.20: behaviour='old' is deprecated in 0.20 and will not be possible in 0.22.

    Deprecated since version 0.22: behaviour parameter will be deprecated in 0.22 and removed in 0.24.

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, optional (default=0)) – Controls the verbosity of the tree building process.

  • warm_start (bool, optional (default=False)) – When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary. .. versionadded:: 0.21

estimators_

The collection of fitted sub-estimators.

Type

list of DecisionTreeClassifier

estimators_samples_

The subset of drawn samples (i.e., the in-bag samples) for each base estimator.

Type

list of arrays

max_samples_

The actual number of samples

Type

integer

offset_

Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. Assuming behaviour == ‘new’, offset_ is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training. Assuming the behaviour parameter is set to ‘old’, we always have offset_ = -0.5, making the decision function independent from the contamination parameter.

Type

float

Notes

The implementation is based on an ensemble of ExtraTreeRegressor. The maximum depth of each tree is set to ceil(log_2(n)) where \(n\) is the number of samples used to build the tree (see (Liu et al., 2008) for more details).

References

1

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.

2

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

algo.knn module

class algo.knn.KNN(contamination=0.1, n_neighbors=5, method='largest', radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, **kwargs)

Bases: pyodds.algo.base.Base

kNN class for outlier detection. For an observation, its distance to its kth nearest neighbor could be viewed as the outlying score. It could be viewed as a way to measure the density. See :cite:`ramaswamy2000efficient,angiulli2002fast` for details.

Three kNN detectors are supported: largest: use the distance to the kth neighbor as the outlier score mean: use the average of all k neighbors as the outlier score median: use the median of the distance to k neighbors as the outlier score

Parameters
  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

  • n_neighbors (int, optional (default = 5)) – Number of neighbors to use by default for k neighbors queries.

  • method (str, optional (default='largest')) –

    {‘largest’, ‘mean’, ‘median’}

    • ’largest’: use the distance to the kth neighbor as the outlier score

    • ’mean’: use the average of all k neighbors as the outlier score

    • ’median’: use the median of the distance to k neighbors as the outlier score

  • radius (float, optional (default = 1.0)) – Range of parameter space to use by default for radius_neighbors queries.

  • algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) –

    Algorithm used to compute the nearest neighbors:

    • ’ball_tree’ will use BallTree

    • ’kd_tree’ will use KDTree

    • ’brute’ will use a brute-force search.

    • ’auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

    Note: fitting on sparse input will override the setting of this parameter, using brute force.

  • leaf_size (int, optional (default = 30)) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

  • metric (string or callable, default 'minkowski') –

    metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

    If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

    Distance matrices are not supported.

    Valid values for metric are:

    • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

    • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

    See the documentation for scipy.spatial.distance for details on these metrics.

  • p (integer, optional (default = 2)) – Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances

  • metric_params (dict, optional (default = None)) – Additional keyword arguments for the metric function.

  • n_jobs (int, optional (default = 1)) – The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.

decision_scores_

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type

numpy array of shape (n_samples,)

threshold_

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type

float

labels_

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type

int, either 0 or 1

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. y is optional for unsupervised methods.

Parameters

X (dataframe of shape (n_samples, n_features)) – The input samples.

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.lof module

class algo.lof.LOF(n_neighbors=20, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, contamination='legacy', novelty=False, n_jobs=None)

Bases: sklearn.neighbors.lof.LocalOutlierFactor, pyodds.algo.base.Base

Unsupervised Outlier Detection using Local Outlier Factor (LOF) The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

Parameters
  • n_neighbors (int, optional (default=20)) – Number of neighbors to use by default for kneighbors() queries. If n_neighbors is larger than the number of samples provided, all samples will be used.

  • algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) –

    Algorithm used to compute the nearest neighbors: - ‘ball_tree’ will use BallTree - ‘kd_tree’ will use KDTree - ‘brute’ will use a brute-force search. - ‘auto’ will attempt to decide the most appropriate algorithm

    based on the values passed to fit() method.

    Note: fitting on sparse input will override the setting of this parameter, using brute force.

  • leaf_size (int, optional (default=30)) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

  • metric (string or callable, default 'minkowski') –

    metric used for the distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used. If ‘precomputed’, the training input X is expected to be a distance matrix. If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string. Valid values for metric are:

    • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

    • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

    See the documentation for scipy.spatial.distance for details on these metrics: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

  • p (integer, optional (default=2)) – Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances(). When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

  • metric_params (dict, optional (default=None)) – Additional keyword arguments for the metric function.

  • contamination (float in (0., 0.5), optional (default=0.1)) –

    The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function. If “auto”, the decision function threshold is determined as in the original paper. .. versionchanged:: 0.20

    The default value of contamination will change from 0.1 in 0.20 to 'auto' in 0.22.

  • novelty (boolean, default False) – By default, LocalOutlierFactor is only meant to be used for outlier detection (novelty=False). Set novelty to True if you want to use LocalOutlierFactor for novelty detection. In this case be aware that that you should only use predict, decision_function and score_samples on new unseen data and not on the training set.

  • n_jobs (int or None, optional (default=None)) – The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. Affects only kneighbors() and kneighbors_graph() methods.

negative_outlier_factor_

The opposite LOF of the training samples. The higher, the more normal. Inliers tend to have a LOF score close to 1 (negative_outlier_factor_ close to -1), while outliers tend to have a larger LOF score. The local outlier factor (LOF) of a sample captures its supposed ‘degree of abnormality’. It is the average of the ratio of the local reachability density of a sample and those of its k-nearest neighbors.

Type

numpy array, shape (n_samples,)

n_neighbors_

The actual number of neighbors used for kneighbors() queries.

Type

integer

offset_

Offset used to obtain binary labels from the raw scores. Observations having a negative_outlier_factor smaller than offset_ are detected as abnormal. The offset is set to -1.5 (inliers score around -1), except when a contamination parameter different than “auto” is provided. In that case, the offset is defined in such a way we obtain the expected number of outliers in training.

Type

float

References

1

Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In ACM sigmod record.

algo.lstmad module

class algo.lstmad.LSTMAD(len_in=1, len_out=10, num_epochs=10, lr=0.001, batch_size=1, seed: int = None, gpu: int = None, details=True, contamination=0.05)

Bases: pyodds.algo.base.Base, pyodds.algo.algorithm_utils.deepBase, pyodds.algo.algorithm_utils.PyTorchUtils

Malhotra, Pankaj, et al. “Long short term memory networks for anomaly detection in time series.” Proceedings. Presses universitaires de Louvain, 2015.

Long Short Term Memory (LSTM) networks have been demonstrated to be particularly useful for learning sequences containing longer term patterns of unknown length, due to their ability to maintain long term memory. Stacking recurrent hidden layers in such networks also enables the learning of higher level temporal features, for faster learning with sparser representations. In this paper, we use stacked LSTM networks for anomaly/fault detection in time series. A network is trained on non-anomalous data and used as a predictor over a number of time steps. The resulting prediction errors are modeled as a multivariate Gaussian distribution, which is used to assess the likelihood of anomalous behavior.

Parameters
  • len_in (int, optional (default=1)) – The length of input layer

  • len_out (int, optional (default=10)) – The length of output layer

  • num_epochs (int, optional (default=100)) – The number of epochs

  • lr (float, optional (default=1e-3)) – The speed of learning rate

  • seed (int, optional (default=None)) – The random seed

  • contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

class algo.lstmad.LSTMSequence(d, batch_size: int, len_in=1, len_out=10)

Bases: torch.nn.modules.module.Module

forward(input_x)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

algo.lstmencdec module

class algo.lstmencdec.LSTMED(name: str = 'LSTM-ED', num_epochs: int = 10, batch_size: int = 20, lr: float = 0.001, hidden_size: int = 5, sequence_length: int = 30, train_gaussian_percentage: float = 0.25, n_layers: tuple = (1, 1), use_bias: tuple = (True, True), dropout: tuple = (0, 0), seed: int = None, gpu: int = None, details=True, contamination=0.05)

Bases: pyodds.algo.base.Base, pyodds.algo.algorithm_utils.deepBase, pyodds.algo.algorithm_utils.PyTorchUtils

Malhotra, Pankaj, et al. “LSTM-based encoder-decoder for multi-sensor anomaly detection.” ICML, 2016.

Mechanical devices such as engines, vehicles, aircrafts, etc., are typically instrumented with numerous sensors to capture the behavior and health of the machine. However, there are often external factors or variables which are not captured by sensors leading to time-series which are inherently unpredictable. For instance, manual controls and/or unmonitored environmental conditions or load may lead to inherently unpredictable time-series. Detecting anomalies in such scenarios becomes challenging using standard approaches based on mathematical models that rely on stationarity, or prediction models that utilize prediction errors to detect anomalies. We propose a Long Short Term Memory Networks based Encoder-Decoder scheme for Anomaly Detection (EncDec-AD) that learns to reconstruct ‘normal’ time-series behavior, and thereafter uses reconstruction error to detect anomalies. We experiment with three publicly available quasi predictable time-series datasets: power demand, space shuttle, and ECG, and two real-world engine datasets with both predictive and unpredictable behavior.

Parameters
  • name (str, optional default='LSTM-ED') – The name of the algorithm

  • num_epochs (int, optional (default=10)) – The number of epochs

  • batch_size (int, optional (default=20)) – The number of batch size

  • lr (float, optional (default=1e-3)) – The speed of learning rate

  • hidden_size (int, optional (default=5)) – The number of hidden layer

  • sequence_length (int, optional (default=30)) – The length of sequence

  • train_gaussian_percentage (float, optional (default=0.25)) – The percentage for gaussian training

  • n_layers (tuple, optional (default=(1,1))) – The number of hidden layers

  • use_bias (tuple, optional (default=(True, True))) – Whether use bias or not in hidden layers

  • dropout (tuple, optional (default=(0, 0))) – Dropout rates in hidden layers

  • seed (int, optional (default=None)) – The random seed

  • contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

decision_function(X: pandas.core.frame.DataFrame)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X: pandas.core.frame.DataFrame)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

class algo.lstmencdec.LSTMEDModule(n_features: int, hidden_size: int, n_layers: tuple, use_bias: tuple, dropout: tuple, seed: int, gpu: int)

Bases: torch.nn.modules.module.Module, pyodds.algo.algorithm_utils.PyTorchUtils

forward(ts_batch, return_latent: bool = False)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

algo.luminolFunc module

class algo.luminolFunc.luminolDet(contamination=0.1)

Bases: pyodds.algo.base.Base

Luminol is a light weight python library for time series data analysis. The two major functionalities it supports are anomaly detection and correlation. It can be used to investigate possible causes of anomaly.

Parameters
  • contamination (float in (0., 0.5), optional (default=0.1)) –

  • amount of contamination of the data set, (The) –

  • the proportion of outliers in the data set. Used when fitting to (i.e.) –

  • the threshold on the decision function. (define) –

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.ocsvm module

class algo.ocsvm.OCSVM(kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, random_state=None)

Bases: sklearn.svm.classes.OneClassSVM, pyodds.algo.base.Base

Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm. Read more in the User Guide. :param kernel: Specifies the kernel type to be used in the algorithm.

It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.

Parameters
  • degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.

  • gamma (float, optional (default='auto')) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma='scale' is passed then it uses 1 / (n_features * X.var()) as value of gamma. The current default of gamma, ‘auto’, will change to ‘scale’ in version 0.22. ‘auto_deprecated’, a deprecated version of ‘auto’ is used as a default indicating that no explicit value of gamma was passed.

  • coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.

  • tol (float, optional) – Tolerance for stopping criterion.

  • nu (float, optional) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.

  • shrinking (boolean, optional) – Whether to use the shrinking heuristic.

  • cache_size (float, optional) – Specify the size of the kernel cache (in MB).

  • verbose (bool, default: False) – Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.

  • max_iter (int, optional (default=-1)) – Hard limit on iterations within solver, or -1 for no limit.

  • random_state (int, RandomState instance or None, optional (default=None)) –

    Ignored. .. deprecated:: 0.20

    random_state has been deprecated in 0.20 and will be removed in 0.22.

support_

Indices of support vectors.

Type

array-like, shape = [n_SV]

support_vectors_

Support vectors.

Type

array-like, shape = [nSV, n_features]

dual_coef_

Coefficients of the support vectors in the decision function.

Type

array, shape = [1, n_SV]

coef_

Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel. coef_ is readonly property derived from dual_coef_ and support_vectors_

Type

array, shape = [1, n_features]

intercept_

Constant in the decision function.

Type

array, shape = [1,]

offset_

Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. The offset is the opposite of intercept_ and is provided for consistency with other outlier detection algorithms.

Type

float

Examples

>>> from sklearn.svm import OneClassSVM
>>> X = [[0], [0.44], [0.45], [0.46], [1]]
>>> clf = OneClassSVM(gamma='auto').fit(X)
>>> clf.predict(X)
array([-1,  1,  1,  1, -1])
>>> clf.score_samples(X)  
array([1.7798..., 2.0547..., 2.0556..., 2.0561..., 1.7332...])

algo.pca module

class algo.pca.PCA(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)

Bases: pyodds.algo.base.Base

Principal component analysis (PCA) can be used in detecting outliers. PCA is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

In this procedure, covariance matrix of the data can be decomposed to orthogonal vectors, called eigenvectors, associated with eigenvalues. The eigenvectors with high eigenvalues capture most of the variance in the data.

Therefore, a low dimensional hyperplane constructed by k eigenvectors can capture most of the variance in the data. However, outliers are different from normal data points, which is more obvious on the hyperplane constructed by the eigenvectors with small eigenvalues.

Therefore, outlier scores can be obtained as the sum of the projected distance of a sample on all eigenvectors. See :cite:`shyu2003novel,aggarwal2015outlier` for details.

Score(X) = Sum of weighted euclidean distance between each sample to the hyperplane constructed by the selected eigenvectors

Parameters
  • n_components (int, float, None or string) –

    Number of components to keep. if n_components is not set all components are kept:

    n_components == min(n_samples, n_features)
    

    if n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components n_components cannot be equal to n_features for svd_solver == ‘arpack’.

  • n_selected_components (int, optional (default=None)) – Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components.

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

  • copy (bool (default True)) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

  • whiten (bool, optional (default False)) –

    When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

  • svd_solver (string {'auto', 'full', 'arpack', 'randomized'}) –

    auto :

    the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

    full :

    run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

    arpack :

    run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < X.shape[1]

    randomized :

    run randomized SVD by the method of Halko et al.

  • tol (float >= 0, optional (default .0)) – Tolerance for singular values computed by svd_solver == ‘arpack’.

  • iterated_power (int >= 0, or 'auto', (default 'auto')) – Number of iterations for the power method computed by svd_solver == ‘randomized’.

  • random_state (int, RandomState instance or None, optional (default None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.

  • weighted (bool, optional (default=True)) – If True, the eigenvalues are used in score computation. The eigenvectors with small eigenvalues comes with more importance in outlier score calculation.

  • standardization (bool, optional (default=True)) – If True, perform standardization first to convert data to zero mean and unit variance. See http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

components_

Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

Type

array, shape (n_components, n_features)

explained_variance_

The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

Type

array, shape (n_components,)

explained_variance_ratio_

Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0.

Type

array, shape (n_components,)

singular_values_

The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

Type

array, shape (n_components,)

mean_

Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).

Type

array, shape (n_features,)

n_components_

The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or n_features if n_components is None.

Type

int

noise_variance_

The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

Type

float

decision_scores_

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type

numpy array of shape (n_samples,)

threshold_

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type

float

labels_

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type

int, either 0 or 1

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

property explained_variance_

The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

Decorator for scikit-learn PCA attributes.

property explained_variance_ratio_

Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0.

Decorator for scikit-learn PCA attributes.

fit(X)

Fit detector.

Parameters

X (dataframe of shape (n_samples, n_features)) – The input samples.

property mean_

Per-feature empirical mean, estimated from the training set.

Decorator for scikit-learn PCA attributes.

property noise_variance_

The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

Decorator for scikit-learn PCA attributes.

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

property singular_values_

The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

Decorator for scikit-learn PCA attributes.

algo.robustcovariance module

class algo.robustcovariance.RCOV(store_precision=True, assume_centered=False, support_fraction=None, contamination=0.1, random_state=None)

Bases: sklearn.covariance.elliptic_envelope.EllipticEnvelope, pyodds.algo.base.Base

An object for detecting outliers in a Gaussian distributed dataset.

Parameters
  • store_precision (boolean, optional (default=True)) – Specify if the estimated precision is stored.

  • assume_centered (boolean, optional (default=False)) – If True, the support of robust location and covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.

  • support_fraction (float in (0., 1.), optional (default=None)) – The proportion of points to be included in the support of the raw MCD estimate. If None, the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2.

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set.

  • random_state (int, RandomState instance or None, optional (default=None)) – The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

location_

Estimated robust location

Type

array-like, shape (n_features,)

covariance_

Estimated robust covariance matrix

Type

array-like, shape (n_features, n_features)

precision_

Estimated pseudo inverse matrix. (stored only if store_precision is True)

Type

array-like, shape (n_features, n_features)

support_

A mask of the observations that have been used to compute the robust estimates of location and shape.

Type

array-like, shape (n_samples,)

offset_

Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. The offset depends on the contamination parameter and is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.

Type

float

Examples

>>> import numpy as np
>>> from sklearn.covariance import EllipticEnvelope
>>> true_cov = np.array([[.8, .3],
...                      [.3, .4]])
>>> X = np.random.RandomState(0).multivariate_normal(mean=[0, 0],
...                                                  cov=true_cov,
...                                                  size=500)
>>> cov = EllipticEnvelope(random_state=0).fit(X)
>>> # predict returns 1 for an inlier and -1 for an outlier
>>> cov.predict([[0, 0],
...              [3, 3]])
array([ 1, -1])
>>> cov.covariance_ 
array([[0.7411..., 0.2535...],
       [0.2535..., 0.3053...]])
>>> cov.location_
array([0.0813... , 0.0427...])

See also

EmpiricalCovariance, MinCovDet

Notes

Outlier detection from covariance estimation may break or not perform well in high-dimensional settings. In particular, one will always take care to work with n_samples > n_features ** 2.

References

1

Rousseeuw, P.J., Van Driessen, K. “A fast algorithm for the minimum covariance determinant estimator” Technometrics 41(3), 212 (1999)

algo.sod module

class algo.sod.SOD(contamination=0.1, n_neighbors=20, ref_set=10, alpha=0.8)

Bases: pyodds.algo.base.Base

Subspace outlier detection (SOD) schema aims to detect outlier in varying subspaces of a high dimensional feature space. For each data object, SOD explores the axis-parallel subspace spanned by the data object’s neighbors and determines how much the object deviates from the neighbors in this subspace.

Parameters
  • n_neighbors (int, optional (default=20)) – Number of neighbors to use by default for k neighbors queries.

  • ref_set (int, optional (default=10)) – Specifies the number of shared nearest neighbors to create the reference set. Note that ref_set must be smaller than n_neighbors.

  • alpha (float in (0., 1.), optional (default=0.8)) – specifies the lower limit for selecting subspace. 0.8 is set as default as suggested in the original paper.

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

decision_scores_

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type

numpy array of shape (n_samples,)

threshold_

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type

float

labels_

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type

int, either 0 or 1

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.staticautoencoder module

class algo.staticautoencoder.StaticAutoEncoder(hidden_neurons=None, epoch=100, dropout_rate=0.2, contamination=0.1, regularizer_weight=0.1, activation='relu', kernel_regularizer=0.01, loss_function='mse', optimizer='adam')

Bases: pyodds.algo.base.Base

decision_function(X)

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters

X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)

Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict(X)

Return outliers with -1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

Returns

ranking – The outlierness of the input samples.

Return type

numpy array of shape (n_samples,)

algo.staticautoencoder.l21shrink(epsilon, x)

Module contents