algo package¶
Submodules¶
algo.algorithm_utils module¶

class
algo.algorithm_utils.
PyTorchUtils
(seed, gpu)¶ Bases:
object
Abstract class for PyTorch based deep learning detection algorithms.

property
device
¶

to_device
(model)¶

to_var
(t, **kwargs)¶

property
algo.autoencoder module¶

class
algo.autoencoder.
AUTOENCODER
(name: str = 'AutoEncoder', num_epochs: int = 10, batch_size: int = 20, lr: float = 0.001, hidden_size: int = 5, sequence_length: int = 30, train_gaussian_percentage: float = 0.25, seed: int = None, gpu: int = None, details=True, contamination=0.05)¶ Bases:
pyodds.algo.base.Base
,pyodds.algo.algorithm_utils.deepBase
,pyodds.algo.algorithm_utils.PyTorchUtils
Auto Encoder (AE) is a type of neural networks for learning useful data representations unsupervisedly. It could be used to detect outlying objects in the data by calculating the reconstruction errors.
 Parameters
name (str, optional (default='AutoEncoder')) – The name of the algorithm
num_epochs (int, optional (default=10)) – The number of epochs
batch_size (int, optional (default=20)) – The number of batch size
lr (float, optional (default=1e3)) – The speed of learning rate
hidden_size (int, optional (default=5)) – The number of hidden layer
sequence_length (int, optional (default=30)) – The length of sequence
train_gaussian_percentage (float, optional (default=0.25)) – The percentage for gaussian training
seed (int, optional (default=None)) – The random seed
contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

decision_function
(X: pandas.core.frame.DataFrame) → numpy.array¶ Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. :param X: The training input samples. Sparse matrices are accepted only
if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X: pandas.core.frame.DataFrame)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the decision_function(X), and the threshold contamination. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

class
algo.autoencoder.
AutoEncoderModule
(n_features: int, sequence_length: int, hidden_size: int, seed: int, gpu: int)¶ Bases:
torch.nn.modules.module.Module
,pyodds.algo.algorithm_utils.PyTorchUtils

forward
(ts_batch, return_latent: bool = False)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

algo.base module¶

class
algo.base.
Base
¶ Bases:
object
Abstract class for all outlier detection algorithms.

decision_function
(X)¶ Predict raw anomaly scores of X using the fitted detector. The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores. :param X: The input samples. Sparse matrices are accepted only
if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. :param X: The input samples. :type X: numpy array of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

algo.cblof module¶

class
algo.cblof.
CBLOF
(n_clusters=8, contamination=0.1, clustering_estimator=None, alpha=0.9, beta=5, use_weights=False, random_state=None, n_jobs=1)¶ Bases:
pyodds.algo.base.Base
The CBLOF operator calculates the outlier score based on clusterbased local outlier factor. CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster. Use weighting for outlier factor based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it is disabled by default.Outliers scores are solely computed based on their distance to the closest large cluster center.
 Parameters
n_clusters (int, optional (default=8)) – The number of clusters to form as well as the number of centroids to generate.
 contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
 clustering_estimatorEstimator, optional (default=None)
The base clustering algorithm for performing data clustering. A valid clustering algorithm should be passed in. The estimator should have standard sklearn APIs, fit() and predict(). The estimator should have attributes
labels_
andcluster_centers_
. Ifcluster_centers_
is not in the attributes once the model is fit, it is calculated as the mean of the samples in a cluster. If not set, CBLOF uses KMeans for scalability. See https://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html alphafloat in (0.5, 1), optional (default=0.9)
Coefficient for deciding small and large clusters. The ratio of the number of samples in large clusters to the number of samples in small clusters.
 betaint or float in (1,), optional (default=5).
Coefficient for deciding small and large clusters. For a list sorted clusters by size C1, C2, …, Cn, beta = Ck/Ck1
 use_weightsbool, optional (default=False)
If set to True, the size of clusters are used as weights in outlier score calculation.
 check_estimatorbool, optional (default=False)
If set to True, check whether the base estimator is consistent with sklearn standard.

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. :param X: The training input samples. Sparse matrices are accepted only
if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

algo.cblof.
pairwise_distances_no_broadcast
(X, Y)¶ Utility function to calculate rowwise euclidean distance of two matrix. Different from pairwise calculation, this function would not broadcast. For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4). :param X: First input samples :type X: array of shape (n_samples, n_features) :param Y: Second input samples :type Y: array of shape (n_samples, n_features)
 Returns
distance – Rowwise euclidean distance of X and Y
 Return type
array of shape (n_samples,)
algo.dagmm module¶
Adapted from Daniel Stanley Tan (https://github.com/danieltan07/dagmm)

class
algo.dagmm.
DAGMM
(num_epochs=10, lambda_energy=0.1, lambda_cov_diag=0.005, lr=0.001, batch_size=50, gmm_k=3, normal_percentile=80, sequence_length=30, autoencoder_type=<class 'pyodds.algo.autoencoder.AutoEncoderModule'>, autoencoder_args=None, hidden_size: int = 5, seed: int = None, gpu: int = None, details=True, contamination=0.05)¶ Bases:
pyodds.algo.base.Base
,pyodds.algo.algorithm_utils.deepBase
,pyodds.algo.algorithm_utils.PyTorchUtils
Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection, Zong et al, 2018. Unsupervised anomaly detection on multi or highdimensional data is of great importance in both fundamental machine learning research and industrial applications, for which density estimation lies at the core. Although previous approaches based on dimensionality reduction followed by density estimation have made fruitful progress, they mainly suffer from decoupled model learning with inconsistent optimization goals and incapability of preserving essential information in the lowdimensional space. In this paper, we present a Deep Autoencoding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. Our model utilizes a deep autoencoder to generate a lowdimensional representation and reconstruction error for each input data point, which is further fed into a Gaussian Mixture Model (GMM). Instead of using decoupled twostage training and the standard ExpectationMaximization (EM) algorithm, DAGMM jointly optimizes the parameters of the deep autoencoder and the mixture model simultaneously in an endtoend fashion, leveraging a separate estimation network to facilitate the parameter learning of the mixture model. The joint optimization, which well balances autoencoding reconstruction, density estimation of latent representation, and regularization, helps the autoencoder escape from less attractive local optima and further reduce reconstruction errors, avoiding the need of pretraining.
 Parameters
num_epochs (int, optional (default=10)) – The number of epochs
lambda_energy (float, optional (default=0.1)) – The parameter to balance the energy in loss function
lambda_cov_diag (float, optional (default=0.05)) – The parameter to balance the covariance in loss function
lr (float, optional (default=1e3)) – The speed of learning rate
batch_size (int, optional (default=50)) – The number of samples in one batch
gmm_k (int, optional (default=3)) – The number of clusters in the Gaussian Mixture model
sequence_length (int, optional (default=30)) – The length of sequence
hidden_size (int, optional (default=5)) – The size of hidden layer
seed (int, optional (default=None)) – The random seed
contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

class
AutoEncoder
¶ Bases:
object

LSTM
¶ alias of
pyodds.algo.lstmencdec.LSTMEDModule

NN
¶ alias of
pyodds.algo.autoencoder.AutoEncoderModule


dagmm_step
(input_data)¶

decision_function
(X: pandas.core.frame.DataFrame)¶ Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. Using the learned mixture probability, mean and covariance for each component k, compute the energy on the given data.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X: pandas.core.frame.DataFrame)¶ Learn the mixture probability, mean and covariance for each component k. Store the computed energy based on the training data and the aforementioned parameters. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

reset_grad
()¶

class
algo.dagmm.
DAGMMModule
(autoencoder, n_gmm, latent_dim, seed: int, gpu: int)¶ Bases:
torch.nn.modules.module.Module
,pyodds.algo.algorithm_utils.PyTorchUtils
Residual Block.

compute_energy
(z, phi=None, mu=None, cov=None, size_average=True)¶

compute_gmm_params
(z, gamma)¶

forward
(x)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

loss_function
(x, x_hat, z, gamma, lambda_energy, lambda_cov_diag)¶

relative_euclidean_distance
(a, b, dim=1)¶

algo.hbos module¶

class
algo.hbos.
HBOS
(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)¶ Bases:
pyodds.algo.base.Base
Histogram based outlier detection (HBOS) is an efficient unsupervised method. It assumes the feature independence and calculates the degree of outlyingness by building histograms. See :cite:`goldstein2012histogram` for details.
 Parameters
n_bins (int, optional (default=10)) – The number of bins.
alpha (float in (0, 1), optional (default=0.1)) – The regularizer for preventing overflow.
tol (float in (0, 1), optional (default=0.1)) – The parameter to decide the flexibility while dealing the samples falling outside the bins.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

bin_edges_
¶ The edges of the bins.
 Type
numpy array of shape (n_bins + 1, n_features )

hist_
¶ The density of each histogram.
 Type
numpy array of shape (n_bins, n_features)

decision_scores_
¶ The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
 Type
numpy array of shape (n_samples,)

threshold_
¶ The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels. Type
float

labels_
¶ The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
. Type
int, either 0 or 1

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector
 Parameters
X (dataframe of shape (n_samples, n_features)) – The input samples.

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

algo.hbos.
invert_order
(scores, method='multiplication')¶ Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different.
 Parameters
scores (list, array or numpy array with shape (n_samples,)) – The list of values to be inverted
method (str, optional (default='multiplication')) – Methods used for order inversion. Valid methods are:  ‘multiplication’: multiply by 1  ‘subtraction’: max(scores)  scores
 Returns
inverted_scores – The inverted list
 Return type
numpy array of shape (n_samples,)
Examples
>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1] >>> invert_order(scores1) array([0.1, 0.3, 0.5, 0.7, 0.2, 0.1]) >>> invert_order(scores1, method='subtraction') array([ 0.6, 0.4, 0.2, 0. , 0.5, 0.6])
algo.iforest module¶

class
algo.iforest.
IFOREST
(n_estimators=100, max_samples='auto', contamination='legacy', max_features=1.0, bootstrap=False, n_jobs=None, behaviour='old', random_state=None, verbose=0, warm_start=False)¶ Bases:
sklearn.ensemble.iforest.IsolationForest
,pyodds.algo.base.Base
Isolation Forest Algorithm Return the anomaly score of each sample using the IsolationForest algorithm The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
 Parameters
n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.
max_samples (int or float, optional (default="auto")) –
 The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.
If “auto”, then max_samples=min(256, n_samples).
If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).
contamination (float in (0., 0.5), optional (default=0.1)) –
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. If ‘auto’, the decision function threshold is determined as in the original paper. .. versionchanged:: 0.20
The default value of
contamination
will change from 0.1 in 0.20 to'auto'
in 0.22.max_features (int or float, optional (default=1.0)) –
 The number of features to draw from X to train each base estimator.
If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.
bootstrap (boolean, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel for both fit and predict.
None
means 1 unless in ajoblib.parallel_backend
context.1
means using all processors. See Glossary for more details.behaviour (str, default='old') –
Behaviour of the
decision_function
which can be either ‘old’ or ‘new’. Passingbehaviour='new'
makes thedecision_function
change to match other anomaly detection algorithm API which will be the default behaviour in the future. As explained in details in theoffset_
attribute documentation, thedecision_function
becomes dependent on the contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers. .. versionadded:: 0.20behaviour
is added in 0.20 for backcompatibility purpose.Deprecated since version 0.20:
behaviour='old'
is deprecated in 0.20 and will not be possible in 0.22.Deprecated since version 0.22:
behaviour
parameter will be deprecated in 0.22 and removed in 0.24.random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
verbose (int, optional (default=0)) – Controls the verbosity of the tree building process.
warm_start (bool, optional (default=False)) – When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary. .. versionadded:: 0.21

estimators_
¶ The collection of fitted subestimators.
 Type
list of DecisionTreeClassifier

estimators_samples_
¶ The subset of drawn samples (i.e., the inbag samples) for each base estimator.
 Type
list of arrays

max_samples_
¶ The actual number of samples
 Type
integer

offset_
¶ Offset used to define the decision function from the raw scores. We have the relation:
decision_function = score_samples  offset_
. Assuming behaviour == ‘new’,offset_
is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to 0.5 as the scores of inliers are close to 0 and the scores of outliers are close to 1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training. Assuming the behaviour parameter is set to ‘old’, we always haveoffset_ = 0.5
, making the decision function independent from the contamination parameter. Type
float
Notes
The implementation is based on an ensemble of ExtraTreeRegressor. The maximum depth of each tree is set to
ceil(log_2(n))
where \(n\) is the number of samples used to build the tree (see (Liu et al., 2008) for more details).References
 1
Liu, Fei Tony, Ting, Kai Ming and Zhou, ZhiHua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.
 2
Liu, Fei Tony, Ting, Kai Ming and Zhou, ZhiHua. “Isolationbased anomaly detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
algo.knn module¶

class
algo.knn.
KNN
(contamination=0.1, n_neighbors=5, method='largest', radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, **kwargs)¶ Bases:
pyodds.algo.base.Base
kNN class for outlier detection. For an observation, its distance to its kth nearest neighbor could be viewed as the outlying score. It could be viewed as a way to measure the density. See :cite:`ramaswamy2000efficient,angiulli2002fast` for details.
Three kNN detectors are supported: largest: use the distance to the kth neighbor as the outlier score mean: use the average of all k neighbors as the outlier score median: use the median of the distance to k neighbors as the outlier score
 Parameters
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
n_neighbors (int, optional (default = 5)) – Number of neighbors to use by default for k neighbors queries.
method (str, optional (default='largest')) –
{‘largest’, ‘mean’, ‘median’}
’largest’: use the distance to the kth neighbor as the outlier score
’mean’: use the average of all k neighbors as the outlier score
’median’: use the median of the distance to k neighbors as the outlier score
radius (float, optional (default = 1.0)) – Range of parameter space to use by default for radius_neighbors queries.
algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) –
Algorithm used to compute the nearest neighbors:
’ball_tree’ will use BallTree
’kd_tree’ will use KDTree
’brute’ will use a bruteforce search.
’auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()
method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size (int, optional (default = 30)) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
metric (string or callable, default 'minkowski') –
metric to use for distance computation. Any metric from scikitlearn or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.
Distance matrices are not supported.
Valid values for metric are:
from scikitlearn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for scipy.spatial.distance for details on these metrics.
p (integer, optional (default = 2)) – Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikitlearn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances
metric_params (dict, optional (default = None)) – Additional keyword arguments for the metric function.
n_jobs (int, optional (default = 1)) – The number of parallel jobs to run for neighbors search. If
1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.

decision_scores_
¶ The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
 Type
numpy array of shape (n_samples,)

threshold_
¶ The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels. Type
float

labels_
¶ The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
. Type
int, either 0 or 1

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. y is optional for unsupervised methods.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The input samples.

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)
algo.lof module¶

class
algo.lof.
LOF
(n_neighbors=20, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, contamination='legacy', novelty=False, n_jobs=None)¶ Bases:
sklearn.neighbors.lof.LocalOutlierFactor
,pyodds.algo.base.Base
Unsupervised Outlier Detection using Local Outlier Factor (LOF) The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by knearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.
 Parameters
n_neighbors (int, optional (default=20)) – Number of neighbors to use by default for
kneighbors()
queries. If n_neighbors is larger than the number of samples provided, all samples will be used.algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) –
Algorithm used to compute the nearest neighbors:  ‘ball_tree’ will use
BallTree
 ‘kd_tree’ will useKDTree
 ‘brute’ will use a bruteforce search.  ‘auto’ will attempt to decide the most appropriate algorithmbased on the values passed to
fit()
method.Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size (int, optional (default=30)) – Leaf size passed to
BallTree
orKDTree
. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.metric (string or callable, default 'minkowski') –
metric used for the distance computation. Any metric from scikitlearn or scipy.spatial.distance can be used. If ‘precomputed’, the training input X is expected to be a distance matrix. If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string. Valid values for metric are:
from scikitlearn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for scipy.spatial.distance for details on these metrics: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html
p (integer, optional (default=2)) – Parameter for the Minkowski metric from
sklearn.metrics.pairwise.pairwise_distances()
. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.metric_params (dict, optional (default=None)) – Additional keyword arguments for the metric function.
contamination (float in (0., 0.5), optional (default=0.1)) –
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function. If “auto”, the decision function threshold is determined as in the original paper. .. versionchanged:: 0.20
The default value of
contamination
will change from 0.1 in 0.20 to'auto'
in 0.22.novelty (boolean, default False) – By default, LocalOutlierFactor is only meant to be used for outlier detection (novelty=False). Set novelty to True if you want to use LocalOutlierFactor for novelty detection. In this case be aware that that you should only use predict, decision_function and score_samples on new unseen data and not on the training set.
n_jobs (int or None, optional (default=None)) – The number of parallel jobs to run for neighbors search.
None
means 1 unless in ajoblib.parallel_backend
context.1
means using all processors. See Glossary for more details. Affects onlykneighbors()
andkneighbors_graph()
methods.

negative_outlier_factor_
¶ The opposite LOF of the training samples. The higher, the more normal. Inliers tend to have a LOF score close to 1 (
negative_outlier_factor_
close to 1), while outliers tend to have a larger LOF score. The local outlier factor (LOF) of a sample captures its supposed ‘degree of abnormality’. It is the average of the ratio of the local reachability density of a sample and those of its knearest neighbors. Type
numpy array, shape (n_samples,)

n_neighbors_
¶ The actual number of neighbors used for
kneighbors()
queries. Type
integer

offset_
¶ Offset used to obtain binary labels from the raw scores. Observations having a negative_outlier_factor smaller than offset_ are detected as abnormal. The offset is set to 1.5 (inliers score around 1), except when a contamination parameter different than “auto” is provided. In that case, the offset is defined in such a way we obtain the expected number of outliers in training.
 Type
float
References
 1
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying densitybased local outliers. In ACM sigmod record.
algo.lstmad module¶

class
algo.lstmad.
LSTMAD
(len_in=1, len_out=10, num_epochs=10, lr=0.001, batch_size=1, seed: int = None, gpu: int = None, details=True, contamination=0.05)¶ Bases:
pyodds.algo.base.Base
,pyodds.algo.algorithm_utils.deepBase
,pyodds.algo.algorithm_utils.PyTorchUtils
Malhotra, Pankaj, et al. “Long short term memory networks for anomaly detection in time series.” Proceedings. Presses universitaires de Louvain, 2015.
Long Short Term Memory (LSTM) networks have been demonstrated to be particularly useful for learning sequences containing longer term patterns of unknown length, due to their ability to maintain long term memory. Stacking recurrent hidden layers in such networks also enables the learning of higher level temporal features, for faster learning with sparser representations. In this paper, we use stacked LSTM networks for anomaly/fault detection in time series. A network is trained on nonanomalous data and used as a predictor over a number of time steps. The resulting prediction errors are modeled as a multivariate Gaussian distribution, which is used to assess the likelihood of anomalous behavior.
 Parameters
len_in (int, optional (default=1)) – The length of input layer
len_out (int, optional (default=10)) – The length of output layer
num_epochs (int, optional (default=100)) – The number of epochs
lr (float, optional (default=1e3)) – The speed of learning rate
seed (int, optional (default=None)) – The random seed
contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

class
algo.lstmad.
LSTMSequence
(d, batch_size: int, len_in=1, len_out=10)¶ Bases:
torch.nn.modules.module.Module

forward
(input_x)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

algo.lstmencdec module¶

class
algo.lstmencdec.
LSTMED
(name: str = 'LSTMED', num_epochs: int = 10, batch_size: int = 20, lr: float = 0.001, hidden_size: int = 5, sequence_length: int = 30, train_gaussian_percentage: float = 0.25, n_layers: tuple = (1, 1), use_bias: tuple = (True, True), dropout: tuple = (0, 0), seed: int = None, gpu: int = None, details=True, contamination=0.05)¶ Bases:
pyodds.algo.base.Base
,pyodds.algo.algorithm_utils.deepBase
,pyodds.algo.algorithm_utils.PyTorchUtils
Malhotra, Pankaj, et al. “LSTMbased encoderdecoder for multisensor anomaly detection.” ICML, 2016.
Mechanical devices such as engines, vehicles, aircrafts, etc., are typically instrumented with numerous sensors to capture the behavior and health of the machine. However, there are often external factors or variables which are not captured by sensors leading to timeseries which are inherently unpredictable. For instance, manual controls and/or unmonitored environmental conditions or load may lead to inherently unpredictable timeseries. Detecting anomalies in such scenarios becomes challenging using standard approaches based on mathematical models that rely on stationarity, or prediction models that utilize prediction errors to detect anomalies. We propose a Long Short Term Memory Networks based EncoderDecoder scheme for Anomaly Detection (EncDecAD) that learns to reconstruct ‘normal’ timeseries behavior, and thereafter uses reconstruction error to detect anomalies. We experiment with three publicly available quasi predictable timeseries datasets: power demand, space shuttle, and ECG, and two realworld engine datasets with both predictive and unpredictable behavior.
 Parameters
name (str, optional default='LSTMED') – The name of the algorithm
num_epochs (int, optional (default=10)) – The number of epochs
batch_size (int, optional (default=20)) – The number of batch size
lr (float, optional (default=1e3)) – The speed of learning rate
hidden_size (int, optional (default=5)) – The number of hidden layer
sequence_length (int, optional (default=30)) – The length of sequence
train_gaussian_percentage (float, optional (default=0.25)) – The percentage for gaussian training
n_layers (tuple, optional (default=(1,1))) – The number of hidden layers
use_bias (tuple, optional (default=(True, True))) – Whether use bias or not in hidden layers
dropout (tuple, optional (default=(0, 0))) – Dropout rates in hidden layers
seed (int, optional (default=None)) – The random seed
contamination (float in (0., 0.5), optional (default=0.05)) – The percentage of outliers

decision_function
(X: pandas.core.frame.DataFrame)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X: pandas.core.frame.DataFrame)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

class
algo.lstmencdec.
LSTMEDModule
(n_features: int, hidden_size: int, n_layers: tuple, use_bias: tuple, dropout: tuple, seed: int, gpu: int)¶ Bases:
torch.nn.modules.module.Module
,pyodds.algo.algorithm_utils.PyTorchUtils

forward
(ts_batch, return_latent: bool = False)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

algo.luminolFunc module¶

class
algo.luminolFunc.
luminolDet
(contamination=0.1)¶ Bases:
pyodds.algo.base.Base
Luminol is a light weight python library for time series data analysis. The two major functionalities it supports are anomaly detection and correlation. It can be used to investigate possible causes of anomaly.
 Parameters
contamination (float in (0., 0.5), optional (default=0.1)) –
amount of contamination of the data set, (The) –
the proportion of outliers in the data set. Used when fitting to (i.e.) –
the threshold on the decision function. (define) –

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)
algo.ocsvm module¶

class
algo.ocsvm.
OCSVM
(kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=1, random_state=None)¶ Bases:
sklearn.svm.classes.OneClassSVM
,pyodds.algo.base.Base
Unsupervised Outlier Detection. Estimate the support of a highdimensional distribution. The implementation is based on libsvm. Read more in the User Guide. :param kernel: Specifies the kernel type to be used in the algorithm.
It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
 Parameters
degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
gamma (float, optional (default='auto')) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if
gamma='scale'
is passed then it uses 1 / (n_features * X.var()) as value of gamma. The current default of gamma, ‘auto’, will change to ‘scale’ in version 0.22. ‘auto_deprecated’, a deprecated version of ‘auto’ is used as a default indicating that no explicit value of gamma was passed.coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
tol (float, optional) – Tolerance for stopping criterion.
nu (float, optional) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
shrinking (boolean, optional) – Whether to use the shrinking heuristic.
cache_size (float, optional) – Specify the size of the kernel cache (in MB).
verbose (bool, default: False) – Enable verbose output. Note that this setting takes advantage of a perprocess runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.
max_iter (int, optional (default=1)) – Hard limit on iterations within solver, or 1 for no limit.
random_state (int, RandomState instance or None, optional (default=None)) –
Ignored. .. deprecated:: 0.20
random_state
has been deprecated in 0.20 and will be removed in 0.22.

support_
¶ Indices of support vectors.
 Type
arraylike, shape = [n_SV]

support_vectors_
¶ Support vectors.
 Type
arraylike, shape = [nSV, n_features]

dual_coef_
¶ Coefficients of the support vectors in the decision function.
 Type
array, shape = [1, n_SV]

coef_
¶ Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel. coef_ is readonly property derived from dual_coef_ and support_vectors_
 Type
array, shape = [1, n_features]

intercept_
¶ Constant in the decision function.
 Type
array, shape = [1,]

offset_
¶ Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples  offset_. The offset is the opposite of intercept_ and is provided for consistency with other outlier detection algorithms.
 Type
float
Examples
>>> from sklearn.svm import OneClassSVM >>> X = [[0], [0.44], [0.45], [0.46], [1]] >>> clf = OneClassSVM(gamma='auto').fit(X) >>> clf.predict(X) array([1, 1, 1, 1, 1]) >>> clf.score_samples(X) array([1.7798..., 2.0547..., 2.0556..., 2.0561..., 1.7332...])
algo.pca module¶

class
algo.pca.
PCA
(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)¶ Bases:
pyodds.algo.base.Base
Principal component analysis (PCA) can be used in detecting outliers. PCA is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
In this procedure, covariance matrix of the data can be decomposed to orthogonal vectors, called eigenvectors, associated with eigenvalues. The eigenvectors with high eigenvalues capture most of the variance in the data.
Therefore, a low dimensional hyperplane constructed by k eigenvectors can capture most of the variance in the data. However, outliers are different from normal data points, which is more obvious on the hyperplane constructed by the eigenvectors with small eigenvalues.
Therefore, outlier scores can be obtained as the sum of the projected distance of a sample on all eigenvectors. See :cite:`shyu2003novel,aggarwal2015outlier` for details.
Score(X) = Sum of weighted euclidean distance between each sample to the hyperplane constructed by the selected eigenvectors
 Parameters
n_components (int, float, None or string) –
Number of components to keep. if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
if n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension if
0 < n_components < 1
and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components n_components cannot be equal to n_features for svd_solver == ‘arpack’.n_selected_components (int, optional (default=None)) – Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
copy (bool (default True)) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool, optional (default False)) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit componentwise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hardwired assumptions.
svd_solver (string {'auto', 'full', 'arpack', 'randomized'}) –
 auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
 full :
run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
 arpack :
run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < X.shape[1]
 randomized :
run randomized SVD by the method of Halko et al.
tol (float >= 0, optional (default .0)) – Tolerance for singular values computed by svd_solver == ‘arpack’.
iterated_power (int >= 0, or 'auto', (default 'auto')) – Number of iterations for the power method computed by svd_solver == ‘randomized’.
random_state (int, RandomState instance or None, optional (default None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
svd_solver
== ‘arpack’ or ‘randomized’.weighted (bool, optional (default=True)) – If True, the eigenvalues are used in score computation. The eigenvectors with small eigenvalues comes with more importance in outlier score calculation.
standardization (bool, optional (default=True)) – If True, perform standardization first to convert data to zero mean and unit variance. See http://scikitlearn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

components_
¶ Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by
explained_variance_
. Type
array, shape (n_components, n_features)

explained_variance_
¶ The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
 Type
array, shape (n_components,)

explained_variance_ratio_
¶ Percentage of variance explained by each of the selected components.
If
n_components
is not set then all components are stored and the sum of explained variances is equal to 1.0. Type
array, shape (n_components,)

singular_values_
¶ The singular values corresponding to each of the selected components. The singular values are equal to the 2norms of the
n_components
variables in the lowerdimensional space. Type
array, shape (n_components,)

mean_
¶ Perfeature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
 Type
array, shape (n_features,)

n_components_
¶ The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or n_features if n_components is None.
 Type
int

noise_variance_
¶ The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf. It is required to computed the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples)  n_components) smallest eigenvalues of the covariance matrix of X.
 Type
float

decision_scores_
¶ The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
 Type
numpy array of shape (n_samples,)

threshold_
¶ The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels. Type
float

labels_
¶ The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
. Type
int, either 0 or 1

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

property
explained_variance_
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
Decorator for scikitlearn PCA attributes.

property
explained_variance_ratio_
Percentage of variance explained by each of the selected components.
If
n_components
is not set then all components are stored and the sum of explained variances is equal to 1.0.Decorator for scikitlearn PCA attributes.

fit
(X)¶ Fit detector.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The input samples.

property
mean_
Perfeature empirical mean, estimated from the training set.
Decorator for scikitlearn PCA attributes.

property
noise_variance_
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf. It is required to computed the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples)  n_components) smallest eigenvalues of the covariance matrix of X.
Decorator for scikitlearn PCA attributes.

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)

property
singular_values_
The singular values corresponding to each of the selected components. The singular values are equal to the 2norms of the
n_components
variables in the lowerdimensional space.Decorator for scikitlearn PCA attributes.
algo.robustcovariance module¶

class
algo.robustcovariance.
RCOV
(store_precision=True, assume_centered=False, support_fraction=None, contamination=0.1, random_state=None)¶ Bases:
sklearn.covariance.elliptic_envelope.EllipticEnvelope
,pyodds.algo.base.Base
An object for detecting outliers in a Gaussian distributed dataset.
 Parameters
store_precision (boolean, optional (default=True)) – Specify if the estimated precision is stored.
assume_centered (boolean, optional (default=False)) – If True, the support of robust location and covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.
support_fraction (float in (0., 1.), optional (default=None)) – The proportion of points to be included in the support of the raw MCD estimate. If None, the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
random_state (int, RandomState instance or None, optional (default=None)) – The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

location_
¶ Estimated robust location
 Type
arraylike, shape (n_features,)

covariance_
¶ Estimated robust covariance matrix
 Type
arraylike, shape (n_features, n_features)

precision_
¶ Estimated pseudo inverse matrix. (stored only if store_precision is True)
 Type
arraylike, shape (n_features, n_features)

support_
¶ A mask of the observations that have been used to compute the robust estimates of location and shape.
 Type
arraylike, shape (n_samples,)

offset_
¶ Offset used to define the decision function from the raw scores. We have the relation:
decision_function = score_samples  offset_
. The offset depends on the contamination parameter and is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training. Type
float
Examples
>>> import numpy as np >>> from sklearn.covariance import EllipticEnvelope >>> true_cov = np.array([[.8, .3], ... [.3, .4]]) >>> X = np.random.RandomState(0).multivariate_normal(mean=[0, 0], ... cov=true_cov, ... size=500) >>> cov = EllipticEnvelope(random_state=0).fit(X) >>> # predict returns 1 for an inlier and 1 for an outlier >>> cov.predict([[0, 0], ... [3, 3]]) array([ 1, 1]) >>> cov.covariance_ array([[0.7411..., 0.2535...], [0.2535..., 0.3053...]]) >>> cov.location_ array([0.0813... , 0.0427...])
See also
EmpiricalCovariance
,MinCovDet
Notes
Outlier detection from covariance estimation may break or not perform well in highdimensional settings. In particular, one will always take care to work with
n_samples > n_features ** 2
.References
 1
Rousseeuw, P.J., Van Driessen, K. “A fast algorithm for the minimum covariance determinant estimator” Technometrics 41(3), 212 (1999)
algo.sod module¶

class
algo.sod.
SOD
(contamination=0.1, n_neighbors=20, ref_set=10, alpha=0.8)¶ Bases:
pyodds.algo.base.Base
Subspace outlier detection (SOD) schema aims to detect outlier in varying subspaces of a high dimensional feature space. For each data object, SOD explores the axisparallel subspace spanned by the data object’s neighbors and determines how much the object deviates from the neighbors in this subspace.
 Parameters
n_neighbors (int, optional (default=20)) – Number of neighbors to use by default for k neighbors queries.
ref_set (int, optional (default=10)) – Specifies the number of shared nearest neighbors to create the reference set. Note that ref_set must be smaller than n_neighbors.
alpha (float in (0., 1.), optional (default=0.8)) – specifies the lower limit for selecting subspace. 0.8 is set as default as suggested in the original paper.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

decision_scores_
¶ The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
 Type
numpy array of shape (n_samples,)

threshold_
¶ The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels. Type
float

labels_
¶ The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
. Type
int, either 0 or 1

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)
algo.staticautoencoder module¶

class
algo.staticautoencoder.
StaticAutoEncoder
(hidden_neurons=None, epoch=100, dropout_rate=0.2, contamination=0.1, regularizer_weight=0.1, activation='relu', kernel_regularizer=0.01, loss_function='mse', optimizer='adam')¶ Bases:
pyodds.algo.base.Base

decision_function
(X)¶ Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
 Parameters
X (dataframe of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
 Returns
anomaly_scores – The anomaly score of the input samples.
 Return type
numpy array of shape (n_samples,)

fit
(X)¶ Fit detector. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)

predict
(X)¶ Return outliers with 1 and inliers with 1, with the outlierness score calculated from the `decision_function(X)’, and the threshold `contamination’. :param X: The input samples. :type X: dataframe of shape (n_samples, n_features)
 Returns
ranking – The outlierness of the input samples.
 Return type
numpy array of shape (n_samples,)


algo.staticautoencoder.
l21shrink
(epsilon, x)¶