Detectors

A Detector is just a collection of hyperparameters. Each detector implements a fit and transform method, where fit refers to learning a model from training data and transform refers to using a learned model to calculate outlier scores of new data. Detectors typically do not classify samples into inliers and outliers; that's a DeterministicDetector wrapper is used to convert the raw scores into binary labels.

Neighbor-based

`ABODDetector`

# OutlierDetectionNeighbors.ABODDetector — Type.

ABODDetector(k = 5,
             metric = Euclidean(),
             algorithm = :kdtree,
             static = :auto,
             leafsize = 10,
             reorder = true,
             parallel = false,
             enhanced = false)

Determine outliers based on the angles to its nearest neighbors. This implements the FastABOD variant described in the paper, that is, it uses the variance of angles to its nearest neighbors, not to the whole dataset, see [1].

Notice: The scores are inverted, to conform to our notion that higher scores describe higher outlierness.

Parameters

k::Integer

Number of neighbors (must be greater than 0).

metric::Metric

This is one of the Metric types defined in the Distances.jl package. It is possible to define your own metrics by creating new types that are subtypes of Metric.

algorithm::Symbol

One of (:kdtree, :balltree). In a kdtree, points are recursively split into groups using hyper-planes. Therefore a KDTree only works with axis aligned metrics which are: Euclidean, Chebyshev, Minkowski and Cityblock. A brutetree linearly searches all points in a brute force fashion and works with any Metric. A balltree recursively splits points into groups bounded by hyper-spheres and works with any Metric.

static::Union{Bool, Symbol}

One of (true, false, :auto). Whether the input data for fitting and transform should be statically or dynamically allocated. If true, the data is statically allocated. If false, the data is dynamically allocated. If :auto, the data is dynamically allocated if the product of all dimensions except the last is greater than 100.

leafsize::Int

Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.

reorder::Bool

While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set to true.

parallel::Bool

Parallelize score and predict using all threads available. The number of threads can be set with the JULIA_NUM_THREADS environment variable. Note: fit is not parallel.

enhanced::Bool

When enhanced=true, it uses the enhanced ABOD (EABOD) adaptation proposed by [2].

Examples

using OutlierDetection: ABODDetector, fit, transform
detector = ABODDetector()
X = rand(10, 100)
model, result = fit(detector, X; verbosity=0)
test_scores = transform(detector, model, X)

References

[1] Kriegel, Hans-Peter; S hubert, Matthias; Zimek, Arthur (2008): Angle-based outlier detection in high-dimensional data.

[2] Li, Xiaojie; Lv, Jian Cheng; Cheng, Dongdong (2015): Angle-Based Outlier Detection Algorithm with More Stable Relationships.

`COFDetector`

# OutlierDetectionNeighbors.COFDetector — Type.

COFDetector(k = 5,
            metric = Euclidean(),
            algorithm = :kdtree,
            leafsize = 10,
            reorder = true,
            parallel = false)

Local outlier density based on chaining distance between graphs of neighbors, as described in [1].

Parameters

k::Integer

Number of neighbors (must be greater than 0).

metric::Metric

This is one of the Metric types defined in the Distances.jl package. It is possible to define your own metrics by creating new types that are subtypes of Metric.

algorithm::Symbol

One of (:kdtree, :balltree). In a kdtree, points are recursively split into groups using hyper-planes. Therefore a KDTree only works with axis aligned metrics which are: Euclidean, Chebyshev, Minkowski and Cityblock. A brutetree linearly searches all points in a brute force fashion and works with any Metric. A balltree recursively splits points into groups bounded by hyper-spheres and works with any Metric.

static::Union{Bool, Symbol}

One of (true, false, :auto). Whether the input data for fitting and transform should be statically or dynamically allocated. If true, the data is statically allocated. If false, the data is dynamically allocated. If :auto, the data is dynamically allocated if the product of all dimensions except the last is greater than 100.

leafsize::Int

Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.

reorder::Bool

While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set to true.

parallel::Bool

Parallelize score and predict using all threads available. The number of threads can be set with the JULIA_NUM_THREADS environment variable. Note: fit is not parallel.

Examples

using OutlierDetection: COFDetector, fit, transform
detector = COFDetector()
X = rand(10, 100)
model, result = fit(detector, X; verbosity=0)
test_scores = transform(detector, model, X)

References

[1] Tang, Jian; Chen, Zhixiang; Fu, Ada Wai-Chee; Cheung, David Wai-Lok (2002): Enhancing Effectiveness of Outlier Detections for Low Density Patterns.

`DNNDetector`

# OutlierDetectionNeighbors.DNNDetector — Type.

DNNDetector(d = 0,
            metric = Euclidean(),
            algorithm = :kdtree,
            leafsize = 10,
            reorder = true,
            parallel = false)

Anomaly score based on the number of neighbors in a hypersphere of radius d. Knorr et al. [1] directly converted the resulting outlier scores to labels, thus this implementation does not fully reflect the approach from the paper.

Parameters

d::Real

The hypersphere radius used to calculate the global density of an instance.

metric::Metric

This is one of the Metric types defined in the Distances.jl package. It is possible to define your own metrics by creating new types that are subtypes of Metric.

algorithm::Symbol

One of (:kdtree, :balltree). In a kdtree, points are recursively split into groups using hyper-planes. Therefore a KDTree only works with axis aligned metrics which are: Euclidean, Chebyshev, Minkowski and Cityblock. A brutetree linearly searches all points in a brute force fashion and works with any Metric. A balltree recursively splits points into groups bounded by hyper-spheres and works with any Metric.

static::Union{Bool, Symbol}

One of (true, false, :auto). Whether the input data for fitting and transform should be statically or dynamically allocated. If true, the data is statically allocated. If false, the data is dynamically allocated. If :auto, the data is dynamically allocated if the product of all dimensions except the last is greater than 100.

leafsize::Int

Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.

reorder::Bool

While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set to true.

parallel::Bool

Parallelize score and predict using all threads available. The number of threads can be set with the JULIA_NUM_THREADS environment variable. Note: fit is not parallel.

Examples

using OutlierDetection: DNNDetector, fit, transform
detector = DNNDetector()
X = rand(10, 100)
model, result = fit(detector, X; verbosity=0)
test_scores = transform(detector, model, X)

References

[1] Knorr, Edwin M.; Ng, Raymond T. (1998): Algorithms for Mining Distance-Based Outliers in Large Datasets.

`KNNDetector`

# OutlierDetectionNeighbors.KNNDetector — Type.

KNNDetector(k=5,
            metric=Euclidean,
            algorithm=:kdtree,
            leafsize=10,
            reorder=true,
            reduction=:maximum)

Calculate the anomaly score of an instance based on the distance to its k-nearest neighbors.

Parameters

k::Integer

Number of neighbors (must be greater than 0).

metric::Metric

This is one of the Metric types defined in the Distances.jl package. It is possible to define your own metrics by creating new types that are subtypes of Metric.

algorithm::Symbol

One of (:kdtree, :balltree). In a kdtree, points are recursively split into groups using hyper-planes. Therefore a KDTree only works with axis aligned metrics which are: Euclidean, Chebyshev, Minkowski and Cityblock. A brutetree linearly searches all points in a brute force fashion and works with any Metric. A balltree recursively splits points into groups bounded by hyper-spheres and works with any Metric.

static::Union{Bool, Symbol}

One of (true, false, :auto). Whether the input data for fitting and transform should be statically or dynamically allocated. If true, the data is statically allocated. If false, the data is dynamically allocated. If :auto, the data is dynamically allocated if the product of all dimensions except the last is greater than 100.

leafsize::Int

Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.

reorder::Bool

While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set to true.

parallel::Bool

Parallelize score and predict using all threads available. The number of threads can be set with the JULIA_NUM_THREADS environment variable. Note: fit is not parallel.

reduction::Symbol

One of (:maximum, :median, :mean). (reduction=:maximum) was proposed by [1]. Angiulli et al. [2] proposed sum to reduce the distances, but mean has been implemented for numerical stability.

Examples

using OutlierDetection: KNNDetector, fit, transform
detector = KNNDetector()
X = rand(10, 100)
model, result = fit(detector, X; verbosity=0)
test_scores = transform(detector, model, X)

References

[1] Ramaswamy, Sridhar; Rastogi, Rajeev; Shim, Kyuseok (2000): Efficient Algorithms for Mining Outliers from Large Data Sets.

[2] Angiulli, Fabrizio; Pizzuti, Clara (2002): Fast Outlier Detection in High Dimensional Spaces.

`LOFDetector`

# OutlierDetectionNeighbors.LOFDetector — Type.

LOFDetector(k = 5,
            metric = Euclidean(),
            algorithm = :kdtree,
            leafsize = 10,
            reorder = true,
            parallel = false)

Calculate an anomaly score based on the density of an instance in comparison to its neighbors. This algorithm introduced the notion of local outliers and was developed by Breunig et al., see [1].

Parameters

k::Integer

Number of neighbors (must be greater than 0).

metric::Metric

This is one of the Metric types defined in the Distances.jl package. It is possible to define your own metrics by creating new types that are subtypes of Metric.

algorithm::Symbol

One of (:kdtree, :balltree). In a kdtree, points are recursively split into groups using hyper-planes. Therefore a KDTree only works with axis aligned metrics which are: Euclidean, Chebyshev, Minkowski and Cityblock. A brutetree linearly searches all points in a brute force fashion and works with any Metric. A balltree recursively splits points into groups bounded by hyper-spheres and works with any Metric.

static::Union{Bool, Symbol}

One of (true, false, :auto). Whether the input data for fitting and transform should be statically or dynamically allocated. If true, the data is statically allocated. If false, the data is dynamically allocated. If :auto, the data is dynamically allocated if the product of all dimensions except the last is greater than 100.

leafsize::Int

Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.

reorder::Bool

While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set to true.

parallel::Bool

Parallelize score and predict using all threads available. The number of threads can be set with the JULIA_NUM_THREADS environment variable. Note: fit is not parallel.

Examples

using OutlierDetection: LOFDetector, fit, transform
detector = LOFDetector()
X = rand(10, 100)
model, result = fit(detector, X; verbosity=0)
test_scores = transform(detector, model, X)

References

[1] Breunig, Markus M.; Kriegel, Hans-Peter; Ng, Raymond T.; Sander, Jörg (2000): LOF: Identifying Density-Based Local Outliers.

Network-based

Warning

The neural-network detectors are experimental and subject to change.

`AEDetector`

# OutlierDetectionNetworks.AEDetector — Type.

AEDetector(encoder= Chain(),
           decoder = Chain(),
           batchsize= 32,
           epochs = 1,
           shuffle = false,
           partial = true,
           opt = Adam(),
           loss = mse)

Calculate the anomaly score of an instance based on the reconstruction loss of an autoencoder, see [1] for an explanation of auto encoders.

Parameters

encoder::Chain

Transforms the input data into a latent state with a fixed shape.

decoder::Chain

Transforms the latent state back into the shape of the input data.

batchsize::Integer

The number of samples to work through before updating the internal model parameters.

epochs::Integer

The number of passes of the entire training dataset the machine learning algorithm has completed.

shuffle::Bool

If shuffle=true, shuffles the observations each time iterations are re-started, else no shuffling is performed.

partial::Bool

If partial=false, drops the last mini-batch if it is smaller than the batchsize.

opt::Any

Any Flux-compatibale optimizer, typically a struct that holds all the optimiser parameters along with a definition of apply! that defines how to apply the update rule associated with the optimizer.

loss::Function

The loss function used to calculate the reconstruction error, see https://fluxml.ai/Flux.jl/stable/models/losses/ for examples.

Examples

using OutlierDetection: AEDetector, fit, transform
detector = AEDetector()
X = rand(10, 100)
model, result = fit(detector, X; verbosity=0)
test_scores = transform(detector, model, X)

References

[1] Aggarwal, Charu C. (2017): Outlier Analysis.

`DSADDetector`

# OutlierDetectionNetworks.DSADDetector — Type.

DSADDetector(encoder = Chain(),
                decoder = Chain(),
                batchsize = 32,
                epochs = 1,
                shuffle = true,
                partial = false,
                opt = Adam(),
                loss = mse,
                eta = 1,
                eps = 1e-6,
                callback = _ -> () -> ())

Deep Semi-Supervised Anomaly detection technique based on the distance to a hypersphere center as described in [1].

Parameters

encoder::Chain

Transforms the input data into a latent state with a fixed shape.

decoder::Chain

Transforms the latent state back into the shape of the input data.

batchsize::Integer

The number of samples to work through before updating the internal model parameters.

epochs::Integer

The number of passes of the entire training dataset the machine learning algorithm has completed.

shuffle::Bool

If shuffle=true, shuffles the observations each time iterations are re-started, else no shuffling is performed.

partial::Bool

If partial=false, drops the last mini-batch if it is smaller than the batchsize.

opt::Any

Any Flux-compatibale optimizer, typically a struct that holds all the optimiser parameters along with a definition of apply! that defines how to apply the update rule associated with the optimizer.

loss::Function

The loss function used to calculate the reconstruction error, see https://fluxml.ai/Flux.jl/stable/models/losses/ for examples.

eta::Real

Weighting parameter for the labeled data; i.e. higher values of eta assign higher weight to labeled data in the svdd loss function. For a sensitivity analysis of this parameter, see [1].

eps::Real

Because the inverse distance used in the svdd loss can lead to division by zero, the parameters eps is added for numerical stability.

callback::Function

Experimental parameter that might change. A function to be called after the model parameters have been updated that can call Flux's callback helpers, see https://fluxml.ai/Flux.jl/stable/utilities/#Callback-Helpers-1.

Notice: The parameters batchsize, epochs, shuffle, partial, opt and callback can also be tuples of size 2, specifying the corresponding values for (1) pretraining and (2) training; otherwise the same values are used for pretraining and training.

Examples

using OutlierDetection: DSADDetector, fit, score
detector = DSADDetector()
X = rand(10, 100)
y = rand([-1,1], 100)
model = fit(detector, X, y; verbosity=0)
train_scores, test_scores = score(detector, model, X)

References

[1] Ruff, Lukas; Vandermeulen, Robert A.; Görnitz, Nico; Binder, Alexander; Müller, Emmanuel; Müller, Klaus-Robert; Kloft, Marius (2019): Deep Semi-Supervised Anomaly Detection.

`ESADDetector`

# OutlierDetectionNetworks.ESADDetector — Type.

ESADDetector(encoder = Chain(),
            decoder = Chain(),
            batchsize = 32,
            epochs = 1,
            shuffle = false,
            partial = true,
            opt = Adam(),
            λ1 = 1,
            λ2 = 1,
            noise = identity)

End-to-End semi-supervised anomaly detection algorithm similar to DeepSAD, but without the pretraining phase. The algorithm was published by Huang et al., see [1].

Parameters

encoder::Chain

Transforms the input data into a latent state with a fixed shape.

decoder::Chain

Transforms the latent state back into the shape of the input data.

batchsize::Integer

The number of samples to work through before updating the internal model parameters.

epochs::Integer

The number of passes of the entire training dataset the machine learning algorithm has completed.

shuffle::Bool

If shuffle=true, shuffles the observations each time iterations are re-started, else no shuffling is performed.

partial::Bool

If partial=false, drops the last mini-batch if it is smaller than the batchsize.

opt::Any

Any Flux-compatibale optimizer, typically a struct that holds all the optimiser parameters along with a definition of apply! that defines how to apply the update rule associated with the optimizer.

λ1::Real

Weighting parameter of the norm loss, which minimizes the empirical variance and thus minimizes entropy.

λ2::Real

Weighting parameter of the assistent loss function to define the consistency between the two encoders.

noise::Function (AbstractArray{T} -> AbstractArray{T})

A function to be applied to a batch of input data to add noise, see [1] for an explanation.

Examples

using OutlierDetection: ESADDetector, fit, score
detector = ESADDetector()
X = rand(10, 100)
y = rand([-1,1], 100)
model = fit(detector, X, y; verbosity=0)
train_scores, test_scores = score(detector, model, X)

References

[1] Huang, Chaoqin; Ye, Fei; Zhang, Ya; Wang, Yan-Feng; Tian, Qi (2020): ESAD: End-to-end Deep Semi-supervised Anomaly Detection.

Python-based

Using PyCall, we can easily integrate existing python outlier detection algorithms. Currently, almost every PyOD algorithm is integrated and can thus be easily used directly from Julia.

`ABODDetector`

# OutlierDetectionPython.ABODDetector — Type.

ABODDetector(n_neighbors = 5,
                method = "fast")

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.abod

`CBLOFDetector`

# OutlierDetectionPython.CBLOFDetector — Type.

CBLOFDetector(n_clusters = 8,
                 alpha = 0.9,
                 beta = 5,
                 use_weights = false,
                 random_state = nothing,
                 n_jobs = 1)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.cblof

`CDDetector`

# OutlierDetectionPython.CDDetector — Type.

CDDetector(whitening = true,
              rule_of_thumb = false)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.cd

`COFDetector`

# OutlierDetectionPython.COFDetector — Type.

COFDetector(n_neighbors = 5,
               method="fast")

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.cof

`COPODDetector`

# OutlierDetectionPython.COPODDetector — Type.

COPODDetector(n_jobs = 1)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.copod

`ECODDetector`

# OutlierDetectionPython.ECODDetector — Type.

ECODDetector(n_jobs = 1)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.ecod

`GMMDetector`

# OutlierDetectionPython.GMMDetector — Type.

GMMDetector(n_components=1,
               covariance_type="full",
               tol=0.001,
               reg_covar=1e-06,
               max_iter=100,
               n_init=1,
               init_params="kmeans",
               weights_init=None,
               means_init=None,
               precisions_init=None,
               random_state=None,
               warm_start=False)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.gmm

`HBOSDetector`

# OutlierDetectionPython.HBOSDetector — Type.

HBOSDetector(n_bins = 10,
                alpha = 0.1,
                tol = 0.5)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.hbos

`IForestDetector`

# OutlierDetectionPython.IForestDetector — Type.

IForestDetector(n_estimators = 100,
                   max_samples = "auto",
                   max_features = 1.0
                   bootstrap = false,
                   random_state = nothing,
                   verbose = 0,
                   n_jobs = 1)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.iforest

`INNEDetector`

# OutlierDetectionPython.INNEDetector — Type.

INNEDetector(n_estimators=200,
                max_samples="auto",
                random_state=None)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.inne

`KDEDetector`

# OutlierDetectionPython.KDEDetector — Type.

KDEDetector(bandwidth=1.0,
               algorithm="auto",
               leaf_size=30,
               metric="minkowski",
               metric_params=None)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.kde

`KNNDetector`

# OutlierDetectionPython.KNNDetector — Type.

KNNDetector(n_neighbors = 5,
               method = "largest",
               radius = 1.0,
               algorithm = "auto",
               leaf_size = 30,
               metric = "minkowski",
               p = 2,
               metric_params = nothing,
               n_jobs = 1)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.knn

`LMDDDetector`

# OutlierDetectionPython.LMDDDetector — Type.

LMDDDetector(n_iter = 50,
                dis_measure = "aad",
                random_state = nothing)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.lmdd

`LODADetector`

# OutlierDetectionPython.LODADetector — Type.

LODADetector(n_bins = 10,
                n_random_cuts = 100)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.loda

`LOFDetector`

# OutlierDetectionPython.LOFDetector — Type.

LOFDetector(n_neighbors = 5,
               algorithm = "auto",
               leaf_size = 30,
               metric = "minkowski",
               p = 2,
               metric_params = nothing,
               n_jobs = 1,
               novelty = true)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.lof

`LOCIDetector`

# OutlierDetectionPython.LOCIDetector — Type.

LOCIDetector(alpha = 0.5,
                k = 3)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.loci

`MCDDetector`

# OutlierDetectionPython.MCDDetector — Type.

MCDDetector(store_precision = true,
               assume_centered = false,
               support_fraction = nothing,
               random_state = nothing)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.mcd

`OCSVMDetector`

# OutlierDetectionPython.OCSVMDetector — Type.

OCSVMDetector(kernel = "rbf",
                 degree = 3,
                 gamma = "auto",
                 coef0 = 0.0,
                 tol = 0.001,
                 nu = 0.5,
                 shrinking = true,
                 cache_size = 200,
                 verbose = false,
                 max_iter = -1)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.ocsvm

`PCADetector`

# OutlierDetectionPython.PCADetector — Type.

PCADetector(n_components = nothing,
               n_selected_components = nothing,
               copy = true,
               whiten = false,
               svd_solver = "auto",
               tol = 0.0
               iterated_power = "auto",
               standardization = true,
               weighted = true,
               random_state = nothing)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.pca

`RODDetector`

# OutlierDetectionPython.RODDetector — Type.

RODDetector(parallel_execution = false)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.rod

`SODDetector`

# OutlierDetectionPython.SODDetector — Type.

SODDetector(n_neighbors = 5,
               ref_set = 10,
               alpha = 0.8)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.sod

`SOSDetector`

# OutlierDetectionPython.SOSDetector — Type.

SOSDetector(perplexity = 4.5,
               metric = "minkowski",
               eps = 1e-5)

https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.sos