Advanced Usage

The simple usage guide covered how you can use and optimize an existing outlier detection model, however, sometimes it is necessary to combine the results of multiple models or create entirely new models.

Working with scores

An outlier detection model, whether supervised or unsupervised, typically assigns an outlier score to each datapoint. We further differentiate between outier scores achieved during training or testing. Because both train and test scores are essential for further score processing, e.g. converting the scores to classes, we provide an augmented_transform that returns a tuple of train and test scores.

using MLJ, OutlierDetection
using OutlierDetectionData: ODDS

X, y = ODDS.load("annthyroid")
train, test = partition(eachindex(y), 0.5, shuffle=true, stratify=y, rng=0)
KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors verbosity=0
knn = KNN()

KNNDetector(
    k = 5,
    metric = Distances.Euclidean(0.0),
    algorithm = :kdtree,
    static = :auto,
    leafsize = 10,
    reorder = true,
    parallel = false,
    reduction = :maximum)

Let's bind the detector to data and perform an augmented_transform.

mach = machine(knn, X, y)
fit!(mach, rows=train)
scores = augmented_transform(mach, rows=test)
scores_train, scores_test = scores

([0.015809329524050033, 0.01227884359375915, 0.0459156835950419, 0.020099952736262826, 0.013580868897091973, 0.021063000735887565, 0.014748030376968972, 0.012825447360618655, 0.03674629232997528, 0.005899999999999996  …  0.01025134137564445, 0.01916101249934356, 0.01497412434835507, 0.015076140089558737, 0.01764709607839205, 0.06715745751590065, 0.014039804129687852, 0.010630785483678995, 0.02923597783553682, 0.02754246902512554], [0.007383319036855991, 0.012256920494153502, 0.017696609844826204, 0.024054440338532098, 0.015375304875026054, 0.023503616742961086, 0.01673977598416418, 0.010000000000000009, 0.028750652166516153, 0.008564864272129474  …  0.012658597868642494, 0.010416544532617329, 0.017795867497820923, 0.04766550115125195, 0.012879689437249653, 0.021236292049225534, 0.013329906226226798, 0.03016661068134767, 0.006801698317332226, 0.10986355173577815])

We split the into 50% train and 50% test data, thus scores_train and scores_test should return an equal amount of scores.

scores_train

3600-element Vector{Float64}:
 0.015809329524050033
 0.01227884359375915
 0.0459156835950419
 0.020099952736262826
 0.013580868897091973
 0.021063000735887565
 0.014748030376968972
 0.012825447360618655
 0.03674629232997528
 0.005899999999999996
 ⋮
 0.01916101249934356
 0.01497412434835507
 0.015076140089558737
 0.01764709607839205
 0.06715745751590065
 0.014039804129687852
 0.010630785483678995
 0.02923597783553682
 0.02754246902512554

scores_test

3600-element Vector{Float64}:
 0.007383319036855991
 0.012256920494153502
 0.017696609844826204
 0.024054440338532098
 0.015375304875026054
 0.023503616742961086
 0.01673977598416418
 0.010000000000000009
 0.028750652166516153
 0.008564864272129474
 ⋮
 0.010416544532617329
 0.017795867497820923
 0.04766550115125195
 0.012879689437249653
 0.021236292049225534
 0.013329906226226798
 0.03016661068134767
 0.006801698317332226
 0.10986355173577815

OutlierDetection.jl provides many helper functions to work with scores, see score helpers. The fundamental datatype to work with scores is a tuple of train/test scores and all helper functions work with this datatype. An example for such a helper function is scale_minmax, which scales the scores to lie between 0 and 1 using min-max scaling.

last(scores |> scale_minmax)

3600-element Vector{Float64}:
 0.009915879968108363
 0.02187720506179054
 0.03522788490341164
 0.05083196703693757
 0.029530684664811027
 0.04948007559615709
 0.032879518717070275
 0.01633802423674759
 0.062357923124135045
 0.0128157573577204
 ⋮
 0.01736035333244564
 0.0354714938783983
 0.10878081177982465
 0.023405672633850193
 0.0439153596479345
 0.02451064385949808
 0.0658331231919836
 0.008488402861413877
 0.2614340621027081

Another exemplary helper function is classify_quantile, which is used to transform scores to classes. We only display the test scores using the last element of the tuple.

last(scores |> classify_quantile(0.9))

3600-element Vector{String}:
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 ⋮
 "normal"
 "normal"
 "outlier"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "outlier"

Sometimes it's also necessary to combine scores from multiple detectors, which can, for example, be achieved with combine_mean.

combine_mean(scores, scores) == scores

true

We can see that combine_mean can work with multiple train/test tuples and combines them into one final tuple. In this case the resulting tuple consists of the means of the individual train and test score vectors.

Combining models

We typically want to deal with probabilistic or deterministic predictions instead of raw scores. Using a ProbabilisticDetector or DeterministicDetector, we can simply wrap a detector to enable such predictions. Both wrappers, however, are designed such that they can work with multiple models and combine them into one probabilistic or deterministic result. When using multiple models, we have to provide them as keyword arguments as follows.

knn = ProbabilisticDetector(knn1=KNN(k=5), knn2=KNN(k=10),
                            normalize=scale_minmax,
                            combine=combine_mean)

ProbabilisticUnsupervisedCompositeDetector(
    normalize = OutlierDetection.scale_minmax,
    combine = OutlierDetection.combine_mean,
    knn1 = KNNDetector(
            k = 5,
            metric = Distances.Euclidean(0.0),
            algorithm = :kdtree,
            static = :auto,
            leafsize = 10,
            reorder = true,
            parallel = false,
            reduction = :maximum),
    knn2 = KNNDetector(
            k = 10,
            metric = Distances.Euclidean(0.0),
            algorithm = :kdtree,
            static = :auto,
            leafsize = 10,
            reorder = true,
            parallel = false,
            reduction = :maximum))

As you can see, we additionally provided explicit arguments to normalize and combine, which take function arguments and are used for score normalization and combination. Those are the default, thus we could have also just left them unspecified and achieved the same result. The scores are always normalized before they are combined. Notice that any function that maps a train/test score tuple to a score tuple with values in the range [0,1] works for normalization. For example, if the scores are already in the range [0,1] we could just pass the identity function. Let's see the predictions of the defined detector.

mach = machine(knn, X, y)
fit!(mach, rows=train)
predict(mach, rows=test)

3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{OrderedFactor{2}}(normal=>0.991, outlier=>0.00914)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.978, outlier=>0.0224)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.963, outlier=>0.0365)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.953, outlier=>0.0465)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.975, outlier=>0.0247)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.951, outlier=>0.0485)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.966, outlier=>0.0344)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.989, outlier=>0.0114)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.942, outlier=>0.0578)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.989, outlier=>0.0115)
 ⋮
 UnivariateFinite{OrderedFactor{2}}(normal=>0.985, outlier=>0.0151)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.97, outlier=>0.03)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.894, outlier=>0.106)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.98, outlier=>0.0201)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.956, outlier=>0.0444)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.974, outlier=>0.0262)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.939, outlier=>0.0606)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.991, outlier=>0.00866)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.748, outlier=>0.252)

Pretty simple, huh?

Learning networks

Sometimes we need more flexibility to define outlier models. Unfortunately MLJ's linear pipelines are not yet usable for outlier detection models, thus we need to define our learning networks manually. Let's, for example, create a machine that standardizes the input features before applying the detector.

Xs, ys = source(X), source(y)
Xstd = transform(machine(Standardizer(), Xs), Xs)
ŷ = predict(machine(knn, Xstd), Xstd)
knn_std = machine(ProbabilisticUnsupervisedDetector(), Xs, ys; predict=ŷ)

Machine{ProbabilisticUnsupervisedDetectorSurrogate,…} trained 0 times; does not cache data
  model: MLJBase.ProbabilisticUnsupervisedDetectorSurrogate
  args: 
    1:  Source @499 ⏎ `Table{AbstractVector{Continuous}}`
    2:  Source @259 ⏎ `AbstractVector{OrderedFactor{2}}`

We can fit! and predict with the resulting model as usual.

fit!(knn_std, rows=train)
predict(knn_std, rows=test)

3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{OrderedFactor{2}}(normal=>0.988, outlier=>0.0116)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.977, outlier=>0.0229)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0359)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.958, outlier=>0.0417)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.971, outlier=>0.029)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.945, outlier=>0.0551)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0362)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.997, outlier=>0.00341)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.937, outlier=>0.0629)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.975, outlier=>0.0247)
 ⋮
 UnivariateFinite{OrderedFactor{2}}(normal=>0.986, outlier=>0.0142)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.955, outlier=>0.0451)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.888, outlier=>0.112)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.961, outlier=>0.0386)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.962, outlier=>0.0376)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.97, outlier=>0.0301)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.953, outlier=>0.0471)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.987, outlier=>0.0129)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.786, outlier=>0.214)

Note that we supplied labels ys to an unsupervised algorithm; this is not necessary if you just want to predict, but it is necessary if you want to evaluate the resulting learning network. We can easily export such a learning network as a model with @from_network.

@from_network knn_std mutable struct StandardizedKNN end

Furthermore, if the goal is to create a standalone model from a network, we could use empty sources (source()) for Xs and ys. The standalone model can be bound to data again like any other model.

knn_std = machine(StandardizedKNN(), X, y)
fit!(knn_std, rows=train)
predict(knn_std, rows=test)

3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{OrderedFactor{2}}(normal=>0.988, outlier=>0.0116)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.977, outlier=>0.0229)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0359)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.958, outlier=>0.0417)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.971, outlier=>0.029)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.945, outlier=>0.0551)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0362)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.997, outlier=>0.00341)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.937, outlier=>0.0629)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.975, outlier=>0.0247)
 ⋮
 UnivariateFinite{OrderedFactor{2}}(normal=>0.986, outlier=>0.0142)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.955, outlier=>0.0451)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.888, outlier=>0.112)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.961, outlier=>0.0386)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.962, outlier=>0.0376)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.97, outlier=>0.0301)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.953, outlier=>0.0471)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.987, outlier=>0.0129)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.786, outlier=>0.214)

There might be occasions, where our ProbabilisticDetector or DeterministicDetector wrappers are not flexible enough. In such cases we can directly use augmented_transform in our learning networks and use a ProbabilisticTransformer or DeterministicTransformer, which takes one or more train/test tuples as inputs returning probabilistic or deterministic predictions.

Implementing models

Learning networks let us flexibly create complex combinations of existing models, however, sometimes it's necessary to develop new outlier detection models for specific tasks. OutlierDetection.jl builds on top of MLJ and provides a simple interface defining how an outlier detection algorithm can be implemented. Let's first import the interface and the packages relevant to our new algorithm.

import OutlierDetectionInterface
const OD = OutlierDetectionInterface

using Statistics:mean
using LinearAlgebra:norm

Our proposed algorithm calculates a central point from the training data and defines an outlier as a point that's far away from that center. The only hyperparameter is p specifying which p-norm to use to calculate the distance. Using @detector, which replicates @mlj_model, we can define our detector struct with macro-generated keyword arguments and default values.

OD.@detector mutable struct SimpleDetector <: OD.UnsupervisedDetector
    p::Float64 = 2
end

Our DetectorModel, then, defines the learned parameters of our model. In this case the only learned parameter is the center.

struct SimpleModel <: OD.DetectorModel
    center::AbstractArray{<:Real}
end

Let's further define a helper function to calculate the distance from the center.

function distances_from(center, vectors::AbstractMatrix, p)
    deviations = vectors .- center
    return [norm(deviations[:, i], p) for i in 1:size(deviations, 2)]
end

distances_from (generic function with 1 method)

Finally, we can implement the two methods necessary to implement a detector, namely fit and transform. Please refer to the Key Concepts to learn more about the involved methods and types.

function OD.fit(detector::SimpleDetector, X::OD.Data; verbosity)::OD.Fit
    center = mean(X, dims=2)
    training_scores = distances_from(center, X, detector.p)
    return SimpleModel(center), training_scores
end

function OD.transform(detector::SimpleDetector, model::SimpleModel, X::OD.Data)::OD.Scores
    distances_from(model.center, X, detector.p)
end

Using a data-frontend, we can make sure that MLJ internally transforms input data to Data, which refers to column-major Julia arrays with the last dimension representing an example. Registering that frontend can be achieved with @default_frontend.

OD.@default_frontend SimpleDetector

Again, we can simply wrap our detector in a ProbabilisticDetector to enable probabilistic predictions.

sd = machine(ProbabilisticDetector(SimpleDetector()), X, y)
fit!(sd, rows=train)
predict(sd, rows=test)

3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{OrderedFactor{2}}(normal=>0.903, outlier=>0.0972)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.874, outlier=>0.126)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.917, outlier=>0.0826)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.564, outlier=>0.436)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.67, outlier=>0.33)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.531, outlier=>0.469)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.832, outlier=>0.168)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.729, outlier=>0.271)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.677, outlier=>0.323)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.688, outlier=>0.312)
 ⋮
 UnivariateFinite{OrderedFactor{2}}(normal=>0.874, outlier=>0.126)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.742, outlier=>0.258)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.816, outlier=>0.184)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.713, outlier=>0.287)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.564, outlier=>0.436)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.932, outlier=>0.0679)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.631, outlier=>0.369)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.526, outlier=>0.474)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.408, outlier=>0.592)

Remember: Your feedback and contributions are extremely welcome, join us on Github or #outlierdetection on Slack and get involved.