Advanced Usage
The simple usage guide covered how you can use and optimize an existing outlier detection model, however, sometimes it is necessary to combine the results of multiple models or create entirely new models.
Working with scores
An outlier detection model, whether supervised or unsupervised, typically assigns an outlier score to each datapoint. We further differentiate between outier scores achieved during training or testing. Because both train and test scores are essential for further score processing, e.g. converting the scores to classes, we provide an augmented_transform
that returns a tuple of train and test scores.
using MLJ, OutlierDetection
using OutlierDetectionData: ODDS
X, y = ODDS.load("annthyroid")
train, test = partition(eachindex(y), 0.5, shuffle=true, stratify=y, rng=0)
KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors verbosity=0
knn = KNN()
KNNDetector(
k = 5,
metric = Distances.Euclidean(0.0),
algorithm = :kdtree,
static = :auto,
leafsize = 10,
reorder = true,
parallel = false,
reduction = :maximum)
Let's bind the detector to data and perform an augmented_transform
.
mach = machine(knn, X, y)
fit!(mach, rows=train)
scores = augmented_transform(mach, rows=test)
scores_train, scores_test = scores
([0.015809329524050033, 0.01227884359375915, 0.0459156835950419, 0.020099952736262826, 0.013580868897091973, 0.021063000735887565, 0.014748030376968972, 0.012825447360618655, 0.03674629232997528, 0.005899999999999996 … 0.01025134137564445, 0.01916101249934356, 0.01497412434835507, 0.015076140089558737, 0.01764709607839205, 0.06715745751590065, 0.014039804129687852, 0.010630785483678995, 0.02923597783553682, 0.02754246902512554], [0.007383319036855991, 0.012256920494153502, 0.017696609844826204, 0.024054440338532098, 0.015375304875026054, 0.023503616742961086, 0.01673977598416418, 0.010000000000000009, 0.028750652166516153, 0.008564864272129474 … 0.012658597868642494, 0.010416544532617329, 0.017795867497820923, 0.04766550115125195, 0.012879689437249653, 0.021236292049225534, 0.013329906226226798, 0.03016661068134767, 0.006801698317332226, 0.10986355173577815])
We split the into 50% train and 50% test data, thus scores_train
and scores_test
should return an equal amount of scores.
scores_train
3600-element Vector{Float64}:
0.015809329524050033
0.01227884359375915
0.0459156835950419
0.020099952736262826
0.013580868897091973
0.021063000735887565
0.014748030376968972
0.012825447360618655
0.03674629232997528
0.005899999999999996
⋮
0.01916101249934356
0.01497412434835507
0.015076140089558737
0.01764709607839205
0.06715745751590065
0.014039804129687852
0.010630785483678995
0.02923597783553682
0.02754246902512554
scores_test
3600-element Vector{Float64}:
0.007383319036855991
0.012256920494153502
0.017696609844826204
0.024054440338532098
0.015375304875026054
0.023503616742961086
0.01673977598416418
0.010000000000000009
0.028750652166516153
0.008564864272129474
⋮
0.010416544532617329
0.017795867497820923
0.04766550115125195
0.012879689437249653
0.021236292049225534
0.013329906226226798
0.03016661068134767
0.006801698317332226
0.10986355173577815
OutlierDetection.jl provides many helper functions to work with scores, see score helpers. The fundamental datatype to work with scores is a tuple of train/test scores and all helper functions work with this datatype. An example for such a helper function is scale_minmax
, which scales the scores to lie between 0 and 1 using min-max scaling.
last(scores |> scale_minmax)
3600-element Vector{Float64}:
0.009915879968108363
0.02187720506179054
0.03522788490341164
0.05083196703693757
0.029530684664811027
0.04948007559615709
0.032879518717070275
0.01633802423674759
0.062357923124135045
0.0128157573577204
⋮
0.01736035333244564
0.0354714938783983
0.10878081177982465
0.023405672633850193
0.0439153596479345
0.02451064385949808
0.0658331231919836
0.008488402861413877
0.2614340621027081
Another exemplary helper function is classify_quantile
, which is used to transform scores to classes. We only display the test scores using the last
element of the tuple.
last(scores |> classify_quantile(0.9))
3600-element Vector{String}:
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
⋮
"normal"
"normal"
"outlier"
"normal"
"normal"
"normal"
"normal"
"normal"
"outlier"
Sometimes it's also necessary to combine scores from multiple detectors, which can, for example, be achieved with combine_mean
.
combine_mean(scores, scores) == scores
true
We can see that combine_mean
can work with multiple train/test tuples and combines them into one final tuple. In this case the resulting tuple consists of the means of the individual train and test score vectors.
Combining models
We typically want to deal with probabilistic or deterministic predictions instead of raw scores. Using a ProbabilisticDetector
or DeterministicDetector
, we can simply wrap a detector to enable such predictions. Both wrappers, however, are designed such that they can work with multiple models and combine them into one probabilistic or deterministic result. When using multiple models, we have to provide them as keyword arguments as follows.
knn = ProbabilisticDetector(knn1=KNN(k=5), knn2=KNN(k=10),
normalize=scale_minmax,
combine=combine_mean)
ProbabilisticUnsupervisedCompositeDetector(
normalize = OutlierDetection.scale_minmax,
combine = OutlierDetection.combine_mean,
knn1 = KNNDetector(
k = 5,
metric = Distances.Euclidean(0.0),
algorithm = :kdtree,
static = :auto,
leafsize = 10,
reorder = true,
parallel = false,
reduction = :maximum),
knn2 = KNNDetector(
k = 10,
metric = Distances.Euclidean(0.0),
algorithm = :kdtree,
static = :auto,
leafsize = 10,
reorder = true,
parallel = false,
reduction = :maximum))
As you can see, we additionally provided explicit arguments to normalize
and combine
, which take function arguments and are used for score normalization and combination. Those are the default, thus we could have also just left them unspecified and achieved the same result. The scores are always normalized before they are combined. Notice that any function that maps a train/test score tuple to a score tuple with values in the range [0,1]
works for normalization
. For example, if the scores are already in the range [0,1]
we could just pass the identity
function. Let's see the predictions of the defined detector.
mach = machine(knn, X, y)
fit!(mach, rows=train)
predict(mach, rows=test)
3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{OrderedFactor{2}}(normal=>0.991, outlier=>0.00914)
UnivariateFinite{OrderedFactor{2}}(normal=>0.978, outlier=>0.0224)
UnivariateFinite{OrderedFactor{2}}(normal=>0.963, outlier=>0.0365)
UnivariateFinite{OrderedFactor{2}}(normal=>0.953, outlier=>0.0465)
UnivariateFinite{OrderedFactor{2}}(normal=>0.975, outlier=>0.0247)
UnivariateFinite{OrderedFactor{2}}(normal=>0.951, outlier=>0.0485)
UnivariateFinite{OrderedFactor{2}}(normal=>0.966, outlier=>0.0344)
UnivariateFinite{OrderedFactor{2}}(normal=>0.989, outlier=>0.0114)
UnivariateFinite{OrderedFactor{2}}(normal=>0.942, outlier=>0.0578)
UnivariateFinite{OrderedFactor{2}}(normal=>0.989, outlier=>0.0115)
⋮
UnivariateFinite{OrderedFactor{2}}(normal=>0.985, outlier=>0.0151)
UnivariateFinite{OrderedFactor{2}}(normal=>0.97, outlier=>0.03)
UnivariateFinite{OrderedFactor{2}}(normal=>0.894, outlier=>0.106)
UnivariateFinite{OrderedFactor{2}}(normal=>0.98, outlier=>0.0201)
UnivariateFinite{OrderedFactor{2}}(normal=>0.956, outlier=>0.0444)
UnivariateFinite{OrderedFactor{2}}(normal=>0.974, outlier=>0.0262)
UnivariateFinite{OrderedFactor{2}}(normal=>0.939, outlier=>0.0606)
UnivariateFinite{OrderedFactor{2}}(normal=>0.991, outlier=>0.00866)
UnivariateFinite{OrderedFactor{2}}(normal=>0.748, outlier=>0.252)
Pretty simple, huh?
Learning networks
Sometimes we need more flexibility to define outlier models. Unfortunately MLJ's linear pipelines are not yet usable for outlier detection models, thus we need to define our learning networks manually. Let's, for example, create a machine that standardizes the input features before applying the detector.
Xs, ys = source(X), source(y)
Xstd = transform(machine(Standardizer(), Xs), Xs)
ŷ = predict(machine(knn, Xstd), Xstd)
knn_std = machine(ProbabilisticUnsupervisedDetector(), Xs, ys; predict=ŷ)
Machine{ProbabilisticUnsupervisedDetectorSurrogate,…} trained 0 times; does not cache data
model: MLJBase.ProbabilisticUnsupervisedDetectorSurrogate
args:
1: Source @499 ⏎ `Table{AbstractVector{Continuous}}`
2: Source @259 ⏎ `AbstractVector{OrderedFactor{2}}`
We can fit!
and predict
with the resulting model as usual.
fit!(knn_std, rows=train)
predict(knn_std, rows=test)
3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{OrderedFactor{2}}(normal=>0.988, outlier=>0.0116)
UnivariateFinite{OrderedFactor{2}}(normal=>0.977, outlier=>0.0229)
UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0359)
UnivariateFinite{OrderedFactor{2}}(normal=>0.958, outlier=>0.0417)
UnivariateFinite{OrderedFactor{2}}(normal=>0.971, outlier=>0.029)
UnivariateFinite{OrderedFactor{2}}(normal=>0.945, outlier=>0.0551)
UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0362)
UnivariateFinite{OrderedFactor{2}}(normal=>0.997, outlier=>0.00341)
UnivariateFinite{OrderedFactor{2}}(normal=>0.937, outlier=>0.0629)
UnivariateFinite{OrderedFactor{2}}(normal=>0.975, outlier=>0.0247)
⋮
UnivariateFinite{OrderedFactor{2}}(normal=>0.986, outlier=>0.0142)
UnivariateFinite{OrderedFactor{2}}(normal=>0.955, outlier=>0.0451)
UnivariateFinite{OrderedFactor{2}}(normal=>0.888, outlier=>0.112)
UnivariateFinite{OrderedFactor{2}}(normal=>0.961, outlier=>0.0386)
UnivariateFinite{OrderedFactor{2}}(normal=>0.962, outlier=>0.0376)
UnivariateFinite{OrderedFactor{2}}(normal=>0.97, outlier=>0.0301)
UnivariateFinite{OrderedFactor{2}}(normal=>0.953, outlier=>0.0471)
UnivariateFinite{OrderedFactor{2}}(normal=>0.987, outlier=>0.0129)
UnivariateFinite{OrderedFactor{2}}(normal=>0.786, outlier=>0.214)
Note that we supplied labels ys
to an unsupervised algorithm; this is not necessary if you just want to predict, but it is necessary if you want to evaluate the resulting learning network. We can easily export such a learning network as a model with @from_network
.
@from_network knn_std mutable struct StandardizedKNN end
Furthermore, if the goal is to create a standalone model from a network, we could use empty sources (source()
) for Xs
and ys
. The standalone model can be bound to data again like any other model.
knn_std = machine(StandardizedKNN(), X, y)
fit!(knn_std, rows=train)
predict(knn_std, rows=test)
3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{OrderedFactor{2}}(normal=>0.988, outlier=>0.0116)
UnivariateFinite{OrderedFactor{2}}(normal=>0.977, outlier=>0.0229)
UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0359)
UnivariateFinite{OrderedFactor{2}}(normal=>0.958, outlier=>0.0417)
UnivariateFinite{OrderedFactor{2}}(normal=>0.971, outlier=>0.029)
UnivariateFinite{OrderedFactor{2}}(normal=>0.945, outlier=>0.0551)
UnivariateFinite{OrderedFactor{2}}(normal=>0.964, outlier=>0.0362)
UnivariateFinite{OrderedFactor{2}}(normal=>0.997, outlier=>0.00341)
UnivariateFinite{OrderedFactor{2}}(normal=>0.937, outlier=>0.0629)
UnivariateFinite{OrderedFactor{2}}(normal=>0.975, outlier=>0.0247)
⋮
UnivariateFinite{OrderedFactor{2}}(normal=>0.986, outlier=>0.0142)
UnivariateFinite{OrderedFactor{2}}(normal=>0.955, outlier=>0.0451)
UnivariateFinite{OrderedFactor{2}}(normal=>0.888, outlier=>0.112)
UnivariateFinite{OrderedFactor{2}}(normal=>0.961, outlier=>0.0386)
UnivariateFinite{OrderedFactor{2}}(normal=>0.962, outlier=>0.0376)
UnivariateFinite{OrderedFactor{2}}(normal=>0.97, outlier=>0.0301)
UnivariateFinite{OrderedFactor{2}}(normal=>0.953, outlier=>0.0471)
UnivariateFinite{OrderedFactor{2}}(normal=>0.987, outlier=>0.0129)
UnivariateFinite{OrderedFactor{2}}(normal=>0.786, outlier=>0.214)
There might be occasions, where our ProbabilisticDetector
or DeterministicDetector
wrappers are not flexible enough. In such cases we can directly use augmented_transform
in our learning networks and use a ProbabilisticTransformer
or DeterministicTransformer
, which takes one or more train/test tuples as inputs returning probabilistic or deterministic predictions.
Implementing models
Learning networks let us flexibly create complex combinations of existing models, however, sometimes it's necessary to develop new outlier detection models for specific tasks. OutlierDetection.jl builds on top of MLJ and provides a simple interface defining how an outlier detection algorithm can be implemented. Let's first import the interface and the packages relevant to our new algorithm.
import OutlierDetectionInterface
const OD = OutlierDetectionInterface
using Statistics:mean
using LinearAlgebra:norm
Our proposed algorithm calculates a central point from the training data and defines an outlier as a point that's far away from that center. The only hyperparameter is p
specifying which p-norm to use to calculate the distance. Using @detector
, which replicates @mlj_model
, we can define our detector struct with macro-generated keyword arguments and default values.
OD.@detector mutable struct SimpleDetector <: OD.UnsupervisedDetector
p::Float64 = 2
end
Our DetectorModel
, then, defines the learned parameters of our model. In this case the only learned parameter is the center.
struct SimpleModel <: OD.DetectorModel
center::AbstractArray{<:Real}
end
Let's further define a helper function to calculate the distance from the center.
function distances_from(center, vectors::AbstractMatrix, p)
deviations = vectors .- center
return [norm(deviations[:, i], p) for i in 1:size(deviations, 2)]
end
distances_from (generic function with 1 method)
Finally, we can implement the two methods necessary to implement a detector, namely fit
and transform
. Please refer to the Key Concepts to learn more about the involved methods and types.
function OD.fit(detector::SimpleDetector, X::OD.Data; verbosity)::OD.Fit
center = mean(X, dims=2)
training_scores = distances_from(center, X, detector.p)
return SimpleModel(center), training_scores
end
function OD.transform(detector::SimpleDetector, model::SimpleModel, X::OD.Data)::OD.Scores
distances_from(model.center, X, detector.p)
end
Using a data-frontend, we can make sure that MLJ internally transforms input data to Data
, which refers to column-major Julia arrays with the last dimension representing an example. Registering that frontend can be achieved with @default_frontend
.
OD.@default_frontend SimpleDetector
Again, we can simply wrap our detector in a ProbabilisticDetector
to enable probabilistic predictions.
sd = machine(ProbabilisticDetector(SimpleDetector()), X, y)
fit!(sd, rows=train)
predict(sd, rows=test)
3600-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{OrderedFactor{2}}(normal=>0.903, outlier=>0.0972)
UnivariateFinite{OrderedFactor{2}}(normal=>0.874, outlier=>0.126)
UnivariateFinite{OrderedFactor{2}}(normal=>0.917, outlier=>0.0826)
UnivariateFinite{OrderedFactor{2}}(normal=>0.564, outlier=>0.436)
UnivariateFinite{OrderedFactor{2}}(normal=>0.67, outlier=>0.33)
UnivariateFinite{OrderedFactor{2}}(normal=>0.531, outlier=>0.469)
UnivariateFinite{OrderedFactor{2}}(normal=>0.832, outlier=>0.168)
UnivariateFinite{OrderedFactor{2}}(normal=>0.729, outlier=>0.271)
UnivariateFinite{OrderedFactor{2}}(normal=>0.677, outlier=>0.323)
UnivariateFinite{OrderedFactor{2}}(normal=>0.688, outlier=>0.312)
⋮
UnivariateFinite{OrderedFactor{2}}(normal=>0.874, outlier=>0.126)
UnivariateFinite{OrderedFactor{2}}(normal=>0.742, outlier=>0.258)
UnivariateFinite{OrderedFactor{2}}(normal=>0.816, outlier=>0.184)
UnivariateFinite{OrderedFactor{2}}(normal=>0.713, outlier=>0.287)
UnivariateFinite{OrderedFactor{2}}(normal=>0.564, outlier=>0.436)
UnivariateFinite{OrderedFactor{2}}(normal=>0.932, outlier=>0.0679)
UnivariateFinite{OrderedFactor{2}}(normal=>0.631, outlier=>0.369)
UnivariateFinite{OrderedFactor{2}}(normal=>0.526, outlier=>0.474)
UnivariateFinite{OrderedFactor{2}}(normal=>0.408, outlier=>0.592)
Remember: Your feedback and contributions are extremely welcome, join us on Github or #outlierdetection on Slack and get involved.