Simple Usage

Let's import the necessary packages first.

using MLJ
using OutlierDetection
using OutlierDetectionData: ODDS
using StatisticalMeasures: area_under_curve

Loading data

We can list the available datasets in the imported ODDS dataset collection with list

ODDS.list()

27-element Vector{String}:
 "annthyroid"
 "arrhythmia"
 "breastw"
 "cardio"
 "cover"
 "glass"
 "http"
 "ionosphere"
 "letter"
 "lympho"
 ⋮
 "satimage-2"
 "shuttle"
 "smtp"
 "speech"
 "thyroid"
 "vertebral"
 "vowels"
 "wbc"
 "wine"

We can now load a dataset by specifying its name.

X, y = ODDS.load("annthyroid")

(7200×6 DataFrame
  Row │ x1       x2       x3       x4       x5       x6
      │ Float64  Float64  Float64  Float64  Float64  Float64
──────┼──────────────────────────────────────────────────────
    1 │    0.73  0.0006    0.015   0.12       0.082  0.146
    2 │    0.24  0.00025   0.03    0.143      0.133  0.108
    3 │    0.47  0.0019    0.024   0.102      0.131  0.078
    4 │    0.64  0.0009    0.017   0.077      0.09   0.085
    5 │    0.23  0.00025   0.026   0.139      0.09   0.153
    6 │    0.69  0.00025   0.016   0.086      0.07   0.123
    7 │    0.85  0.00025   0.023   0.128      0.104  0.121
    8 │    0.48  0.00208   0.02    0.086      0.078  0.11
  ⋮   │    ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 7194 │    0.7   0.0009    0.015   0.104      0.095  0.109
 7195 │    0.79  0.0049    0.0201  0.077      0.082  0.094
 7196 │    0.59  0.0025    0.0208  0.079      0.099  0.08
 7197 │    0.51  0.106     0.006   0.005      0.089  0.0055
 7198 │    0.51  0.00076   0.0201  0.09       0.067  0.134
 7199 │    0.35  0.0028    0.0201  0.09       0.089  0.101
 7200 │    0.73  0.00056   0.0201  0.081      0.09   0.09
                                            7185 rows omitted, CategoricalArrays.CategoricalValue{String, UInt32}["normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal"  …  "normal", "normal", "normal", "normal", "normal", "normal", "outlier", "normal", "normal", "normal"])

Data formats

Because OutlierDetection.jl is built upon MLJ, there are some things to know regarding the data used in outlier detection tasks. A detector can typically be instantiated with continuous data X satisfying the Tables.jl interface. Often we use DataFrames.jl to create such tables. An important distinction to know is the difference between machine types and scientific types.

The machine type refers to the Julia type being used to represent the object (for instance, Float64).
The scientific type is one of the types defined in ScientificTypes.jl reflecting how the object should be interpreted (for instance, Continuous or Multiclass).

We can examine the machine and scientific types of our loaded dataframe X with ScientificTypes.schema.

schema(X)

┌───────┬────────────┬─────────┐
│ names │ scitypes   │ types   │
├───────┼────────────┼─────────┤
│ x1    │ Continuous │ Float64 │
│ x2    │ Continuous │ Float64 │
│ x3    │ Continuous │ Float64 │
│ x4    │ Continuous │ Float64 │
│ x5    │ Continuous │ Float64 │
│ x6    │ Continuous │ Float64 │
└───────┴────────────┴─────────┘

Fortunately, our table contains only Continuous data as expected. Labels in outlier detection are always encoded as a categorical vectors with classes "normal" and "outlier" and scitype OrderedFactor{2}. Data with type OrderedFactor{2} is considered to have an intrinsic "positive" class, in our case "outlier". Measures, such as true_positive assume the second class in the ordering is the "positive" class. Using the helper to_categorical, we can transform a Vector{String} to a categorical vector, which ensures there are only two classes and the positive class is "outlier". We don't need to coerce y to a categorical array in our example because load already returns categorical vectors.

to_categorical(["normal", "normal", "outlier"])

3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "normal"
 "normal"
 "outlier"

Loading models

Having the data ready, we can list all available detectors in MLJ. By convention, a detector is named $(Name)Detector in MLJ, e.g. KNNDetector and we can thus simply search for "Detector".

models("Detector")

28-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ABODDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = ABODDetector, package_name = OutlierDetectionPython, ... )
 (name = CBLOFDetector, package_name = OutlierDetectionPython, ... )
 (name = CDDetector, package_name = OutlierDetectionPython, ... )
 (name = COFDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = COFDetector, package_name = OutlierDetectionPython, ... )
 (name = COPODDetector, package_name = OutlierDetectionPython, ... )
 (name = DNNDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = ECODDetector, package_name = OutlierDetectionPython, ... )
 (name = GMMDetector, package_name = OutlierDetectionPython, ... )
 ⋮
 (name = LOFDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = LOFDetector, package_name = OutlierDetectionPython, ... )
 (name = MCDDetector, package_name = OutlierDetectionPython, ... )
 (name = OCSVMDetector, package_name = OutlierDetectionPython, ... )
 (name = OneClassSVM, package_name = LIBSVM, ... )
 (name = PCADetector, package_name = OutlierDetectionPython, ... )
 (name = RODDetector, package_name = OutlierDetectionPython, ... )
 (name = SODDetector, package_name = OutlierDetectionPython, ... )
 (name = SOSDetector, package_name = OutlierDetectionPython, ... )

Loading a detector of your choice is simple with @load or @iload, see loading model code. There are multiple detectors named KNNDetector, thus we specify the package beforehand.

KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors verbosity=0

OutlierDetectionNeighbors.KNNDetector

To enable later evaluation, we wrap a raw detector (which only defines transform returning raw outlier scores) in a ProbabilisticDetector; this enables us to predict outlier probabilities from the raw scores.

knn = ProbabilisticDetector(KNN())

ProbabilisticUnsupervisedCompositeDetector(
  normalize = OutlierDetection.scale_minmax, 
  combine = OutlierDetection.combine_mean, 
  detector = KNNDetector(
        k = 5, 
        metric = Distances.Euclidean(0.0), 
        algorithm = :kdtree, 
        static = :auto, 
        leafsize = 10, 
        reorder = true, 
        parallel = false, 
        reduction = :maximum))

Note that the call above assumes that you want to use the default parameters to instantiate the OutlierDetectionNeighbors.KNNDetector and ProbabilisticDetector, e.g. k=5 so on.

Model evaluation

We can now evaluate how such a model performs. By default, a probabilistic detector is evaluated using area_under_curve, but there are a lot of other evaluation strategies available, see the list of measures. We use stratified five-fold cross validation to evaluate our model, but other resampling strategies are possible as well.

cv = StratifiedCV(nfolds=5, shuffle=true, rng=0)
evaluate(knn, X, y; resampling=cv, measure=area_under_curve)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────────────┬───────────┬─────────────┐
│ measure          │ operation │ measurement │
├──────────────────┼───────────┼─────────────┤
│ AreaUnderCurve() │ predict   │ 0.745       │
└──────────────────┴───────────┴─────────────┘
┌────────────────────────────────────┬─────────┐
│ per_fold                           │ 1.96*SE │
├────────────────────────────────────┼─────────┤
│ [0.737, 0.754, 0.798, 0.74, 0.696] │ 0.0361  │
└────────────────────────────────────┴─────────┘

Model optimization

As previously mentioned, we used the default parameters to create our model. However, we typically don't know an appropriate amount of neighbors (k) beforehand. Using MLJ's built-in model tuning we can identify the best k given some performance measure.

Let's first define a range of possible parameter values for k.

r = range(knn, :(detector.k), values=[1,2,3,4,5:5:100...])

NominalRange(detector.k = 1, 2, 3, ...)

We can then use this range, or multiple ranges, to create a tuned model by additionally specifying a tuning-strategy, which defines how to efficiently evaluate ranges. In our case we use a simple grid search to evaluate all the given parameter options.

t = TunedModel(model=knn, resampling=cv, tuning=Grid(), range=r, acceleration=CPUThreads(), measure=area_under_curve)

ProbabilisticTunedModel(
  model = ProbabilisticUnsupervisedCompositeDetector(
        normalize = OutlierDetection.scale_minmax, 
        combine = OutlierDetection.combine_mean, 
        detector = KNNDetector(k = 5, …)), 
  tuning = Grid(
        goal = nothing, 
        resolution = 10, 
        shuffle = true, 
        rng = Random.TaskLocalRNG()), 
  resampling = StratifiedCV(
        nfolds = 5, 
        shuffle = true, 
        rng = Random.MersenneTwister(0, (0, 11022, 10020, 380))), 
  measure = AreaUnderCurve(), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  range = NominalRange(detector.k = 1, 2, 3, ...), 
  selection_heuristic = MLJTuning.NaiveSelection(nothing), 
  train_best = true, 
  repeats = 1, 
  n = nothing, 
  acceleration = ComputationalResources.CPUThreads{Int64}(1), 
  acceleration_resampling = ComputationalResources.CPU1{Nothing}(nothing), 
  check_measure = true, 
  cache = true, 
  compact_history = true, 
  logger = nothing)

We can again bind that model to data and fit it. Fitting a tuned model instigates a search for optimal model hyperparameters, within specified ranges, and then uses all supplied data to train the best model.

m = machine(t, X, y) |> fit!

trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = ProbabilisticUnsupervisedCompositeDetector(normalize = scale_minmax, …), …)
  args: 
    1:  Source @759 ⏎ Table{AbstractVector{Continuous}}
    2:  Source @077 ⏎ AbstractVector{OrderedFactor{2}}

Using the machines' report, we can identify the best evaluation results.

report(m).best_history_entry

(model = ProbabilisticUnsupervisedCompositeDetector(normalize = scale_minmax, …),
 measure = [AreaUnderCurve()],
 measurement = [0.7692060934966346],
 per_fold = [[0.7132612938813606, 0.7523048986545702, 0.7647285653189007, 0.8143390987933906, 0.8013966108349508]],
 evaluation = CompactPerformanceEvaluation(0.769,),)

Additionally, we can easily extract the best identified model.

b = report(m).best_model

ProbabilisticUnsupervisedCompositeDetector(
  normalize = OutlierDetection.scale_minmax, 
  combine = OutlierDetection.combine_mean, 
  detector = KNNDetector(
        k = 1, 
        metric = Distances.Euclidean(0.0), 
        algorithm = :kdtree, 
        static = :auto, 
        leafsize = 10, 
        reorder = true, 
        parallel = false, 
        reduction = :maximum))

Let's evaluate the best model again to make sure it achieves the expected performance.

evaluate(b, X, y, resampling=cv, measure=area_under_curve)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────────────┬───────────┬─────────────┐
│ measure          │ operation │ measurement │
├──────────────────┼───────────┼─────────────┤
│ AreaUnderCurve() │ predict   │ 0.769       │
└──────────────────┴───────────┴─────────────┘
┌─────────────────────────────────────┬─────────┐
│ per_fold                            │ 1.96*SE │
├─────────────────────────────────────┼─────────┤
│ [0.713, 0.752, 0.765, 0.814, 0.801] │ 0.0395  │
└─────────────────────────────────────┴─────────┘

Model usage

Now that we have found the best model, we can use it to determine outliers in the data. Converting scores to classes can be achieved with a DeterministicDetector. Let's create some fake train/test indices and suppose we want to identify outliers in the test data.

train, test = partition(eachindex(y), 0.5, shuffle=true, stratify=y, rng=0)

([2549, 3479, 6603, 1019, 2000, 3894, 114, 3804, 5535, 5134  …  1248, 5397, 5735, 3722, 4621, 6645, 2546, 3332, 4756, 4642], [2164, 1398, 4230, 1364, 4118, 5372, 6312, 7020, 4093, 2984  …  2560, 1774, 4845, 6784, 2106, 3749, 6300, 509, 5484, 1134])

Let's determine the outlier_fraction in the training data, which we then use to determine a threshold to convert the outlier scores into classes. Using classify_quantile, we can create a classification function based on quantiles of the training data. In the following example we define an outlier's score to lie above the 1 - outlier_fraction training scores' quantile.

threshold = classify_quantile(1 - outlier_fraction(y[train]))
final = machine(DeterministicDetector(b.detector, classify=threshold), X)
fit!(final, rows=train)

trained Machine; does not cache data
  model: DeterministicUnsupervisedCompositeDetector(normalize = scale_minmax, …)
  args: 
    1:  Source @991 ⏎ Table{AbstractVector{Continuous}}

Using predict allows us to determine the outliers in the test data.

ŷ = predict(final, rows=test)

3600-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "outlier"
 "normal"
 "normal"
 ⋮
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "outlier"
 "normal"
 "normal"
 "normal"

Model persistence

Finally, we can store the model with MLJ.save.

MLJ.save("final.jlso", final)

Loading the model again, the machine is not bound to data anymore, but we can bind it to data if we supply X again.

final = machine("final.jlso")

trained Machine; does not cache data
  model: DeterministicUnsupervisedCompositeDetector(normalize = scale_minmax, …)
  args:

We can still use the machine to predict, even though its not bound to data.

ŷ == predict(final, X[test, :])

true

If you would like to know how you can combine detectors or how to develop your own detectors, continue with the Advanced Usage guide.