Key Concepts
This guide should provide you the necessary knowledge to work with OutlierDetection.jl
and understand the concepts behind the library design.
Note
Outlier detection is predominantly an unsupervised learning task, transforming each data point to an outlier score quantifying the level of "outlierness". This very general form of output retains all the information provided by a specific algorithm.
The key design choice of OutlierDetection.jl is promoting the usage of outlier scores, not labels. The main data type, a Detector
, has to implement two methods: fit
and transform
.
Detector
: Astruct
defining the hyperparameters for an outlier detection algorithm, just like an estimator in scikit-learn or a model in MLJ. A detector actually is anMLJModelInterface.Model
(subtype).fit
: Learn aDetectorModel
for a specific detector from input dataX
and labelsy
(if supervised), for example the weights of a neural network.transform
: Using a detector and a learned model, transform unseen data into outlier scores.
Transforming the outlier scores to classes is seen as the last step of an outlier detection task. A Wrapper or Transformer turns scores into probabilities or labels, typically with two classes describing inliers "normal"
and outliers "outlier"
.
A convention used in OutlierDetection.jl is that higher scores imply higher outlierness.
Note
A peculiarity of working with outlier scores is the distinction between train scores and test scores. Train scores result from fitting a detector (fit
), and test scores result from predicting unseen data (transform
). Classifying an instance as an inlier or outlier always requires a comparison to the train scores.
Let's see how the data types look like in a typical outlier detection task. We use the following naming conventions for the data we are working with:
- the input data
OutlierDetectionInterface.Data
- the raw scores
OutlierDetectionInterface.Scores
- the labels
OutlierDetectionInterface.Labels
One last unmentioned structure is the Fit
result, a struct
that bundles the learned model and training scores. Let's now looks how the methods defined by OutlierDetection.jl transform the mentioned data structures.
fit(::UnsupervisedDetector, ::Data; verbosity::Integer)::Fit
fit(::SupervisedDetector, ::Data, ::Labels; verbosity::Integer)::Fit
transform(::Detector, ::Fit, ::Data)::Scores
A new outlier detection algorithm can be implemented in OutlierDetection.jl easily by implementing above fit
and transform
methods.
Warning
We expect the data to be formatted using the columns-as-observations convention for improved performance with Julia's column-major data.
Integration with MLJ
One of the exciting features of OutlierDetection.jl is it's interoperability with the rest of Julia's machine learning ecosystem. You might want to preprocess your data, cluster it, detect outliers, classify, and so forth.
OutlierDetection.jl defines an interface for MLJ such the implemented OutlierDetection.jl
detectors can be used directly with MLJ.
- A
Detector
is bound to data, either throughmachine(::UnsupervisedDetector, X)
, ormachine(::SupervisedDetector, X, y)
. fit(::Detector, X, [y]; verbosity)
becomesfit!(machine)
, which callsfit
under the hoodtransform(::Detector, ::Fit, X)
becomestransform(machine)
, which callstransform
under the hood
Additionally, OutlierDetection.jl defines a data front-end for MLJ, which ensures that fit
and transform
are always called with Julia arrays in column-major format, even though machine(::Detector, X, y)
also accepts data from any Tables.jl-compatible data source.
Take a look at our Simple Usage to learn more.