Skip to content

Getting Started

This example demonstrates using the OutlierDetection API to determine the outlierness of instances in the Thyroid Disease Dataset, which is part of the ODDS collection. We use OutlierDetectionData.jl to load the dataset.

Import MLJ, OutlierDetection and OutlierDetectionData.

using MLJ
using OutlierDetection
using OutlierDetectionData: ODDS

Load the "thyroid" dataset from the ODDS collection.

X, y = ODDS.load("thyroid")
(3772×6 DataFrame
  Row │ x1        x2           x3         x4        x5        x6
      │ Float64   Float64      Float64    Float64   Float64   Float64
──────┼────────────────────────────────────────────────────────────────
    1 │ 0.774194  0.00113208   0.137571   0.275701  0.295775  0.236066
    2 │ 0.247312  0.000471698  0.279886   0.329439  0.535211  0.17377
    3 │ 0.494624  0.00358491   0.22296    0.233645  0.525822  0.12459
    4 │ 0.677419  0.00169811   0.156546   0.175234  0.333333  0.136066
    5 │ 0.236559  0.000471698  0.241935   0.320093  0.333333  0.247541
    6 │ 0.731183  0.000471698  0.147059   0.196262  0.239437  0.198361
    7 │ 0.903226  0.000471698  0.213472   0.294393  0.399061  0.195082
    8 │ 0.505376  0.00392453   0.185009   0.196262  0.276995  0.177049
  ⋮   │    ⋮           ⋮           ⋮         ⋮         ⋮         ⋮
 3766 │ 0.763441  0.00943396   0.190702   0.231308  0.323944  0.185246
 3767 │ 0.688172  0.000886792  0.0711575  0.35514   0.262911  0.331148
 3768 │ 0.817204  0.000113208  0.190702   0.287383  0.413146  0.188525
 3769 │ 0.430108  0.00245283   0.232448   0.287383  0.446009  0.17541
 3770 │ 0.935484  0.0245283    0.160342   0.28271   0.375587  0.2
 3771 │ 0.677419  0.0014717    0.190702   0.242991  0.323944  0.195082
 3772 │ 0.483871  0.00356604   0.190702   0.212617  0.338028  0.163934
                                                      3757 rows omitted, CategoricalArrays.CategoricalValue{String, UInt32}["normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal"  …  "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal"])

Create indices to split the data into 50% training and test data.

train, test = partition(eachindex(y), 0.5, shuffle=true, rng=0)
([2913, 2848, 707, 3243, 2580, 1308, 2321, 2373, 1876, 1063  …  1830, 2001, 812, 2964, 200, 1295, 3008, 1264, 3250, 893], [2757, 1446, 3184, 3035, 3682, 1489, 1391, 3379, 1272, 1499  …  3294, 1176, 1305, 276, 2305, 401, 3126, 922, 83, 3649])

Load a OutlierDetectionNeighbors.KNNDetector and initialize it with k=10 neighbors.

KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors verbosity=0
knn = KNN(k=10)
KNNDetector(
  k = 10, 
  metric = Distances.Euclidean(0.0), 
  algorithm = :kdtree, 
  static = :auto, 
  leafsize = 10, 
  reorder = true, 
  parallel = false, 
  reduction = :maximum)

Bind a raw, probabilistic and deterministic detector to data using a machine.

knn_raw = machine(knn, X)
knn_probabilistic = machine(ProbabilisticDetector(knn), X)
knn_deterministic = machine(DeterministicDetector(knn), X)
untrained Machine; does not cache data
  model: DeterministicUnsupervisedCompositeDetector(normalize = scale_minmax, …)
  args: 
    1:  Source @922 ⏎ Table{AbstractVector{Continuous}}

Learn models from the training data.

fit!(knn_raw, rows=train)
fit!(knn_probabilistic, rows=train)
fit!(knn_deterministic, rows=train)
trained Machine; does not cache data
  model: DeterministicUnsupervisedCompositeDetector(normalize = scale_minmax, …)
  args: 
    1:  Source @922 ⏎ Table{AbstractVector{Continuous}}

Transform the data into raw outlier scores.

transform(knn_raw, rows=test)
([0.01269328678242623, 0.11156667711676535, 0.05088217126019664, 0.0841671446633964, 0.032502201393999515, 0.03766161901423662, 0.1734321580021272, 0.06473704089745065, 0.06125773587607718, 0.0490542190240858  …  0.05576388438216412, 0.06427497252168496, 0.12223562779567965, 0.05180742182984218, 0.05661766603307932, 0.03708848055736689, 0.06657327635767228, 0.05110185241493615, 0.04823106409311639, 0.0635408342739163], [0.04150080722414154, 0.06332161576246768, 0.15584728188740374, 0.10211441619850055, 0.049028607009550355, 0.05535455202415265, 0.040781731999420305, 0.047055640453124284, 0.028744836441760478, 0.058699224498302595  …  0.03332858315541014, 0.06395147696918803, 0.04980165081635798, 0.09253572176985131, 0.036833962818158164, 0.031917866836413324, 0.0741331730353965, 0.11129509611654069, 0.34959776501224477, 0.04514282829421119])

Predict outlier probabilities based on the test data.

predict(knn_probabilistic, rows=test)
1886-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{OrderedFactor{2}}(normal=>0.96, outlier=>0.0403)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.931, outlier=>0.069)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.81, outlier=>0.19)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.88, outlier=>0.12)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.95, outlier=>0.0502)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.941, outlier=>0.0585)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.961, outlier=>0.0394)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.952, outlier=>0.0476)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.976, outlier=>0.0236)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.937, outlier=>0.0629)
 ⋮
 UnivariateFinite{OrderedFactor{2}}(normal=>0.93, outlier=>0.0698)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.949, outlier=>0.0512)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.893, outlier=>0.107)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.966, outlier=>0.0342)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.972, outlier=>0.0278)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.917, outlier=>0.0832)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.868, outlier=>0.132)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.555, outlier=>0.445)
 UnivariateFinite{OrderedFactor{2}}(normal=>0.955, outlier=>0.0451)

Predict outlier classes based on the test data.

predict(knn_deterministic, rows=test)
1886-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "normal"
 "normal"
 "outlier"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 ⋮
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "normal"
 "outlier"
 "normal"

Learn more

To learn more about the concepts in OutlierDetection.jl, check out the simple usage guide.