Getting Started
This example demonstrates using the OutlierDetection
API to determine the outlierness of instances in the Thyroid Disease Dataset, which is part of the ODDS collection. We use OutlierDetectionData.jl
to load the dataset.
Import MLJ
, OutlierDetection
and OutlierDetectionData
.
using MLJ
using OutlierDetection
using OutlierDetectionData: ODDS
Load the "thyroid"
dataset from the ODDS
collection.
X, y = ODDS.load("thyroid")
(3772×6 DataFrame
Row │ x1 x2 x3 x4 x5 x6
│ Float64 Float64 Float64 Float64 Float64 Float64
──────┼────────────────────────────────────────────────────────────────
1 │ 0.774194 0.00113208 0.137571 0.275701 0.295775 0.236066
2 │ 0.247312 0.000471698 0.279886 0.329439 0.535211 0.17377
3 │ 0.494624 0.00358491 0.22296 0.233645 0.525822 0.12459
4 │ 0.677419 0.00169811 0.156546 0.175234 0.333333 0.136066
5 │ 0.236559 0.000471698 0.241935 0.320093 0.333333 0.247541
6 │ 0.731183 0.000471698 0.147059 0.196262 0.239437 0.198361
7 │ 0.903226 0.000471698 0.213472 0.294393 0.399061 0.195082
8 │ 0.505376 0.00392453 0.185009 0.196262 0.276995 0.177049
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
3766 │ 0.763441 0.00943396 0.190702 0.231308 0.323944 0.185246
3767 │ 0.688172 0.000886792 0.0711575 0.35514 0.262911 0.331148
3768 │ 0.817204 0.000113208 0.190702 0.287383 0.413146 0.188525
3769 │ 0.430108 0.00245283 0.232448 0.287383 0.446009 0.17541
3770 │ 0.935484 0.0245283 0.160342 0.28271 0.375587 0.2
3771 │ 0.677419 0.0014717 0.190702 0.242991 0.323944 0.195082
3772 │ 0.483871 0.00356604 0.190702 0.212617 0.338028 0.163934
3757 rows omitted, CategoricalArrays.CategoricalValue{String, UInt32}["normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal" … "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal"])
Create indices to split the data into 50% training and test data.
train, test = partition(eachindex(y), 0.5, shuffle=true, rng=0)
([2549, 3479, 2164, 1422, 1398, 1019, 2000, 2436, 2177, 1364 … 3088, 36, 3583, 2433, 3065, 2842, 2428, 2908, 2577, 2790], [1557, 2234, 131, 1119, 2588, 1117, 2654, 133, 2860, 3678 … 2692, 214, 3685, 1663, 1479, 80, 3716, 2006, 2404, 323])
Load a OutlierDetectionNeighbors.KNNDetector
and initialize it with k=10
neighbors.
KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors verbosity=0
knn = KNN(k=10)
KNNDetector(
k = 10,
metric = Distances.Euclidean(0.0),
algorithm = :kdtree,
static = :auto,
leafsize = 10,
reorder = true,
parallel = false,
reduction = :maximum)
Bind a raw, probabilistic and deterministic detector to data using a machine.
knn_raw = machine(knn, X)
knn_probabilistic = machine(ProbabilisticDetector(knn), X)
knn_deterministic = machine(DeterministicDetector(knn), X)
untrained Machine; does not cache data
model: DeterministicUnsupervisedCompositeDetector(normalize = scale_minmax, …)
args:
1: Source @666 ⏎ Table{AbstractVector{Continuous}}
Learn models from the training data.
fit!(knn_raw, rows=train)
fit!(knn_probabilistic, rows=train)
fit!(knn_deterministic, rows=train)
trained Machine; does not cache data
model: DeterministicUnsupervisedCompositeDetector(normalize = scale_minmax, …)
args:
1: Source @666 ⏎ Table{AbstractVector{Continuous}}
Transform the data into raw outlier scores.
transform(knn_raw, rows=test)
([0.24795057307189558, 0.04163784860104409, 0.047810818591195306, 0.04338317044306964, 0.03648438361620953, 0.19331044322801444, 0.0648797052935328, 0.0604534313203906, 0.0626014343074869, 0.03237110986551567 … 0.07920929039144071, 0.04271525372196582, 0.04312011181720217, 0.06576503980590592, 0.034570993610013256, 0.02275082665800582, 0.19975840930450825, 0.04980355758518007, 0.08103650519015003, 0.0617888096609636], [0.06557784520257778, 0.038629381892548616, 0.05363417429961932, 0.07577679935771788, 0.09061915033647523, 0.08253340507766346, 0.06196491761650549, 0.07179941593346231, 0.088190657057796, 0.09902804040590202 … 0.04075175091227025, 0.04930769979481979, 0.053086156239446514, 0.11687117998642389, 0.03791856851206514, 0.037222290553039654, 0.035630314609887685, 0.04012485974851572, 0.03178501550709559, 0.026266282084613526])
Predict outlier probabilities based on the test data.
predict(knn_probabilistic, rows=test)
1886-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{OrderedFactor{2}}(normal=>0.932, outlier=>0.0682)
UnivariateFinite{OrderedFactor{2}}(normal=>0.965, outlier=>0.0347)
UnivariateFinite{OrderedFactor{2}}(normal=>0.947, outlier=>0.0534)
UnivariateFinite{OrderedFactor{2}}(normal=>0.919, outlier=>0.0809)
UnivariateFinite{OrderedFactor{2}}(normal=>0.901, outlier=>0.0994)
UnivariateFinite{OrderedFactor{2}}(normal=>0.911, outlier=>0.0893)
UnivariateFinite{OrderedFactor{2}}(normal=>0.936, outlier=>0.0637)
UnivariateFinite{OrderedFactor{2}}(normal=>0.924, outlier=>0.076)
UnivariateFinite{OrderedFactor{2}}(normal=>0.904, outlier=>0.0964)
UnivariateFinite{OrderedFactor{2}}(normal=>0.89, outlier=>0.11)
⋮
UnivariateFinite{OrderedFactor{2}}(normal=>0.952, outlier=>0.048)
UnivariateFinite{OrderedFactor{2}}(normal=>0.947, outlier=>0.0527)
UnivariateFinite{OrderedFactor{2}}(normal=>0.868, outlier=>0.132)
UnivariateFinite{OrderedFactor{2}}(normal=>0.966, outlier=>0.0338)
UnivariateFinite{OrderedFactor{2}}(normal=>0.967, outlier=>0.0329)
UnivariateFinite{OrderedFactor{2}}(normal=>0.969, outlier=>0.031)
UnivariateFinite{OrderedFactor{2}}(normal=>0.963, outlier=>0.0366)
UnivariateFinite{OrderedFactor{2}}(normal=>0.974, outlier=>0.0262)
UnivariateFinite{OrderedFactor{2}}(normal=>0.981, outlier=>0.0193)
Predict outlier classes based on the test data.
predict(knn_deterministic, rows=test)
1886-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
⋮
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
"normal"
Learn more
To learn more about the concepts in OutlierDetection.jl
, check out the simple usage guide.