fit#

pycafee.sample.outliers.Tukey.fit(self, x_exp, which=None, critical=None, details=None)#

This function applies the Tukey method (Boxplot) for outlier detection [1].

Parameters
x_expnumpy array

One dimension numpy array with at least 3 sample data.

detailsstr, optional

The details parameter determines the amount of information presented about the hypothesis test.

  • If details = "short" (or None, e.g, the default), a simplified version of the test result is returned.

  • if details = "binary", the conclusion will be 1 (data has outlier) or 0 (data has no outlier).

whichstr, optional

The value that should be evaluated as a possible outlier.

  • If it is None (default), the outlier is automatically inferred as the farthest observation from the mean

  • If it is "max", the highest value is checked if it is a possible outlier.

  • If it is "min", the lowest value is checked if it is a possible outlier.

criticalstr, int or float, optional

The critical value of the test.

  • If critical="extreme", the critical value is 3.0 (default);

  • If critical="mild", the critical value is 1.5;

  • If a number, it must higher than zero (0);

Returns
resulttuple with
intervallist of floats

The range where the data are not considered to be outliers

criticalfloat

The critical value.

outlierfloat or int

The value checked as a possible outlier

conclusionstr or int

The test conclusion (e.g, Possible outlier/ no outliers).

Notes

The Tukey test for outlier detection checks if the possible outlier is within the interval adjacent to the data, e. g., the interval where the sample is not considered an outlier. The lower limit of this range is estimated through the following equation:

\[lower = Q_1 - C \times IR\]

where \(Q_1\) is the first quartile, \(IR\) is the interquartile range and \(C\) is the decision criterion value. The upper limit of the decision range is estimaged through the following equation:

\[upper = Q_3 + C \times IR\]

where \(Q_3\) is the third quartile. The interquartile range is estimated with interquartile_range function using method="tukey".

By default, the decision criterion value is 3.0, which implies checking for an extreme outlier. The conclusion of the test is done by checking if the outlier is within the range where the data are not considered as a outlier:

if lower <= outlier <= upper:
    Data does not have a outlier
else:
    Data has a outlier

References

1

TUKEY, J. W. Exploring Data Analysis. 1. ed. Reading: Addison-Wesley Publishing Company. Inc., 1977.

Examples

Looking for extreme outlier

>>> from pycafee.sample.outliers import Tukey
>>> import numpy as np
>>> x = np.array([5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9])
>>> test = Tukey()
>>> result, conclusion = test.fit(x)
>>> print(result)
TukeyResult(Interval=[3.3999999999999986, 6.200000000000001], critical=3, Outlier=5.4)
>>> print(conclusion)
The dataset has no outliers

Looking for mild outlier

>>> from pycafee.sample.outliers import Tukey
>>> import numpy as np
>>> x = np.array([5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9])
>>> test = Tukey()
>>> result, conclusion = test.fit(x, critical="mild")
>>> print(result)
TukeyResult(Interval=[3.999999999999999, 5.6000000000000005], Critical=1.5, Outlier=5.4)
>>> print(conclusion)
The dataset has no outliers