fit#
- pycafee.sample.outliers.Tukey.fit(self, x_exp, which=None, critical=None, details=None)#
This function applies the Tukey method (Boxplot) for outlier detection [1].
- Parameters
- x_exp
numpy array One dimension numpy array with at least 3 sample data.
- details
str, optional The
detailsparameter determines the amount of information presented about the hypothesis test.If
details = "short"(orNone, e.g, the default), a simplified version of the test result is returned.if
details = "binary", the conclusion will be1(data has outlier) or0(data has no outlier).
- which
str, optional The value that should be evaluated as a possible outlier.
If it is
None(default), the outlier is automatically inferred as the farthest observation from the meanIf it is
"max", the highest value is checked if it is a possible outlier.If it is
"min", the lowest value is checked if it is a possible outlier.
- critical
str,intorfloat, optional The critical value of the test.
If
critical="extreme", the critical value is3.0(default);If
critical="mild", the critical value is1.5;If a number, it must higher than zero (
0);
- x_exp
- Returns
- result
tuplewith - interval
listoffloats The range where the data are not considered to be outliers
- critical
float The critical value.
- outlier
floatorint The value checked as a possible outlier
- interval
- conclusion
strorint The test conclusion (e.g, Possible outlier/ no outliers).
- result
See also
Notes
The Tukey test for outlier detection checks if the possible outlier is within the interval adjacent to the data, e. g., the interval where the sample is not considered an outlier. The lower limit of this range is estimated through the following equation:
\[lower = Q_1 - C \times IR\]where \(Q_1\) is the first quartile, \(IR\) is the interquartile range and \(C\) is the decision criterion value. The upper limit of the decision range is estimaged through the following equation:
\[upper = Q_3 + C \times IR\]where \(Q_3\) is the third quartile. The interquartile range is estimated with interquartile_range function using
method="tukey".By default, the decision criterion value is
3.0, which implies checking for an extreme outlier. The conclusion of the test is done by checking if theoutlieris within the range where the data are not considered as a outlier:if lower <= outlier <= upper: Data does not have a outlier else: Data has a outlier
References
- 1
TUKEY, J. W. Exploring Data Analysis. 1. ed. Reading: Addison-Wesley Publishing Company. Inc., 1977.
Examples
Looking for extreme outlier
>>> from pycafee.sample.outliers import Tukey >>> import numpy as np >>> x = np.array([5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9]) >>> test = Tukey() >>> result, conclusion = test.fit(x) >>> print(result) TukeyResult(Interval=[3.3999999999999986, 6.200000000000001], critical=3, Outlier=5.4) >>> print(conclusion) The dataset has no outliers
Looking for mild outlier
>>> from pycafee.sample.outliers import Tukey >>> import numpy as np >>> x = np.array([5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9]) >>> test = Tukey() >>> result, conclusion = test.fit(x, critical="mild") >>> print(result) TukeyResult(Interval=[3.999999999999999, 5.6000000000000005], Critical=1.5, Outlier=5.4) >>> print(conclusion) The dataset has no outliers