fit#
- pycafee.sample.outliers.Dixon.fit(self, x_exp, ratio=None, which=None, alfa=None, details=None)#
This function applies the Dixon test to identify outliers in Normal data with few samples [1].
- Parameters
- x_exp
numpy array One dimension numpy array with at least 3 sample data (vary).
- ratio
strorNone, optional The ratio to be used. This parameter determines the equation and critical data that are used.
- alfa
float, optional The level of significance (
ɑ). Default isNonewhich results in0.05(ɑ = 5%).- details
str, optional The
detailsparameter determines the amount of information presented about the hypothesis test.If
details = "short"(orNone, e.g, the default), a simplified version of the test result is returned.If
details = "full", a detailed version of the hypothesis test result is returned.if
details = "binary", the conclusion will be1(\(H_0\) is rejected) or0(\(H_0\) is accepted).
- which
str, optional The value that should be evaluated as a possible outlier.
If it is
None(default), the outlier is automatically inferred as the farthest observation from the meanIf it is
"max", the highest value is checked if it is a possible outlier.If it is
"min", the lowest value is checked if it is a possible outlier.
- x_exp
- Returns
- result
tuplewith - statistic
float The test statistic.
- critical
float The critical value.
- alpha
float The significance level used.
- ratio
str The ratio used.
- statistic
- conclusion
str The test conclusion (e.g, Possible outlier/ no outliers).
- result
See also
Notes
The Dixon test for outlier detection has the following premise:
☕
\(H_0:\) data does not have a outlier.
\(H_1:\) data has a outlier.
The conclusion of the test is based on the comparison between the
criticalvalue (atɑsignificance level) andstatisticof the test:if critical <= statistic: Data does not have a outlier else: Data has a outlier
There are critical values for alpha equal to
0.20,0.10,0.05,0.04,0.02and0.01. These values are for the two-tailed Dixon distribution [2].The minimum number of samples needed to apply the test varies depending on the ratio parameter, while the maximum number for all cases is 30. The available ranges are:
For
ratio="r10"\(\rightarrow 3 \leq n \leq 30\);For
ratio="r11"\(\rightarrow 4 \leq n \leq 30\);For
ratio="r12"\(\rightarrow 5 \leq n \leq 30\);For
ratio="r20"\(\rightarrow 4 \leq n \leq 30\);For
ratio="r21"\(\rightarrow 5 \leq n \leq 30\);For
ratio="r22"\(\rightarrow 6 \leq n \leq 30\);
The
ratioparameter determines which equation will be used to apply the test. Ifratio=None(default), the general rule [2] is used to determine outliers:If \(3 \leq n \leq 7\) then
ratio=r10is used;If \(8 \leq n \leq 10\) then
ratio=r11is used;If \(10 \leq n \leq 13\) then
ratio=r21is used;If \(14 \leq n \leq 30\) then
ratio=r22is used;
The equations to calculate the test statistic (for the minimum or maximum values) depend on the ratio parameter, and are calculated as follows:
If
ratio="r10":
\[r_{10, min} = \frac{x_2-x_1}{x_n-x_1} \; \; \; OR \; \; \; r_{10,max} = \frac{x_n-x_{n-1}}{x_n-x_1}\]If
ratio="r11":
\[r_{11, min} = \frac{x_2-x_{1}}{x_{n-1}-x_1} \; \; \; OR \; \; \; r_{11,max} = \frac{x_n-x_{n-1}}{x_{n}-x_2}\]If
ratio="r12":
\[r_{12, min} = \frac{x_2-x_{1}}{x_{n-2}-x_1} \; \; \; OR \; \; \; r_{12,max} = \frac{x_n-x_{n-1}}{x_{n}-x_3}\]If
ratio="r20":
\[r_{20, min} = \frac{x_3-x_{1}}{x_{n}-x_1} \; \; \; OR \; \; \; r_{20,max} = \frac{x_n-x_{n-2}}{x_{n}-x_1}\]If
ratio="r21":
\[r_{21, min} = \frac{x_3-x_{1}}{x_{n-1}-x_1} \; \; \; OR \; \; \; r_{21,max} = \frac{x_n-x_{n-2}}{x_{n}-x_2}\]If
ratio="r22":
\[r_{22, min} = \frac{x_3-x_{1}}{x_{n-2}-x_1} \; \; \; OR \; \; \; r_{22,max} = \frac{x_n-x_{n-2}}{x_{n}-x_3}\]References
- 1
DIXON, W. J. Processing Data for Outliers. Biometrics, v. 9, n. 1, p. 74–89, 1953.
- 2(1,2)
RORABACHER, D. B. Statistical Treatment for Rejection of Deviant Values: Critical Values of Dixon’s “Q” Parameter and Related Subrange Ratios at the 95% Confidence Level. v. 63, n. 2, p. 139–146, 1991.
Examples
Checking if the highest value (15.68) is a possible outlier:
>>> from pycafee.sample.dixon import Dixon >>> import numpy as np >>> x_exp = np.array([15.48, 15.51, 15.52, 15.52, 15.53, 15.53, 15.68]) >>> test = Dixon() >>> result, conclusion = test.fit(x_exp, which='max', details="full") >>> print(result) DixonResult(Statistic=0.7500000000000044, critical=0.568, alpha=0.05, ratio='r10') >>> print(conclusion) Since the test statistic value (0.75) is higher than the critical value (0.568), we have evidence to reject the null hypothesis that the sample does not contain outliers, and perhaps the upper value (15.68) is an outlier (with 95.0% confidence).
Checking if the lowest value (15.43) is a possible outlier:
>>> from pycafee.sample.dixon import Dixon >>> import numpy as np >>> x_exp = np.array([15.43, 15.48, 15.51, 15.52, 15.52, 15.53, 15.53, 15.58]) >>> test = Dixon() >>> result, conclusion = test.fit(x_exp, which='min', details="full") >>> print(result) DixonResult(Statistic=0.5000000000000089, critical=0.615, alpha=0.05, ratio='r11') >>> print(conclusion) Since the test statistic value (0.5) is lower than the critical value (0.615), we have no evidence to reject the null hypothesis that the sample does not contain outliers (with 95.0% confidence).