fit#

pycafee.sample.outliers.Dixon.fit(self, x_exp, ratio=None, which=None, alfa=None, details=None)#

This function applies the Dixon test to identify outliers in Normal data with few samples [1].

Parameters
x_expnumpy array

One dimension numpy array with at least 3 sample data (vary).

ratiostr or None, optional

The ratio to be used. This parameter determines the equation and critical data that are used.

alfafloat, optional

The level of significance (ɑ). Default is None which results in 0.05 (ɑ = 5%).

detailsstr, optional

The details parameter determines the amount of information presented about the hypothesis test.

  • If details = "short" (or None, e.g, the default), a simplified version of the test result is returned.

  • If details = "full", a detailed version of the hypothesis test result is returned.

  • if details = "binary", the conclusion will be 1 (\(H_0\) is rejected) or 0 (\(H_0\) is accepted).

whichstr, optional

The value that should be evaluated as a possible outlier.

  • If it is None (default), the outlier is automatically inferred as the farthest observation from the mean

  • If it is "max", the highest value is checked if it is a possible outlier.

  • If it is "min", the lowest value is checked if it is a possible outlier.

Returns
resulttuple with
statisticfloat

The test statistic.

criticalfloat

The critical value.

alphafloat

The significance level used.

ratiostr

The ratio used.

conclusionstr

The test conclusion (e.g, Possible outlier/ no outliers).

Notes

The Dixon test for outlier detection has the following premise:

\(H_0:\) data does not have a outlier.

\(H_1:\) data has a outlier.

The conclusion of the test is based on the comparison between the critical value (at ɑ significance level) and statistic of the test:

if critical <= statistic:
    Data does not have a outlier
else:
    Data has a outlier

There are critical values for alpha equal to 0.20, 0.10, 0.05, 0.04, 0.02 and 0.01. These values are for the two-tailed Dixon distribution [2].

The minimum number of samples needed to apply the test varies depending on the ratio parameter, while the maximum number for all cases is 30. The available ranges are:

  • For ratio="r10" \(\rightarrow 3 \leq n \leq 30\);

  • For ratio="r11" \(\rightarrow 4 \leq n \leq 30\);

  • For ratio="r12" \(\rightarrow 5 \leq n \leq 30\);

  • For ratio="r20" \(\rightarrow 4 \leq n \leq 30\);

  • For ratio="r21" \(\rightarrow 5 \leq n \leq 30\);

  • For ratio="r22" \(\rightarrow 6 \leq n \leq 30\);

The ratio parameter determines which equation will be used to apply the test. If ratio=None (default), the general rule [2] is used to determine outliers:

  • If \(3 \leq n \leq 7\) then ratio=r10 is used;

  • If \(8 \leq n \leq 10\) then ratio=r11 is used;

  • If \(10 \leq n \leq 13\) then ratio=r21 is used;

  • If \(14 \leq n \leq 30\) then ratio=r22 is used;

The equations to calculate the test statistic (for the minimum or maximum values) depend on the ratio parameter, and are calculated as follows:

  • If ratio="r10":

\[r_{10, min} = \frac{x_2-x_1}{x_n-x_1} \; \; \; OR \; \; \; r_{10,max} = \frac{x_n-x_{n-1}}{x_n-x_1}\]
  • If ratio="r11":

\[r_{11, min} = \frac{x_2-x_{1}}{x_{n-1}-x_1} \; \; \; OR \; \; \; r_{11,max} = \frac{x_n-x_{n-1}}{x_{n}-x_2}\]
  • If ratio="r12":

\[r_{12, min} = \frac{x_2-x_{1}}{x_{n-2}-x_1} \; \; \; OR \; \; \; r_{12,max} = \frac{x_n-x_{n-1}}{x_{n}-x_3}\]
  • If ratio="r20":

\[r_{20, min} = \frac{x_3-x_{1}}{x_{n}-x_1} \; \; \; OR \; \; \; r_{20,max} = \frac{x_n-x_{n-2}}{x_{n}-x_1}\]
  • If ratio="r21":

\[r_{21, min} = \frac{x_3-x_{1}}{x_{n-1}-x_1} \; \; \; OR \; \; \; r_{21,max} = \frac{x_n-x_{n-2}}{x_{n}-x_2}\]
  • If ratio="r22":

\[r_{22, min} = \frac{x_3-x_{1}}{x_{n-2}-x_1} \; \; \; OR \; \; \; r_{22,max} = \frac{x_n-x_{n-2}}{x_{n}-x_3}\]

References

1

DIXON, W. J. Processing Data for Outliers. Biometrics, v. 9, n. 1, p. 74–89, 1953.

2(1,2)

RORABACHER, D. B. Statistical Treatment for Rejection of Deviant Values: Critical Values of Dixon’s “Q” Parameter and Related Subrange Ratios at the 95% Confidence Level. v. 63, n. 2, p. 139–146, 1991.

Examples

Checking if the highest value (15.68) is a possible outlier:

>>> from pycafee.sample.dixon import Dixon
>>> import numpy as np
>>> x_exp = np.array([15.48, 15.51, 15.52, 15.52, 15.53, 15.53, 15.68])
>>> test = Dixon()
>>> result, conclusion = test.fit(x_exp, which='max', details="full")
>>> print(result)
DixonResult(Statistic=0.7500000000000044, critical=0.568, alpha=0.05, ratio='r10')
>>> print(conclusion)
Since the test statistic value (0.75) is higher than the critical value (0.568), we have evidence to reject the null hypothesis that the sample does not contain outliers, and perhaps the upper value (15.68) is an outlier (with 95.0% confidence).

Checking if the lowest value (15.43) is a possible outlier:

>>> from pycafee.sample.dixon import Dixon
>>> import numpy as np
>>> x_exp = np.array([15.43, 15.48, 15.51, 15.52, 15.52, 15.53, 15.53, 15.58])
>>> test = Dixon()
>>> result, conclusion = test.fit(x_exp, which='min', details="full")
>>> print(result)
DixonResult(Statistic=0.5000000000000089, critical=0.615, alpha=0.05, ratio='r11')
>>> print(conclusion)
Since the test statistic value (0.5) is lower than the critical value (0.615), we have no evidence to reject the null hypothesis that the sample does not contain outliers (with 95.0% confidence).