fit#

pycafee.sample.outliers.Grubbs.fit(self, x_exp, kind=None, which=None, alfa=None, details=None)#

This function applies the Grubbs test to identify outliers in Normal data with a few samples [1].

Parameters
x_expnumpy array

One dimension numpy array with at least 3 sample data (vary).

kindstr, optional

The type of the test.

  • If kind="one" (or None), the function returns the critical value for \(G^{'}\)

  • If kind="two", the function returns the critical value for \(G^{''}\)

  • If kind="three", the function returns the critical value for \(G^{'''}\)

alfafloat

The significance level (0.10, 0.05 (default) or 0.01)

detailsstr, optional

The details parameter determines the amount of information presented about the hypothesis test.

  • If details = "short" (or None, e.g, the default), a simplified version of the test result is returned.

  • If details = "full", a detailed version of the hypothesis test result is returned.

  • if details = "binary", the conclusion will be 1 (\(H_0\) is rejected) or 0 (\(H_0\) is accepted).

whichstr, optional

The value that should be evaluated as a possible outlier.

  • If which=None (default), the possible outlier is automatically inferred as the farthest observation from the mean.

  • If which="max", the highest value is checked if it is a possible outlier.

  • If which="min", the lowest value is checked if it is a possible outlier.

The which parameter has no effect when kind="two".

Returns
resulttuple with
statisticfloat

The test statistic.

criticalfloat

The critical value.

alphafloat

The significance level used.

kindstr

The kind parameter used.

outliernumber or list of numbers

The outlier tested.

  • If kind="one", a number is returned with the value tested;

  • If kind="two" or kind="three", a list with two numbers is returned with the two values tested;

conclusionstr

The test conclusion (e.g, Possible outlier/ no outliers).

Notes

The implementation of the Grubbs test is done in three different ways. The first implementation (\(G^{'}\)), checks whether the dataset has a single outlier (kind="one"), with the following hypotheses:

\(H_0:\) data does not have a outlier.

\(H_1:\) data has a outlier.

The conclusion of the test is based on the comparison between the critical value (at ɑ significance level) and statistic of the test:

if critical <= statistic:
    Data does not have a outlier
else:
    Data has a possible outlier

For this case, when the possible outlier is the smallest observation in the data set (which=="min"), the test statistic is estimated through the following equation:

\[G^{'} = \frac{\overline{x}-x_1}{s}\]

and when the possible outlier is the highest observation in the data set (which=="max"), the test statistic is estimated through the following equation:

\[G^{'} = \frac{x_n-\overline{x}}{s}\]

The second implementation (\(G^{''}\)) checks whether the dataset has two outliers (kind="two") each on one side of the distribution (the smallest and largest value), with the following hypotheses:

\(H_0:\) data does not have a outlier.

\(H_1:\) the minimum and the maximum values are possible outliers.

The conclusion of the test is based on the comparison between the critical value (at ɑ significance level) and statistic of the test:

if critical <= statistic:
    Data does not have a outlier
else:
    Data has a outlier

For this case, the test statistic is estimated through the following equation:

\[G^{''} = \frac{x_n-x_1}{s}\]

The third implementation (\(G^{'''}\)) checks whether the dataset has two outliers on the same side of the distribution (kind="three"), with the following hypotheses:

\(H_0:\) data does not have a outlier.

\(H_1:\) data has two possible outliers.

The conclusion of the test is based on the comparison between the critical value (at ɑ significance level) and statistic of the test:

if critical > statistic:
    Data does not have a outlier
else:
    Data has two outliers

Note that the comparison in this second case is done in reverse to what is usually done in hypothesis testing.

For this case, when the two possible outliers are the smallest observations in the data set (which=="min"), the test statistic is estimated through the following equation:

\[G_{'''} = \frac{(n-3)\times s^2_{2 \; lower}}{(n-1)\times s^2}\]

and when the two possible outliers are the highest observations in the data set (which=="max"), the test statistic is estimated through the following equation:

\[G_{'''} = \frac{(n-3)\times s^2_{2 \; upper}}{(n-1)\times s^2}\]

There are critical values for alpha equal to 0.10, 0.05 and 0.01. These values are for the two-tailed Grubbs distribution [2].

The minimum and maximum number of samples needed to apply the test varies depending on the kind parameter. The range for each option is as follows:

  • If kind="one": 3<=n<=30;

  • If kind="two": 3<=n<=20;

  • If kind="three": 4<=n<=30;

References

1

GRUBBS, F. E. Sample Criteria for Testing Outlying Observations. The Annals of Mathematical Statistics, v. 21, n. 1, p. 27–58, 1950.

2

GRUBBS, F. E.; BECK, G. Extension of Sample Sizes and Percentage Points for Significance Tests of Outlying Observations. Technometrics, v. 14, n. 4, p. 847–854, 1972.

Examples

Checking if the highest observation is a possible outlier at a 95% of confidence level

>>> from pycafee.sample.outliers import Grubbs
>>> import numpy as np
>>> x = np.array([159, 153, 184, 153, 156, 150, 147])
>>> test = Grubbs()
>>> result, conclusion = test.fit(x)
>>> print(result)
GrubbsResult(Statistic=2.1532047136140045, Critical=2.02, alpha=0.05, kind='one', outlier=184)
>>> print(conclusion)
The sample 184 perhaps be an outlier (95.0% confidence level)

Checking if the highest observation is a possible outlier at a 99% of confidence level

>>> from pycafee.sample.outliers import Grubbs
>>> import numpy as np
>>> x = np.array([159, 153, 184, 153, 156, 150, 147])
>>> test = Grubbs()
>>> result, conclusion = test.fit(x, alfa=0.01, details="full")
>>> print(result)
GrubbsResult(Statistic=2.1532047136140045, Critical=2.139, alpha=0.01, kind='one', outlier=184)
>>> print(conclusion)
Since the test statistic (2.153) is higher than the critical value (2.139), we have evidence to reject the null hypothesis, and perhaps sample 184 is an outlier (99.0% confidence level)

Checking if the two highest observations are possible outliers at a 95% of confidence level

>>> from pycafee.sample.outliers import Grubbs
>>> import numpy as np
>>> x = np.array([159, 153, 184, 153, 156, 150, 147, 186])
>>> test = Grubbs()
>>> result, conclusion = test.fit(x, kind="three", details="full")
>>> print(result)
GrubbsResult(Statistic=0.05528255528255528, Critical=0.1101, alpha=0.05, kind='three', outlier=[184, 186])
>>> print(conclusion)
Since the test statistic (0.055) is lower than the critical value (0.11), we have evidence to reject the null hypothesis, and perhaps sample 184 and 186 are outliers (95.0% confidence level)

Checking if the highest and the lowest observations are possible outliers at a 95% of confidence level

>>> from pycafee.sample.outliers import Grubbs
>>> import numpy as np
>>> x = np.array([159, 153, 184, 153, 156, 150, 147, 140])
>>> test = Grubbs()
>>> result, conclusion = test.fit(x, kind="two", details="full")
>>> print(result)
GrubbsResult(Statistic=3.3896333493939195, Critical=3.399, alpha=0.05, kind='two', outlier=[140, 184])
>>> print(conclusion)
As the test statistic (3.389) is lower than the critical value (3.399), we have no evidence to reject the null hypothesis that the sample does not contain outliers (95.0% confidence level)