fit#
- pycafee.sample.outliers.Grubbs.fit(self, x_exp, kind=None, which=None, alfa=None, details=None)#
This function applies the Grubbs test to identify outliers in Normal data with a few samples [1].
- Parameters
- x_exp
numpy array One dimension numpy array with at least
3sample data (vary).- kind
str, optional The type of the test.
If
kind="one"(orNone), the function returns the critical value for \(G^{'}\)If
kind="two", the function returns the critical value for \(G^{''}\)If
kind="three", the function returns the critical value for \(G^{'''}\)
- alfa
float The significance level (
0.10,0.05(default) or0.01)- details
str, optional The
detailsparameter determines the amount of information presented about the hypothesis test.If
details = "short"(orNone, e.g, the default), a simplified version of the test result is returned.If
details = "full", a detailed version of the hypothesis test result is returned.if
details = "binary", the conclusion will be1(\(H_0\) is rejected) or0(\(H_0\) is accepted).
- which
str, optional The value that should be evaluated as a possible outlier.
If
which=None(default), the possible outlier is automatically inferred as the farthest observation from the mean.If
which="max", the highest value is checked if it is a possible outlier.If
which="min", the lowest value is checked if it is a possible outlier.
The
whichparameter has no effect whenkind="two".
- x_exp
- Returns
- result
tuplewith - statistic
float The test statistic.
- critical
float The critical value.
- alpha
float The significance level used.
- kind
str The kind parameter used.
- outlier
numberorlistofnumbers The outlier tested.
If
kind="one", a number is returned with the value tested;If
kind="two"orkind="three", a list with two numbers is returned with the two values tested;
- statistic
- conclusion
str The test conclusion (e.g, Possible outlier/ no outliers).
- result
See also
Notes
The implementation of the Grubbs test is done in three different ways. The first implementation (\(G^{'}\)), checks whether the dataset has a single outlier (
kind="one"), with the following hypotheses:☕
\(H_0:\) data does not have a outlier.
\(H_1:\) data has a outlier.
The conclusion of the test is based on the comparison between the
criticalvalue (atɑsignificance level) andstatisticof the test:if critical <= statistic: Data does not have a outlier else: Data has a possible outlier
For this case, when the possible outlier is the smallest observation in the data set (
which=="min"), the teststatisticis estimated through the following equation:\[G^{'} = \frac{\overline{x}-x_1}{s}\]and when the possible outlier is the highest observation in the data set (
which=="max"), the teststatisticis estimated through the following equation:\[G^{'} = \frac{x_n-\overline{x}}{s}\]The second implementation (\(G^{''}\)) checks whether the dataset has two outliers (
kind="two") each on one side of the distribution (the smallest and largest value), with the following hypotheses:☕
\(H_0:\) data does not have a outlier.
\(H_1:\) the minimum and the maximum values are possible outliers.
The conclusion of the test is based on the comparison between the
criticalvalue (atɑsignificance level) andstatisticof the test:if critical <= statistic: Data does not have a outlier else: Data has a outlier
For this case, the test
statisticis estimated through the following equation:\[G^{''} = \frac{x_n-x_1}{s}\]The third implementation (\(G^{'''}\)) checks whether the dataset has two outliers on the same side of the distribution (
kind="three"), with the following hypotheses:☕
\(H_0:\) data does not have a outlier.
\(H_1:\) data has two possible outliers.
The conclusion of the test is based on the comparison between the
criticalvalue (atɑsignificance level) andstatisticof the test:if critical > statistic: Data does not have a outlier else: Data has two outliers
Note that the comparison in this second case is done in reverse to what is usually done in hypothesis testing.
For this case, when the two possible outliers are the smallest observations in the data set (
which=="min"), the teststatisticis estimated through the following equation:\[G_{'''} = \frac{(n-3)\times s^2_{2 \; lower}}{(n-1)\times s^2}\]and when the two possible outliers are the highest observations in the data set (
which=="max"), the teststatisticis estimated through the following equation:\[G_{'''} = \frac{(n-3)\times s^2_{2 \; upper}}{(n-1)\times s^2}\]There are critical values for alpha equal to
0.10,0.05and0.01. These values are for the two-tailed Grubbs distribution [2].The minimum and maximum number of samples needed to apply the test varies depending on the
kindparameter. The range for each option is as follows:If
kind="one":3<=n<=30;If
kind="two":3<=n<=20;If
kind="three":4<=n<=30;
References
- 1
GRUBBS, F. E. Sample Criteria for Testing Outlying Observations. The Annals of Mathematical Statistics, v. 21, n. 1, p. 27–58, 1950.
- 2
GRUBBS, F. E.; BECK, G. Extension of Sample Sizes and Percentage Points for Significance Tests of Outlying Observations. Technometrics, v. 14, n. 4, p. 847–854, 1972.
Examples
Checking if the highest observation is a possible outlier at a 95% of confidence level
>>> from pycafee.sample.outliers import Grubbs >>> import numpy as np >>> x = np.array([159, 153, 184, 153, 156, 150, 147]) >>> test = Grubbs() >>> result, conclusion = test.fit(x) >>> print(result) GrubbsResult(Statistic=2.1532047136140045, Critical=2.02, alpha=0.05, kind='one', outlier=184) >>> print(conclusion) The sample 184 perhaps be an outlier (95.0% confidence level)
Checking if the highest observation is a possible outlier at a 99% of confidence level
>>> from pycafee.sample.outliers import Grubbs >>> import numpy as np >>> x = np.array([159, 153, 184, 153, 156, 150, 147]) >>> test = Grubbs() >>> result, conclusion = test.fit(x, alfa=0.01, details="full") >>> print(result) GrubbsResult(Statistic=2.1532047136140045, Critical=2.139, alpha=0.01, kind='one', outlier=184) >>> print(conclusion) Since the test statistic (2.153) is higher than the critical value (2.139), we have evidence to reject the null hypothesis, and perhaps sample 184 is an outlier (99.0% confidence level)
Checking if the two highest observations are possible outliers at a 95% of confidence level
>>> from pycafee.sample.outliers import Grubbs >>> import numpy as np >>> x = np.array([159, 153, 184, 153, 156, 150, 147, 186]) >>> test = Grubbs() >>> result, conclusion = test.fit(x, kind="three", details="full") >>> print(result) GrubbsResult(Statistic=0.05528255528255528, Critical=0.1101, alpha=0.05, kind='three', outlier=[184, 186]) >>> print(conclusion) Since the test statistic (0.055) is lower than the critical value (0.11), we have evidence to reject the null hypothesis, and perhaps sample 184 and 186 are outliers (95.0% confidence level)
Checking if the highest and the lowest observations are possible outliers at a 95% of confidence level
>>> from pycafee.sample.outliers import Grubbs >>> import numpy as np >>> x = np.array([159, 153, 184, 153, 156, 150, 147, 140]) >>> test = Grubbs() >>> result, conclusion = test.fit(x, kind="two", details="full") >>> print(result) GrubbsResult(Statistic=3.3896333493939195, Critical=3.399, alpha=0.05, kind='two', outlier=[140, 184]) >>> print(conclusion) As the test statistic (3.389) is lower than the critical value (3.399), we have no evidence to reject the null hypothesis that the sample does not contain outliers (95.0% confidence level)