Coverage for src/ts_stat_tests/normality/algorithms.py: 100%

1# ============================================================================ #

2# #

3# Title: Normality Algorithms #

4# Purpose: Algorithms for testing normality of data. #

5# #

6# ============================================================================ #

9# ---------------------------------------------------------------------------- #

10# #

11# Overview ####

12# #

13# ---------------------------------------------------------------------------- #

16# ---------------------------------------------------------------------------- #

17# Description ####

18# ---------------------------------------------------------------------------- #

21"""

22!!! note "Summary"

23 This module provides implementations of various statistical tests to assess the normality of data distributions. These tests are essential in statistical analysis and time series forecasting, as many models assume that the underlying data follows a normal distribution.

24"""

27# ---------------------------------------------------------------------------- #

28# #

29# Setup ####

30# #

31# ---------------------------------------------------------------------------- #

34# ---------------------------------------------------------------------------- #

35# Imports ####

36# ---------------------------------------------------------------------------- #

39# ## Python StdLib Imports ----

40from typing import Literal

42# ## Python Third Party Imports ----

43import numpy as np

44from numpy.typing import ArrayLike

45from scipy.stats import anderson as _ad, normaltest as _dp, shapiro as _sw

46from scipy.stats._morestats import AndersonResult, ShapiroResult

47from scipy.stats._stats_py import NormaltestResult

48from statsmodels.stats.stattools import jarque_bera as _jb, omni_normtest as _ob

49from typeguard import typechecked

52# ---------------------------------------------------------------------------- #

53# Exports ####

54# ---------------------------------------------------------------------------- #

57__all__: list[str] = ["jb", "ob", "sw", "dp", "ad"]

60## --------------------------------------------------------------------------- #

61## Constants ####

62## --------------------------------------------------------------------------- #

65VALID_DP_NAN_POLICY_OPTIONS = Literal["propagate", "raise", "omit"]

68VALID_AD_DIST_OPTIONS = Literal[

69 "norm", "expon", "logistic", "gumbel", "gumbel_l", "gumbel_r", "extreme1", "weibull_min"

70]

73# ---------------------------------------------------------------------------- #

74# #

75# Algorithms ####

76# #

77# ---------------------------------------------------------------------------- #

80@typechecked

81def jb(x: ArrayLike, axis: int = 0) -> tuple[np.float64, np.float64, np.float64, np.float64]:

82 r"""

83 !!! note "Summary"

84 The Jarque-Bera test is a statistical test used to determine whether a dataset follows a normal distribution. In time series forecasting, the test can be used to evaluate whether the residuals of a model follow a normal distribution.

86 ???+ abstract "Details"

87 To apply the Jarque-Bera test to time series data, we first need to estimate the residuals of the forecasting model. The residuals represent the difference between the actual values of the time series and the values predicted by the model. We can then use the Jarque-Bera test to evaluate whether the residuals follow a normal distribution.

89 The Jarque-Bera test is based on two statistics, skewness and kurtosis, which measure the degree of asymmetry and peakedness in the distribution of the residuals. The test compares the observed skewness and kurtosis of the residuals to the expected values for a normal distribution. If the observed values are significantly different from the expected values, the test rejects the null hypothesis that the residuals follow a normal distribution.

91 Params:

92 x (ArrayLike):

93 Data to test for normality. Usually regression model residuals that are mean 0.

94 axis (int):

95 Axis to use if data has more than 1 dimension.

96 Default: `0`

98 Raises:

99 (ValueError):

100 If the input data `x` is invalid.

101

102 Returns:

103 JB (float):

104 The Jarque-Bera test statistic.

105 JBpv (float):

106 The pvalue of the test statistic.

107 skew (float):

108 Estimated skewness of the data.

109 kurtosis (float):

110 Estimated kurtosis of the data.

111

112 ???+ example "Examples"

113

114 ```pycon {.py .python linenums="1" title="Setup"}

115 >>> from ts_stat_tests.normality.algorithms import jb

116 >>> from ts_stat_tests.utils.data import data_airline, data_noise

117 >>> airline = data_airline.values

118 >>> noise = data_noise

119

120 ```

121

122 ```pycon {.py .python linenums="1" title="Example 1: Using the airline dataset"}

123 >>> jb_value, p_value, skew, kurt = jb(airline)

124 >>> print(f"{jb_value:.4f}")

125 8.9225

126

127 ```

128

129 ```pycon {.py .python linenums="1" title="Example 2: Using random noise"}

130 >>> jb_value, p_value, skew, kurt = jb(noise)

131 >>> print(f"{jb_value:.4f}")

132 0.7478

133 >>> print(f"{p_value:.4f}")

134 0.6881

135 >>> print(f"{skew:.4f}")

136 -0.0554

137 >>> print(f"{kurt:.4f}")

138 3.0753

139

140 ```

141

142 ??? equation "Calculation"

143 The Jarque-Bera test statistic is defined as:

144

145 $$

146 JB = \frac{n}{6} \left( S^2 + \frac{(K-3)^2}{4} \right)

147 $$

148

149 where:

150

151 - $n$ is the sample size,

152 - $S$ is the sample skewness, and

153 - $K$ is the sample kurtosis.

154

155 ??? note "Notes"

156 Each output returned has 1 dimension fewer than data.

157 The Jarque-Bera test statistic tests the null that the data is normally distributed against an alternative that the data follow some other distribution. It has an asymptotic $\chi_2^2$ distribution.

158

159 ??? success "Credit"

160 All credit goes to the [`statsmodels`](https://www.statsmodels.org) library.

161

162 ??? question "References"

163 - Jarque, C. and Bera, A. (1980) "Efficient tests for normality, homoscedasticity and serial independence of regression residuals", 6 Econometric Letters 255-259.

164

165 ??? tip "See Also"

166 - [`ob()`][ts_stat_tests.normality.algorithms.ob]

167 - [`sw()`][ts_stat_tests.normality.algorithms.sw]

168 - [`dp()`][ts_stat_tests.normality.algorithms.dp]

169 - [`ad()`][ts_stat_tests.normality.algorithms.ad]

170 """

171 return _jb(resids=x, axis=axis) # type: ignore[return-value]

172

173

174@typechecked

175def ob(x: ArrayLike, axis: int = 0) -> tuple[float, float]:

176 r"""

177 !!! note "Summary"

178 The Omnibus test is a statistical test used to evaluate the normality of a dataset, including time series data. In time series forecasting, the Omnibus test can be used to assess whether the residuals of a model follow a normal distribution.

179

180 ???+ abstract "Details"

181 The Omnibus test uses a combination of skewness and kurtosis measures to assess whether the residuals follow a normal distribution. Skewness measures the degree of asymmetry in the distribution of the residuals, while kurtosis measures the degree of peakedness or flatness. If the residuals follow a normal distribution, their skewness and kurtosis should be close to zero.

182

183 Params:

184 x (ArrayLike):

185 Data to test for normality. Usually regression model residuals that are mean 0.

186 axis (int):

187 Axis to use if data has more than 1 dimension.

188 Default: `0`

189

190 Raises:

191 (ValueError):

192 If the input data `x` is invalid.

193

194 Returns:

195 statistic (float):

196 The Omnibus test statistic.

197 pvalue (float):

198 The p-value for the hypothesis test.

199

200 ???+ example "Examples"

201

202 ```pycon {.py .python linenums="1" title="Setup"}

203 >>> from ts_stat_tests.normality.algorithms import ob

204 >>> from ts_stat_tests.utils.data import data_airline, data_noise

205 >>> airline = data_airline.values

206 >>> noise = data_noise

207

208 ```

209

210 ```pycon {.py .python linenums="1" title="Example 1: Using the airline dataset"}

211 >>> stat, p_val = ob(airline)

212 >>> print(f"{stat:.4f}")

213 8.6554

214

215 ```

216

217 ```pycon {.py .python linenums="1" title="Example 2: Using random noise"}

218 >>> stat, p_val = ob(noise)

219 >>> print(f"{stat:.4f}")

220 0.8637

221

222 ```

223

224 ??? equation "Calculation"

225 The D'Agostino's $K^2$ test statistic is defined as:

226

227 $$

228 K^2 = Z_1(g_1)^2 + Z_2(g_2)^2

229 $$

230

231 where:

232

233 - $Z_1(g_1)$ is the standard normal transformation of skewness, and

234 - $Z_2(g_2)$ is the standard normal transformation of kurtosis.

235

236 ??? note "Notes"

237 The Omnibus test statistic tests the null that the data is normally distributed against an alternative that the data follow some other distribution. It is based on D'Agostino's $K^2$ test statistic.

238

239 ??? success "Credit"

240 All credit goes to the [`statsmodels`](https://www.statsmodels.org) library.

241

242 ??? question "References"

243 - D'Agostino, R. B. and Pearson, E. S. (1973), "Tests for departure from normality," Biometrika, 60, 613-622.

244 - D'Agostino, R. B. and Stephens, M. A. (1986), "Goodness-of-fit techniques," New York: Marcel Dekker.

245

246 ??? tip "See Also"

247 - [`jb()`][ts_stat_tests.normality.algorithms.jb]

248 - [`sw()`][ts_stat_tests.normality.algorithms.sw]

249 - [`dp()`][ts_stat_tests.normality.algorithms.dp]

250 - [`ad()`][ts_stat_tests.normality.algorithms.ad]

251 """

252 return _ob(resids=x, axis=axis)

253

254

255@typechecked

256def sw(x: ArrayLike) -> ShapiroResult:

257 r"""

258 !!! note "Summary"

259 The Shapiro-Wilk test is a statistical test used to determine whether a dataset follows a normal distribution.

260

261 ???+ abstract "Details"

262 The Shapiro-Wilk test is based on the null hypothesis that the residuals of the forecasting model are normally distributed. The test calculates a test statistic that compares the observed distribution of the residuals to the expected distribution under the null hypothesis of normality.

263

264 Params:

265 x (ArrayLike):

266 Array of sample data.

267

268 Raises:

269 (ValueError):

270 If the input data `x` is invalid.

271

272 Returns:

273 (ShapiroResult):

274 A named tuple containing the test statistic and p-value:

275 - statistic (float): The test statistic.

276 - pvalue (float): The p-value for the hypothesis test.

277

278 ???+ example "Examples"

279

280 ```pycon {.py .python linenums="1" title="Setup"}

281 >>> from ts_stat_tests.normality.algorithms import sw

282 >>> from ts_stat_tests.utils.data import data_airline, data_noise

283 >>> airline = data_airline.values

284 >>> noise = data_noise

285

286 ```

287

288 ```pycon {.py .python linenums="1" title="Example 1: Using the airline dataset"}

289 >>> stat, p_val = sw(airline)

290 >>> print(f"{stat:.4f}")

291 0.9520

292

293 ```

294

295 ```pycon {.py .python linenums="1" title="Example 2: Using random noise"}

296 >>> stat, p_val = sw(noise)

297 >>> print(f"{stat:.4f}")

298 0.9985

299

300 ```

301

302 ??? equation "Calculation"

303 The Shapiro-Wilk test statistic is defined as:

304

305 $$

306 W = \frac{\left( \sum_{i=1}^n a_i x_{(i)} \right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

307 $$

308

309 where:

310

311 - $x_{(i)}$ are the ordered sample values,

312 - $\bar{x}$ is the sample mean, and

313 - $a_i$ are constants generated from the covariances, variances and means of the order statistics of a sample of size $n$ from a normal distribution.

314

315 ??? note "Notes"

316 The algorithm used is described in (Algorithm as R94 Appl. Statist. (1995)) but censoring parameters as described are not implemented. For $N > 5000$ the $W$ test statistic is accurate but the $p-value$ may not be.

317

318 ??? success "Credit"

319 All credit goes to the [`scipy`](https://docs.scipy.org/) library.

320

321 ??? question "References"

322 - Shapiro, S. S. & Wilk, M.B (1965). An analysis of variance test for normality (complete samples), Biometrika, Vol. 52, pp. 591-611.

323 - Algorithm as R94 Appl. Statist. (1995) VOL. 44, NO. 4.

324

325 ??? tip "See Also"

326 - [`jb()`][ts_stat_tests.normality.algorithms.jb]

327 - [`ob()`][ts_stat_tests.normality.algorithms.ob]

328 - [`dp()`][ts_stat_tests.normality.algorithms.dp]

329 - [`ad()`][ts_stat_tests.normality.algorithms.ad]

330 """

331 return _sw(x=x)

332

333

334@typechecked

335def dp(

336 x: ArrayLike,

337 axis: int = 0,

338 nan_policy: VALID_DP_NAN_POLICY_OPTIONS = "propagate",

339) -> NormaltestResult:

340 r"""

341 !!! note "Summary"

342 The D'Agostino and Pearson's test is a statistical test used to evaluate whether a dataset follows a normal distribution.

343

344 ???+ abstract "Details"

345 The D'Agostino and Pearson's test uses a combination of skewness and kurtosis measures to assess whether the residuals follow a normal distribution. Skewness measures the degree of asymmetry in the distribution of the residuals, while kurtosis measures the degree of peakedness or flatness.

346

347 Params:

348 x (ArrayLike):

349 The array containing the sample to be tested.

350 axis (int):

351 Axis along which to compute test. If `None`, compute over the whole array `a`.

352 Default: `0`

353 nan_policy (VALID_DP_NAN_POLICY_OPTIONS):

354 Defines how to handle when input contains nan.

355

356 - `"propagate"`: returns nan

357 - `"raise"`: throws an error

358 - `"omit"`: performs the calculations ignoring nan values

359

360 Default: `"propagate"`

361

362 Raises:

363 (ValueError):

364 If the input data `x` is invalid.

365

366 Returns:

367 (NormaltestResult):

368 A named tuple containing the test statistic and p-value:

369 - statistic (float): The test statistic ($K^2$).

370 - pvalue (float): A 2-sided chi-squared probability for the hypothesis test.

371

372 ???+ example "Examples"

373

374 ```pycon {.py .python linenums="1" title="Setup"}

375 >>> from ts_stat_tests.normality.algorithms import dp

376 >>> from ts_stat_tests.utils.data import data_airline, data_noise

377 >>> airline = data_airline.values

378 >>> noise = data_noise

379

380 ```

381

382 ```pycon {.py .python linenums="1" title="Example 1: Using the airline dataset"}

383 >>> stat, p_val = dp(airline)

384 >>> print(f"{stat:.4f}")

385 8.6554

386

387 ```

388

389 ```pycon {.py .python linenums="1" title="Example 2: Using random noise"}

390 >>> stat, p_val = dp(noise)

391 >>> print(f"{stat:.4f}")

392 0.8637

393

394 ```

395

396 ??? equation "Calculation"

397 The D'Agostino's $K^2$ test statistic is defined as:

398

399 $$

400 K^2 = Z_1(g_1)^2 + Z_2(g_2)^2

401 $$

402

403 where:

404

405 - $Z_1(g_1)$ is the standard normal transformation of skewness, and

406 - $Z_2(g_2)$ is the standard normal transformation of kurtosis.

407

408 ??? note "Notes"

409 This function is a wrapper for the `scipy.stats.normaltest` function.

410

411 ??? success "Credit"

412 All credit goes to the [`scipy`](https://docs.scipy.org/) library.

413

414 ??? question "References"

415 - D'Agostino, R. B. (1971), "An omnibus test of normality for moderate and large sample size", Biometrika, 58, 341-348

416 - D'Agostino, R. and Pearson, E. S. (1973), "Tests for departure from normality", Biometrika, 60, 613-622

417

418 ??? tip "See Also"

419 - [`jb()`][ts_stat_tests.normality.algorithms.jb]

420 - [`ob()`][ts_stat_tests.normality.algorithms.ob]

421 - [`sw()`][ts_stat_tests.normality.algorithms.sw]

422 - [`ad()`][ts_stat_tests.normality.algorithms.ad]

423 """

424 return _dp(a=x, axis=axis, nan_policy=nan_policy)

425

426

427@typechecked

428def ad(

429 x: ArrayLike,

430 dist: VALID_AD_DIST_OPTIONS = "norm",

431) -> AndersonResult:

432 r"""

433 !!! note "Summary"

434 The Anderson-Darling test is a statistical test used to evaluate whether a dataset follows a normal distribution.

435

436 ???+ abstract "Details"

437 The Anderson-Darling test tests the null hypothesis that a sample is drawn from a population that follows a particular distribution. For the Anderson-Darling test, the critical values depend on which distribution is being tested against.

438

439 Params:

440 x (ArrayLike):

441 Array of sample data.

442 dist (VALID_AD_DIST_OPTIONS):

443 The type of distribution to test against.

444 Default: `"norm"`

445

446 Raises:

447 (ValueError):

448 If the input data `x` is invalid.

449

450 Returns:

451 (AndersonResult):

452 A named tuple containing the test statistic, critical values, and significance levels:

453 - statistic (float): The Anderson-Darling test statistic.

454 - critical_values (list[float]): The critical values for this distribution.

455 - significance_level (list[float]): The significance levels for the corresponding critical values in percents.

456

457 ???+ example "Examples"

458

459 ```pycon {.py .python linenums="1" title="Setup"}

460 >>> from ts_stat_tests.normality.algorithms import ad

461 >>> from ts_stat_tests.utils.data import data_airline, data_noise

462 >>> airline = data_airline.values

463 >>> noise = data_noise

464

465 ```

466

467 ```pycon {.py .python linenums="1" title="Example 1: Using the airline dataset"}

468 >>> stat, cv, sl = ad(airline)

469 >>> print(f"{stat:.4f}")

470 1.8185

471

472 ```

473

474 ```pycon {.py .python linenums="1" title="Example 2: Using random normal data"}

475 >>> stat, cv, sl = ad(noise)

476 >>> print(f"{stat:.4f}")

477 0.2325

478

479 ```

480

481 ??? equation "Calculation"

482 The Anderson-Darling test statistic $A^2$ is defined as:

483

484 $$

485 A^2 = -n - \sum_{i=1}^n \frac{2i-1}{n} \left[ \ln(F(x_i)) + \ln(1 - F(x_{n-i+1})) \right]

486 $$

487

488 where:

489

490 - $n$ is the sample size,

491 - $F$ is the cumulative distribution function of the specified distribution, and

492 - $x_i$ are the ordered sample values.

493

494 ??? note "Notes"

495 Critical values provided are for the following significance levels:

496 - normal/exponential: 15%, 10%, 5%, 2.5%, 1%

497 - logistic: 25%, 10%, 5%, 2.5%, 1%, 0.5%

498 - Gumbel: 25%, 10%, 5%, 2.5%, 1%

499

500 ??? success "Credit"

501 All credit goes to the [`scipy`](https://docs.scipy.org/) library.

502

503 ??? question "References"

504 - Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the American Statistical Association, Vol. 69, pp. 730-737.

505

506 ??? tip "See Also"

507 - [`jb()`][ts_stat_tests.normality.algorithms.jb]

508 - [`ob()`][ts_stat_tests.normality.algorithms.ob]

509 - [`sw()`][ts_stat_tests.normality.algorithms.sw]

510 - [`dp()`][ts_stat_tests.normality.algorithms.dp]

511 """

512 return _ad(x=x, dist=dist)

Coverage for src / ts_stat_tests / normality / algorithms.py: 100%

26 statements