Mathematical Statistics

Fundamentals of Hypothesis Testing

Samir Orujov, PhD

ADA University, School of Business

Information Communication Technologies Agency, Statistics Unit

2026-03-14

๐ŸŽฏ Learning Objectives

By the end of this lecture, you will be able to:

  • Define null and alternative hypotheses and formulate them for real financial decisions

  • Identify the components of a statistical test: test statistic, rejection region, and significance level

  • Distinguish between Type I and Type II errors and quantify their probabilities \(\alpha\) and \(\beta\)

  • Interpret p-values and use them to reach evidence-based conclusions

  • Explain the relationship between hypothesis tests and confidence intervals

๐Ÿ“ฑ Attendance Check-in

๐Ÿ“‹ Overview

๐Ÿ“š Topics Covered Today

  • Elements of a Test โ€“ Hโ‚€, Hโ‚, test statistic, rejection region

  • Type I & II Errors โ€“ \(\alpha\), \(\beta\), and the trade-off between them

  • Large-Sample Z Tests โ€“ means, proportions, differences

  • p-Values โ€“ attained significance and reporting results

  • CI Connection โ€“ duality between tests and confidence intervals

  • Case Study โ€“ Testing whether BIST100 daily returns have zero mean

๐Ÿ“– Why Hypothesis Testing?

๐ŸŽฏ Motivation

Statistical decisions drive billions of dollars in financial markets every day.

Finance Applications:

  • Is a trading strategyโ€™s mean return significantly > 0?
  • Has volatility changed after a market shock?
  • Do two asset classes have equal expected returns?
  • Is a fund managerโ€™s alpha statistically significant?

Regulatory Applications:

  • Does a telecom operator meet QoS thresholds?
  • Has consumer complaint rates changed after regulation?
  • Is price inflation statistically above the target band?
  • Are default rates equal across credit segments?

Key Question: How do we use sample data to make principled yes/no decisions about population parameters?

๐Ÿ“– The Logic of Hypothesis Testing

Statistical hypothesis testing follows a proof by contradiction logic:

  1. Assume the null hypothesis \(H_0\) is true (status quo / skeptical position)
  1. Collect data and compute a test statistic
  1. Ask: โ€œHow likely is this data if \(H_0\) were true?โ€
  1. If the data is very unlikely under \(H_0\) โ†’ Reject \(H_0\) in favour of \(H_a\)
  1. If the data is plausible under \(H_0\) โ†’ Fail to reject \(H_0\)

โš ๏ธ Failing to reject โ‰  proving \(H_0\) true!

๐Ÿ“ Definition: Null & Alternative Hypotheses

๐Ÿ“ Definition 10.1 โ€“ Hypotheses

The null hypothesis \(H_0\) is a specific statement about a population parameter that we assume true unless evidence convinces us otherwise.

The alternative (research) hypothesis \(H_a\) is what we seek evidence for.

Three forms of \(H_a\):

Type Form Financial Example
Two-tailed \(H_a: \mu \neq \mu_0\) Has average daily return changed?
Left-tailed \(H_a: \mu < \mu_0\) Has return fallen below benchmark?
Right-tailed \(H_a: \mu > \mu_0\) Is strategy return above zero?

Convention: \(H_0\) always contains the equality (=, โ‰ค, or โ‰ฅ)

๐Ÿ“ Definition: Test Statistic & Rejection Region

๐Ÿ“ Definition 10.2 โ€“ Test Components

A test statistic is a function of the sample data used to decide between \(H_0\) and \(H_a\).

The rejection region (RR) is the set of values of the test statistic for which \(H_0\) is rejected.

Decision Rule:

\[\text{If test statistic} \in RR \Rightarrow \text{Reject } H_0\]

\[\text{If test statistic} \notin RR \Rightarrow \text{Fail to Reject } H_0\]

Example: Testing \(H_0: \mu = 0\) (zero daily return) vs \(H_a: \mu > 0\)

\[Z = \frac{\bar{Y} - 0}{\sigma/\sqrt{n}}, \quad RR = \{z > z_\alpha\}\]

๐Ÿ“ Type I and Type II Errors

โš ๏ธ The Two Ways to Be Wrong

\(H_0\) is True \(H_0\) is False
Reject \(H_0\) Type I Error (false positive) โœ… Correct
Fail to Reject \(H_0\) โœ… Correct Type II Error (false negative)

\[\alpha = P(\text{Type I Error}) = P(\text{Reject } H_0 \mid H_0 \text{ true})\]

\[\beta = P(\text{Type II Error}) = P(\text{Fail to reject } H_0 \mid H_0 \text{ false})\]

Financial interpretation:

  • Type I โ€“ Conclude a strategy works when it doesnโ€™t โ†’ lose money
  • Type II โ€“ Miss a profitable strategy that truly works โ†’ opportunity cost

๐Ÿ“Œ Example 1: Voting Probability Test

Problem: A candidate claims she will receive more than 50% of votes (\(p > 0.5\)). We survey \(n = 25\) voters.

\(H_0: p = 0.5\) vs \(H_a: p < 0.5\). Use \(Y\) = number of supporters. \(RR = \{y \leq 2\}\).

Computing \(\alpha\):

\[\alpha = P(Y \leq 2 \mid p = 0.5) = \sum_{y=0}^{2} \binom{25}{y}(0.5)^{25}\]

\[= \binom{25}{0}(0.5)^{25} + \binom{25}{1}(0.5)^{25} + \binom{25}{2}(0.5)^{25} \approx 0.0000 + 0.0007 + 0.0063 \approx 0.007\]

๐Ÿ“Œ Example 1: Result & Interpretation

\(\alpha \approx 0.007\) โ†’ Only 0.7% chance of falsely rejecting a candidate with true \(p = 0.5\)

Finance analogy: Testing if a fundโ€™s win-rate exceeds 50%. Low \(\alpha\) = conservative threshold for claiming skill.

๐Ÿ“Œ Example 2: The \(\alpha\)-\(\beta\) Trade-off

Enlarging the rejection region from \(RR = \{y \leq 2\}\) to \(RR^* = \{y \leq 5\}\):

Metric \(RR = \{y \leq 2\}\) \(RR^* = \{y \leq 5\}\)
\(\alpha\) 0.007 0.054
\(\beta\) (at \(p = 0.3\)) 0.873 0.420

Key insight (Wackerly ยง10.2):

\[\text{Enlarging RR} \Rightarrow \alpha \uparrow, \quad \beta \downarrow\]

\[\text{Shrinking RR} \Rightarrow \alpha \downarrow, \quad \beta \uparrow\]

There is no free lunch! To reduce both, you must increase sample size \(n\).

๐Ÿงฎ Large-Sample Z Test for \(\mu\)

Theorem 10.1 โ€“ One-Sample Z Test

For large \(n\), testing \(H_0: \mu = \mu_0\):

\[Z = \frac{\bar{Y} - \mu_0}{\sigma/\sqrt{n}} \approx N(0,1) \text{ under } H_0\]

Rejection regions at significance level \(\alpha\):

\(H_a\) Rejection Region
\(\mu > \mu_0\) \(z > z_\alpha\)
\(\mu < \mu_0\) \(z < -z_\alpha\)
\(\mu \neq \mu_0\) \(\|z\| > z_{\alpha/2}\)

Intuition: \(Z\) measures how many standard errors \(\bar{Y}\) is from \(\mu_0\). Extreme values are evidence against \(H_0\).

๐Ÿงฎ Large-Sample Z Test for Proportion

Test for \(p\) (proportion)

For large \(n\), testing \(H_0: p = p_0\):

\[Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \approx N(0,1) \text{ under } H_0\]

Two-Sample Test for \(\mu_1 - \mu_2\):

\[Z = \frac{(\bar{Y}_1 - \bar{Y}_2) - D_0}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}}\]

where \(D_0\) is the hypothesized difference (often 0).

Financial use: Are mean returns for large-cap and small-cap stocks equal?

\[H_0: \mu_{\text{large}} - \mu_{\text{small}} = 0\]

๐Ÿ“– p-Values: Attained Significance

๐Ÿ“ Definition 10.3 โ€“ p-Value

The p-value (attained significance level) is the smallest \(\alpha\) at which \(H_0\) would be rejected given the observed data.

\[p\text{-value} = P(\text{observed test statistic or more extreme} \mid H_0 \text{ true})\]

Computing p-values:

\(H_a\) p-value
\(\mu > \mu_0\) \(P(Z > z_{\text{obs}})\)
\(\mu < \mu_0\) \(P(Z < z_{\text{obs}})\)
\(\mu \neq \mu_0\) \(2 \cdot P(Z > \|z_{\text{obs}}\|)\)

Decision: Reject \(H_0\) if \(p\text{-value} < \alpha\) (e.g., 0.05 or 0.01)

๐Ÿ“– Interpreting p-Values

โš ๏ธ What p-values are NOT

  • The probability that \(H_0\) is true
  • The probability of making an error
  • A measure of practical significance

Common thresholds:

p-value Evidence against \(H_0\)
\(< 0.001\) Very strong
\(0.001 โ€“ 0.01\) Strong
\(0.01 โ€“ 0.05\) Moderate
\(0.05 โ€“ 0.10\) Weak / suggestive
\(> 0.10\) Insufficient

๐Ÿ”— Tests โ†”๏ธŽ Confidence Intervals

Theorem 10.2 โ€“ Duality (Wackerly ยง10.5)

A two-tailed test at level \(\alpha\) rejects \(H_0: \mu = \mu_0\) if and only if \(\mu_0\) falls outside the \((1-\alpha)\) confidence interval for \(\mu\).

Example: 95% CI for daily return: \((0.003, \; 0.021)\)

  • Test \(H_0: \mu = 0\) at \(\alpha = 0.05\): 0 is outside CI โ†’ Reject \(H_0\) โœ…
  • Test \(H_0: \mu = 0.01\) at \(\alpha = 0.05\): 0.01 is inside CI โ†’ Fail to Reject โœ…

Advantage of CIs over tests: CIs communicate magnitude of effect, not just reject/not reject.

๐ŸŽฎ Interactive: Visualising Type I & Type II Errors

Adjust \(\mu_a\) (true mean) and significance \(\alpha\) to see the error probabilities change.

Blue shaded = ฮฑ (Type I). Red shaded = ฮฒ (Type II). Dashed line = critical value.

๐Ÿค Think-Pair-Share

๐Ÿ’ฌ Activity (4 minutes)

Scenario: An investment bank claims their algorithmic trading strategy achieves a Sharpe ratio of at least 1.0. A risk auditor tests this claim using 252 trading days of data and obtains:

\[\bar{Y} = 0.87, \quad s = 0.42, \quad n = 252\]

Questions:

  1. Formulate the appropriate \(H_0\) and \(H_a\)

  2. Compute the test statistic \(Z\)

  3. Find the p-value (one-tailed test)

  4. What is your conclusion at \(\alpha = 0.05\)?

  5. What Type II error could occur here, and what are its consequences?

โœ… Think-Pair-Share: Solution

1. Hypotheses:

The auditor is testing whether the Sharpe ratio is below the claimed 1.0:

\[H_0: \mu_S \geq 1.0 \qquad H_a: \mu_S < 1.0\]

2. Test Statistic:

\[Z = \frac{\bar{Y} - \mu_0}{s/\sqrt{n}} = \frac{0.87 - 1.0}{0.42/\sqrt{252}} = \frac{-0.13}{0.02646} \approx -4.91\]

โœ… Think-Pair-Share: Solution (Continued)

3. p-Value & Decision:

\[p\text{-value} = P(Z < -4.91) \approx 0.0000046\]

Since \(p \ll \alpha = 0.05\), we reject \(H_0\). Strong evidence the true Sharpe ratio is below 1.0.

4. Financial Interpretation:

The bankโ€™s claimed Sharpe ratio of 1.0 is statistically refuted by the 252-day audit data. The strategy is significantly underperforming its advertised risk-adjusted return โ€” a potential misrepresentation to investors.

๐Ÿ’ก Key Insight: A very large \(|Z|\) here is driven by the large \(n = 252\). Even a modest gap (0.13) becomes highly significant with a full year of daily data.

๐Ÿ’ฐ Case Study: Testing Zero Mean Return

Code
library(tidyverse)
library(tidyquant)
library(knitr)

# Download BIST100 proxy data (Turkey ETF) 
# Using SPY as demonstration
spy <- tq_get("SPY", from = "2022-01-01", to = "2023-12-31")

returns <- spy %>%
  tq_transmute(select = adjusted,
               mutate_fun = periodReturn,
               period = "daily",
               col_rename = "return")

n    <- nrow(returns)
ybar <- mean(returns$return)
s    <- sd(returns$return)
se   <- s / sqrt(n)
z    <- ybar / se
pval <- 2 * (1 - pnorm(abs(z)))

results <- data.frame(
  Statistic = c("n", "Mean Return", "Std Dev", "Std Error", "Z statistic", "p-value"),
  Value     = round(c(n, ybar, s, se, z, pval), 5)
)
kable(results, caption = "Two-Tailed Test: Hโ‚€: ฮผ = 0")
Two-Tailed Test: Hโ‚€: ฮผ = 0
Statistic Value
n 501.00000
Mean Return 0.00013
Std Dev 0.01229
Std Error 0.00055
Z statistic 0.23239
p-value 0.81623
Code
# Visualise the test
x_seq <- seq(-4, 4, length.out = 400)
df_norm <- data.frame(z = x_seq, density = dnorm(x_seq))
z_crit  <- qnorm(0.975)

ggplot(df_norm, aes(x = z, y = density)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_area(data = filter(df_norm, z >= z_crit),
            aes(x = z, y = density), fill = "red", alpha = 0.4) +
  geom_area(data = filter(df_norm, z <= -z_crit),
            aes(x = z, y = density), fill = "red", alpha = 0.4) +
  geom_vline(xintercept = z, color = "darkgreen",
             linewidth = 1.2, linetype = "dashed") +
  annotate("text", x = z + 0.2, y = 0.35,
           label = paste0("Z = ", round(z, 3)), hjust = 0, size = 4) +
  labs(title = "Z Test for Mean Daily Return (SPY 2022-2023)",
       subtitle = paste0("p-value = ", round(pval, 4),
                         " | ฮฑ = 0.05 | Red zones = rejection regions"),
       x = "Z statistic", y = "Density") +
  theme_minimal(base_size = 12)

๐Ÿ’ฐ Case Study: Key Findings

๐Ÿ“Š Analysis Results

Test Setup:

  • \(H_0: \mu_{\text{daily}} = 0\)

  • \(H_a: \mu_{\text{daily}} \neq 0\)

  • \(n = 504\) trading days

  • \(\alpha = 0.05\)

  • Two-tailed Z test

Computed Values:

  • \(\bar{Y} \approx 0.00019\)

  • \(s \approx 0.013\)

  • \(Z \approx 0.32\)

  • p-value \(\approx 0.75\)

  • Fail to Reject \(H_0\)

Implications:

  1. Market Efficiency: Daily returns are not distinguishable from zero mean

  2. Practical vs Statistical: Small \(Z\) suggests insufficient signal

  3. Sample Matters: Longer history or different period may yield different result

๐Ÿ“ Quiz #1: Error Types

A central bank audit finds that an investment fundโ€™s reported average quarterly return is 2.5%. The regulator tests \(H_0: \mu = 2.5\%\) vs \(H_a: \mu < 2.5\%\) at \(\alpha = 0.05\). The test rejects \(H_0\). If the fund was actually performing at exactly 2.5%, what error was made?

  • Type I Error โ€” falsely rejecting a true null hypothesis
  • Type II Error โ€” failing to detect a true difference
  • No error โ€” the test is always correct at ฮฑ = 0.05
  • Power error โ€” the sample was too small

๐Ÿ“ Quiz #2: p-Value Interpretation

A quantitative analyst tests whether a new factorโ€™s mean return is positive. She computes \(Z = 1.89\), giving \(p\text{-value} = 0.029\). Which conclusion is correct at \(\alpha = 0.05\)?

  • Reject \(H_0\); there is statistically significant evidence of a positive mean return
  • Fail to reject \(H_0\); no evidence of positive returns
  • The probability that the strategy is profitable is 97.1%
  • The probability that \(H_0\) is true is 2.9%

๐Ÿ“ Quiz #3: Test Statistic Setup

An analyst tests whether a mutual fundโ€™s average monthly excess return \(\mu\) equals zero. She has \(n = 60\) months, \(\bar{Y} = 0.008\), \(s = 0.030\). Which is the correct \(Z\) statistic?

  • \(Z = \dfrac{0.008 - 0}{0.030/\sqrt{60}} \approx 2.07\)
  • \(Z = 0.008 / 0.030 = 0.267\)
  • \(Z = (0.008 \times 60) / 0.030 = 16\)
  • \(Z = 0.030 / \sqrt{60} = 0.00387\)

๐Ÿ“ Quiz #4: CIโ€“Test Duality

For the test \(H_0: \mu = 0\) at \(\alpha = 0.05\) (two-tailed), the 95% confidence interval is computed as \((-0.002, \; 0.018)\). What is the correct decision?

  • Fail to reject \(H_0\); since 0 is inside the 95% CI, there is insufficient evidence to conclude \(\mu \neq 0\)
  • Reject \(H_0\); the interval does not contain the hypothesized value
  • Cannot determine without the test statistic
  • Reject \(H_0\); the interval is very narrow

๐Ÿ“ Summary

โœ… Key Takeaways

  • Hypotheses: \(H_0\) (null/status quo) vs \(H_a\) (research/alternative); always formulate before seeing data

  • Error Trade-off: \(\alpha\) (Type I, false positive) and \(\beta\) (Type II, false negative) are inversely related; reducing both requires larger \(n\)

  • Z Tests: For large samples, \(Z = (\bar{Y} - \mu_0)/(\sigma/\sqrt{n})\) follows \(N(0,1)\) under \(H_0\)

  • p-Values: Measure evidence against \(H_0\); reject \(H_0\) when \(p < \alpha\); do NOT interpret as probability \(H_0\) is true

  • CI Duality: A two-tailed test rejects \(H_0: \mu = \mu_0\) iff \(\mu_0\) lies outside the \((1-\alpha)\) confidence interval

๐Ÿ“š Practice Problems

๐Ÿ“ Homework Problems โ€” Chapter 10

Problem 1 (Formulation): A telecom regulator claims that average download speed is at least 25 Mbps. You sample \(n = 100\) customers and find \(\bar{Y} = 23.4\) Mbps, \(s = 8.2\) Mbps. Formulate hypotheses, compute \(Z\), and find the p-value.

Problem 2 (Error Types): A credit risk manager sets \(\alpha = 0.01\) when testing whether default rates have increased. Explain the practical consequences of a Type I and Type II error in this context.

Problem 3 (p-Value): Compute and interpret p-values for \(H_a: \mu > 0\) when \(Z = 1.28\), \(Z = 1.96\), and \(Z = 2.58\).

Problem 4 (Financial Application): A portfolio manager tests whether her fundโ€™s Sharpe ratio exceeds 0.5. With 36 months of data, \(\bar{S} = 0.63\) and \(s_S = 0.28\). Test at \(\alpha = 0.05\) and construct a 90% one-sided confidence bound. Compare conclusions.

๐Ÿ‘‹ Thank You!

๐Ÿ“ฌ Contact Information:

Samir Orujov, PhD

Assistant Professor

School of Business, ADA University

๐Ÿ“ง sorujov@ada.edu.az

๐Ÿข Office: D312

โฐ Office Hours: By appointment

๐Ÿ“… Next Class:

Topic: Tests for Means and Proportions

Reading: Chapter 10, Sections 10.3, 10.4, 10.8

Preparation: Review the Standard Normal table; practice computing Z statistics

โฐ Reminders:

โœ… Complete Practice Problems 1โ€“4

โœ… Review Type I / Type II error concepts

โœ… Think about real-world decision contexts where each error is more costly

โœ… Work hard!

โ“ Questions?

๐Ÿ’ฌ Open Discussion

Key Topics for Discussion:

  • Why do courts use โ€œbeyond reasonable doubtโ€ โ€” which error type is being controlled?

  • In algorithmic trading, which error (Type I or II) is more costly? Does it depend on strategy?

  • How does increasing sample size affect the trade-off between \(\alpha\) and \(\beta\)?

  • Why might a very small p-value be statistically significant but practically unimportant?