Introduction to hypothesis testing
Before reading this guide, it is recommended that you read [Guide: Introduction to probability], [Guide: Mean, variance, and standard deviation], and Guide: PMFs, PDFs, CDFs.
What is a hypothesis test?
Statistics is the science of collecting, organizing and interpreting data. Making statements about characteristics of entire populations is important in statistics, as it gives a complete picture of the data. However, in real life, taking data from the entire population is incredibly difficult and could be impossible; for instance, could you imagine trying to co-ordinate the entire population of the world to measure their height?
Instead, statisticians take samples of data and use tools to potentially extend an observation to the entire population. This leads to the idea of a hypothesis test, one of the most important and most useful tools in statistics. Hypothesis tests help you to use data from a sample to test whether or not it is reasonable to believe a certain statistical characteristic is true for a whole population.
For example, the flagship branch of Cantor’s Confectionery have \(50\) customers on average one week. Can they say that their daily average customers has now probably increased compared to their previous average of \(45\) per day? This is the sort of scenario that a hypothesis test can assess; trying to infer a statistical characteristic from a sample size. Hypothesis testing is therefore used extensively in any field that requires statistical analysis, such as biology and ecology, psychology and neuroscience, geography, sociology, and even in medicine.
This guide will introduce the idea of a hypothesis test. First, the null and alternative hypotheses are introduced. Then, a list of tests is introduced to assess certain statistical qualities such as mean and variance. The level of confidence required to make a conclusion from the test is introduced via statistical significance, and the equivalent ideas of critical and \(p\)-values are demonstrated and a fully worked example is given.
It’s important to say that this guide is intended to be an overview as to how hypothesis testing works, rather than exploring the particular details of each hypothesis test. For more on this, see our associated guides.
Setting up a hypothesis test
A statistical hypothesis (shortened to hypothesis in this guide) is a statement showing an idea, or something that could possibly be true, about the whole of a particular data set.
A hypothesis test allows you to say, with a defined level of certainty, whether or not a hypothesis can be rejected. This determines whether there is enough statistical evidence to show that the original hypothesis is unlikely to be true.
Generally speaking, when you define hypotheses in statistics, they follow a very specific format, and refer to the entire population being studied.
Null hypothesis (\(H_0\)): This hypothesis represents the ‘status quo’ or no effect. It is always a statement of equality.
Alternative hypothesis (\(H_1\)): This is the hypothesis that you are trying to test. It is always a statement of inequality.
Depending on the question, the alternative hypothesis \(H_1\) can either be one-tailed or two-tailed:
- One-tailed means you are you are testing whether the characteristic has increased or decreased.
- Two-tailed means you are testing whether the tested characteristic is equal to the comparative characteristic or not.
Here’s some examples of hypotheses that require testing. In these examples, the Greek letter \(\mu\) (mu) is used for the population mean; see [Guide: Means, variance, and standard deviation] for more.
Cantor’s Confectionery wants to determine whether their waiting times in their corporate headquarters are longer than their target of \(15\) minutes per person.
You can set up a hypothesis test to evaluate this claim.
Null hypothesis (\(H_0\)): The average waiting time is equal to the target waiting time of \(15\) minutes. (Remember that the null hypothesis is always statement of equality!)
Alternative hypothesis (\(H_1\)): The average waiting time is longer than the target waiting time of \(15\) minutes.
So the null hypothesis is \(H_0: \mu=15\) and the alternative hypothesis is \(H_1: \mu > 15\). This is a one-tailed alternative hypothesis.
Context: Cantor’s Confectionery wants to determine if a new sweet production method produces different average approval scores compared to their traditional method.
Null hypothesis (\(H_0\)): The mean approval score for the new method \(n\) is equal to the mean approval score for the traditional method \(t\). So the null hypothesis is \(H_0:\mu_n=\mu_t\).
Alternative hypothesis (\(H_1\)): The mean approval score for the new method \(n\) is not equal to the mean approval score for the traditional method, indicating a difference in performance. So the alternative hypothesis here is \(H_1: \mu_n\neq \mu_t\)
Context: Cantor’s Confectionery wants to test if the proportion \(P\) of defective products produced is lower than last year’s average of \(2\%\).
Null hypothesis (\(H_0\)): Here, the proportion would be equal to \(2\%\); which means your null hypothesis is \(H_0: P=0.02\).
Alternative hypothesis (\(H_1\)): The hypothesis that you want to test is that the proportion of defective products is less than \(2\%\); so your alternative hypothesis should be \(H_1: P<0.02\).
Test selection and statistic calculation
Choosing the appropriate test statistic is really important because they all have different inputs depending on the data you have available to you. You can use this interactive flow chart of which test statistic you would want to use in whichever situation! Once you have found the test you want to use, you can refer to their relevant guide to make sure you are doing the correct calculations.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 340
library(shiny)
library(bslib)
# Define the decision tree structure (removed don't know options)
decision_tree <- list(
question = "Are you comparing means in your hypothesis test μ?",
yes = list(
question = "Do you have one sample or two? Click yes for one, no for two.",
yes = list(
question = "Brilliant, one sample. Do you know the population standard deviation σ?",
yes = list(
result = "You should use a Z-test."
),
no = list(
question = "Ok, so you don't know σ. Is your sample size n > 30?",
yes = list(
result = "You should use a Z-test."
),
no = list(
result = "You should use a t-test."
)
)
),
no = list(
question = "Ok, so you have two samples. Are they independent or are they paired? Click yes for independent, no for paired.",
yes = list(
question = "Ok, your two samples are independent. Do you know the population standard deviation σ for both samples?",
yes = list(
result = "You should use a Z-test."
),
no = list(
question = "Ok, so you don't know σ for both samples. Is your sample size n > 30?",
yes = list(
result = "You should use a two-sample Z-test."
),
no = list(
result = "You should use a two-sample t-test."
)
)
),
no = list(
result = "You should use a paired t-test."
)
)
),
no = list(
question = "Are you testing for variance?",
yes = list(
result = "You should use an F-test for variance."
),
no = list(
question = "Are you testing for goodness of fit?",
yes = list(
result = "You should use a chi-squared test for goodness of fit."
),
no = list(
question = "Are you testing for independence?",
yes = list(
result = "You should use a chi-squared test for independence."
),
no = list(
result = "Unfortunately this is not covered in the interactive figure; please start again."
)
)
)
)
)
# Define button color hex codes
button_colors <- list(
yes = "#3F6BB6", # Green
no = "#DB4315", # Red
back = "#9E9E9E", # Gray
start_over = "#FFCB00", # syellow
result_bg = "#C0D6FF", # Light green background for results
result_border = "#3F6BB6", # Slightly darker green border for results
result_text = "#000000" # Dark green text for results
)
ui <- page_fluid(
# Add some custom CSS to style buttons with hex colors
tags$head(
tags$style(HTML(
paste0(
".yes-btn { background-color: ", button_colors$yes, "; border-color: ", button_colors$yes, "; color: white; }",
".no-btn { background-color: ", button_colors$no, "; border-color: ", button_colors$no, "; color: white; }",
".back-btn { background-color: ", button_colors$back, "; border-color: ", button_colors$back, "; color: white; }",
".start-over-btn { background-color: ", button_colors$start_over, "; border-color: ", button_colors$start_over, "; color: white; }",
".result-box { background-color: ", button_colors$result_bg, "; border-color: ", button_colors$result_border, "; color: ", button_colors$result_text, "; border: 1px solid; padding: 15px; border-radius: 5px; }"
)
))
),
card(
card_header(
h2("Hypothesis testing interactive flowchart", class = "text-center")
),
card_body(
uiOutput("question_ui"),
uiOutput("result_ui"),
div(
class = "d-flex justify-content-between mt-4",
actionButton("back_btn", "Back", icon = icon("arrow-left"), class = "back-btn"),
actionButton("start_over_btn", "Start Over", icon = icon("refresh"), class = "start-over-btn")
)
)
)
)
server <- function(input, output, session) {
# Keep track of the path through the decision tree
path_history <- reactiveVal(list())
# Current node in the decision tree
current_node <- reactiveVal(decision_tree)
# Update the UI based on the current node
output$question_ui <- renderUI({
node <- current_node()
if (is.null(node) || !hasName(node, "question")) {
return(NULL)
}
div(
h4(node$question, class = "mb-4"),
div(
class = "d-flex justify-content-center gap-2",
actionButton("yes_btn", "Yes", class = "yes-btn"),
actionButton("no_btn", "No", class = "no-btn")
)
)
})
output$result_ui <- renderUI({
node <- current_node()
if (is.null(node) || !hasName(node, "result")) {
return(NULL)
}
div(
class = "result-box", # Using custom class instead of alert-success
h4("Recommendation:", class = "mb-3"),
p(node$result)
)
})
# Navigate to the next node based on user choice
observeEvent(input$yes_btn, {
node <- current_node()
if (!is.null(node$yes)) {
path_history(c(path_history(), list(node)))
current_node(node$yes)
}
})
observeEvent(input$no_btn, {
node <- current_node()
if (!is.null(node$no)) {
path_history(c(path_history(), list(node)))
current_node(node$no)
}
})
# Go back to the previous question
observeEvent(input$back_btn, {
history <- path_history()
if (length(history) > 0) {
# Set current node to the last node in history
current_node(history[[length(history)]])
# Remove the last node from history
path_history(history[-length(history)])
}
})
# Start over - reset to the root of the decision tree
observeEvent(input$start_over_btn, {
current_node(decision_tree)
path_history(list())
})
}
shinyApp(ui = ui, server = server)
Significance levels
A significance level \(\alpha\) (alpha) is the level of certainty you want to test your hypothesis with, or the line at which you would reject the null hypothesis \(H_0\).
This significance level is the percentage of risk that you are willing to accept rejecting a true null hypothesis. The smaller the \(\alpha\), the smaller the risk of a false conclusion.
Most commonly, \(\alpha\) is set to \(0.05\) \((5\%)\), \(0.01\) \((1\%)\) or \(0.10\) \((10\%)\). Usually, the most common choice is \(\alpha=0.05\) (so a significance level of \(5\%\)). This means you are willing to accept a \(5\%\) risk of rejecting a true null hypothesis.
Rejecting a true null hypothesis is known as a Type I error; better known as a false positive. There is also a Type II error (a false negative), which is where you fail to reject the null hypothesis (resulting in a negative test) but in fact the alternative hypothesis is true. For more information on these, see [Guide: Errors in hypothesis testing].
Critical values and \(p\)-values
You can use two different but equivalent methods to decide whether or not to reject your hypothesis to the stated significance level. These methods depend on the probability distribution being used; see Guide: PMFs, PDFs, CDFs and [Overview: Probability distributions] for more.
A critical value is a boundary that defines the rejection region (or critical region) based on the significance level \(\alpha\).
For a one-tailed test, this critical region is at one end of the probability distribution with PDF \(f(x)\). The critical value is then the unique \(z\) such that
\(\displaystyle \int_{-\infty}^z f(x)\,\textrm{d}x = \alpha\) for a lower tail test
\(\displaystyle \int_{z}^\infty f(x)\,\textrm{d}x = \alpha\) for an upper tail test.
For a two-tailed test, this critical region occupies both ends of the probability distribution with PDF \(f(x)\). The critical values \(z_1,z_2\) are the two boundaries such that \[\int_{-\infty}^{z_1} f(x)\,\textrm{d}x = \frac{\alpha}{2}\quad\textsf{and}\quad\int_{z_2}^{\infty} f(x)\,\textrm{d}x = \frac{\alpha}{2}\] You can notice that you want \(\alpha/2\) in a two-tailed test; this is because the critical region is split in half!
If your probability distribution is symmetric (like the normal distribution), then \(z_1 = -z_2\) in a two-tailed test. If it is not symmetric (like the \(\chi^2\) distribution), then this doesn’t happen.
Finding critical values for different significance levels is hard for humans to do, let alone any reader of this guide! You can use statistical tables or a computer to find the critical value for the chosen test and then compare your test statistic to the critical value you find.
It can be helpful to sketch or visualize the graph whilst doing this to make sure you are not missing any negatives and you are using the right critical region depending on if you are doing a one-tailed or a two-tailed test.
Once the critical value(s) and critical region has been located, you can then perform your test. This is usually expressed in the language of \(p\)-values:
The \(p\)-value is the probability of obtaining a test statistic as extreme as the observed statistic, and is defined to be to the area under the distribution curve corresponding to the test statistic.
The \(p\)-value will be different depending on the type of test used, whether it was a one-tailed or a two-tailed test.
If you have an upper-tailed test \[H_1: \mu>\mu_0\] (where \(\mu_0\) is the value you are testing your sample data against), then the \(p\)-value is the area to the right of the test statistic under the probability distribution curve.
If you have a lower-tailed test \[H_1: \mu<\mu_0\] the \(p\)-value is the area to the left of the test statistic under the probability distribution curve.
In the case of a two-tailed test \[H_1:\mu\neq\mu_0\], then the \(p\)-value corresponds to the areas beyond the critical values in both cases; you would need to halve \(p\) in this case.
You would then compare the \(p\)-value to \(\alpha\). If the \(p\)-value is less than \(\alpha\), you reject \(H_0\). If the \(p\)-value is greater than or equal to \(\alpha\), you fail to reject \(H_0\).
This has a different impact based on whether you are performing a one-tailed or a two-tailed test. You can use the interactive figure below to investigate this behaviour. For a lower one-tailed test, set \(Z = -1.67\); for an upper one-tailed test, set \(Z=1.67\). You can see that the whole \(5\%\) rejection region will be at one end of the results. This is because you are only testing whether the characteristic is greater than or equal to our comparative characteristic. In the case of a two-tailed test, set \(Z = 1.96\). You can then see that the \(5\%\) will be split across either end of the distribution to reject the \(2.5\%\) at both far left and far right of the curve. This is because you are testing whether the test characteristic is both greater than or less than the comparative characteristic.
If your test statistic falls in the critical regions as defined by your test, that is when you reject the null hypothesis \(H_0\).
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 620
library(shiny)
library(bslib)
library(ggplot2)
ui <- page_fluid(
title = "Example of lower, upper, two-tailed tests",
layout_columns(
col_widths = c(4, 8),
# Left column - Input parameters
card(
card_header("Input parameters"),
card_body(
numericInput("zscore", "Z-score", value = 1.96, step = 0.01),
radioButtons("test_type", "Test Type",
choices = list("Two-tailed" = "two",
"One-tailed (upper)" = "upper",
"One-tailed (lower)" = "lower"),
selected = "two"),
hr(),
helpText("This app demonstrates p-values in the normal distribution, for use in Z-testing (see Example 4).")
)
),
# Right column - Graphical representation
card(
card_header("Graphical representation"),
card_body(
plotOutput("density_plot", height = "300px")
)
)
),
# Results at the bottom
card(
card_header("Result"),
card_body(
verbatimTextOutput("pvalue_result")
)
)
)
server <- function(input, output, session) {
# Calculate p-value based on test type
p_value <- reactive({
z <- input$zscore
test <- input$test_type
if (test == "two") {
p <- 2 * pnorm(-abs(z))
result <- paste0("Two-tailed p-value: ", round(p, 6))
} else if (test == "upper") {
p <- pnorm(z, lower.tail = FALSE)
result <- paste0("Upper-tailed p-value: ", round(p, 6))
} else if (test == "lower") {
p <- pnorm(z, lower.tail = TRUE)
result <- paste0("Lower-tailed p-value: ", round(p, 6))
}
return(list(p = p, result = result, test = test))
})
# Display p-value
output$pvalue_result <- renderText({
p_value()$result
})
# Create density plot
output$density_plot <- renderPlot({
z <- input$zscore
test <- p_value()$test
# Generate x values for normal distribution
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)
df <- data.frame(x = x, y = y)
# Base plot
p <- ggplot(df, aes(x = x, y = y)) +
geom_line() +
labs(x = "Z-score", y = "Density") +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5)
# Add shaded area based on test type, using #3F6BB6 as the color
if (test == "two") {
# Two-tailed test: shade both tails
if (z > 0) {
p <- p +
geom_area(data = subset(df, x >= z), aes(x = x, y = y), fill = "#3F6BB6", alpha = 0.5) +
geom_area(data = subset(df, x <= -z), aes(x = x, y = y), fill = "#3F6BB6", alpha = 0.5) +
geom_vline(xintercept = z, color = "#3F6BB6") +
geom_vline(xintercept = -z, color = "#3F6BB6")
} else {
p <- p +
geom_area(data = subset(df, x <= z), aes(x = x, y = y), fill = "#3F6BB6", alpha = 0.5) +
geom_area(data = subset(df, x >= -z), aes(x = x, y = y), fill = "#3F6BB6", alpha = 0.5) +
geom_vline(xintercept = z, color = "#3F6BB6") +
geom_vline(xintercept = -z, color = "#3F6BB6")
}
} else if (test == "upper") {
# Upper-tailed test: shade area above z
p <- p +
geom_area(data = subset(df, x >= z), aes(x = x, y = y), fill = "#3F6BB6", alpha = 0.5) +
geom_vline(xintercept = z, color = "#3F6BB6")
} else if (test == "lower") {
# Lower-tailed test: shade area below z
p <- p +
geom_area(data = subset(df, x <= z), aes(x = x, y = y), fill = "#3F6BB6", alpha = 0.5) +
geom_vline(xintercept = z, color = "#3F6BB6")
}
return(p)
})
}
shinyApp(ui, server)
Forming a conclusion
To form a conclusion you then use the test to decide whether or not to reject your null hypothesis \(H_0\).
If the test statistic falls in the critical region, or if the \(p\)-value is less than \(\alpha\), then you would reject \(H_0\). This suggests there is enough evidence to support the alternative hypothesis.
If the test statistic does not fall in the critical region, or if the \(p\)-value is greater than or equal to \(\alpha\), this means there is not enough evidence to support the alternative hypothesis and you fail to reject \(H_0\).
To complete the problem, you then must formally state your conclusion. It’s important that any conclusion is robust. You could use the following templates for help.
“I reject \(H_0\) as there is sufficient evidence to conclude that the mean is greater than \(50\).”
or
“I do not reject \(H_0\) as there is not sufficient evidence to conclude that the mean is greater than \(50\).”
Hypothesis tests are based on sample data, not on the entire population. This means you can never accept either hypothesis; it is always a rejection or a failure to reject. This is because you cannot definitively ‘prove’ the null hypothesis to be true.
Your local mathematical sweet shop, Cantor’s Confectionery, claims that the average weight of its popular Boole Bar is \(5\) grams. You suspect that the actual average weight is less than \(5\) grams, so you decide to perform a hypothesis test. You take a random sample of \(30\) Boole Bars and find that the sample mean weight of the sample is \(\bar{x} = 4.5\) grams with a sample standard deviation of \(s=1.5\) grams. Because of the amazing production methods of Cantor’s Confectionery, you can safely assume that the population standard deviation \(\sigma = 1.5\) grams.
You want to test if the average weight of the Boole Bars is less than \(5\) grams at a significance level of \(5\%\). This means that \(\alpha = 0.05\).
State your hypotheses
Your null hypothesis is that the average weight of the bars is \(\mu_0 = 5\) grams and your alternative hypothesis is the average weight of bars is less than \(\mu_0 = 5\) grams. So \(H_0:\mu=5\) and \(H_1:\mu<5\).
Calculating the test statistic
In this case you have a sample of \(30\) bars and you know the population standard deviation \(\sigma = 1.5\). You are also testing the alternative hypothesis that \(\mu < 5\). This implies that you would want to use a lower one-tailed \(Z\)-test. This means calculating a test statistic according to a \(Z\)-test, which is outlined here.
First, you would need to calculate the standard error from the mean. This is \[SE = \frac{\sigma}{\sqrt{n}} = \frac{1.5}{\sqrt{30}}\] Next, the test statistic (\(Z\)-score) is given by \[Z = \frac{\bar{x} - \mu_0}{SE} = \frac{4.5 - 5}{\frac{1.5}{\sqrt{30}}} \approx -1.826\] (See [Guide: Introduction to \(Z\)-testing] for more.)
Find the critical value or \(p\)-value
Since this is a lower one-tailed test, you need to find the critical value for \(\alpha=0.05\) from a \(Z\)-table. The critical \(Z\)-value for a significance level of \(0.05\) in a lower one-tailed test is \(-1.645\).
Alternatively, you can find the \(p\)-value associated with the test statistic \(Z=−1.826\). You can use Calculator: \(Z\)-testing to see that a \(Z\)-score of \(-1.826\) corresponds to a \(p\)-value of approximately \(0.033925\).
Conclusion
If your test statistic is less than the critical value \(-1.645\) or \(p\leq \alpha\), you can reject the null hypothesis \(H_0\). If neither condition is met, you fail to reject the null hypothesis.
In this case, \(Z=-1.826\), which is less than the critical value of \(-1.645\). Similarly \(p=0.033925\), which is smaller than \(\alpha=0.05\).
So you can reject the null hypothesis \(H_0\). There is enough evidence to conclude that the average weight of the Boole Bars is less than \(5\) grams. It looks like Cantor Confectionery need to get to work…
Quick check problems
- What would your hypotheses be when testing whether the average battery life of a laptop is less than 10 hours?
\(H_0:\mu>10\) \(H_1:\mu<10\)
\(H_0:\mu=10\) \(H_1:\mu\neq10\)
\(H_0:\mu=10\) \(H_1:\mu<10\)
Answer: .
- What \(\alpha\) corresponds to a \(10\%\) level of significance?
Answer: \(\alpha=\) .
- What test would you use to compare means when you have one sample and you know the population standard deviation?
Answer:.
- You are performing a two-tailed \(Z\)-test with \(\alpha=0.05\) and critical values of \(1.96\) and \(-1.96\). Your test statistic is \(-2.34\). What conclusion do you draw?
Answer:.
Further reading
For more questions on the subject, please go to Questions: Hypothesis testing.
[For more information on how to perform a \(Z\)-test, please see Guide: \(Z\)-testing.]
[For more information on type I and type II errors please see Guide: Errors in hypothesis testing.]
Version history
v1.0: initial version created 12/24 by Ellie Trace as part of a University of St Andrews VIP project.