• Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical Literature
  • Classical Reception
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Language Acquisition
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Media
  • Music and Culture
  • Music and Religion
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Science
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Society
  • Law and Politics
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Clinical Neuroscience
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Games
  • Computer Security
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Ethics
  • Business History
  • Business Strategy
  • Business and Technology
  • Business and Government
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic History
  • Economic Methodology
  • Economic Systems
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Politics and Law
  • Public Policy
  • Public Administration
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

Mayo Clinic Preventive Medicine and Public Health Board Review

  • < Previous
  • Next chapter >

Mayo Clinic Preventive Medicine and Public Health Board Review

1 Biostatistics 1: Basic Concepts

  • Published: May 2010
  • Cite Icon Cite
  • Permissions Icon Permissions

Chapter 1 reviews basic concepts of biostatistics. Topics include descriptive data, probability and odds, estimation and sampling error, hypothesis testing, and power and sample size calculations. The discussion of descriptive data includes types of data (discrete vs continuous and nominal vs ordinal), central tendency (mean, median, and mode), skewed distributions, and measures of dispersion (range, variance, standard deviation). Probability and odds are broken down into laws of probability, odds, odds ratio, relative risk, and probability distribution. The examination of estimation and sampling error covers concepts such as random error, bias, standard error, point estimation, and interval estimation.

Signed in as

Institutional accounts.

  • GoogleCrawler [DO NOT DELETE]
  • Google Scholar Indexing

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Sign in through your institution

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Our books are available by subscription or purchase to libraries and institutions.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 31289

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

CO-6: Apply basic concepts of probability, random variation, and commonly used statistical probability distributions.

Learning Objectives

LO 6.26: Outline the logic and process of hypothesis testing.

LO 6.27: Explain what the p-value is and how it is used to draw conclusions.

Video: Hypothesis Testing (8:43)

Introduction

We are in the middle of the part of the course that has to do with inference for one variable.

So far, we talked about point estimation and learned how interval estimation enhances it by quantifying the magnitude of the estimation error (with a certain level of confidence) in the form of the margin of error. The result is the confidence interval — an interval that, with a certain confidence, we believe captures the unknown parameter.

We are now moving to the other kind of inference, hypothesis testing . We say that hypothesis testing is “the other kind” because, unlike the inferential methods we presented so far, where the goal was estimating the unknown parameter, the idea, logic and goal of hypothesis testing are quite different.

In the first two parts of this section we will discuss the idea behind hypothesis testing, explain how it works, and introduce new terminology that emerges in this form of inference. The final two parts will be more specific and will discuss hypothesis testing for the population proportion ( p ) and the population mean ( μ, mu).

If this is your first statistics course, you will need to spend considerable time on this topic as there are many new ideas. Many students find this process and its logic difficult to understand in the beginning.

In this section, we will use the hypothesis test for a population proportion to motivate our understanding of the process. We will conduct these tests manually. For all future hypothesis test procedures, including problems involving means, we will use software to obtain the results and focus on interpreting them in the context of our scenario.

General Idea and Logic of Hypothesis Testing

The purpose of this section is to gradually build your understanding about how statistical hypothesis testing works. We start by explaining the general logic behind the process of hypothesis testing. Once we are confident that you understand this logic, we will add some more details and terminology.

To start our discussion about the idea behind statistical hypothesis testing, consider the following example:

A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.

There are two opposing claims in this case:

  • The student’s claim: I did not cheat on the exam.
  • The instructor’s claim: The student did cheat on the exam.

Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim. The instructor explains that the exam had two versions, and shows the committee members that on three separate exam questions, the student used in his solution numbers that were given in the other version of the exam.

The committee members all agree that it would be extremely unlikely to get evidence like that if the student’s claim of not cheating had been true. In other words, the committee members all agree that the instructor brought forward strong enough evidence to reject the student’s claim, and conclude that the student did cheat on the exam.

What does this example have to do with statistics?

While it is true that this story seems unrelated to statistics, it captures all the elements of hypothesis testing and the logic behind it. Before you read on to understand why, it would be useful to read the example again. Please do so now.

Statistical hypothesis testing is defined as:

  • Assessing evidence provided by the data against the null claim (the claim which is to be assumed true unless enough evidence exists to reject it).

Here is how the process of statistical hypothesis testing works:

  • We have two claims about what is going on in the population. Let’s call them claim 1 (this will be the null claim or hypothesis) and claim 2 (this will be the alternative) . Much like the story above, where the student’s claim is challenged by the instructor’s claim, the null claim 1 is challenged by the alternative claim 2. (For us, these claims are usually about the value of population parameter(s) or about the existence or nonexistence of a relationship between two variables in the population).
  • We choose a sample, collect relevant data and summarize them (this is similar to the instructor collecting evidence from the student’s exam). For statistical tests, this step will also involve checking any conditions or assumptions.
  • We figure out how likely it is to observe data like the data we obtained, if claim 1 is true. (Note that the wording “how likely …” implies that this step requires some kind of probability calculation). In the story, the committee members assessed how likely it is to observe evidence such as the instructor provided, had the student’s claim of not cheating been true.
  • If, after assuming claim 1 is true, we find that it would be extremely unlikely to observe data as strong as ours or stronger in favor of claim 2, then we have strong evidence against claim 1, and we reject it in favor of claim 2. Later we will see this corresponds to a small p-value.
  • If, after assuming claim 1 is true, we find that observing data as strong as ours or stronger in favor of claim 2 is NOT VERY UNLIKELY , then we do not have enough evidence against claim 1, and therefore we cannot reject it in favor of claim 2. Later we will see this corresponds to a p-value which is not small.

In our story, the committee decided that it would be extremely unlikely to find the evidence that the instructor provided had the student’s claim of not cheating been true. In other words, the members felt that it is extremely unlikely that it is just a coincidence (random chance) that the student used the numbers from the other version of the exam on three separate problems. The committee members therefore decided to reject the student’s claim and concluded that the student had, indeed, cheated on the exam. (Wouldn’t you conclude the same?)

Hopefully this example helped you understand the logic behind hypothesis testing.

Interactive Applet: Reasoning of a Statistical Test

To strengthen your understanding of the process of hypothesis testing and the logic behind it, let’s look at three statistical examples.

A recent study estimated that 20% of all college students in the United States smoke. The head of Health Services at Goodheart University (GU) suspects that the proportion of smokers may be lower at GU. In hopes of confirming her claim, the head of Health Services chooses a random sample of 400 Goodheart students, and finds that 70 of them are smokers.

Let’s analyze this example using the 4 steps outlined above:

  • claim 1: The proportion of smokers at Goodheart is 0.20.
  • claim 2: The proportion of smokers at Goodheart is less than 0.20.

Claim 1 basically says “nothing special goes on at Goodheart University; the proportion of smokers there is no different from the proportion in the entire country.” This claim is challenged by the head of Health Services, who suspects that the proportion of smokers at Goodheart is lower.

  • Choosing a sample and collecting data: A sample of n = 400 was chosen, and summarizing the data revealed that the sample proportion of smokers is p -hat = 70/400 = 0.175.While it is true that 0.175 is less than 0.20, it is not clear whether this is strong enough evidence against claim 1. We must account for sampling variation.
  • Assessment of evidence: In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves: How surprising is it to get a sample proportion as low as p -hat = 0.175 (or lower), assuming claim 1 is true? In other words, we need to find how likely it is that in a random sample of size n = 400 taken from a population where the proportion of smokers is p = 0.20 we’ll get a sample proportion as low as p -hat = 0.175 (or lower).It turns out that the probability that we’ll get a sample proportion as low as p -hat = 0.175 (or lower) in such a sample is roughly 0.106 (do not worry about how this was calculated at this point – however, if you think about it hopefully you can see that the key is the sampling distribution of p -hat).
  • Conclusion: Well, we found that if claim 1 were true there is a probability of 0.106 of observing data like that observed or more extreme. Now you have to decide …Do you think that a probability of 0.106 makes our data rare enough (surprising enough) under claim 1 so that the fact that we did observe it is enough evidence to reject claim 1? Or do you feel that a probability of 0.106 means that data like we observed are not very likely when claim 1 is true, but they are not unlikely enough to conclude that getting such data is sufficient evidence to reject claim 1. Basically, this is your decision. However, it would be nice to have some kind of guideline about what is generally considered surprising enough.

A certain prescription allergy medicine is supposed to contain an average of 245 parts per million (ppm) of a certain chemical. If the concentration is higher than 245 ppm, the drug will likely cause unpleasant side effects, and if the concentration is below 245 ppm, the drug may be ineffective. The manufacturer wants to check whether the mean concentration in a large shipment is the required 245 ppm or not. To this end, a random sample of 64 portions from the large shipment is tested, and it is found that the sample mean concentration is 250 ppm with a sample standard deviation of 12 ppm.

  • Claim 1: The mean concentration in the shipment is the required 245 ppm.
  • Claim 2: The mean concentration in the shipment is not the required 245 ppm.

Note that again, claim 1 basically says: “There is nothing unusual about this shipment, the mean concentration is the required 245 ppm.” This claim is challenged by the manufacturer, who wants to check whether that is, indeed, the case or not.

  • Choosing a sample and collecting data: A sample of n = 64 portions is chosen and after summarizing the data it is found that the sample mean concentration is x-bar = 250 and the sample standard deviation is s = 12.Is the fact that x-bar = 250 is different from 245 strong enough evidence to reject claim 1 and conclude that the mean concentration in the whole shipment is not the required 245? In other words, do the data provide strong enough evidence to reject claim 1?
  • Assessing the evidence: In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves the following question: If the mean concentration in the whole shipment were really the required 245 ppm (i.e., if claim 1 were true), how surprising would it be to observe a sample of 64 portions where the sample mean concentration is off by 5 ppm or more (as we did)? It turns out that it would be extremely unlikely to get such a result if the mean concentration were really the required 245. There is only a probability of 0.0007 (i.e., 7 in 10,000) of that happening. (Do not worry about how this was calculated at this point, but again, the key will be the sampling distribution.)
  • Making conclusions: Here, it is pretty clear that a sample like the one we observed or more extreme is VERY rare (or extremely unlikely) if the mean concentration in the shipment were really the required 245 ppm. The fact that we did observe such a sample therefore provides strong evidence against claim 1, so we reject it and conclude with very little doubt that the mean concentration in the shipment is not the required 245 ppm.

Do you think that you’re getting it? Let’s make sure, and look at another example.

Is there a relationship between gender and combined scores (Math + Verbal) on the SAT exam?

Following a report on the College Board website, which showed that in 2003, males scored generally higher than females on the SAT exam, an educational researcher wanted to check whether this was also the case in her school district. The researcher chose random samples of 150 males and 150 females from her school district, collected data on their SAT performance and found the following:

Again, let’s see how the process of hypothesis testing works for this example:

  • Claim 1: Performance on the SAT is not related to gender (males and females score the same).
  • Claim 2: Performance on the SAT is related to gender – males score higher.

Note that again, claim 1 basically says: “There is nothing going on between the variables SAT and gender.” Claim 2 represents what the researcher wants to check, or suspects might actually be the case.

  • Choosing a sample and collecting data: Data were collected and summarized as given above. Is the fact that the sample mean score of males (1,025) is higher than the sample mean score of females (1,010) by 15 points strong enough information to reject claim 1 and conclude that in this researcher’s school district, males score higher on the SAT than females?
  • Assessment of evidence: In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves: If SAT scores are in fact not related to gender (claim 1 is true), how likely is it to get data like the data we observed, in which the difference between the males’ average and females’ average score is as high as 15 points or higher? It turns out that the probability of observing such a sample result if SAT score is not related to gender is approximately 0.29 (Again, do not worry about how this was calculated at this point).
  • Conclusion: Here, we have an example where observing a sample like the one we observed or more extreme is definitely not surprising (roughly 30% chance) if claim 1 were true (i.e., if indeed there is no difference in SAT scores between males and females). We therefore conclude that our data does not provide enough evidence for rejecting claim 1.
  • “The data provide enough evidence to reject claim 1 and accept claim 2”; or
  • “The data do not provide enough evidence to reject claim 1.”

In particular, note that in the second type of conclusion we did not say: “ I accept claim 1 ,” but only “ I don’t have enough evidence to reject claim 1 .” We will come back to this issue later, but this is a good place to make you aware of this subtle difference.

Hopefully by now, you understand the logic behind the statistical hypothesis testing process. Here is a summary:

A flow chart describing the process. First, we state Claim 1 and Claim 2. Claim 1 says "nothing special is going on" and is challenged by claim 2. Second, we collect relevant data and summarize it. Third, we assess how surprising it woudl be to observe data like that observed if Claim 1 is true. Fourth, we draw conclusions in context.

Learn by Doing: Logic of Hypothesis Testing

Did I Get This?: Logic of Hypothesis Testing

Steps in Hypothesis Testing

Video: Steps in Hypothesis Testing (16:02)

Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.

Hypothesis Testing Step 1: State the Hypotheses

In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, Claim 1 is called the null hypothesis (denoted “ Ho “), and Claim 2 plays the role of the alternative hypothesis (denoted “ Ha “). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.

Let’s go back to our three examples and apply the new notation:

In example 1:

  • Ho: The proportion of smokers at GU is 0.20.
  • Ha: The proportion of smokers at GU is less than 0.20.

In example 2:

  • Ho: The mean concentration in the shipment is the required 245 ppm.
  • Ha: The mean concentration in the shipment is not the required 245 ppm.

In example 3:

  • Ho: Performance on the SAT is not related to gender (males and females score the same).
  • Ha: Performance on the SAT is related to gender – males score higher.

Learn by Doing: State the Hypotheses

Did I Get This?: State the Hypotheses

Hypothesis Testing Step 2: Collect Data, Check Conditions and Summarize Data

This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.

There is, however, one detail that we would like to add here. In this step we collect data and summarize it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion ( p -hat), sample mean (x-bar) and the sample standard deviation (s).

In practice, you go a step further and use these sample statistics to summarize the data with what’s called a test statistic . We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.

This step will also involve checking any conditions or assumptions required to use the test.

Hypothesis Testing Step 3: Assess the Evidence

As we saw, this is the step where we calculate how likely is it to get data like that observed (or more extreme) when Ho is true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability.

  • If this probability is very small (see example 2), then that means that it would be very surprising to get data like that observed (or more extreme) if Ho were true. The fact that we did observe such data is therefore evidence against Ho, and we should reject it.
  • On the other hand, if this probability is not very small (see example 3) this means that observing data like that observed (or more extreme) is not very surprising if Ho were true. The fact that we observed such data does not provide evidence against Ho. This crucial probability, therefore, has a special name. It is called the p-value of the test.

In our three examples, the p-values were given to you (and you were reassured that you didn’t need to worry about how these were derived yet):

  • Example 1: p-value = 0.106
  • Example 2: p-value = 0.0007
  • Example 3: p-value = 0.29

Obviously, the smaller the p-value, the more surprising it is to get data like ours (or more extreme) when Ho is true, and therefore, the stronger the evidence the data provide against Ho.

Looking at the three p-values of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against Ho.

  • Right now we will not go into specific details about p-value calculations, but just mention that since the p-value is the probability of getting data like those observed (or more extreme) when Ho is true, it would make sense that the calculation of the p-value will be based on the data summary, which, as we mentioned, is the test statistic. Indeed, this is the case. In practice, we will mostly use software to provide the p-value for us.

Hypothesis Testing Step 4: Making Conclusions

Since our statistical conclusion is based on how small the p-value is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the p-value must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.

This cutoff exists, and because it is so important, it has a special name. It is called the significance level of the test and is usually denoted by the Greek letter α (alpha). The most commonly used significance level is α (alpha) = 0.05 (or 5%). This means that:

  • if the p-value < α (alpha) (usually 0.05), then the data we obtained is considered to be “rare (or surprising) enough” under the assumption that Ho is true, and we say that the data provide statistically significant evidence against Ho, so we reject Ho and thus accept Ha.
  • if the p-value > α (alpha)(usually 0.05), then our data are not considered to be “surprising enough” under the assumption that Ho is true, and we say that our data do not provide enough evidence to reject Ho (or, equivalently, that the data do not provide enough evidence to accept Ha).

Now that we have a cutoff to use, here are the appropriate conclusions for each of our examples based upon the p-values we were given.

In Example 1:

  • Using our cutoff of 0.05, we fail to reject Ho.
  • Conclusion : There IS NOT enough evidence that the proportion of smokers at GU is less than 0.20
  • Still we should consider: Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?

In Example 2:

  • Using our cutoff of 0.05, we reject Ho.
  • Conclusion : There IS enough evidence that the mean concentration in the shipment is not the required 245 ppm.

In Example 3:

  • Conclusion : There IS NOT enough evidence that males score higher on average than females on the SAT.

Notice that all of the above conclusions are written in terms of the alternative hypothesis and are given in the context of the situation. In no situation have we claimed the null hypothesis is true. Be very careful of this and other issues discussed in the following comments.

  • Although the significance level provides a good guideline for drawing our conclusions, it should not be treated as an incontrovertible truth. There is a lot of room for personal interpretation. What if your p-value is 0.052? You might want to stick to the rules and say “0.052 > 0.05 and therefore I don’t have enough evidence to reject Ho”, but you might decide that 0.052 is small enough for you to believe that Ho should be rejected. It should be noted that scientific journals do consider 0.05 to be the cutoff point for which any p-value below the cutoff indicates enough evidence against Ho, and any p-value above it, or even equal to it , indicates there is not enough evidence against Ho. Although a p-value between 0.05 and 0.10 is often reported as marginally statistically significant.
  • It is important to draw your conclusions in context . It is never enough to say: “p-value = …, and therefore I have enough evidence to reject Ho at the 0.05 significance level.” You should always word your conclusion in terms of the data. Although we will use the terminology of “rejecting Ho” or “failing to reject Ho” – this is mostly due to the fact that we are instructing you in these concepts. In practice, this language is rarely used. We also suggest writing your conclusion in terms of the alternative hypothesis.Is there or is there not enough evidence that the alternative hypothesis is true?
  • Let’s go back to the issue of the nature of the two types of conclusions that I can make.
  • Either I reject Ho (when the p-value is smaller than the significance level)
  • or I cannot reject Ho (when the p-value is larger than the significance level).

As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is not necessarily the case . Consider the following slightly artificial yet effective example:

An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following two hypotheses:

  • Ho: The proportion of male managers hired is 0.5
  • Ha: The proportion of male managers hired is more than 0.5

Data: You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.

Assessing Evidence: If the proportion of male managers hired is really 0.5 (Ho is true), then the probability that the random selection of three managers will yield three males is therefore 0.5 * 0.5 * 0.5 = 0.125. This is the p-value (using the multiplication rule for independent events).

Conclusion: Using 0.05 as the significance level, you conclude that since the p-value = 0.125 > 0.05, the fact that the three randomly selected managers were all males is not enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy (Ho).

However, the data (all three selected are males) definitely does NOT provide evidence to accept the employer’s claim (Ho).

Learn By Doing: Using p-values

Did I Get This?: Using p-values

Comment about wording: Another common wording in scientific journals is:

  • “The results are statistically significant” – when the p-value < α (alpha).
  • “The results are not statistically significant” – when the p-value > α (alpha).

Often you will see significance levels reported with additional description to indicate the degree of statistical significance. A general guideline (although not required in our course) is:

  • If 0.01 ≤ p-value < 0.05, then the results are (statistically) significant .
  • If 0.001 ≤ p-value < 0.01, then the results are highly statistically significant .
  • If p-value < 0.001, then the results are very highly statistically significant .
  • If p-value > 0.05, then the results are not statistically significant (NS).
  • If 0.05 ≤ p-value < 0.10, then the results are marginally statistically significant .

Let’s summarize

We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap:

Video: Hypothesis Testing Overview (2:20)

Here are a few more activities if you need some additional practice.

Did I Get This?: Hypothesis Testing Overview

  • Notice that the p-value is an example of a conditional probability . We calculate the probability of obtaining results like those of our data (or more extreme) GIVEN the null hypothesis is true. We could write P(Obtaining results like ours or more extreme | Ho is True).
  • We could write P(Obtaining a test statistic as or more extreme than ours | Ho is True).
  • In this case we are asking “Assuming the null hypothesis is true, how rare is it to observe something as or more extreme than what I have found in my data?”
  • If after assuming the null hypothesis is true, what we have found in our data is extremely rare (small p-value), this provides evidence to reject our assumption that Ho is true in favor of Ha.
  • The p-value can also be thought of as the probability, assuming the null hypothesis is true, that the result we have seen is solely due to random error (or random chance). We have already seen that statistics from samples collected from a population vary. There is random error or random chance involved when we sample from populations.

In this setting, if the p-value is very small, this implies, assuming the null hypothesis is true, that it is extremely unlikely that the results we have obtained would have happened due to random error alone, and thus our assumption (Ho) is rejected in favor of the alternative hypothesis (Ha).

  • It is EXTREMELY important that you find a definition of the p-value which makes sense to you. New students often need to contemplate this idea repeatedly through a variety of examples and explanations before becoming comfortable with this idea. It is one of the two most important concepts in statistics (the other being confidence intervals).
  • We infer that the alternative hypothesis is true ONLY by rejecting the null hypothesis.
  • A statistically significant result is one that has a very low probability of occurring if the null hypothesis is true.
  • Results which are statistically significant may or may not have practical significance and vice versa.

Error and Power

LO 6.28: Define a Type I and Type II error in general and in the context of specific scenarios.

LO 6.29: Explain the concept of the power of a statistical test including the relationship between power, sample size, and effect size.

Video: Errors and Power (12:03)

Type I and Type II Errors in Hypothesis Tests

We have not yet discussed the fact that we are not guaranteed to make the correct decision by this process of hypothesis testing. Maybe you are beginning to see that there is always some level of uncertainty in statistics.

Let’s think about what we know already and define the possible errors we can make in hypothesis testing. When we conduct a hypothesis test, we choose one of two possible conclusions based upon our data.

If the p-value is smaller than your pre-specified significance level (α, alpha), you reject the null hypothesis and either

  • You have made the correct decision since the null hypothesis is false
  • You have made an error ( Type I ) and rejected Ho when in fact Ho is true (your data happened to be a RARE EVENT under Ho)

If the p-value is greater than (or equal to) your chosen significance level (α, alpha), you fail to reject the null hypothesis and either

  • You have made the correct decision since the null hypothesis is true
  • You have made an error ( Type II ) and failed to reject Ho when in fact Ho is false (the alternative hypothesis, Ha, is true)

The following summarizes the four possible results which can be obtained from a hypothesis test. Notice the rows represent the decision made in the hypothesis test and the columns represent the (usually unknown) truth in reality.

mod12-errors1

Although the truth is unknown in practice – or we would not be conducting the test – we know it must be the case that either the null hypothesis is true or the null hypothesis is false. It is also the case that either decision we make in a hypothesis test can result in an incorrect conclusion!

A TYPE I Error occurs when we Reject Ho when, in fact, Ho is True. In this case, we mistakenly reject a true null hypothesis.

  • P(TYPE I Error) = P(Reject Ho | Ho is True) = α = alpha = Significance Level

A TYPE II Error occurs when we fail to Reject Ho when, in fact, Ho is False. In this case we fail to reject a false null hypothesis.

P(TYPE II Error) = P(Fail to Reject Ho | Ho is False) = β = beta

When our significance level is 5%, we are saying that we will allow ourselves to make a Type I error less than 5% of the time. In the long run, if we repeat the process, 5% of the time we will find a p-value < 0.05 when in fact the null hypothesis was true.

In this case, our data represent a rare occurrence which is unlikely to happen but is still possible. For example, suppose we toss a coin 10 times and obtain 10 heads, this is unlikely for a fair coin but not impossible. We might conclude the coin is unfair when in fact we simply saw a very rare event for this fair coin.

Our testing procedure CONTROLS for the Type I error when we set a pre-determined value for the significance level.

Notice that these probabilities are conditional probabilities. This is one more reason why conditional probability is an important concept in statistics.

Unfortunately, calculating the probability of a Type II error requires us to know the truth about the population. In practice we can only calculate this probability using a series of “what if” calculations which depend upon the type of problem.

Comment: As you initially read through the examples below, focus on the broad concepts instead of the small details. It is not important to understand how to calculate these values yourself at this point.

  • Try to understand the pictures we present. Which pictures represent an assumed null hypothesis and which represent an alternative?
  • It may be useful to come back to this page (and the activities here) after you have reviewed the rest of the section on hypothesis testing and have worked a few problems yourself.

Interactive Applet: Statistical Significance

Here are two examples of using an older version of this applet. It looks slightly different but the same settings and options are available in the version above.

In both cases we will consider IQ scores.

Our null hypothesis is that the true mean is 100. Assume the standard deviation is 16 and we will specify a significance level of 5%.

In this example we will specify that the true mean is indeed 100 so that the null hypothesis is true. Most of the time (95%), when we generate a sample, we should fail to reject the null hypothesis since the null hypothesis is indeed true.

Here is one sample that results in a correct decision:

mod12-significance_ex1a

In the sample above, we obtain an x-bar of 105, which is drawn on the distribution which assumes μ (mu) = 100 (the null hypothesis is true). Notice the sample is shown as blue dots along the x-axis and the shaded region shows for which values of x-bar we would reject the null hypothesis. In other words, we would reject Ho whenever the x-bar falls in the shaded region.

Enter the same values and generate samples until you obtain a Type I error (you falsely reject the null hypothesis). You should see something like this:

mod12-significance_ex2

If you were to generate 100 samples, you should have around 5% where you rejected Ho. These would be samples which would result in a Type I error.

The previous example illustrates a correct decision and a Type I error when the null hypothesis is true. The next example illustrates a correct decision and Type II error when the null hypothesis is false. In this case, we must specify the true population mean.

Let’s suppose we are sampling from an honors program and that the true mean IQ for this population is 110. We do not know the probability of a Type II error without more detailed calculations.

Let’s start with a sample which results in a correct decision.

mod12-significance_ex3

In the sample above, we obtain an x-bar of 111, which is drawn on the distribution which assumes μ (mu) = 100 (the null hypothesis is true).

Enter the same values and generate samples until you obtain a Type II error (you fail to reject the null hypothesis). You should see something like this:

mod12-significance_ex4

You should notice that in this case (when Ho is false), it is easier to obtain an incorrect decision (a Type II error) than it was in the case where Ho is true. If you generate 100 samples, you can approximate the probability of a Type II error.

We can find the probability of a Type II error by visualizing both the assumed distribution and the true distribution together. The image below is adapted from an applet we will use when we discuss the power of a statistical test.

mod12-significance_ex5a

There is a 37.4% chance that, in the long run, we will make a Type II error and fail to reject the null hypothesis when in fact the true mean IQ is 110 in the population from which we sample our 10 individuals.

Can you visualize what will happen if the true population mean is really 115 or 108? When will the Type II error increase? When will it decrease? We will look at this idea again when we discuss the concept of power in hypothesis tests.

  • It is important to note that there is a trade-off between the probability of a Type I and a Type II error. If we decrease the probability of one of these errors, the probability of the other will increase! The practical result of this is that if we require stronger evidence to reject the null hypothesis (smaller significance level = probability of a Type I error), we will increase the chance that we will be unable to reject the null hypothesis when in fact Ho is false (increases the probability of a Type II error).
  • When α (alpha) = 0.05 we obtained a Type II error probability of 0.374 = β = beta

mod12-significance_ex4

  • When α (alpha) = 0.01 (smaller than before) we obtain a Type II error probability of 0.644 = β = beta (larger than before)

mod12-significance_ex6a

  • As the blue line in the picture moves farther right, the significance level (α, alpha) is decreasing and the Type II error probability is increasing.
  • As the blue line in the picture moves farther left, the significance level (α, alpha) is increasing and the Type II error probability is decreasing

Let’s return to our very first example and define these two errors in context.

  • Ho = The student’s claim: I did not cheat on the exam.
  • Ha = The instructor’s claim: The student did cheat on the exam.

Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim.

There are four possible outcomes of this process. There are two possible correct decisions:

  • The student did cheat on the exam and the instructor brings enough evidence to reject Ho and conclude the student did cheat on the exam. This is a CORRECT decision!
  • The student did not cheat on the exam and the instructor fails to provide enough evidence that the student did cheat on the exam. This is a CORRECT decision!

Both the correct decisions and the possible errors are fairly easy to understand but with the errors, you must be careful to identify and define the two types correctly.

TYPE I Error: Reject Ho when Ho is True

  • The student did not cheat on the exam but the instructor brings enough evidence to reject Ho and conclude the student cheated on the exam. This is a Type I Error.

TYPE II Error: Fail to Reject Ho when Ho is False

  • The student did cheat on the exam but the instructor fails to provide enough evidence that the student cheated on the exam. This is a Type II Error.

In most situations, including this one, it is more “acceptable” to have a Type II error than a Type I error. Although allowing a student who cheats to go unpunished might be considered a very bad problem, punishing a student for something he or she did not do is usually considered to be a more severe error. This is one reason we control for our Type I error in the process of hypothesis testing.

Did I Get This?: Type I and Type II Errors (in context)

  • The probabilities of Type I and Type II errors are closely related to the concepts of sensitivity and specificity that we discussed previously. Consider the following hypotheses:

Ho: The individual does not have diabetes (status quo, nothing special happening)

Ha: The individual does have diabetes (something is going on here)

In this setting:

When someone tests positive for diabetes we would reject the null hypothesis and conclude the person has diabetes (we may or may not be correct!).

When someone tests negative for diabetes we would fail to reject the null hypothesis so that we fail to conclude the person has diabetes (we may or may not be correct!)

Let’s take it one step further:

Sensitivity = P(Test + | Have Disease) which in this setting equals P(Reject Ho | Ho is False) = 1 – P(Fail to Reject Ho | Ho is False) = 1 – β = 1 – beta

Specificity = P(Test – | No Disease) which in this setting equals P(Fail to Reject Ho | Ho is True) = 1 – P(Reject Ho | Ho is True) = 1 – α = 1 – alpha

Notice that sensitivity and specificity relate to the probability of making a correct decision whereas α (alpha) and β (beta) relate to the probability of making an incorrect decision.

Usually α (alpha) = 0.05 so that the specificity listed above is 0.95 or 95%.

Next, we will see that the sensitivity listed above is the power of the hypothesis test!

Reasons for a Type I Error in Practice

Assuming that you have obtained a quality sample:

  • The reason for a Type I error is random chance.
  • When a Type I error occurs, our observed data represented a rare event which indicated evidence in favor of the alternative hypothesis even though the null hypothesis was actually true.

Reasons for a Type II Error in Practice

Again, assuming that you have obtained a quality sample, now we have a few possibilities depending upon the true difference that exists.

  • The sample size is too small to detect an important difference. This is the worst case, you should have obtained a larger sample. In this situation, you may notice that the effect seen in the sample seems PRACTICALLY significant and yet the p-value is not small enough to reject the null hypothesis.
  • The sample size is reasonable for the important difference but the true difference (which might be somewhat meaningful or interesting) is smaller than your test was capable of detecting. This is tolerable as you were not interested in being able to detect this difference when you began your study. In this situation, you may notice that the effect seen in the sample seems to have some potential for practical significance.
  • The sample size is more than adequate, the difference that was not detected is meaningless in practice. This is not a problem at all and is in effect a “correct decision” since the difference you did not detect would have no practical meaning.
  • Note: We will discuss the idea of practical significance later in more detail.

Power of a Hypothesis Test

It is often the case that we truly wish to prove the alternative hypothesis. It is reasonable that we would be interested in the probability of correctly rejecting the null hypothesis. In other words, the probability of rejecting the null hypothesis, when in fact the null hypothesis is false. This can also be thought of as the probability of being able to detect a (pre-specified) difference of interest to the researcher.

Let’s begin with a realistic example of how power can be described in a study.

In a clinical trial to study two medications for weight loss, we have an 80% chance to detect a difference in the weight loss between the two medications of 10 pounds. In other words, the power of the hypothesis test we will conduct is 80%.

In other words, if one medication comes from a population with an average weight loss of 25 pounds and the other comes from a population with an average weight loss of 15 pounds, we will have an 80% chance to detect that difference using the sample we have in our trial.

If we were to repeat this trial many times, 80% of the time we will be able to reject the null hypothesis (that there is no difference between the medications) and 20% of the time we will fail to reject the null hypothesis (and make a Type II error!).

The difference of 10 pounds in the previous example, is often called the effect size . The measure of the effect differs depending on the particular test you are conducting but is always some measure related to the true effect in the population. In this example, it is the difference between two population means.

Recall the definition of a Type II error:

Notice that P(Reject Ho | Ho is False) = 1 – P(Fail to Reject Ho | Ho is False) = 1 – β = 1- beta.

The POWER of a hypothesis test is the probability of rejecting the null hypothesis when the null hypothesis is false . This can also be stated as the probability of correctly rejecting the null hypothesis .

POWER = P(Reject Ho | Ho is False) = 1 – β = 1 – beta

Power is the test’s ability to correctly reject the null hypothesis. A test with high power has a good chance of being able to detect the difference of interest to us, if it exists .

As we mentioned on the bottom of the previous page, this can be thought of as the sensitivity of the hypothesis test if you imagine Ho = No disease and Ha = Disease.

Factors Affecting the Power of a Hypothesis Test

The power of a hypothesis test is affected by numerous quantities (similar to the margin of error in a confidence interval).

Assume that the null hypothesis is false for a given hypothesis test. All else being equal, we have the following:

  • Larger samples result in a greater chance to reject the null hypothesis which means an increase in the power of the hypothesis test.
  • If the effect size is larger, it will become easier for us to detect. This results in a greater chance to reject the null hypothesis which means an increase in the power of the hypothesis test. The effect size varies for each test and is usually closely related to the difference between the hypothesized value and the true value of the parameter under study.
  • From the relationship between the probability of a Type I and a Type II error (as α (alpha) decreases, β (beta) increases), we can see that as α (alpha) decreases, Power = 1 – β = 1 – beta also decreases.
  • There are other mathematical ways to change the power of a hypothesis test, such as changing the population standard deviation; however, these are not quantities that we can usually control so we will not discuss them here.

In practice, we specify a significance level and a desired power to detect a difference which will have practical meaning to us and this determines the sample size required for the experiment or study.

For most grants involving statistical analysis, power calculations must be completed to illustrate that the study will have a reasonable chance to detect an important effect. Otherwise, the money spent on the study could be wasted. The goal is usually to have a power close to 80%.

For example, if there is only a 5% chance to detect an important difference between two treatments in a clinical trial, this would result in a waste of time, effort, and money on the study since, when the alternative hypothesis is true, the chance a treatment effect can be found is very small.

  • In order to calculate the power of a hypothesis test, we must specify the “truth.” As we mentioned previously when discussing Type II errors, in practice we can only calculate this probability using a series of “what if” calculations which depend upon the type of problem.

The following activity involves working with an interactive applet to study power more carefully.

Learn by Doing: Power of Hypothesis Tests

The following reading is an excellent discussion about Type I and Type II errors.

(Optional) Outside Reading: A Good Discussion of Power (≈ 2500 words)

We will not be asking you to perform power calculations manually. You may be asked to use online calculators and applets. Most statistical software packages offer some ability to complete power calculations. There are also many online calculators for power and sample size on the internet, for example, Russ Lenth’s power and sample-size page .

Proportions (Introduction & Step 1)

CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.

LO 4.33: In a given context, distinguish between situations involving a population proportion and a population mean and specify the correct null and alternative hypothesis for the scenario.

LO 4.34: Carry out a complete hypothesis test for a population proportion by hand.

Video: Proportions (Introduction & Step 1) (7:18)

Now that we understand the process of hypothesis testing and the logic behind it, we are ready to start learning about specific statistical tests (also known as significance tests).

The first test we are going to learn is the test about the population proportion (p).

This test is widely known as the “z-test for the population proportion (p).”

We will understand later where the “z-test” part is coming from.

This will be the only type of problem you will complete entirely “by-hand” in this course. Our goal is to use this example to give you the tools you need to understand how this process works. After working a few problems, you should review the earlier material again. You will likely need to review the terminology and concepts a few times before you fully understand the process.

In reality, you will often be conducting more complex statistical tests and allowing software to provide the p-value. In these settings it will be important to know what test to apply for a given situation and to be able to explain the results in context.

Review: Types of Variables

When we conduct a test about a population proportion, we are working with a categorical variable. Later in the course, after we have learned a variety of hypothesis tests, we will need to be able to identify which test is appropriate for which situation. Identifying the variable as categorical or quantitative is an important component of choosing an appropriate hypothesis test.

Learn by Doing: Review Types of Variables

One Sample Z-Test for a Population Proportion

In this part of our discussion on hypothesis testing, we will go into details that we did not go into before. More specifically, we will use this test to introduce the idea of a test statistic , and details about how p-values are calculated .

Let’s start by introducing the three examples, which will be the leading examples in our discussion. Each example is followed by a figure illustrating the information provided, as well as the question of interest.

A machine is known to produce 20% defective products, and is therefore sent for repair. After the machine is repaired, 400 products produced by the machine are chosen at random and 64 of them are found to be defective. Do the data provide enough evidence that the proportion of defective products produced by the machine (p) has been reduced as a result of the repair?

The following figure displays the information, as well as the question of interest:

The question of interest helps us formulate the null and alternative hypotheses in terms of p, the proportion of defective products produced by the machine following the repair:

  • Ho: p = 0.20 (No change; the repair did not help).
  • Ha: p < 0.20 (The repair was effective at reducing the proportion of defective parts).

There are rumors that students at a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 100 students from the college, 19 admitted to marijuana use. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is 0.157? (This number is reported by the Harvard School of Public Health.)

Again, the following figure displays the information as well as the question of interest:

As before, we can formulate the null and alternative hypotheses in terms of p, the proportion of students in the college who use marijuana:

  • Ho: p = 0.157 (same as among all college students in the country).
  • Ha: p > 0.157 (higher than the national figure).

Polls on certain topics are conducted routinely in order to monitor changes in the public’s opinions over time. One such topic is the death penalty. In 2003 a poll estimated that 64% of U.S. adults support the death penalty for a person convicted of murder. In a more recent poll, 675 out of 1,000 U.S. adults chosen at random were in favor of the death penalty for convicted murderers. Do the results of this poll provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers (p) changed between 2003 and the later poll?

Here is a figure that displays the information, as well as the question of interest:

Again, we can formulate the null and alternative hypotheses in term of p, the proportion of U.S. adults who support the death penalty for convicted murderers.

  • Ho: p = 0.64 (No change from 2003).
  • Ha: p ≠ 0.64 (Some change since 2003).

Learn by Doing: Proportions (Overview)

Did I Get This?: Proportions ( Overview )

Recall that there are basically 4 steps in the process of hypothesis testing:

  • STEP 1: State the appropriate null and alternative hypotheses, Ho and Ha.
  • STEP 2: Obtain a random sample, collect relevant data, and check whether the data meet the conditions under which the test can be used . If the conditions are met, summarize the data using a test statistic.
  • STEP 3: Find the p-value of the test.
  • STEP 4: Based on the p-value, decide whether or not the results are statistically significant and draw your conclusions in context.
  • Note: In practice, we should always consider the practical significance of the results as well as the statistical significance.

We are now going to go through these steps as they apply to the hypothesis testing for the population proportion p. It should be noted that even though the details will be specific to this particular test, some of the ideas that we will add apply to hypothesis testing in general.

Step 1. Stating the Hypotheses

Here again are the three set of hypotheses that are being tested in each of our three examples:

Has the proportion of defective products been reduced as a result of the repair?

Is the proportion of marijuana users in the college higher than the national figure?

Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?

The null hypothesis always takes the form:

  • Ho: p = some value

and the alternative hypothesis takes one of the following three forms:

  • Ha: p < that value (like in example 1) or
  • Ha: p > that value (like in example 2) or
  • Ha: p ≠ that value (like in example 3).

Note that it was quite clear from the context which form of the alternative hypothesis would be appropriate. The value that is specified in the null hypothesis is called the null value , and is generally denoted by p 0 . We can say, therefore, that in general the null hypothesis about the population proportion (p) would take the form:

  • Ho: p = p 0

We write Ho: p = p 0 to say that we are making the hypothesis that the population proportion has the value of p 0 . In other words, p is the unknown population proportion and p 0 is the number we think p might be for the given situation.

The alternative hypothesis takes one of the following three forms (depending on the context):

Ha: p < p 0 (one-sided)

Ha: p > p 0 (one-sided)

Ha: p ≠ p 0 (two-sided)

The first two possible forms of the alternatives (where the = sign in Ho is challenged by < or >) are called one-sided alternatives , and the third form of alternative (where the = sign in Ho is challenged by ≠) is called a two-sided alternative. To understand the intuition behind these names let’s go back to our examples.

Example 3 (death penalty) is a case where we have a two-sided alternative:

In this case, in order to reject Ho and accept Ha we will need to get a sample proportion of death penalty supporters which is very different from 0.64 in either direction, either much larger or much smaller than 0.64.

In example 2 (marijuana use) we have a one-sided alternative:

Here, in order to reject Ho and accept Ha we will need to get a sample proportion of marijuana users which is much higher than 0.157.

Similarly, in example 1 (defective products), where we are testing:

in order to reject Ho and accept Ha, we will need to get a sample proportion of defective products which is much smaller than 0.20.

Learn by Doing: State Hypotheses (Proportions)

Did I Get This?: State Hypotheses (Proportions)

Proportions (Step 2)

Video: Proportions (Step 2) (12:38)

Step 2. Collect Data, Check Conditions, and Summarize Data

After the hypotheses have been stated, the next step is to obtain a sample (on which the inference will be based), collect relevant data , and summarize them.

It is extremely important that our sample is representative of the population about which we want to draw conclusions. This is ensured when the sample is chosen at random. Beyond the practical issue of ensuring representativeness, choosing a random sample has theoretical importance that we will mention later.

In the case of hypothesis testing for the population proportion (p), we will collect data on the relevant categorical variable from the individuals in the sample and start by calculating the sample proportion p-hat (the natural quantity to calculate when the parameter of interest is p).

Let’s go back to our three examples and add this step to our figures.

As we mentioned earlier without going into details, when we summarize the data in hypothesis testing, we go a step beyond calculating the sample statistic and summarize the data with a test statistic . Every test has a test statistic, which to some degree captures the essence of the test. In fact, the p-value, which so far we have looked upon as “the king” (in the sense that everything is determined by it), is actually determined by (or derived from) the test statistic. We will now introduce the test statistic.

The test statistic is a measure of how far the sample proportion p-hat is from the null value p 0 , the value that the null hypothesis claims is the value of p. In other words, since p-hat is what the data estimates p to be, the test statistic can be viewed as a measure of the “distance” between what the data tells us about p and what the null hypothesis claims p to be.

Let’s use our examples to understand this:

The parameter of interest is p, the proportion of defective products following the repair.

The data estimate p to be p-hat = 0.16

The null hypothesis claims that p = 0.20

The data are therefore 0.04 (or 4 percentage points) below the null hypothesis value.

It is hard to evaluate whether this difference of 4% in defective products is enough evidence to say that the repair was effective at reducing the proportion of defective products, but clearly, the larger the difference, the more evidence it is against the null hypothesis. So if, for example, our sample proportion of defective products had been, say, 0.10 instead of 0.16, then I think you would all agree that cutting the proportion of defective products in half (from 20% to 10%) would be extremely strong evidence that the repair was effective at reducing the proportion of defective products.

The parameter of interest is p, the proportion of students in a college who use marijuana.

The data estimate p to be p-hat = 0.19

The null hypothesis claims that p = 0.157

The data are therefore 0.033 (or 3.3. percentage points) above the null hypothesis value.

The parameter of interest is p, the proportion of U.S. adults who support the death penalty for convicted murderers.

The data estimate p to be p-hat = 0.675

The null hypothesis claims that p = 0.64

There is a difference of 0.035 (or 3.5. percentage points) between the data and the null hypothesis value.

The problem with looking only at the difference between the sample proportion, p-hat, and the null value, p 0 is that we have not taken into account the variability of our estimator p-hat which, as we know from our study of sampling distributions, depends on the sample size.

For this reason, the test statistic cannot simply be the difference between p-hat and p 0 , but must be some form of that formula that accounts for the sample size. In other words, we need to somehow standardize the difference so that comparison between different situations will be possible. We are very close to revealing the test statistic, but before we construct it, let’s be reminded of the following two facts from probability:

Fact 1: When we take a random sample of size n from a population with population proportion p, then

mod9-sampp_hat2

Fact 2: The z-score of any normal value (a value that comes from a normal distribution) is calculated by finding the difference between the value and the mean and then dividing that difference by the standard deviation (of the normal distribution associated with the value). The z-score represents how many standard deviations below or above the mean the value is.

Thus, our test statistic should be a measure of how far the sample proportion p-hat is from the null value p 0 relative to the variation of p-hat (as measured by the standard error of p-hat).

Recall that the standard error is the standard deviation of the sampling distribution for a given statistic. For p-hat, we know the following:

sampdistsummaryphat

To find the p-value, we will need to determine how surprising our value is assuming the null hypothesis is true. We already have the tools needed for this process from our study of sampling distributions as represented in the table above.

If we assume the null hypothesis is true, we can specify that the center of the distribution of all possible values of p-hat from samples of size 400 would be 0.20 (our null value).

We can calculate the standard error, assuming p = 0.20 as

\(\sqrt{\dfrac{p_{0}\left(1-p_{0}\right)}{n}}=\sqrt{\dfrac{0.2(1-0.2)}{400}}=0.02\)

The following picture represents the sampling distribution of all possible values of p-hat of samples of size 400, assuming the true proportion p is 0.20 and our other requirements for the sampling distribution to be normal are met (we will review these during the next step).

A normal curve representing samping distribution of p-hat assuming that p=p_0. Marked on the horizontal axis is p_0 and a particular value of p-hat. z is the difference between p-hat and p_0 measured in standard deviations (with the sign of z indicating whether p-hat is below or above p_0)

In order to calculate probabilities for the picture above, we would need to find the z-score associated with our result.

This z-score is the test statistic ! In this example, the numerator of our z-score is the difference between p-hat (0.16) and null value (0.20) which we found earlier to be -0.04. The denominator of our z-score is the standard error calculated above (0.02) and thus quickly we find the z-score, our test statistic, to be -2.

The sample proportion based upon this data is 2 standard errors below the null value.

Hopefully you now understand more about the reasons we need probability in statistics!!

Now we will formalize the definition and look at our remaining examples before moving on to the next step, which will be to determine if a normal distribution applies and calculate the p-value.

Test Statistic for Hypothesis Tests for One Proportion is:

\(z=\dfrac{\hat{p}-p_{0}}{\sqrt{\dfrac{p_{0}\left(1-p_{0}\right)}{n}}}\)

It represents the difference between the sample proportion and the null value, measured in standard deviations (standard error of p-hat).

The picture above is a representation of the sampling distribution of p-hat assuming p = p 0 . In other words, this is a model of how p-hat behaves if we are drawing random samples from a population for which Ho is true.

Notice the center of the sampling distribution is at p 0 , which is the hypothesized proportion given in the null hypothesis (Ho: p = p 0 .) We could also mark the axis in standard error units,

\(\sqrt{\dfrac{p_{0}\left(1-p_{0}\right)}{n}}\)

For example, if our null hypothesis claims that the proportion of U.S. adults supporting the death penalty is 0.64, then the sampling distribution is drawn as if the null is true. We draw a normal distribution centered at 0.64 (p 0 ) with a standard error dependent on sample size,

\(\sqrt{\dfrac{0.64(1-0.64)}{n}}\).

Important Comment:

  • Note that under the assumption that Ho is true (and if the conditions for the sampling distribution to be normal are satisfied) the test statistic follows a N(0,1) (standard normal) distribution. Another way to say the same thing which is quite common is: “The null distribution of the test statistic is N(0,1).”

By “null distribution,” we mean the distribution under the assumption that Ho is true. As we’ll see and stress again later, the null distribution of the test statistic is what the calculation of the p-value is based on.

Let’s go back to our remaining two examples and find the test statistic in each case:

Since the null hypothesis is Ho: p = 0.157, the standardized (z) score of p-hat = 0.19 is

\(z=\dfrac{0.19-0.157}{\sqrt{\dfrac{0.157(1-0.157)}{100}}} \approx 0.91\)

This is the value of the test statistic for this example.

We interpret this to mean that, assuming that Ho is true, the sample proportion p-hat = 0.19 is 0.91 standard errors above the null value (0.157).

Since the null hypothesis is Ho: p = 0.64, the standardized (z) score of p-hat = 0.675 is

\(z=\dfrac{0.675-0.64}{\sqrt{\dfrac{0.64(1-0.64)}{1000}}} \approx 2.31\)

We interpret this to mean that, assuming that Ho is true, the sample proportion p-hat = 0.675 is 2.31 standard errors above the null value (0.64).

Learn by Doing: Proportions (Step 2)

Comments about the Test Statistic:

  • We mentioned earlier that to some degree, the test statistic captures the essence of the test. In this case, the test statistic measures the difference between p-hat and p 0 in standard errors. This is exactly what this test is about. Get data, and look at the discrepancy between what the data estimates p to be (represented by p-hat) and what Ho claims about p (represented by p 0 ).
  • You can think about this test statistic as a measure of evidence in the data against Ho. The larger the test statistic, the “further the data are from Ho” and therefore the more evidence the data provide against Ho.

Learn by Doing: Proportions (Step 2) Understanding the Test Statistic

Did I Get This?: Proportions (Step 2)

  • It should now be clear why this test is commonly known as the z-test for the population proportion . The name comes from the fact that it is based on a test statistic that is a z-score.
  • Recall fact 1 that we used for constructing the z-test statistic. Here is part of it again:

When we take a random sample of size n from a population with population proportion p 0 , the possible values of the sample proportion p-hat ( when certain conditions are met ) have approximately a normal distribution with a mean of p 0 … and a standard deviation of

stderror

This result provides the theoretical justification for constructing the test statistic the way we did, and therefore the assumptions under which this result holds (in bold, above) are the conditions that our data need to satisfy so that we can use this test. These two conditions are:

i. The sample has to be random.

ii. The conditions under which the sampling distribution of p-hat is normal are met. In other words:

sampsizprop

  • Here we will pause to say more about condition (i.) above, the need for a random sample. In the Probability Unit we discussed sampling plans based on probability (such as a simple random sample, cluster, or stratified sampling) that produce a non-biased sample, which can be safely used in order to make inferences about a population. We noted in the Probability Unit that, in practice, other (non-random) sampling techniques are sometimes used when random sampling is not feasible. It is important though, when these techniques are used, to be aware of the type of bias that they introduce, and thus the limitations of the conclusions that can be drawn from them. For our purpose here, we will focus on one such practice, the situation in which a sample is not really chosen randomly, but in the context of the categorical variable that is being studied, the sample is regarded as random. For example, say that you are interested in the proportion of students at a certain college who suffer from seasonal allergies. For that purpose, the students in a large engineering class could be considered as a random sample, since there is nothing about being in an engineering class that makes you more or less likely to suffer from seasonal allergies. Technically, the engineering class is a convenience sample, but it is treated as a random sample in the context of this categorical variable. On the other hand, if you are interested in the proportion of students in the college who have math anxiety, then the class of engineering students clearly could not possibly be viewed as a random sample, since engineering students probably have a much lower incidence of math anxiety than the college population overall.

Learn by Doing: Proportions (Step 2) Valid or Invalid Sampling?

Let’s check the conditions in our three examples.

i. The 400 products were chosen at random.

ii. n = 400, p 0 = 0.2 and therefore:

\(n p_{0}=400(0.2)=80 \geq 10\)

\(n\left(1-p_{0}\right)=400(1-0.2)=320 \geq 10\)

i. The 100 students were chosen at random.

ii. n = 100, p 0 = 0.157 and therefore:

\begin{gathered} n p_{0}=100(0.157)=15.7 \geq 10 \\ n\left(1-p_{0}\right)=100(1-0.157)=84.3 \geq 10 \end{gathered}

i. The 1000 adults were chosen at random.

ii. n = 1000, p 0 = 0.64 and therefore:

\begin{gathered} n p_{0}=1000(0.64)=640 \geq 10 \\ n\left(1-p_{0}\right)=1000(1-0.64)=360 \geq 10 \end{gathered}

Learn by Doing: Proportions (Step 2) Verify Conditions

Checking that our data satisfy the conditions under which the test can be reliably used is a very important part of the hypothesis testing process. Be sure to consider this for every hypothesis test you conduct in this course and certainly in practice.

The Four Steps in Hypothesis Testing

With respect to the z-test, the population proportion that we are currently discussing we have:

Step 1: Completed

Step 2: Completed

Step 3: This is what we will work on next.

Proportions (Step 3)

Video: Proportions (Step 3) (14:46)

Calculators and Tables

Step 3. Finding the P-value of the Test

So far we’ve talked about the p-value at the intuitive level: understanding what it is (or what it measures) and how we use it to draw conclusions about the statistical significance of our results. We will now go more deeply into how the p-value is calculated.

It should be mentioned that eventually we will rely on technology to calculate the p-value for us (as well as the test statistic), but in order to make intelligent use of the output, it is important to first understand the details, and only then let the computer do the calculations for us. Again, our goal is to use this simple example to give you the tools you need to understand the process entirely. Let’s start.

Recall that so far we have said that the p-value is the probability of obtaining data like those observed assuming that Ho is true. Like the test statistic, the p-value is, therefore, a measure of the evidence against Ho. In the case of the test statistic, the larger it is in magnitude (positive or negative), the further p-hat is from p 0 , the more evidence we have against Ho. In the case of the p-value , it is the opposite; the smaller it is, the more unlikely it is to get data like those observed when Ho is true, the more evidence it is against Ho . One can actually draw conclusions in hypothesis testing just using the test statistic, and as we’ll see the p-value is, in a sense, just another way of looking at the test statistic. The reason that we actually take the extra step in this course and derive the p-value from the test statistic is that even though in this case (the test about the population proportion) and some other tests, the value of the test statistic has a very clear and intuitive interpretation, there are some tests where its value is not as easy to interpret. On the other hand, the p-value keeps its intuitive appeal across all statistical tests.

How is the p-value calculated?

Intuitively, the p-value is the probability of observing data like those observed assuming that Ho is true. Let’s be a bit more formal:

  • Since this is a probability question about the data , it makes sense that the calculation will involve the data summary, the test statistic.
  • What do we mean by “like” those observed? By “like” we mean “as extreme or even more extreme.”

Putting it all together, we get that in general:

The p-value is the probability of observing a test statistic as extreme as that observed (or even more extreme) assuming that the null hypothesis is true.

By “extreme” we mean extreme in the direction(s) of the alternative hypothesis.

Specifically , for the z-test for the population proportion:

  • If the alternative hypothesis is Ha: p < p 0 (less than) , then “extreme” means small or less than , and the p-value is: The probability of observing a test statistic as small as that observed or smaller if the null hypothesis is true.
  • If the alternative hypothesis is Ha: p > p 0 (greater than) , then “extreme” means large or greater than , and the p-value is: The probability of observing a test statistic as large as that observed or larger if the null hypothesis is true.
  • If the alternative is Ha: p ≠ p 0 (different from) , then “extreme” means extreme in either direction either small or large (i.e., large in magnitude) or just different from , and the p-value therefore is: The probability of observing a test statistic as large in magnitude as that observed or larger if the null hypothesis is true.(Examples: If z = -2.5: p-value = probability of observing a test statistic as small as -2.5 or smaller or as large as 2.5 or larger. If z = 1.5: p-value = probability of observing a test statistic as large as 1.5 or larger, or as small as -1.5 or smaller.)

OK, hopefully that makes (some) sense. But how do we actually calculate it?

Recall the important comment from our discussion about our test statistic,

ztestprop

which said that when the null hypothesis is true (i.e., when p = p 0 ), the possible values of our test statistic follow a standard normal (N(0,1), denoted by Z) distribution. Therefore, the p-value calculations (which assume that Ho is true) are simply standard normal distribution calculations for the 3 possible alternative hypotheses.

Alternative Hypothesis is “Less Than”

The probability of observing a test statistic as small as that observed or smaller , assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.

Looking at the shaded region, you can see why this is often referred to as a left-tailed test. We shaded to the left of the test statistic, since less than is to the left.

Alternative Hypothesis is “Greater Than”

The probability of observing a test statistic as large as that observed or larger , assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution

Looking at the shaded region, you can see why this is often referred to as a right-tailed test. We shaded to the right of the test statistic, since greater than is to the right.

Alternative Hypothesis is “Not Equal To”

The probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.

This is often referred to as a two-tailed test, since we shaded in both directions.

Next, we will apply this to our three examples. But first, work through the following activities, which should help your understanding.

Learn by Doing: Proportions (Step 3)

Did I Get This?: Proportions (Step 3)

The p-value in this case is:

  • The probability of observing a test statistic as small as -2 or smaller, assuming that Ho is true.

OR (recalling what the test statistic actually means in this case),

  • The probability of observing a sample proportion that is 2 standard deviations or more below the null value (p 0 = 0.20), assuming that p 0 is the true population proportion.

OR, more specifically,

  • The probability of observing a sample proportion of 0.16 or lower in a random sample of size 400, when the true population proportion is p 0 =0.20

In either case, the p-value is found as shown in the following figure:

To find P(Z ≤ -2) we can either use the calculator or table we learned to use in the probability unit for normal random variables. Eventually, after we understand the details, we will use software to run the test for us and the output will give us all the information we need. The p-value that the statistical software provides for this specific example is 0.023. The p-value tells us that it is pretty unlikely (probability of 0.023) to get data like those observed (test statistic of -2 or less) assuming that Ho is true.

  • The probability of observing a test statistic as large as 0.91 or larger, assuming that Ho is true.
  • The probability of observing a sample proportion that is 0.91 standard deviations or more above the null value (p 0 = 0.157), assuming that p 0 is the true population proportion.
  • The probability of observing a sample proportion of 0.19 or higher in a random sample of size 100, when the true population proportion is p 0 =0.157

Again, at this point we can either use the calculator or table to find that the p-value is 0.182, this is P(Z ≥ 0.91).

The p-value tells us that it is not very surprising (probability of 0.182) to get data like those observed (which yield a test statistic of 0.91 or higher) assuming that the null hypothesis is true.

  • The probability of observing a test statistic as large as 2.31 (or larger) or as small as -2.31 (or smaller), assuming that Ho is true.
  • The probability of observing a sample proportion that is 2.31 standard deviations or more away from the null value (p 0 = 0.64), assuming that p 0 is the true population proportion.
  • The probability of observing a sample proportion as different as 0.675 is from 0.64, or even more different (i.e. as high as 0.675 or higher or as low as 0.605 or lower) in a random sample of size 1,000, when the true population proportion is p 0 = 0.64

Again, at this point we can either use the calculator or table to find that the p-value is 0.021, this is P(Z ≤ -2.31) + P(Z ≥ 2.31) = 2*P(Z ≥ |2.31|)

The p-value tells us that it is pretty unlikely (probability of 0.021) to get data like those observed (test statistic as high as 2.31 or higher or as low as -2.31 or lower) assuming that Ho is true.

  • We’ve just seen that finding p-values involves probability calculations about the value of the test statistic assuming that Ho is true. In this case, when Ho is true, the values of the test statistic follow a standard normal distribution (i.e., the sampling distribution of the test statistic when the null hypothesis is true is N(0,1)). Therefore, p-values correspond to areas (probabilities) under the standard normal curve.

Similarly, in any test , p-values are found using the sampling distribution of the test statistic when the null hypothesis is true (also known as the “null distribution” of the test statistic). In this case, it was relatively easy to argue that the null distribution of our test statistic is N(0,1). As we’ll see, in other tests, other distributions come up (like the t-distribution and the F-distribution), which we will just mention briefly, and rely heavily on the output of our statistical package for obtaining the p-values.

We’ve just completed our discussion about the p-value, and how it is calculated both in general and more specifically for the z-test for the population proportion. Let’s go back to the four-step process of hypothesis testing and see what we’ve covered and what still needs to be discussed.

With respect to the z-test the population proportion:

Step 3: Completed

Step 4. This is what we will work on next.

Learn by Doing: Proportions (Step 3) Understanding P-values

Proportions (Step 4 & Summary)

Video: Proportions (Step 4 & Summary) (4:30)

Step 4. Drawing Conclusions Based on the P-Value

This last part of the four-step process of hypothesis testing is the same across all statistical tests, and actually, we’ve already said basically everything there is to say about it, but it can’t hurt to say it again.

The p-value is a measure of how much evidence the data present against Ho. The smaller the p-value, the more evidence the data present against Ho.

We already mentioned that what determines what constitutes enough evidence against Ho is the significance level (α, alpha), a cutoff point below which the p-value is considered small enough to reject Ho in favor of Ha. The most commonly used significance level is 0.05.

  • Conclusion: There IS enough evidence that Ha is True
  • Conclusion: There IS NOT enough evidence that Ha is True

Where instead of Ha is True , we write what this means in the words of the problem, in other words, in the context of the current scenario.

It is important to mention again that this step has essentially two sub-steps:

(i) Based on the p-value, determine whether or not the results are statistically significant (i.e., the data present enough evidence to reject Ho).

(ii) State your conclusions in the context of the problem.

Note: We always still must consider whether the results have any practical significance, particularly if they are statistically significant as a statistically significant result which has not practical use is essentially meaningless!

Let’s go back to our three examples and draw conclusions.

We found that the p-value for this test was 0.023.

Since 0.023 is small (in particular, 0.023 < 0.05), the data provide enough evidence to reject Ho.

Conclusion:

  • There IS enough evidence that the proportion of defective products is less than 20% after the repair .

The following figure is the complete story of this example, and includes all the steps we went through, starting from stating the hypotheses and ending with our conclusions:

We found that the p-value for this test was 0.182.

Since .182 is not small (in particular, 0.182 > 0.05), the data do not provide enough evidence to reject Ho.

  • There IS NOT enough evidence that the proportion of students at the college who use marijuana is higher than the national figure.

Here is the complete story of this example:

Learn by Doing: Learn by Doing – Proportions (Step 4)

We found that the p-value for this test was 0.021.

Since 0.021 is small (in particular, 0.021 < 0.05), the data provide enough evidence to reject Ho

  • There IS enough evidence that the proportion of adults who support the death penalty for convicted murderers has changed since 2003.

Did I Get This?: Proportions (Step 4)

Many Students Wonder: Hypothesis Testing for the Population Proportion

Many students wonder why 5% is often selected as the significance level in hypothesis testing, and why 1% is the next most typical level. This is largely due to just convenience and tradition.

When Ronald Fisher (one of the founders of modern statistics) published one of his tables, he used a mathematically convenient scale that included 5% and 1%. Later, these same 5% and 1% levels were used by other people, in part just because Fisher was so highly esteemed. But mostly these are arbitrary levels.

The idea of selecting some sort of relatively small cutoff was historically important in the development of statistics; but it’s important to remember that there is really a continuous range of increasing confidence towards the alternative hypothesis, not a single all-or-nothing value. There isn’t much meaningful difference, for instance, between a p-value of .049 or .051, and it would be foolish to declare one case definitely a “real” effect and to declare the other case definitely a “random” effect. In either case, the study results were roughly 5% likely by chance if there’s no actual effect.

Whether such a p-value is sufficient for us to reject a particular null hypothesis ultimately depends on the risk of making the wrong decision, and the extent to which the hypothesized effect might contradict our prior experience or previous studies.

Let’s Summarize!!

We have now completed going through the four steps of hypothesis testing, and in particular we learned how they are applied to the z-test for the population proportion. Here is a brief summary:

Step 1: State the hypotheses

State the null hypothesis:

State the alternative hypothesis:

where the choice of the appropriate alternative (out of the three) is usually quite clear from the context of the problem. If you feel it is not clear, it is most likely a two-sided problem. Students are usually good at recognizing the “more than” and “less than” terminology but differences can sometimes be more difficult to spot, sometimes this is because you have preconceived ideas of how you think it should be! Use only the information given in the problem.

Step 2: Obtain data, check conditions, and summarize data

Obtain data from a sample and:

(i) Check whether the data satisfy the conditions which allow you to use this test.

random sample (or at least a sample that can be considered random in context)

the conditions under which the sampling distribution of p-hat is normal are met

sampsizprop

(ii) Calculate the sample proportion p-hat, and summarize the data using the test statistic:

ztestprop

( Recall: This standardized test statistic represents how many standard deviations above or below p 0 our sample proportion p-hat is.)

Step 3: Find the p-value of the test by using the test statistic as follows

IMPORTANT FACT: In all future tests, we will rely on software to obtain the p-value.

When the alternative hypothesis is “less than” the probability of observing a test statistic as small as that observed or smaller , assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.

When the alternative hypothesis is “greater than” the probability of observing a test statistic as large as that observed or larger , assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution

When the alternative hypothesis is “not equal to” the probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.

Step 4: Conclusion

Reach a conclusion first regarding the statistical significance of the results, and then determine what it means in the context of the problem.

If p-value ≤ 0.05 then WE REJECT Ho Conclusion: There IS enough evidence that Ha is True

If p-value > 0.05 then WE FAIL TO REJECT Ho Conclusion: There IS NOT enough evidence that Ha is True

Recall that: If the p-value is small (in particular, smaller than the significance level, which is usually 0.05), the results are statistically significant (in the sense that there is a statistically significant difference between what was observed in the sample and what was claimed in Ho), and so we reject Ho.

If the p-value is not small, we do not have enough statistical evidence to reject Ho, and so we continue to believe that Ho may be true. ( Remember: In hypothesis testing we never “accept” Ho ).

Finally, in practice, we should always consider the practical significance of the results as well as the statistical significance.

Learn by Doing: Z-Test for a Population Proportion

What’s next?

Before we move on to the next test, we are going to use the z-test for proportions to bring up and illustrate a few more very important issues regarding hypothesis testing. This might also be a good time to review the concepts of Type I error, Type II error, and Power before continuing on.

More about Hypothesis Testing

CO-1: Describe the roles biostatistics serves in the discipline of public health.

LO 1.11: Recognize the distinction between statistical significance and practical significance.

LO 6.30: Use a confidence interval to determine the correct conclusion to the associated two-sided hypothesis test.

Video: More about Hypothesis Testing (18:25)

The issues regarding hypothesis testing that we will discuss are:

  • The effect of sample size on hypothesis testing.
  • Statistical significance vs. practical importance.
  • Hypothesis testing and confidence intervals—how are they related?

Let’s begin.

1. The Effect of Sample Size on Hypothesis Testing

We have already seen the effect that the sample size has on inference, when we discussed point and interval estimation for the population mean (μ, mu) and population proportion (p). Intuitively …

Larger sample sizes give us more information to pin down the true nature of the population. We can therefore expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, we can report a smaller margin of error, and get a narrower confidence interval. What we’ve seen, then, is that larger sample size gives a boost to how much we trust our sample results.

In hypothesis testing, larger sample sizes have a similar effect. We have also discussed that the power of our test increases when the sample size increases, all else remaining the same. This means, we have a better chance to detect the difference between the true value and the null value for larger samples.

The following two examples will illustrate that a larger sample size provides more convincing evidence (the test has greater power), and how the evidence manifests itself in hypothesis testing. Let’s go back to our example 2 (marijuana use at a certain liberal arts college).

We do not have enough evidence to conclude that the proportion of students at the college who use marijuana is higher than the national figure.

Now, let’s increase the sample size.

There are rumors that students in a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 400 students from the college, 76 admitted to marijuana use . Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is 0.157? (Reported by the Harvard School of Public Health).

Our results here are statistically significant . In other words, in example 2* the data provide enough evidence to reject Ho.

  • Conclusion: There is enough evidence that the proportion of marijuana users at the college is higher than among all U.S. students.

What do we learn from this?

We see that sample results that are based on a larger sample carry more weight (have greater power).

In example 2, we saw that a sample proportion of 0.19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than 0.157. Recall, from our general overview of hypothesis testing, that this conclusion (not having enough evidence to reject the null hypothesis) doesn’t mean the null hypothesis is necessarily true (so, we never “accept” the null); it only means that the particular study didn’t yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference.

However, in example 2*, we saw that when the sample proportion of 0.19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than 0.157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference.

The following activity will allow you to practice the ideas and terminology used in hypothesis testing when a result is not statistically significant.

Learn by Doing: Interpreting Non-significant Results

2. Statistical significance vs. practical importance.

Now, we will address the issue of statistical significance versus practical importance (which also involves issues of sample size).

The following activity will let you explore the effect of the sample size on the statistical significance of the results yourself, and more importantly will discuss issue 2: Statistical significance vs. practical importance.

Important Fact: In general, with a sufficiently large sample size you can make any result that has very little practical importance statistically significant! A large sample size alone does NOT make a “good” study!!

This suggests that when interpreting the results of a test, you should always think not only about the statistical significance of the results but also about their practical importance.

Learn by Doing: Statistical vs. Practical Significance

3. Hypothesis Testing and Confidence Intervals

The last topic we want to discuss is the relationship between hypothesis testing and confidence intervals. Even though the flavor of these two forms of inference is different (confidence intervals estimate a parameter, and hypothesis testing assesses the evidence in the data against one claim and in favor of another), there is a strong link between them.

We will explain this link (using the z-test and confidence interval for the population proportion), and then explain how confidence intervals can be used after a test has been carried out.

Recall that a confidence interval gives us a set of plausible values for the unknown population parameter. We may therefore examine a confidence interval to informally decide if a proposed value of population proportion seems plausible.

For example, if a 95% confidence interval for p, the proportion of all U.S. adults already familiar with Viagra in May 1998, was (0.61, 0.67), then it seems clear that we should be able to reject a claim that only 50% of all U.S. adults were familiar with the drug, since based on the confidence interval, 0.50 is not one of the plausible values for p.

In fact, the information provided by a confidence interval can be formally related to the information provided by a hypothesis test. ( Comment: The relationship is more straightforward for two-sided alternatives, and so we will not present results for the one-sided cases.)

Suppose we want to carry out the two-sided test:

  • Ha: p ≠ p 0

using a significance level of 0.05.

An alternative way to perform this test is to find a 95% confidence interval for p and check:

  • If p 0 falls outside the confidence interval, reject Ho.
  • If p 0 falls inside the confidence interval, do not reject Ho.

In other words,

  • If p 0 is not one of the plausible values for p, we reject Ho.
  • If p 0 is a plausible value for p, we cannot reject Ho.

( Comment: Similarly, the results of a test using a significance level of 0.01 can be related to the 99% confidence interval.)

Let’s look at an example:

Recall example 3, where we wanted to know whether the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, when it was 0.64.

We are testing:

and as the figure reminds us, we took a sample of 1,000 U.S. adults, and the data told us that 675 supported the death penalty for convicted murderers (p-hat = 0.675).

A 95% confidence interval for p, the proportion of all U.S. adults who support the death penalty, is:

\(0.675 \pm 1.96 \sqrt{\dfrac{0.675(1-0.675)}{1000}} \approx 0.675 \pm 0.029=(0.646,0.704)\)

Since the 95% confidence interval for p does not include 0.64 as a plausible value for p, we can reject Ho and conclude (as we did before) that there is enough evidence that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003.

You and your roommate are arguing about whose turn it is to clean the apartment. Your roommate suggests that you settle this by tossing a coin and takes one out of a locked box he has on the shelf. Suspecting that the coin might not be fair, you decide to test it first. You toss the coin 80 times, thinking to yourself that if, indeed, the coin is fair, you should get around 40 heads. Instead you get 48 heads. You are puzzled. You are not sure whether getting 48 heads out of 80 is enough evidence to conclude that the coin is unbalanced, or whether this a result that could have happened just by chance when the coin is fair.

Statistics can help you answer this question.

Let p be the true proportion (probability) of heads. We want to test whether the coin is fair or not.

  • Ho: p = 0.5 (the coin is fair).
  • Ha: p ≠ 0.5 (the coin is not fair).

The data we have are that out of n = 80 tosses, we got 48 heads, or that the sample proportion of heads is p-hat = 48/80 = 0.6.

A 95% confidence interval for p, the true proportion of heads for this coin, is:

\(0.6 \pm 1.96 \sqrt{\dfrac{0.6(1-0.6)}{80}} \approx 0.6 \pm 0.11=(0.49,0.71)\)

Since in this case 0.5 is one of the plausible values for p, we cannot reject Ho. In other words, the data do not provide enough evidence to conclude that the coin is not fair.

The context of the last example is a good opportunity to bring up an important point that was discussed earlier.

Even though we use 0.05 as a cutoff to guide our decision about whether the results are statistically significant, we should not treat it as inviolable and we should always add our own judgment. Let’s look at the last example again.

It turns out that the p-value of this test is 0.0734. In other words, it is maybe not extremely unlikely, but it is quite unlikely (probability of 0.0734) that when you toss a fair coin 80 times you’ll get a sample proportion of heads of 48/80 = 0.6 (or even more extreme). It is true that using the 0.05 significance level (cutoff), 0.0734 is not considered small enough to conclude that the coin is not fair. However, if you really don’t want to clean the apartment, the p-value might be small enough for you to ask your roommate to use a different coin, or to provide one yourself!

Did I Get This?: Connection between Confidence Intervals and Hypothesis Tests

Did I Get This?: Hypothesis Tests for Proportions (Extra Practice)

Here is our final point on this subject:

When the data provide enough evidence to reject Ho, we can conclude (depending on the alternative hypothesis) that the population proportion is either less than, greater than, or not equal to the null value p 0 . However, we do not get a more informative statement about its actual value. It might be of interest, then, to follow the test with a 95% confidence interval that will give us more insight into the actual value of p.

In our example 3,

we concluded that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, when it was 0.64. It is probably of interest not only to know that the proportion has changed, but also to estimate what it has changed to. We’ve calculated the 95% confidence interval for p on the previous page and found that it is (0.646, 0.704).

We can combine our conclusions from the test and the confidence interval and say:

Data provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, and we are 95% confident that it is now between 0.646 and 0.704. (i.e. between 64.6% and 70.4%).

Let’s look at our example 1 to see how a confidence interval following a test might be insightful in a different way.

Here is a summary of example 1:

We conclude that as a result of the repair, the proportion of defective products has been reduced to below 0.20 (which was the proportion prior to the repair). It is probably of great interest to the company not only to know that the proportion of defective has been reduced, but also estimate what it has been reduced to, to get a better sense of how effective the repair was. A 95% confidence interval for p in this case is:

\(0.16 \pm 1.96 \sqrt{\dfrac{0.16(1-0.16)}{400}} \approx 0.16 \pm 0.036=(0.124,0.196)\)

We can therefore say that the data provide evidence that the proportion of defective products has been reduced, and we are 95% confident that it has been reduced to somewhere between 12.4% and 19.6%. This is very useful information, since it tells us that even though the results were significant (i.e., the repair reduced the number of defective products), the repair might not have been effective enough, if it managed to reduce the number of defective products only to the range provided by the confidence interval. This, of course, ties back in to the idea of statistical significance vs. practical importance that we discussed earlier. Even though the results are statistically significant (Ho was rejected), practically speaking, the repair might still be considered ineffective.

Learn by Doing: Hypothesis Tests and Confidence Intervals

Even though this portion of the current section is about the z-test for population proportion, it is loaded with very important ideas that apply to hypothesis testing in general. We’ve already summarized the details that are specific to the z-test for proportions, so the purpose of this summary is to highlight the general ideas.

The process of hypothesis testing has four steps :

I. Stating the null and alternative hypotheses (Ho and Ha).

II. Obtaining a random sample (or at least one that can be considered random) and collecting data. Using the data:

Check that the conditions under which the test can be reliably used are met.

Summarize the data using a test statistic.

  • The test statistic is a measure of the evidence in the data against Ho. The larger the test statistic is in magnitude, the more evidence the data present against Ho.

III. Finding the p-value of the test. The p-value is the probability of getting data like those observed (or even more extreme) assuming that the null hypothesis is true, and is calculated using the null distribution of the test statistic. The p-value is a measure of the evidence against Ho. The smaller the p-value, the more evidence the data present against Ho.

IV. Making conclusions.

Conclusions about the statistical significance of the results:

If the p-value is small, the data present enough evidence to reject Ho (and accept Ha).

If the p-value is not small, the data do not provide enough evidence to reject Ho.

To help guide our decision, we use the significance level as a cutoff for what is considered a small p-value. The significance cutoff is usually set at 0.05.

Conclusions should then be provided in the context of the problem.

Additional Important Ideas about Hypothesis Testing

  • Results that are based on a larger sample carry more weight, and therefore as the sample size increases, results become more statistically significant.
  • Even a very small and practically unimportant effect becomes statistically significant with a large enough sample size. The distinction between statistical significance and practical importance should therefore always be considered.
  • Confidence intervals can be used in order to carry out two-sided tests (95% confidence for the 0.05 significance level). If the null value is not included in the confidence interval (i.e., is not one of the plausible values for the parameter), we have enough evidence to reject Ho. Otherwise, we cannot reject Ho.
  • If the results are statistically significant, it might be of interest to follow up the tests with a confidence interval in order to get insight into the actual value of the parameter of interest.
  • It is important to be aware that there are two types of errors in hypothesis testing ( Type I and Type II ) and that the power of a statistical test is an important measure of how likely we are to be able to detect a difference of interest to us in a particular problem.

Means (All Steps)

NOTE: Beginning on this page, the Learn By Doing and Did I Get This activities are presented as interactive PDF files. The interactivity may not work on mobile devices or with certain PDF viewers. Use an official ADOBE product such as ADOBE READER .

If you have any issues with the Learn By Doing or Did I Get This interactive PDF files, you can view all of the questions and answers presented on this page in this document:

  • QUESTION/Answer (SPOILER ALERT!)

Tests About μ (mu) When σ (sigma) is Unknown – The t-test for a Population Mean

The t-distribution.

Video: Means (All Steps) (13:11)

So far we have talked about the logic behind hypothesis testing and then illustrated how this process proceeds in practice, using the z-test for the population proportion (p).

We are now moving on to discuss testing for the population mean (μ, mu), which is the parameter of interest when the variable of interest is quantitative.

A few comments about the structure of this section:

  • The basic groundwork for carrying out hypothesis tests has already been laid in our general discussion and in our presentation of tests about proportions.

Therefore we can easily modify the four steps to carry out tests about means instead, without going into all of the details again.

We will use this approach for all future tests so be sure to go back to the discussion in general and for proportions to review the concepts in more detail.

  • In our discussion about confidence intervals for the population mean, we made the distinction between whether the population standard deviation, σ (sigma) was known or if we needed to estimate this value using the sample standard deviation, s .

In this section, we will only discuss the second case as in most realistic settings we do not know the population standard deviation .

In this case we need to use the t- distribution instead of the standard normal distribution for the probability aspects of confidence intervals (choosing table values) and hypothesis tests (finding p-values).

  • Although we will discuss some theoretical or conceptual details for some of the analyses we will learn, from this point on we will rely on software to conduct tests and calculate confidence intervals for us , while we focus on understanding which methods are used for which situations and what the results say in context.

If you are interested in more information about the z-test, where we assume the population standard deviation σ (sigma) is known, you can review the Carnegie Mellon Open Learning Statistics Course (you will need to click “ENTER COURSE”).

Like any other tests, the t- test for the population mean follows the four-step process:

  • STEP 1: Stating the hypotheses H o and H a .
  • STEP 2: Collecting relevant data, checking that the data satisfy the conditions which allow us to use this test, and summarizing the data using a test statistic.
  • STEP 3: Finding the p-value of the test, the probability of obtaining data as extreme as those collected (or even more extreme, in the direction of the alternative hypothesis), assuming that the null hypothesis is true. In other words, how likely is it that the only reason for getting data like those observed is sampling variability (and not because H o is not true)?
  • STEP 4: Drawing conclusions, assessing the statistical significance of the results based on the p-value, and stating our conclusions in context. (Do we or don’t we have evidence to reject H o and accept H a ?)
  • Note: In practice, we should also always consider the practical significance of the results as well as the statistical significance.

We will now go through the four steps specifically for the t- test for the population mean and apply them to our two examples.

Only in a few cases is it reasonable to assume that the population standard deviation, σ (sigma), is known and so we will not cover hypothesis tests in this case. We discussed both cases for confidence intervals so that we could still calculate some confidence intervals by hand.

For this and all future tests we will rely on software to obtain our summary statistics, test statistics, and p-values for us.

The case where σ (sigma) is unknown is much more common in practice. What can we use to replace σ (sigma)? If you don’t know the population standard deviation, the best you can do is find the sample standard deviation, s, and use it instead of σ (sigma). (Note that this is exactly what we did when we discussed confidence intervals).

Is that it? Can we just use s instead of σ (sigma), and the rest is the same as the previous case? Unfortunately, it’s not that simple, but not very complicated either.

Here, when we use the sample standard deviation, s, as our estimate of σ (sigma) we can no longer use a normal distribution to find the cutoff for confidence intervals or the p-values for hypothesis tests.

Instead we must use the t- distribution (with n-1 degrees of freedom) to obtain the p-value for this test.

We discussed this issue for confidence intervals. We will talk more about the t- distribution after we discuss the details of this test for those who are interested in learning more.

It isn’t really necessary for us to understand this distribution but it is important that we use the correct distributions in practice via our software.

We will wait until UNIT 4B to look at how to accomplish this test in the software. For now focus on understanding the process and drawing the correct conclusions from the p-values given.

Now let’s go through the four steps in conducting the t- test for the population mean.

The null and alternative hypotheses for the t- test for the population mean (μ, mu) have exactly the same structure as the hypotheses for z-test for the population proportion (p):

The null hypothesis has the form:

  • Ho: μ = μ 0 (mu = mu_zero)

(where μ 0 (mu_zero) is often called the null value)

  • Ha: μ < μ 0 (mu < mu_zero) (one-sided)
  • Ha: μ > μ 0 (mu > mu_zero) (one-sided)
  • Ha: μ ≠ μ 0 (mu ≠ mu_zero) (two-sided)

where the choice of the appropriate alternative (out of the three) is usually quite clear from the context of the problem.

If you feel it is not clear, it is most likely a two-sided problem. Students are usually good at recognizing the “more than” and “less than” terminology but differences can sometimes be more difficult to spot, sometimes this is because you have preconceived ideas of how you think it should be! You also cannot use the information from the sample to help you determine the hypothesis. We would not know our data when we originally asked the question.

Now try it yourself. Here are a few exercises on stating the hypotheses for tests for a population mean.

Learn by Doing: State the Hypotheses for a test for a population mean

Here are a few more activities for practice.

Did I Get This?: State the Hypotheses for a test for a population mean

When setting up hypotheses, be sure to use only the information in the research question. We cannot use our sample data to help us set up our hypotheses.

For this test, it is still important to correctly choose the alternative hypothesis as “less than”, “greater than”, or “different” although generally in practice two-sample tests are used.

Obtain data from a sample:

  • In this step we would obtain data from a sample. This is not something we do much of in courses but it is done very often in practice!

Check the conditions:

  • Then we check the conditions under which this test (the t- test for one population mean) can be safely carried out – which are:
  • The sample is random (or at least can be considered random in context).
  • We are in one of the three situations marked with a green check mark in the following table (which ensure that x-bar is at least approximately normal and the test statistic using the sample standard deviation, s, is therefore a t- distribution with n-1 degrees of freedom – proving this is beyond the scope of this course):
  • For large samples, we don’t need to check for normality in the population . We can rely on the sample size as the basis for the validity of using this test.
  • For small samples , we need to have data from a normal population in order for the p-values and confidence intervals to be valid.

In practice, for small samples, it can be very difficult to determine if the population is normal. Here is a simulation to give you a better understanding of the difficulties.

Video: Simulations – Are Samples from a Normal Population? (4:58)

Now try it yourself with a few activities.

Learn by Doing: Checking Conditions for Hypothesis Testing for the Population Mean

  • It is always a good idea to look at the data and get a sense of their pattern regardless of whether you actually need to do it in order to assess whether the conditions are met.
  • This idea of looking at the data is relevant to all tests in general. In the next module—inference for relationships—conducting exploratory data analysis before inference will be an integral part of the process.

Here are a few more problems for extra practice.

Did I Get This?: Checking Conditions for Hypothesis Testing for the Population Mean

When setting up hypotheses, be sure to use only the information in the res

Calculate Test Statistic

Assuming that the conditions are met, we calculate the sample mean x-bar and the sample standard deviation, s (which estimates σ (sigma)), and summarize the data with a test statistic.

The test statistic for the t -test for the population mean is:

\(t=\dfrac{\bar{x} - \mu_0}{s/ \sqrt{n}}\)

Recall that such a standardized test statistic represents how many standard deviations above or below μ 0 (mu_zero) our sample mean x-bar is.

Therefore our test statistic is a measure of how different our data are from what is claimed in the null hypothesis. This is an idea that we mentioned in the previous test as well.

Again we will rely on the p-value to determine how unusual our data would be if the null hypothesis is true.

As we mentioned, the test statistic in the t -test for a population mean does not follow a standard normal distribution. Rather, it follows another bell-shaped distribution called the t- distribution.

We will present the details of this distribution at the end for those interested but for now we will work on the process of the test.

Here are a few important facts.

  • In statistical language we say that the null distribution of our test statistic is the t- distribution with (n-1) degrees of freedom. In other words, when Ho is true (i.e., when μ = μ 0 (mu = mu_zero)), our test statistic has a t- distribution with (n-1) d.f., and this is the distribution under which we find p-values.
  • For a large sample size (n), the null distribution of the test statistic is approximately Z, so whether we use t (n – 1) or Z to calculate the p-values does not make a big difference. However, software will use the t -distribution regardless of the sample size and so will we.

Although we will not calculate p-values by hand for this test, we can still easily calculate the test statistic.

Try it yourself:

Learn by Doing: Calculate the Test Statistic for a Test for a Population Mean

From this point in this course and certainly in practice we will allow the software to calculate our test statistics and we will use the p-values provided to draw our conclusions.

We will use software to obtain the p-value for this (and all future) tests but here are the images illustrating how the p-value is calculated in each of the three cases corresponding to the three choices for our alternative hypothesis.

Note that due to the symmetry of the t distribution, for a given value of the test statistic t, the p-value for the two-sided test is twice as large as the p-value of either of the one-sided tests. The same thing happens when p-values are calculated under the t distribution as when they are calculated under the Z distribution.

We will show some examples of p-values obtained from software in our examples. For now let’s continue our summary of the steps.

As usual, based on the p-value (and some significance level of choice) we assess the statistical significance of results, and draw our conclusions in context.

To review what we have said before:

If p-value ≤ 0.05 then WE REJECT Ho

If p-value > 0.05 then WE FAIL TO REJECT Ho

This step has essentially two sub-steps:

We are now ready to look at two examples.

A certain prescription medicine is supposed to contain an average of 250 parts per million (ppm) of a certain chemical. If the concentration is higher than this, the drug may cause harmful side effects; if it is lower, the drug may be ineffective.

The manufacturer runs a check to see if the mean concentration in a large shipment conforms to the target level of 250 ppm or not.

A simple random sample of 100 portions is tested, and the sample mean concentration is found to be 247 ppm with a sample standard deviation of 12 ppm.

Here is a figure that represents this example:

A large circle represents the population, which is the shipment. μ represents the concentration of the chemical. The question we want to answer is "is the mean concentration the required 250ppm or not? (Assume: SD = 12)." Selected from the population is a sample of size n=100, represented by a smaller circle. x-bar for this sample is 247.

1. The hypotheses being tested are:

  • Ha: μ ≠ μ 0 (mu ≠ mu_zero)
  • Where μ = population mean part per million of the chemical in the entire shipment

2. The conditions that allow us to use the t-test are met since:

  • The sample is random
  • The sample size is large enough for the Central Limit Theorem to apply and ensure the normality of x-bar. We do not need normality of the population in order to be able to conduct this test for the population mean. We are in the 2 nd column in the table below.
  • The test statistic is:

\(t=\dfrac{\bar{x}-\mu_{0}}{s / \sqrt{n}}=\dfrac{247-250}{12 / \sqrt{100}}=-2.5\)

  • The data (represented by the sample mean) are 2.5 standard errors below the null value.

3. Finding the p-value.

  • To find the p-value we use statistical software, and we calculate a p-value of 0.014.

4. Conclusions:

  • The p-value is small (.014) indicating that at the 5% significance level, the results are significant.
  • We reject the null hypothesis.
  • There is enough evidence to conclude that the mean concentration in entire shipment is not the required 250 ppm.
  • It is difficult to comment on the practical significance of this result without more understanding of the practical considerations of this problem.

Here is a summary:

  • The 95% confidence interval for μ (mu) can be used here in the same way as for proportions to conduct the two-sided test (checking whether the null value falls inside or outside the confidence interval) or following a t- test where Ho was rejected to get insight into the value of μ (mu).
  • We find the 95% confidence interval to be (244.619, 249.381) . Since 250 is not in the interval we know we would reject our null hypothesis that μ (mu) = 250. The confidence interval gives additional information. By accounting for estimation error, it estimates that the population mean is likely to be between 244.62 and 249.38. This is lower than the target concentration and that information might help determine the seriousness and appropriate course of action in this situation.

In most situations in practice we use TWO-SIDED HYPOTHESIS TESTS, followed by confidence intervals to gain more insight.

For completeness in covering one sample t-tests for a population mean, we still cover all three possible alternative hypotheses here HOWEVER, this will be the last test for which we will do so.

A research study measured the pulse rates of 57 college men and found a mean pulse rate of 70 beats per minute with a standard deviation of 9.85 beats per minute.

Researchers want to know if the mean pulse rate for all college men is different from the current standard of 72 beats per minute.

  • The hypotheses being tested are:
  • Ho: μ = 72
  • Ha: μ ≠ 72
  • Where μ = population mean heart rate among college men
  • The conditions that allow us to use the t- test are met since:
  • The sample is random.
  • The sample size is large (n = 57) so we do not need normality of the population in order to be able to conduct this test for the population mean. We are in the 2 nd column in the table below.

\(t=\dfrac{\bar{x}-\mu}{s / \sqrt{n}}=\dfrac{70-72}{9.85 / \sqrt{57}}=-1.53\)

  • The data (represented by the sample mean) are 1.53 estimated standard errors below the null value.
  • Recall that in general the p-value is calculated under the null distribution of the test statistic, which, in the t- test case, is t (n-1). In our case, in which n = 57, the p-value is calculated under the t (56) distribution. Using statistical software, we find that the p-value is 0.132 .
  • Here is how we calculated the p-value. http://homepage.stat.uiowa.edu/~mbognar/applets/t.html .

A t(56) curve, for which the horizontal axis has been labeled with t-scores of -2.5 and 2.5 . The area under the curve and to the left of -1.53 and to the right of 1.53 is the p-value.

4. Making conclusions.

  • The p-value (0.132) is not small, indicating that the results are not significant.
  • We fail to reject the null hypothesis.
  • There is not enough evidence to conclude that the mean pulse rate for all college men is different from the current standard of 72 beats per minute.
  • The results from this sample do not appear to have any practical significance either with a mean pulse rate of 70, this is very similar to the hypothesized value, relative to the variation expected in pulse rates.

Now try a few yourself.

Learn by Doing: Hypothesis Testing for the Population Mean

From this point in this course and certainly in practice we will allow the software to calculate our test statistic and p-value and we will use the p-values provided to draw our conclusions.

That concludes our discussion of hypothesis tests in Unit 4A.

In the next unit we will continue to use both confidence intervals and hypothesis test to investigate the relationship between two variables in the cases we covered in Unit 1 on exploratory data analysis – we will look at Case CQ, Case CC, and Case QQ.

Before moving on, we will discuss the details about the t- distribution as a general object.

We have seen that variables can be visually modeled by many different sorts of shapes, and we call these shapes distributions. Several distributions arise so frequently that they have been given special names, and they have been studied mathematically.

So far in the course, the only one we’ve named, for continuous quantitative variables, is the normal distribution, but there are others. One of them is called the t- distribution.

The t- distribution is another bell-shaped (unimodal and symmetric) distribution, like the normal distribution; and the center of the t- distribution is standardized at zero, like the center of the standard normal distribution.

Like all distributions that are used as probability models, the normal and the t- distribution are both scaled, so the total area under each of them is 1.

So how is the t-distribution fundamentally different from the normal distribution?

  • The spread .

The following picture illustrates the fundamental difference between the normal distribution and the t-distribution:

Here we have an image which illustrates the fundamental difference between the normal distribution and the t- distribution:

You can see in the picture that the t- distribution has slightly less area near the expected central value than the normal distribution does, and you can see that the t distribution has correspondingly more area in the “tails” than the normal distribution does. (It’s often said that the t- distribution has “fatter tails” or “heavier tails” than the normal distribution.)

This reflects the fact that the t- distribution has a larger spread than the normal distribution. The same total area of 1 is spread out over a slightly wider range on the t- distribution, making it a bit lower near the center compared to the normal distribution, and giving the t- distribution slightly more probability in the ‘tails’ compared to the normal distribution.

Therefore, the t- distribution ends up being the appropriate model in certain cases where there is more variability than would be predicted by the normal distribution. One of these cases is stock values, which have more variability (or “volatility,” to use the economic term) than would be predicted by the normal distribution.

There’s actually an entire family of t- distributions. They all have similar formulas (but the math is beyond the scope of this introductory course in statistics), and they all have slightly “fatter tails” than the normal distribution. But some are closer to normal than others.

The t- distributions that have higher “degrees of freedom” are closer to normal (degrees of freedom is a mathematical concept that we won’t study in this course, beyond merely mentioning it here). So, there’s a t- distribution “with one degree of freedom,” another t- distribution “with 2 degrees of freedom” which is slightly closer to normal, another t- distribution “with 3 degrees of freedom” which is a bit closer to normal than the previous ones, and so on.

The following picture illustrates this idea with just a couple of t- distributions (note that “degrees of freedom” is abbreviated “d.f.” on the picture):

The test statistic for our t-test for one population mean is a t -score which follows a t- distribution with (n – 1) degrees of freedom. Recall that each t- distribution is indexed according to “degrees of freedom.” Notice that, in the context of a test for a mean, the degrees of freedom depend on the sample size in the study.

Remember that we said that higher degrees of freedom indicate that the t- distribution is closer to normal. So in the context of a test for the mean, the larger the sample size , the higher the degrees of freedom, and the closer the t- distribution is to a normal z distribution .

As a result, in the context of a test for a mean, the effect of the t- distribution is most important for a study with a relatively small sample size .

We are now done introducing the t-distribution. What are implications of all of this?

  • The null distribution of our t-test statistic is the t-distribution with (n-1) d.f. In other words, when Ho is true (i.e., when μ = μ 0 (mu = mu_zero)), our test statistic has a t-distribution with (n-1) d.f., and this is the distribution under which we find p-values.
  • For a large sample size (n), the null distribution of the test statistic is approximately Z, so whether we use t(n – 1) or Z to calculate the p-values does not make a big difference.

Enago Academy

Quick Guide to Biostatistics in Clinical Research: Hypothesis Testing

' src=

In this article series, we will be looking at some of the important concepts of biostatistics in clinical trials and clinical research. Statistics is frequently used to analyze quantitative research data. Clinical trials and clinical research both often rely on statistics. Clinical trials proceed through many phases . Contract Research Organizations (CRO) can be hired to conduct a clinical trial. Clinical trials are an important step in deciding if a treatment can be safely and effectively used in medical practice. Once the clinical trial phases are completed, biostatistics is used to analyze the results.

Research generally proceeds in an orderly fashion as shown below.

Research Process

Once you have identified the research question you need to answer, it is time to frame a good hypothesis. The hypothesis is the starting point for biostatistics and is usually based on a theory. Experiments are then designed to test the hypothesis. What is a hypothesis ? A research hypothesis is a statement describing a relationship between two or more variables that can be tested. A good hypothesis will be clear, avoid moral judgments, specific, objective, and relevant to the research question. Above all, a hypothesis must be testable.

A simple hypothesis would contain one predictor and one outcome variable. For instance, if your hypothesis was, “Chocolate consumption is linked to type II diabetes” the predictor would be whether or not a person eats chocolate and the outcome would be developing type II diabetes. A good hypothesis would also be specific. This means that it should be clear which subjects and research methodology will be used to test the hypothesis. An example of a specific hypothesis would be, “Adults who consume more than 20 grams of milk chocolate per day, as measured by a questionnaire over the course of 12 months, are more likely to develop type II diabetes than adults who consume less than 10 grams of milk chocolate per day.”

Null and Alternative Hypothesis

In statistics, the null hypothesis (H 0 ) states that there is no relationship between the predictor and the outcome variable in the population being studied. For instance, “There is no relationship between a family history of depression and the probability that a person will attempt suicide.” The alternative hypothesis (H 1 ) states that there is a relationship between the predictor (a history of depression) and the outcome (attempted suicide). It is impossible to prove a statement by making several observations but it is possible to disprove a statement with a single observation. If you always saw red tulips, it is not proof that no other colors exist. However, seeing a single tulip that was not red would immediately prove that the statement, “All tulips are red” is false. This is why statistics tests the null hypothesis. It is also why the alternative hypothesis cannot be tested directly.

The alternative hypothesis proposed in medical research may be one-tailed or two-tailed. A one-tailed alternative hypothesis would predict the direction of the effect. Clinical studies may have an alternative hypothesis that patients taking the study drug will have a lower cholesterol level than those taking a placebo. This is an example of a one-tailed hypothesis. A two-tailed alternative hypothesis would only state that there is an association without specifying a direction. An example would be, “Patients who take the study drug will have a significantly different cholesterol level than those patients taking a placebo”. The alternative hypothesis does not state if that level will be higher or lower in those taking the placebo.

The P-Value Approach to Test Hypothesis

Once the hypothesis has been designed, statistical tests help you to decide if you should accept or reject the null hypothesis. Statistical tests determine the p-value associated with the research data. The p-value is the probability that one could have obtained the result by chance; assuming the null hypothesis (H 0 ) was true. You must reject the null hypothesis if the p-value of the data falls below the predetermined level of statistical significance. Usually, the level of statistical significance is set at 0.05. If the p- value is less than 0.05, then you would reject the null hypothesis stating that there is no relationship between the predictor and the outcome in the sample population.

However, if the p-value is greater than the predetermined level of significance, then there is no statistically significant association between the predictor and the outcome variable. This does not mean that there is no association between the predictor and the outcome in the population. It only means that the difference between the relationship observed and the relationship that could have occurred by random chance is small.

For example, null hypothesis (H 0 ): The patients who take the study drug after a heart attack did not have a better chance of not having a second heart attack over the next 24 months.

Data suggests that those who did not take the study drug were twice as likely to have a second heart attack with a p-value of 0.08. This p-value would indicate that there was an 8% chance that you would see a similar result (people on the placebo being twice as likely to have a second heart attack) in the general population because of random chance.

The hypothesis is not a trivial part of the clinical research process. It is a key element in a good biostatistics plan regardless of the clinical trial phase. There are many other concepts that are important for analyzing data from clinical trials. In our next article in the series, we will examine hypothesis testing for one or many populations, as well as error types.

' src=

Thank you for this very informative article. You describe all the things very well. I am doing a fellowship in Clinical research training. This information really helps me a lot in my research studies. I have been connected with your site since a long time for such updates. Thank you once again

Rate this article Cancel Reply

Your email address will not be published.

describe the importance of hypothesis in biostatistics and research

Enago Academy's Most Popular Articles

manuscript writing with AI

  • AI in Academia
  • Infographic
  • Manuscripts & Grants
  • Reporting Research
  • Trending Now

Can AI Tools Prepare a Research Manuscript From Scratch? — A comprehensive guide

As technology continues to advance, the question of whether artificial intelligence (AI) tools can prepare…

difference between abstract and introduction

Abstract Vs. Introduction — Do you know the difference?

Ross wants to publish his research. Feeling positive about his research outcomes, he begins to…

describe the importance of hypothesis in biostatistics and research

  • Old Webinars
  • Webinar Mobile App

Demystifying Research Methodology With Field Experts

Choosing research methodology Research design and methodology Evidence-based research approach How RAxter can assist researchers

Best Research Methodology

  • Manuscript Preparation
  • Publishing Research

How to Choose Best Research Methodology for Your Study

Successful research conduction requires proper planning and execution. While there are multiple reasons and aspects…

Methods and Methodology

Top 5 Key Differences Between Methods and Methodology

While burning the midnight oil during literature review, most researchers do not realize that the…

How to Draft the Acknowledgment Section of a Manuscript

Discussion Vs. Conclusion: Know the Difference Before Drafting Manuscripts

describe the importance of hypothesis in biostatistics and research

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

describe the importance of hypothesis in biostatistics and research

As a researcher, what do you consider most when choosing an image manipulation detector?

  • Open access
  • Published: 05 February 2020

Why do you need a biostatistician?

  • Antonia Zapf   ORCID: orcid.org/0000-0001-5339-2472 1 ,
  • Geraldine Rauch 2 &
  • Meinhard Kieser 3  

BMC Medical Research Methodology volume  20 , Article number:  23 ( 2020 ) Cite this article

25k Accesses

15 Citations

23 Altmetric

Metrics details

The quality of medical research importantly depends, among other aspects, on a valid statistical planning of the study, analysis of the data, and reporting of the results, which is usually guaranteed by a biostatistician. However, there are several related professions next to the biostatistician, for example epidemiologists, medical informaticians and bioinformaticians. For medical experts, it is often not clear what the differences between these professions are and how the specific role of a biostatistician can be described. For physicians involved in medical research, this is problematic because false expectations often lead to frustration on both sides. Therefore, the aim of this article is to outline the tasks and responsibilities of biostatisticians in clinical trials as well as in other fields of application in medical research.

Peer Review reports

What is a biostatistician, what does he or she actually do and what distinguishes him or her from, for example, an epidemiologist? If we would ask this our main cooperation partners like physicians or biologists, they probably could not give a satisfying answer. This is problematic because false expectations often lead to frustration on both sides. Therefore, in this article we want to clarify the tasks and responsibilities of biostatisticians.

There are some expressions which are often used interchangeably to the term ‘biostatistician’. In here, we will use the expression ‘(medical) biostatistics’ as a synonym for ‘medical biometry’ and ‘medical statistics’, and analogously we will do for the term ‘biostatistician’.

In contrast to the clearly defined educational and professional career steps of a physician, there is no unique way of becoming a biostatistician. Only very few universities do indeed offer studies in biometry, which is why most people working as biostatisticians studied something related, subjects such as mathematics or statistics, or application subjects such as medicine, psychology, or biology. So a biostatistician cannot be defined by his or her education, but must be defined by his or her expertise and competencies [ 1 ]. This corresponds to our definition of a biostatistician in this article. The International Biometric Society (IBS) provides a definition of biometrics as a ‘field of development of statistical and mathematical methods applicable in the biological sciences’ [ 2 ]. In here, we will focus on (human) medicine as area of application, but the results can be easily transferred to the other biological sciences like, for example, agriculture or ecology. As mentioned above, there are some professions neighbouring biostatistics, and for many cooperation partners, the differences between biostatisticians, medical informaticians, bioinformaticians, and epidemiologists are not clear. According to the current representatives of these four disciplines within the German Association for Medical Informatics, Biometry and Epidemiology (GMDS) e. V.:

‘Medical biostatistics develops, implements, and uses statistical and mathematical methods to allow for a gain of knowledge from medical data.’ ‘Results are made accessible for the individual medical disciplines and for the public by statistically valid interpretations and suitable presentations’ (authors’ translation from [ 3 ]).

‘Medical informatics is the science of the systematic development, management, storage, processing, and provision of data, information and knowledge in medicine and healthcare’ (authors’ translation from [ 4 ]).

Bioinformatics is a science for ‘the research, development and application of computer-based methods used to answer biomolecular and biomedical research questions. Bioinformatics mainly focusses on models and algorithms for data on the molecular and cell-biological level’ [ 5 ].

‘Epidemiology deals with the spread and the course of diseases and the underlying factors in the public. Apart from conducting research into the causes of disease, epidemiology also investigates options of prevention’ (authors’ translation from [ 6 ]).

Another discipline is data science, which is a relatively new expression used in a multitude of different contexts. Often it is meant as a global summarizing term covering all of the above mentioned fields. As there is no common agreement on what data science is and as this term does not correspond to a uniquely defined profession, this expression will not be discussed in more detail.

The self-descriptions as stated above are rather general and not necessarily complete. Therefore, we will in the following describe the specific tasks and responsibilities of biostatisticians in different important application fields in more detail. This allows us to specify what cooperation partners may (or may not) expect from a biostatistician. Furthermore, clarification of the roles of all involved parties and their successful implementation in practice will overall lead to more efficient collaborations and higher quality.

Tasks and responsibilities of biostatisticians

There are many medical areas where biostatisticians can contribute to the general research progress. These fields of application and the related biostatistical methods are not strictly separated, but there are many overlaps and a classification of the related methodology can be done in various ways. We consider in the following the important application fields of clinical trials, systematic reviews and meta-analysis, observational and complex interventional studies, and statistical genetics to highlight the tasks and responsibilities of biostatisticians working in these areas.

Biostatisticians working in the area of clinical trials

The tasks of biostatisticians in clinical trials are not limited to the analysis of the data, but there are many more responsibilities. It is a quite misguided view that biostatisticians are only required after the data has been collected. According to Lewis et al. (1996), statistical considerations are not only relevant for the analysis of data but also for the design of the trial [ 7 ]. This is not a personal view, but general consensus. It is demanded by the ethics committee and confirmed by the principle investigator and / or the sponsor when stating that the clinical trial will be conducted according to Good Clinical Practice (GCP). The corresponding guideline E6 from the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) explicitly states that statistical expertise should be utilized throughout all stages [ 8 ]. In there, it is stated in Section 5.4.1: ‘The sponsor should utilize qualified individuals (e.g. biostatisticians, clinical pharmacologists, and physicians) as appropriate, throughout all stages of the trial process, from designing the protocol and CRFs [case report forms, AZ] and planning the analyses to analyzing and preparing interim and final clinical trial reports.’ Mansmann et al. [ 9 ] provided a more specific guidance about good biometrical practice in medical research and the responsibilities of a biostatistician. In there, the responsibility of a biostatistician is described as a person participating in the planning and the execution of a study, in the dissemination of the results and in statistical refereeing. These are very general descriptions of the tasks and responsibilities of biostatisticians. In the following, we will explain the biostatistician’s mission in more detail based on the guidance on good biometrical practice [ 9 ] and on the E9 guideline from the ICH about Statistical Principles for Clinical Trials [ 10 ].

In the initial phase of a medical research project, a biostatistician should actively participate in the assessment of the relevance and the feasibility of the study. During the planning phase, the biostatistician should already be involved in the discussion of general study aspects as outlined in more detail below. It is evident that the physician must provide the framework for this. However, the biostatistician can and should point out important biostatistical issues which will have important influence on the whole construct of the study. Therefore, an important part of the biostatistician’s work is to be done long before a study can start. For example, the appropriate study population (special subgroups or healthy subjects in early phases versus large representative samples of the targeted patient population in confirmatory trials) and reasonable primary and secondary endpoints (e.g. suitable to the study aim, objectively measurable, clearly and uniquely defined) need to be identified. He also should make the physician aware of the potential problems with multiple or composite primary endpoints and with surrogate or categorised (especially dichotomized) variables. Another very important topic related to the general study design is blinding and randomisation as techniques to avoid bias. Moreover, the comparators or treatment arms must be specified and it has to be defined how they are embedded in the general study design (for example parallel or crossover). It also has to be specified the aim in whether is to show superiority or non-inferiority of the new treatment and whether interim analyses are reasonable (group sequential designs). Moreover the procedures for data capture and processing have to be discussed at this point. Only after fixing all these planning aspects, the biostatistician can provide an elaborated sample size calculation.

During the ongoing study, main tasks and responsibilities consist of biostatistical monitoring (for example as part of a data safety monitoring board) and performing interim analyses (if planned). If any modifications of the study design are urgently required during the ongoing trial (for example changes within an adaptive designs, or early stopping after an interim analysis), the biostatistician has to be involved in the discussions and decisions as otherwise the integrity of the study can be damaged.

The main data analysis is performed after all patients were recruited and fully observed. However, the statistical methods applied within the data analysis must already be specified during the planning phase within the study protocol. The study protocol should already be as detailed as possible in particular with regard to the analysis of the primary endpoint(s). In addition, the statistical analysis plan (SAP), which must be finalized before start for the data analysis, provides a document which describes all details on the primary, secondary and safety analyses. It also covers possible data transformations, applied point and interval estimators, statistical tests, subgroup analyses, and the consideration of interactions and covariates. Furthermore, the used data sets (for example intention to treat or per protocol), the handling of missing values, and a possible adjustment for multiplicity should be described and discussed. Another important issue is how the integrity of the data and the validity of the statistical software can be guaranteed.

In a last step, after the finalization of the data analysis according to the SAP, the biostatistician contributes to reporting the results in the study report as well as in the related publications submitted to medical journals. He or she is responsible for the appropriate presentation and the correct interpretation of the results.

To sum up, in clinical studies, the tasks and responsibilities of biostatisticians thus extend from the planning phase, through the execution of the study to data analysis and publication of the results. In particular, a careful study planning, in which the contribution of a biostatistician is indispensable, is essential to obtain valid study results.

Biostatisticians working in the area of systematic reviews and meta-analysis

To judge the level of evidence of medical research, different systems of evidence grading were suggested. The recent grading system from the Oxford Centre for Evidence-Based Medicine (OCEBM) defines ten evidence levels. The highest level is a systematic review of high quality studies for the therapeutic as well as for the diagnostic and prognostic context [ 11 ]. The need for such reviews results from the huge amount of articles in the medical literature, which has to be aggregated appropriately [ 12 ]. As Gopalakrishnan and Ganeshkumar describe, the aim of a systematic review is to ‘systematically search, critically appraise, and synthesize on a specific issue’ [ 13 ]. A meta-analysis, which additionally provides a quantitative summary, can be part of a systematic review, if a reasonable number of individual studies are available. The task and responsibilities of biostatisticians in this field are described in the following. As in clinical trials, the biostatistician should already be involved during the planning phase of a systematic review/meta-analysis to discuss the design aspects and the feasibility. Beside the literature search and the collection of the study data (most often not available on an individual patient level), the assessment of the study quality and the risk of bias are important topics. There are different tools for the assessment, like the GRADE approach (Grading of Recommendations, Assessment, Development and Evaluation) [ 14 ] or the QUADAS-2 tool (Quality Assessment of Diagnostic Accuracy Studies) for diagnostic meta-analyses [ 15 ]. A general description of these approaches can be found in the Cochrane Handbook [ 16 ]. The main task of biostatisticians in the field of systematic reviews is then to perform the meta-analysis itself including the calculation of weighted summary measures, creation of graphs, and performing subgroup and sensitivity analyses. As a last step, the biostatistician should again support the physicians in interpreting und publishing the results.

In summary, the tasks and responsibilities of biostatisticians in the field of systematic reviews and meta-analyses relate to the proper planning, the evaluation of the quality of the individual studies, the meta-analysis itself and the publication of the results.

Biostatisticians working in the area of observational and complex interventional studies

In observational studies, where confounding plays a major role, statistical modelling aims at incorporating, investigating, and exploiting relationships between variables using mathematical equations. Other important examples for application of the related techniques are longitudinal data measured repeatedly in time for the same subject or data with an inherent hierarchical structure, for example data of patients observed in different departments within various clinics. Valid conclusions from the analysis are only obtained if the functional relationship between the variables is correctly taken into account [ 17 ]. Another prominent task of statistical modelling is prediction, for example to forecast a future outcome of patients. Frequently, the relationship between the involved variables is complex. For example, patients may undergo several states between start of observation and outcome and the transitions between these states as well as potential competing risks have to be adequately considered (see, for example, Hansen et al. [ 18 ]). Extrapolation is another field of growing interest where techniques of statistical modelling are indispensable. This process can be defined as ‘extending information and conclusions available from studies in one or more subgroups of the patient population (source population), or in related conditions or with related medicinal products, to make inferences for another subgroup of the population (target population), or condition or product’ [ 19 ]. For example, clinical trial data for adults may be used to assist the development of treatments for children [ 20 ]. Last but not least, statistical modelling may be of help in situations where data of different origin shall be synthesized to increase evidence, for example, from randomized clinical trials, observational studies, and registries. These examples are by far not exhaustive and illustrate the wide spectrum of potential data sources and applications. It is obvious that there are direct connections to the two working areas of biostatisticians described in the preceding subsections, and consequently there are substantial overlaps in the related tasks and responsibilities. As in the other working areas considered, the biostatistician is responsible for choosing a correct and efficient analysis method that includes all relevant information. Due to the complexity of statistical models, this point is especially challenging here. Furthermore, it is the task of biostatisticians to decide whether the mandatory data required to adequately map the underlying relationships are included in the available data set, whether data quality and completeness is sufficiently high to justify a reliable analysis, and to define appropriate methods dealing with missing values. It is highly recommended to prepare an SAP not only for clinical trials (see Biostatisticians working in the area of clinical trials section) but also for analyses using methods of statistical modelling.

Again, the biostatistician is responsible not only for a proper planning and conducting of the analyses but also for appropriate interpretation and presentation of the results. The particular challenge for biostatisticians in this area is to choose appropriate statistical models for the analysis of data with a complex structure.

Biostatisticians working in the area of statistical genetics

Biostatisticians working in the fields of genetics and genomics are often the responsible persons for the final integration of multidisciplinary expertise in mathematics, statistics, genetics, epidemiology, and bioinformatics to only cite some common ingredients. Planning tasks include the design of research studies, which may pursue exploratory and/or confirmatory objectives. There exist a broad range of possible study designs which make use of well-differentiated modelling techniques. Generated data are often pre-processed by bioinformaticians before it reaches the biostatistician. Pre-processing of sequencing data, for instance, usually comprises quality control of sequenced reads, alignment to the human reference genome and markup of duplicates previously to the identification of somatic mutations and indels. Good knowledge of the limitations of applied pre-processing techniques by the statistician is often very helpful. A strong background and a deep understanding of genetics and genomics as well as an interdisciplinary thinking are a must for biostatisticians working in this area. These competences will be even more important in future. For example, emerging fields of research like Mendelian randomization where genetic variants are used as instruments to predict causality will require an even stronger interaction between statistics and genetics.

In the field of statistical genetics, tasks and responsibilities relate in particular to study planning, critical review of pre-processing, and data analysis using appropriate statistical models.

Biostatistics mainly addresses the development, implementation, and application of statistical methods in the field of medical research [ 3 ]. Therefore, an understanding of the medical background and the clinical context of the research problem they are working on is essential for biostatisticians [ 21 ]. Furthermore, a specific professional expertise is inevitable, and also soft skill competencies are very important. Regarding the professional expertise, the ICH E9 guideline states that a trial statistician should be qualified and experienced [ 10 ]. Qualification, which means biostatistical expertise, covers methodological background (mathematics, statistics, and biostatistics), biostatistical application, medical background, medical documentation, and statistical programming. The experience relates to consulting, planning, conducting and analysing medical studies. Jaki et al. [ 22 ] gave a review of training provided by existing medical statistics programmes and made recommendations for a curriculum for biostatisticians working in drug development. Regarding the soft skills of a biostatistician, some literature exists (for example [ 23 ] or [ 24 ]). Furthermore, Zapf et al. [ 1 ] summarize the professional expertise and the needed soft skills of a biostatistician according to the CanMEDS framework [ 25 ], which was developed to describe the required abilities of physician (the original abbreviation ‘Canadian Medical Education Directions for Specialists’ is no longer in use).

In this article, we did not explicitly consider the recently upcoming field of biomedical data science which is applied in many different areas of medical research such as, for example, individualized medicine, omics research, big data analysis. The tasks and responsibilities of biostatisticians working in this domain are not different from those reported above but in fact include all mentioned aspects [ 26 ].

There is evidently an overlap between the tasks and responsibilities of medical biostatisticians and neighbouring professions. However, all disciplines have different focuses. Important application fields of biostatistics are clinical studies, systematic reviews / meta-analysis, observational and complex interventional studies, and statistical genetics.

In all fields of biostatistical activities, the working environment is diverse and multi-disciplinary. Therefore, it is essential for fruitful, efficient, and high-quality collaborations to clearly define the tasks and responsibilities of the cooperating partners. In summary, the tasks and responsibilities of a biostatistician across all application areas cover active participation in a proper planning, consultation during the entire study duration, data analysis using appropriate statistical methods as well as interpretation and suitable presentation of the results in reports and publications. These tasks are similarly formulated by the ICH E6 guideline concerning good clinical practice [ 8 ].

Availability of data and materials

Not applicable.

Abbreviations

Canadian Medical Education Directions for Specialists

Case report form

Good Clinical Practice

German Association for Medical Informatics, Biometry and Epidemiology

Grading of Recommendations, Assessment, Development and Evaluation

International Biometric Society

International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use

Oxford Centre for Evidence-Based Medicine

Quality Assessment of Diagnostic Accuracy Studies

Statistical analysis plan

Zapf A, Hübner M, Rauch G, Kieser M. What makes a biostatistician? Stat Med. 2018;38(4):695–701.

Article   Google Scholar  

Homepage of the International Biometric Society. http://www.biometricsociety.org/about/definition-of-biometrics/ . Accessed 11 Nov 2019.

Homepage of the German Society of Medical Informatics, Biometry and Epidemiology (GMDS), Section Medical Biometry. http://www.gmds.de/fachbereiche/biometrie/index.php . Accessed 11 Nov 2019.

Homepage of the German Society of Medical Informatics, Biometry and Epidemiology (GMDS), Section Medical Informatics. https://gmds.de/aktivitaeten/medizinische-informatik/ . Accessed 11 Nov 2019.

Homepage from the professional group bioinformatics (FaBI). https://www.bioinformatik.de/en/bioinformatics.html . Accessed 11 Nov 2019.

Homepage of the German Society of Medical Informatics, Biometry and Epidemiology (GMDS), Section Epidemiology. https://gmds.de/aktivitaeten/epidemiologie/ . Access 11 Nov 2019.

Lewis JA. Editorial: statistics and statisticians in the regulation of medicines. J R Stat Soc Ser A. 1996;159(3):359–62.

International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (1996). Guideline for good clinical practice E6 (R2). https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-6-r2-guideline-good-clinical-practice-step-5_en.pdf . Access 11 Nov 2019.

Google Scholar  

Mansmann U, Jensen K, Dirschedl P. Good biometrical practice in medical research - guidelines and recommendations. Informatik, Biometrie und Epidemiologie in Medizin und Biologie. 2004;35:63–71.

International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (1998). Statistical principles for clinical trials E9. https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-9-statistical-principles-clinical-trials-step-5_en.pdf . Accessed 11 Nov 2019.

OCEBM. The Oxford 2011 levels of evidence: Oxford Centre for Evidence-Based Medicine; 2011. http://www.cebm.net/oxford-centre-evidence-based-medicine-levels-evidence-march-2009/ . Accessed 11 Nov 2019

Mulrow CD. Systematic reviews: rationale for systematic reviews. BMJ. 1994;309:597–9.

Article   CAS   Google Scholar  

Gopalakrishnan S, Ganeshkumar P. Systematic reviews and meta-analysis: understanding the best evidence in primary healthcare. J Fam Med Prim Care. 2013;2(1):9–14.

Schünemann H, Brożek J, Guyatt G, Oxman A, editors. GRADE handbook for grading quality of evidence and strength of recommendations. Updated October 2013: The GRADE Working Group; 2013. Available from https://gdt.gradepro.org/app/handbook/handbook.html . Accessed 11 Nov 2019

Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MMG, Sterne JAC, Bossuyt PMM, the QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–36.

Higgins JPT, Green S, editors. Cochrane handbook for systematic reviews of interventions version 5.1.0 [updated March 2011]: The Cochrane Collaboration; 2011. Available from http://handbook.cochrane.org . Accessed 11 Nov 2019

Snijders AB, Bosker RJ. Multilevel analysis - an introduction to basic and advanced multilevel modeling. London: SAGE Publications; 1999.

Hansen BE, Thorogood J, Hermans J, Ploeg RJ, van Bockel JH, van Houwelingen JC. Multistate modelling of liver transplantation data. Stat Med. 1994;13:2517–29.

European Medicines Agency (2012). Concept paper on extrapolation of efficacy and safety in medicine development - EMA/129698/2012. http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2013/04/WC500142358.pdf . Accessed 11 Nov 2019.

Wadsworth I, Hampson LV, Jaki T. Extrapolation of efficacy and other data to support the development of new medicines for children: a systematic review of methods. Stat Meth Med Res. 2018;27(2):398–413.

Simon R. Challenges for biometry in 21st century oncology. In: Matsui S, Crowley J, editors. Frontiers of biostatistical methods and applications in clinical oncology. Singapore: Springer; 2017. Available from https://link.springer.com/chapter/10.1007%2F978-981-10-0126-0_1 . Accessed 25 Nov 2019.

Jaki T, Gordon A, Forster P, Bijnens L, Bornkamp B, Brannath W, Fontana R, Gasparini M, Hampson LV, Jacobs T, Jones B, Paoletti X, Posch M, Titman A, Vonk R, Koenig F. A proposal for a new PhD level curriculum on quantitative methods for drug development. Pharm Stat. 2018;17:593–606.

CAS   PubMed   PubMed Central   Google Scholar  

Lewis T. Statisticians in the pharmaceutical industry. In: Stonier PD, editor. Discovering new medicines. Chichester: Wiley; 1994. p. 153–63.

Chuang-Stein C, Bain R, Branson M, Burton C, Hoseyni C, Rockhold FW, Ruberg SJ, Zhang J. Statisticians in the pharmaceutical industry: the 21st century. Stat Biopharm Res. 2010;2(2):145–52.

Royal College of Physicians and Surgeons of Canada. CanMEDS: better standards, better physician, better care. http://www.royalcollege.ca/rcsite/canmeds/canmeds-framework-e . Accessed 11 Nov 2019.

Alarcón-Soto Y, Espasandín-Domínguez J, Guler I, Conde-Amboage M, Gude-Sampedro F, Langohr K, Cadarso-Suárez C, Gómez-Melis G (2019) Data Science in Biomedicine. arXiv:1909.04486v1. Available from https://arxiv.org/abs/1909.04486v1 . Accessed 25 Nov 2019.

Download references

Acknowledgements

There was no funding for this project.

Author information

Authors and affiliations.

Department of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Martinistr. 52, 20246, Hamburg, Germany

Antonia Zapf

Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, 10117, Berlin, Germany

Geraldine Rauch

Institute of Medical Biometry and Informatics, Heidelberg University Hospital, Im Neuenheimer Feld 130.3, 69120, Heidelberg, Germany

Meinhard Kieser

You can also search for this author in PubMed   Google Scholar

Contributions

AZ drafted the work and all authors substantively revised it. All authors approved the final version and agreed both to be personally accountable for the author’s own contributions and to ensure that questions related to the accuracy or integrity of any part of the work are appropriately investigated, resolved, and the resolution documented in the literature.

Corresponding author

Correspondence to Antonia Zapf .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Zapf, A., Rauch, G. & Kieser, M. Why do you need a biostatistician?. BMC Med Res Methodol 20 , 23 (2020). https://doi.org/10.1186/s12874-020-0916-4

Download citation

Received : 21 March 2019

Accepted : 28 January 2020

Published : 05 February 2020

DOI : https://doi.org/10.1186/s12874-020-0916-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Medical research
  • Biostatistician
  • Responsibilities

BMC Medical Research Methodology

ISSN: 1471-2288

describe the importance of hypothesis in biostatistics and research

Descriptive Biostatistics

  • First Online: 01 January 2012

Cite this chapter

describe the importance of hypothesis in biostatistics and research

  • Pedro J. Gutiérrez Diez PhD 4 ,
  • Irma H. Russo MD, FCAP, FASCP 5 &
  • Jose Russo MD, FCAP 6  

770 Accesses

In this chapter we briefly describe the main methods and techniques in descriptive biostatistics, as well as their application to the analysis of biomedical questions. With special attention to the study of cancer, this chapter provides a general understanding of the nature and relevance of descriptive biostatistical methods in medicine and biology, explains the design behind a biostatistical descriptive analysis, and stresses the paramount importance of descriptive statistics as the initial stage of any biomedical investigation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Author information

Authors and affiliations.

University of Valladolid, Avda. Valle Esgueva 6, 47011, Valladolid, Spain

Pedro J. Gutiérrez Diez PhD

Fox Chase Cancer Center, Cottman Ave 333, 19111, Philadelphia, PA, USA

Irma H. Russo MD, FCAP, FASCP

Lab. Breast Cancer Research, Fox Chase Cancer Center, Burholme Avenue 7701, 19111, Philadelphia, PA, USA

Dr. Jose Russo MD, FCAP

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Pedro J. Gutiérrez Diez PhD .

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Gutiérrez Diez, P.J., Russo, I.H., Russo, J. (2012). Descriptive Biostatistics. In: The Evolution of the Use of Mathematics in Cancer Research. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-2397-3_2

Download citation

DOI : https://doi.org/10.1007/978-1-4614-2397-3_2

Published : 17 February 2012

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4614-2396-6

Online ISBN : 978-1-4614-2397-3

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

From the Front Row: Using biostatistics and P-value in public health research

Published on March 20, 2023

Joe Cavanaugh , professor and head of the University of Iowa Department of Biostatistics is this week’s guest. He chats with Amy and Anya about the central role that biostatistics plays in public health and medical research and explains the concept of P-value and its use in biostatistics.

Find our previous episodes on  Spotify ,  Apple Podcasts , and  SoundCloud .

Hello, everyone, and welcome back to From the Front Row. Behind a lot of public health evidence is numbers, and biostatisticians are the ones behind the scene, interpreting those numbers every day.

One staple of many biostatistical tests is a number called the p-value. Most of us are taught that if the p-value is smaller than 0.05, you found something statistically significant. And if it’s larger than 0.05, your numbers were probably due to random chance. Because the p-value is used so widely in statistics, this concept has a huge impact on evidence-based decisions and research impacted. But it turns out there is a lot of controversy behind the p-value. We have Dr. Cavanaugh, one of our biostatistics professors, on the show today to talk with us about just that.

Dr. Cavanaugh has published more than 160 peer-reviewed papers, and his research contribution span a wide range of fields, from cardiology to health services utilization to sports medicine to infectious disease, just to name a few. Outside of that, he’s an elected fellow of the American Statistical Association, an elected member of the International Statistical Institute, and has received several awards for teaching and mentoring. We’re glad to have him at the college and we’re very excited to have him on the show today to break down p-values for our audience.

I’m Amy Wu, joined today by Anya Morozov. If it’s your first time with us, welcome. We’re a student-run podcast that talks about major issues in public health and how they are relevant to anyone, both in and outside the field of public health. Welcome to the show, Dr. Cavanaugh.

Joe Cavanaugh:

It’s great to be here. Thank you for inviting me.

Of course. Before we get into the topic of today’s episode, can you tell us a little bit about your background and what folks can do with a biostatistics degree?

Yeah, absolutely. Most biostatisticians, they followed a rather non-linear path to the discipline. I went to a small STEM college as an undergraduate, Montana Tech in my hometown of Butte, Montana, and I received bachelor’s degrees in computer science and mathematics. And I found that I enjoyed computing, but I really wanted to program my own ideas.

I had a couple of programming jobs, one at a utility company and one at a national lab, but I was programming the ideas of engineers and physicists. And to me, the creative aspect of computing is the development of the algorithms. And as far as my mathematics degree goes, I really liked my math courses, but I realized I liked the more applied side of math.

So one of my undergraduate mentors suggested that I consider work in statistics, and I followed his advice and received my PhD in statistics from the University of California Davis. I spent the first 10 years of my academic career in a department of statistics at the University of Missouri. And then in 2003, I slightly changed course from stat to biostat, and moved here to the University of Iowa. So this is my 20th year as a Hawkeye.

What I liked about biostatistics is that it allows you to use your quantitative skills to help solve important practical problems. And the field has always been one where demand greatly exceeds supply, so the job market is excellent. And as far as the types of jobs that our graduates pursue, it’s pretty wide-ranging. We have some that work at pharmaceutical companies, biomedical research facilities, such as the Mayo Clinic, Fred Hutchinson Cancer Center, Memorial Sloan Kettering Cancer Center, high tech companies like Facebook and Google, and government agencies like the FDA, the NIH, and the CDC.

Awesome. Thanks for sharing. As a biostatistics student, I’m very excited to hear that there are a lot of job prospects out there.

Absolutely.

And I do enjoy the breadth of work that I have the potential of going into.

Anya Morozov:

Yeah, so we mentioned a little bit in the introduction about p-values. If they’re smaller than 0.05, generally you’ve found something statistically significant, and if they’re larger than 0.05, your findings might be due to random chance. But could you explain a little more and remind us what a p-value is for non-statisticians?

Yeah, absolutely. So in many statistical applications, you’re going to build a statistical model to investigate the possible association between an explanatory variable and an outcome of interest. So an example might be to investigate the association between influenza and flu vaccinations to determine the extent to which your risk is reduced if you’re vaccinated.

So to test this association, you formulate two hypotheses. There’s a null hypothesis, which assumes that the association or the so-called effect doesn’t exist. And then there’s an alternative hypothesis, which assumes that the association or effect does exist and is potentially important. So the p-value, it’s computed by assuming hypothetically that the null hypothesis is true, and then finding the probability of obtaining data similar to the data observed in your study under that assumption. So if this probability then is small, this might cause you to doubt the veracity of the null hypothesis and to view the alternative hypothesis as more credible.

So in more succinct terms, you can view the p-value perhaps as a type of conditional probability. It’s trying to address the question just what is the probability of the data, given that the null hypothesis is true. And to some extent, the p-value is founded on the notion of a proof by contradiction, because you’re assuming that the null hypothesis is true, and then you’re trying to determine whether or not the data discredits that assumption by getting a low probability.

Yeah, so just also for the non-statisticians out there, a lower p-value would basically make you favor the alternative hypothesis over the null?

That’s correct. Yeah. And as you alluded to earlier, Amy, a common practice, which I’ll comment about in a bit, is to compare the p-value to a level of significance that is set at 0.05. 0.05 probability. So if the p-value is less than that, often you declare the result as being statistically significant. And if it’s greater than that, then you say that you don’t have the burden of proof met in order to reject the null in favor of the alternative. But that practice is problematic, so I think we’re going to get to that issue in just a bit.

Yeah, of course. So on that subject, could you give us a brief history on the controversy surrounding p-values? Or example, last semester in your distinguished faculty lecture, you mentioned that the Basic and Applied Social Psychology journal decided to ban all p-values in 2015. So for non-statisticians, can you explain why they would do this, what are some of the pitfalls of p-values, et cetera?

Yeah, certainly. There’s a few different questions there that I’ll try to address, Amy. To begin a bit of history about the p-value, it’s been around for about a century. It’s often credited to Karl Pearson, who introduced the concept in 1925. It was introduced in the context of hypothesis testing, which is a paradigm designed for very specific types of studies, namely randomized experiments. So in biostatistics, clinical trials would be an example of a randomized experiment.

But as it turns out, since then, p-values have become much more pervasively used and often, I would argue, in context for which they were not designed. And they’re often misapplied and misinterpreted and used to justify conclusions that really are not warranted.

So over the last decade or two, scientists have become more concerned with reproducibility. There’s been a lot of backlash against p-values, and some scientists have suggested that the best way of dealing with the problems caused by the p-value is to just banish it altogether. But from my perspective, that is neither a practical nor an ideal solution.

Having said that, there are many problems with the p-value. One of the most significant issues is based on the practice that you alluded to earlier, comparing the p-value to the 0.05 level of significance in deciding whether to reject the null hypothesis and declare the existence of an effect if the p-value is less than 0.05. Now, the reason this is a problem is that the p-value can assume any value between zero and one. So it’s a continuous measure that should be evaluated on a spectrum of evidence.

To illustrate that idea a bit further, there’s very little practical difference between a p-value of 0.04 and 0.06. So making completely different decisions based on these two p-values is not rational. To say that if you have a p-value of 0.04, that the effect is significant, that you should pay attention to it. But if you have a p-value of 0.06, you haven’t met the burden of proof, and therefore, you shouldn’t doubt the null hypothesis.

Now, one of the problems that has resulted from this binary decision-making, it’s a practice known as p-hacking. That’s the practice of repeatedly analyzing data using different analytic techniques to obtain a p-value that is less than 0.05. So you might come up with a variety of different models, and the first time you get a p-value for the effect of interest of .14, you’re not happy with that. You reformulate the model, you get a p-value of 0.08, and you’re still not happy. Reformulate it again, you get a p-value of 0.04, and then you say, “Okay, I finally achieved the burden of proof, that level of significance.” P-hacking is one of the reasons that many studies are not reproducible.

So another problem with the p-value is that it allows you to assess statistical significance, but not clinical or practical significance. To explain the difference between those two ideas, suppose that we have a treatment for hypertension that’s designed to reduce systolic blood pressure, so you conduct a clinical trial to try to assess the efficacy of the treatment. Now, the result is statistically significant if you can establish that the mean change in blood pressure is non-zero, but the result is clinically significant if the mean change is substantial enough to impact a person’s health. You could argue that that’s a higher bar to attain than statistical significance.

So as it turns out, one can obtain a small p-value that leads to statistical significance if a small change is estimated with a high degree of accuracy. In that setting, you’re quite sure that the change is non-zero, but you’re also quite sure that the change is minor. And a small p-value can arise when an effect is accurately estimated, but it’s estimated to be small. Small enough that it probably is not clinically important or practically meaningful.

So how can you get around that problem? Well, confidence intervals or Bayesian credible intervals, they’re more informative because they provide a range of plausible values for the effect of interest. And the center of the interval, it represents what you could think of as the most likely value for the effect. It’s often the point estimate, the so-called point estimate of the effect. And in the width of the interval reflects the accuracy of the effect estimate. And both the point estimate and its measure of accuracy are very important in coming up with an overall assessment of the effect of interest.

So the problem with the p-value is it’s taking these two important pieces of information, the point estimate and the measure of accuracy, often called the standard error, and conflating these two quantities by combining them into one number. And once you’ve collapsed those two quantities into one number, there’s no way of separating them out and determining what the two quantities are individually.

Yeah, so I had one follow-up question. If you could briefly explain what reproducibility is in the context of p-values and p-hacking.

Yeah, absolutely. A study is reproducible if you can conduct a very similar study with the same outcome of interest, the same explanatory factor of interest, and get a similar result. And because all studies are inherently flawed to various degrees, reproducibility is very important because it’s the aggregation of evidence over a variety of different studies that starts giving us a definitive understanding of a particular phenomenon.

So for instance, we now widely accept the fact that there are a lot of bad health conditions that are a result of smoking cigarettes. But 50 years ago, that was not widely known. And all cigarette smoking studies are observational. You can’t do a randomized experiment where you break subjects into two groups and say, “You’re going to smoke, and you’re not going to smoke.” But because we have found over time that the ill effects of smoking are reproducible in different observational studies, that preponderance of evidence over a variety of different studies has led us to the conclusion that you shouldn’t smoke.

So the problem that has resulted with reproducibility in some studies is you’ll have a paper, say, that’s published, and it declares an effect as being statistically significant, and then the authors will say, “This is a conclusive result.” And other authors may say, “Well, I think that this phenomenon is important enough to investigate in a separate study with a different database, different population of interest,” and they may find no evidence at all of the same effect. And you could imagine that one of the reasons why that could happen is if you have authors that are repeatedly analyzing the data in order to get a p-value that’s less than 0.05 and then they publish the result once they find the right analysis that will give them that result, then it’s going to lead to a study that is not reproducible. So that has become a major issue recently in science where you have a study that is investigating a very important phenomenon. Other investigators want to see if they can replicate the result, and they’re unable to do so.

Yeah. Well, thanks for extending on reproducibility. I also just wanted to retouch on clinical significance versus statistical significance. So what I’m hearing is that results can be statistically significant but not clinically relevant or important, say, to medical professionals, for example, or they could be both. So you’re kind of saying that p-values kind of can only speak to the former, where they’re only statistically significant and not-

That’s exactly correct, Amy. So you could imagine a situation where you have, say, a small effect where any physician would take a look at the effect and say, “Not worth taking that drug, because if it’s going to have such a minor impact on your health, that it’s just not worth it.” And you could imagine another study where you have a large effect, say in the hypertension example, where you have a drug that could potentially reduce your systolic blood pressure by as much as 20 to 30 points, which would be a major game changer.

Now, in either of those settings, if you have a highly accurate estimate, you’ll get a very small p-value, and so you’ll be able to establish statistical significance. In both of those settings, you can say fairly conclusively that the effect is non-zero. But in one case, it’s non-zero, but it’s very small and very close to zero. And in the other case, it’s non-zero and it’s substantial in magnitude. In the latter setting, you would have clinical or practical significance, whereas in the former setting, you would not. But in both of those settings, you would have statistical significance.

So it’s almost like the p-value is creating this incentive for folks in research to really strive for statistical significance, and the clinical or public health significance can become secondary to that if you’re focused so much on whether or not you’re hitting that 0.05 or whatever p-value you’ve set as statistically significant.

Yeah, that’s exactly right, Anya. And I will say that part of the problem is you’re incentivized, through the publication process, to have statistically significant results. So another problem that results in a lack of reproducibility is called publication bias. So there’s the feeling among editors of journals and reviewers of articles that if you don’t establish an effect that is statistically significant, then we haven’t learned anything. And yet a null result, a null finding can often be as informative or even more informative than a result that is statistically significant.

But if the journals are going to practice this tendency where they’re going to favor results that report statistically significant findings and ignore results that don’t report such findings, then basically you’re getting a very biased representation of a particular phenomenon. So you might have, for instance, a subtle effect. And in some studies, it’s showing up as statistically significant. In other studies, it’s not. And yet the only studies that are being published are those where you have statistical significance. So if you search the literature, you think, “Well, this effect is showing up consistently in a large variety of different studies,” because you haven’t seen all of the studies where it hasn’t shown up as being significant, due to the fact that those studies are not published.

So we just talked a little bit about the potential pitfalls of p-values. In your aforementioned lecture, you mentioned that the p-value is not going away. So are p-values still relevant?

Yeah, I definitely think that they are, Amy. They still have relevance in the settings for which they were designed, hypothesis testing in the context of randomized experiments such as a clinical trial that’s performed to assess the efficacy of a new drug by comparison to a placebo. But there’s a saying that you might have heard that goes along the following lines, when all you have is a hammer, every problem looks like a nail. And unfortunately, researchers often treat the p-value like a hammer and use it as a tool rather indiscriminately for problems where it is contextually inappropriate.

So from my perspective, p-values still have a place in statistics, a very prominent place, but they should probably be used much less pervasively. And in settings where a p-value is inappropriate, there are other inferential tools that are available. They’re not perhaps as widely known, but as statisticians, we should promote the use of those tools rather than always producing p-values because we think that is what is expected.

Yeah, so along those lines, p-values can be useful in some settings, but not all. You have done some work on one alternative to the p-value, called the discrepancy comparison probability, or DCP. Can you tell us a little bit about that alternative?

Yeah, yeah, I’d be happy to. To provide a little bit of background, the p-value is often used to test for the existence of an effect in the context of a statistical model. That’s how we typically see p-values used in research, especially observational studies.

Now, the model under the alternative hypothesis, it contains the effect of interest, and the model under the null hypothesis does not. And the model often contains other variables of interest as well. So as an example, think of a prognostic model that is formulated to predict the onset of heart disease for middle-aged individuals. Now, the effect of interest might be a measure of physical activity, because we know that if you’re physically active, that that should reduce your risk of future heart disease. But you’ll probably want to include other variables in the model that could impact this relationship, such as age, BMI, cholesterol level, blood pressure, sex, ethnicity.

Now, if the p-value is small, we reject the null model in favor of the alternative model, and we claim that there is an effect. But the problem in this context is that the p-value can only be defined and interpreted under the assumption that one or the other model represents truth, because that’s the hypothesis testing paradigm. You have a null hypothesis, in this case, a null model, an alternative hypothesis, in this case, an alternative model, and both represent incompatible states of nature. One represents the truth, and one does not. That’s the hypothesis testing paradigm, and it’s up to you to try to use the data in order to try to decide which of those two competing states of nature is the most credible.

So where do you run into problems when you’re comparing two models in a hypothesis testing setting? Well, models, they’re only approximations to reality. They don’t represent reality. So the entire paradigm of hypothesis testing and p-values is really misaligned with statistical modeling. There’s a quote that is a favorite among statisticians that is attributed to George Box, who was a very famous statistician. It goes like this, “All models are wrong, some are useful.” So to unpack that quote, all models are wrong because all models are approximations to reality. They don’t represent reality. Some are useful because some are sufficiently accurate approximations for the inferential purpose at hand.

So the discrepancy comparison probability, or the DCP, it represents the probability that the null model is closer to the truth, to reality, than the alternative model, or that the null model is less discrepant from the truth or reality than the alternative model. And importantly, the DCP, it doesn’t assume that either model represents the truth. So basically assessing the probability that the null model is a better approximation to the truth than the alternative model.

Now, like the p-value, the estimated DCP is going to be close to zero if the alternative model is markedly better than the null model. But unlike the p-value, the estimated DCP will be close to one if the null model is markedly better than the alternative model. So it tells you something if it’s small, if it’s close to zero. And it tells you something if it’s close to one, if it’s large.

So that actually points out another flaw with the p-value, and that is that a small p-value, it represents evidence against the null hypothesis, in favor of the alternative hypothesis. But as it turns out, a large p-value really doesn’t tell you anything, and that it represents an absence of evidence rather than evidence in favor of the null hypothesis. And you’ve heard that adage, absence of evidence is not evidence of absence. But often when researchers will come up with a large p-value, 0.5, they’ll say, “Well, this provides evidence that the null hypothesis is credible,” and that’s not the case.

But based on the way that the DCP is set up, it will lean towards one if the null model is a better approximation to reality than the alternative model, and it will lean towards zero if the alternative is a better approximation to reality. So it does provide evidence in support of either model, but again, only thinking of how well the models approximate the truth, not by trying to think that either model represents the truth.

So along the lines of the holistic philosophy behind the DCP, with these non-binary or interpretations along the spectrum or continuum, could you talk about the role of biostatisticians in interpreting these complex and nuanced medical or public health type problems?

Yeah, absolutely. One point that I’d like to make is that statistical methods require advanced training. And part of the problem with the misuse of p-values is that sophisticated analyses are often conducted by researchers without the appropriate training. So Amy, you’re working on a graduate degree in biostatistics, and Anya, you’re working on a graduate degree in epidemiology, where you have to learn a lot of biostatistics.

So epidemiologists who are well trained in modern statistical methods, and biostatisticians, they’re more aware of what a p-value can tell you and what it cannot tell you. Also, they’re likely more aware of alternative measures of statistical evidence that have been introduced during the past few decades. So if you’re working on a particular study and you’re convinced that a p-value is not the best measure of statistical evidence, that you might be aware, if you have training in more advanced methods, of some of these alternatives.

Now, having said this, scientific paradigms are hard to change. And because p-value are so predominant in biomedical and public health research, it will take time to change the culture so that p-values are used more sparingly and in context where they’re more appropriate. But from my perspective, biostatisticians and statisticians, they really need to be willing to push the envelope and use some of these more modern and sophisticated inferential tools, such as say, Bayesian posterior probabilities, Bayes factors, likelihood ratio, statistics, information criteria, such as the Akaike information criterion, the Bayesian information criterion.

Now, these phrases probably sound unfamiliar to students who’ve had an introductory course in statistics but haven’t gone beyond that course. And in that introductory course, they’re likely to remember two constructs, the p-value and the confidence interval. But if you’ve had more advanced training in biostatistics, you’re likely more aware of some of these tools, and there may be settings that arise in your research where you feel like you should advocate in favor of using something other than the p-value in order to address the inferential question of interest. And if we, as statisticians, always default to the p-value because we think that’s what editors of journals and referees of articles are going to expect, then the culture is never going to change.

Well, I’m glad you’re here at the college and just generally kind of advocating for more nuance in how we interpret results of studies. I also think it kind of shows the importance of communication in any field, even biostatistics. I feel like that’s one where traditionally, I think maybe communication isn’t as important to being able to do biostatistics, but you have to be able to communicate. If you are proposing that change and not using the p-value to a lab full of people who maybe aren’t biostatisticians, you have to be able to talk about these other methods and why there may be a better fit.

That’s very well said, Anya. In fact, I will often say to our students that there is this perception when you’re in graduate school that being really good at math or being really good at computing, that those are the most important skills for a biostatistician. And they are, without question, important skills, and those are the types of skills that will often allow you to get good grades in your coursework.

But the most important skill, from my perspective, is to have very good oral and written communication skills for exactly the reason that you mentioned, because everything that we do is collaborative, and you don’t want to talk to your collaborators as though they have the same background that you do. A physician is not going to talk to a patient the same way that they would talk to another physician with expertise in that area.

So it’s a real art for an epidemiologist or a biostatistician to be able to distill the essence of an inferential result and communicate what it can tell you and what it can’t tell you in such a way that your collaborators understand what you’ve done with their data and what conclusions are warranted, what conclusions are not warranted. And it’s not easy to do. It takes a lifetime of practice. So I completely agree with you. And I think perhaps if we all communicated better as biostatisticians, then perhaps that would be a step in the right direction as far as using more appropriate inferential tools. Because when a setting does arise where you feel that p-value isn’t appropriate, you can articulate the reasons why.

Very well said. Now we’ll move on to our last question on the show. This is one that we ask to all of our guests. It can be related to p-values, biostatistics, or just everyday life. But what is one thing you thought you knew but were later wrong about?

Yeah. Well, this was a fun question to think about, Anya, and so I wanted to provide an academic answer and a completely non-academic answer. I’ll start with the academic answer. And this does tie in with our discussion about p-values. I think that learning about the politics and the messiness of publishing scientific research was a real eye-opener for me. When I was young, I believed that good research was published, and bad research was not. Now I realized that most research is imperfect, and that the evaluation of research is highly subjective. Also, to publish research, you often need to sacrifice idealism for pragmatism. But I would claim that once you understand the rules of the game, and that includes the problems that are endemic to the publication process, you realize that you can still conduct and publish good work and do so with ethics and integrity, but you’ll often need to battle to defend your principles.

So that’s one thing that is academic, related to research, that I thought I knew, had sort of an idealistic oversimplified view, and then later found out that I was misguided.

So here’s my non-academic answer, and this just occurred to me this morning. I’m a big fan of football, both college and the NFL. My favorite college team is, of course, the Hawkeyes, and my favorite NFL team is the Buffalo Bills. The Buffalo Bills were really bad for a long time, and in 2018, they drafted a new quarterback, Josh Allen, and I thought they’d made a horrible mistake, that they’d wasted this high first round draft pick on someone who would be a complete bust. And now, as it turns out, Josh Allen is one of the best quarterbacks in the NFL, and the Bills are actually a good team. So I’ve never been so happy to be so wrong. That’s my fun answer.

All right. Well, thanks, Dr. Cavanaugh, for joining us for this episode. It was very helpful to hear you explain p-values, their history, their pitfalls, their future, and also novel biostatistical methods, like the DCP, which you have worked on. And we’re very lucky that you’ve been able to explain it from a biostatistician’s perspective to our non-statistician audience. So yeah, thank you.

Thank you for having me, Amy and Anya. It’s been a pleasure.

That’s it for our episode this week. Big thanks to Dr. Joe Cavanaugh for joining us today. This episode was hosted and written by Amy Wu and Anya Morozov, and edited and produced by Anya Morozov. You can learn more about the University of Iowa College of Public Health on Facebook. And our podcast is available on Spotify, Apple Podcasts, and SoundCloud.

If you enjoyed this episode and would like to help support the podcast, please share it with your colleagues, friends, or anyone interested in public health.

Have a suggestion for our team? You can reach us at [email protected].

This episode was brought to you by the University of Iowa College of Public Health. Until next week, stay healthy, stay curious, and take care.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Clin Transl Sci
  • v.5(1); 2021

Logo of jctsci

Guidance for biostatisticians on their essential contributions to clinical and translational research protocol review

Jody d. ciolino.

1 Department of Preventive Medicine, Division of Biostatistics, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA

Cathie Spino

2 Department of Biostatistics, University of Michigan, Washington Heights, Ann Arbor, MI, USA

Walter T. Ambrosius

3 Department of Biostatistics and Data Science, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA

Shokoufeh Khalatbari

4 Michigan Institute for Clinical & Health Research (MICHR), University of Michigan, Ann Arbor, MI, USA

Shari Messinger Cayetano

5 Department of Public Health, Division of Biostatistics, University of Miami, Miami, FL, USA

Jodi A. Lapidus

6 School of Public Health, Oregon Health & Sciences University, Portland, OR, USA

Paul J Nietert

7 Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA

Robert A. Oster

8 Department of Medicine, Division of Preventive Medicine, University of Alabama at Birmingham, Birmingham, AL, UK

Susan M. Perkins

9 Department of Biostatistics, Indiana University, Indianapolis, IN, USA

Brad H. Pollock

10 Department of Public Health Sciences, UC Davis School of Medicine, Davis, CA, USA

Gina-Maria Pomann

11 Duke Biostatistics, Epidemiology and Research Design (BERD) Methods Core, Duke University, Durham, NC, USA

Lori Lyn Price

12 Tufts Clinical and Translational Science Institute, Tufts University, Boston, MA, USA

13 Institute of Clinical Research and Health Policy Studies, Tufts Medical Center, Boston, MA, USA

Todd W. Rice

14 Department of Medicine, Division of Allergy, Pulmonary, and Critical Care Medicine, Medical Director, Vanderbilt Human Research Protections Program, Vice-President for Clinical Trials Innovation and Operations, Nashville, TN, USA

Tor D. Tosteson

15 Department of Biomedical Data Science, Division of Biostatistics, Geisel School of Medicine at Dartmouth, Hanover, NH, USA

Christopher J. Lindsell

16 Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA

Heidi Spratt

17 Department of Preventive Medicine and Population Health, University of Texas Medical Branch, Galveston, TX, USA

Rigorous scientific review of research protocols is critical to making funding decisions, and to the protection of both human and non-human research participants. Given the increasing complexity of research designs and data analysis methods, quantitative experts, such as biostatisticians, play an essential role in evaluating the rigor and reproducibility of proposed methods. However, there is a common misconception that a statistician’s input is relevant only to sample size/power and statistical analysis sections of a protocol. The comprehensive nature of a biostatistical review coupled with limited guidance on key components of protocol review motived this work. Members of the Biostatistics, Epidemiology, and Research Design Special Interest Group of the Association for Clinical and Translational Science used a consensus approach to identify the elements of research protocols that a biostatistician should consider in a review, and provide specific guidance on how each element should be reviewed. We present the resulting review framework as an educational tool and guideline for biostatisticians navigating review boards and panels. We briefly describe the approach to developing the framework, and we provide a comprehensive checklist and guidance on review of each protocol element. We posit that the biostatistical reviewer, through their breadth of engagement across multiple disciplines and experience with a range of research designs, can and should contribute significantly beyond review of the statistical analysis plan and sample size justification. Through careful scientific review, we hope to prevent excess resource expenditure and risk to humans and animals on poorly planned studies.

Introduction

Rigorous scientific review of research protocols is critical to making funding decisions [ 1 , 2 ], and to the protection of both human and non-human research participants [ 3 ]. Two pillars of ethical clinical and translational research include scientific validity and independent review of the proposed research [ 4 ]. As such, the review process often emphasizes the scientific approach and the study design, along with rigor and reproducibility of data collection and analysis. The criterion score labeled “Approach” has been shown to be the strongest predictor of the overall Impact Score and the likelihood of funding for research project grants (e.g., R01s) at the National Institutes of Health (NIH) [ 5 ]. Evidence also favors scientific review as a consequential component of institutional review of human participant research [ 3 ]. Given the increasing complexity of research designs and data analysis methods, quantitative experts, such as biostatisticians, often play an essential role in evaluating the rigor and reproducibility of proposed analytic methods. However, the structure and components of formal review can vary greatly when quantitative methodologists review research protocols prior to data collection, whether for Institutional Review Boards (IRBs), scientific review committees, or intramural and extramural grant review committees.

Protocol submitters and protocol reviewers often mistakenly view a statistician’s input as relevant only to sample size/power and statistical analysis sections of a protocol. Experienced reviewers know that to provide informative and actionable review of a research protocol from a biostatistical perspective requires a comprehensive view of the research strategy. This can be a daunting task to novice quantitative methodologists, yet to our knowledge, there is little guidance on the role and crucial components of biostatistical review of a protocol before data are collected.

Members of the Biostatistics, Epidemiology, and Research Design (BERD) Special Interest Group (SIG) of the Association for Clinical and Translational Science (ACTS) sought to develop this guidance. We used a consensus approach to identify the elements of research protocols that a biostatistician should consider in a review, and provide specific guidance on how each element should be reviewed. The resulting review framework can be used as an educational tool and guideline for biostatisticians navigating review boards and panels. This article briefly describes the approach to developing the framework, provides a comprehensive checklist, and guidance on review of each protocol element. We are disseminating this framework to better position biostatisticians to (1) advocate for research protocols that achieve the goal of answering their proposed study questions while minimizing risk to participants, and (2) serve as a steward of resources, with the ultimate goal of preventing the pursuit of uninformative or unnecessary research activities. We hope a consequence of this work will also be improved rigor and reproducibility of research protocols at the time of submission because protocol writers will also benefit from the guidance.

Approach to Developing Guidelines

In fall of 2017, the BERD SIG of the ACTS identified the considerable variation in the expectations for, and practice of, biostatistical review of research protocols as a modifiable barrier to effectively informing funding decisions, and to weighing risks and benefits for research participants. The BERD SIG is comprised of biostatisticians and epidemiologists with expertise in clinical and translational research at academic medical centers across the USA. Volunteers from this group formed a working group, consisting of all 16 coauthors of this article, to develop a checklist of items a quantitative methodologist should review in a research protocol (Table  1 ). The initial checklist focused on defining essential elements for reviewing a randomized controlled trial (RCT) as this is considered the most robust design in clinical research [ 6 ]. However, RCTs are not necessarily always feasible, practical, scientifically, or ethically justified, so elements for reviewing other important types of studies were added. Protocol elements essential for an RCT may be irrelevant to other types of studies, and vice versa.

Checklist guide of items to consider in biostatistical review of protocols

As the checklist was finalized, working group members were assigned to draft guidance describing review essentials pertaining to each item on the checklist. The expectation was that the text should describe the biostatical review perspective arrived at during the group discussions that occurred during development of the checklist. Another assigned reviewer then revised each section. The consensus approach involved multiple iterations of review and revision, and the final text presented in this article reflects the consensus of the working group. The co-primary (Ciolino and Spino) and senior authors (Lindsell and Spratt) synthesized all feedback from revision and review to finalize this article and corresponding checklist. Consensus was reached when all coauthors agreed with the final resultant article and checklist tool.

In the spring of 2019, a group of early career investigators (i.e., recipients of K awards) reviewed and provided comment on the checklist and article during a question-and-answer review lunch. Their feedback was that to maximize dissemination and impact beyond the statistical community, it would be more effective to emphasize why the statistical perspective matters for a protocol element rather than trying to justify one statistical argument or another. To obtain additional feedback to help focus the manuscript, we invited members and affiliates of the BERD SIG to rate the relative importance of each protocol element for different study designs (Fig.  1 ). This figure supplements the accompanying checklist of protocol items a biostatistical reviewer should consider in reviewing study protocols. The heat map illustrates the high-level summary view, among coauthors and other quantitative methodologists, of relevance for each checklist item. Individual respondents ( N = 20) rated each item from 1 (most relevance) to 4 (no relevance/not applicable). Darker cells correspond to higher importance or relevance for a given item/study type, while lighter cells indicate less relevance or importance. If we use the RCT as a benchmark, we note that the majority of the checklist items are important to consider and review in a research protocol for this study type. The dark column to the left illustrates this. As the study type strays from the RCT, we illustrate the varying degrees of relevance for each of these items. For example, a statistical reviewer should not put weight on things like interim analyses for several of these other study types (cohort studies, case-control, etc.), and the group determined that use of validated instruments and minimizing bias in enrollment in animal studies are less relevant. On the other hand, the need for clear objectives and hypotheses is consistent throughout, no matter what the study type. With this rich context and feedback, we finalized the guidance and checklist for presentation here.

An external file that holds a picture, illustration, etc.
Object name is S2059866121008141_fig1.jpg

Illustration of varying degrees of relevance for protocol items across common study types. This figure supplements the accompanying checklist of protocol items a biostatistical reviewer should consider in reviewing study protocols. The heat map illustrates the high-level summary view, among coauthors and other quantitative methodologists ( N = 20 respondents), of relevance for each checklist item. Individual respondents rated each item from 1 (most relevance) to 4 (no relevance/not applicable). Darker cells correspond to higher importance or relevance for a given item/study type, while lighter cells indicate less relevance or importance. If we use the randomized controlled trial (RCT) as a benchmark, we note that the majority of the checklist items are important to consider and review in a research protocol for this study type. The ordering of study types from left to right reflects the order in which respondents were presented these items when completing the survey. The dark column to the left illustrates this. As the study type strays from the RCT, we illustrate the varying degrees of relevance for each of these items. For example, a statistical reviewer should not put weight on things like interim analyses for several of these other study types (cohort studies, case-control, etc.), and the group determined that the use of validated instruments and minimizing bias in enrollment in animal studies are less relevant. On the other hand, the need for clear objectives and hypotheses is consistent throughout, no matter what the study type.

Objectives and Hypotheses

Objectives are articulated and consistent: specific, measurable, achievable, relevant, time bound (i.e., smart).

The first step in protocol review is understanding the research question. Objectives describe the explicit goal(s) of the study and should be clearly stated regardless of design. It is common for the objectives to be summarized in the form of “specific aims.” They should be presented in the context of the broader program of research, including a description of existing knowledge gaps and future directions. Objectives should be written so that they are easily understandable by all who read the protocol.

It is challenging to evaluate the rigor and impact of a study when objectives are diffuse. A common guide to writing objectives is the “SMART” approach [ 7 ]. That is, objectives should be specific as to exactly what will be accomplished. They should be measurable so that it can be determined whether the goals are accomplished. They should be achievable within the time, resource, and design constraints. They should be relevant to the scientific context and existing state of knowledge. Finally, objectives need to be tied to a specific time frame, often the duration of a project funding period. Biostatistical reviewers should evaluate objectives according to these criteria, as it will make them better positioned to properly evaluate the rest of the protocol.

Hypotheses Follow from Objectives

Hypotheses are statements of expected findings from the research outlined in the objectives. A study can have one or many hypotheses, or none at all. If a study is not designed to test the veracity of some assumed truth, it is not necessary – and often detrimental – to force a hypothesis statement. The biostatistical reviewer should appropriately temper criticism of studies that are “hypothesis generating” as opposed to formal statistical hypothesis testing. The observational, hypothesis generating loop of the scientific method provides an opportunity for the biostatistical reviewer to focus on evaluating the rigor and reproducibility of the proposed work in the absence of a formally testable hypothesis.

Statistical Hypothesis Tests Are Clear and Match Aims

When a hypothesis is appropriate, it should be stated in a testable framework using the data generated by the proposed study. The biostatistical reviewer should assess how the statistical approach relates to the hypothesis and contextualized by the objectives. Our experience is that an objective with more than one or two key hypotheses has insufficient focus to allow for a rigorous, unbiased study design accompanied by a robust analytic approach. Inclusion of several supportive hypotheses is of less concern.

These same notions of cohesion between objective and analyses apply for preliminary and pilot studies. The objectives of preliminary studies should be clearly stated. They may seek to demonstrate a specific procedure can be performed, a specified number of subjects can be enrolled in a given time frame, or that a technology can be produced. A pilot study with an objective to estimate effect size should be redesigned with alternative objectives because the sample size often precludes estimating the effect size with meaningful precision [ 8 ]. The moniker of pilot study is often mistakenly used to justify an underpowered study (i.e., uninformative study) [ 9 ]. While it is important that pilot studies specify any hypothesis to be tested in a subsequent definitive study, in general they should seldom (if ever) propose to conduct statistical hypothesis tests [ 8 , 10 ]. In every case, the biostatistical reviewer should look for objectives that are specific to demonstration or estimation.

General Approach

General study design matches the objectives and hypotheses to address research questions.

Once a study’s purpose is clear, the next goal of a biostatistical review is to confirm the general approach (i.e., type of study) matches the objectives and is consistent with the hypotheses that will be tested. RCTs are generally accepted for confirming causal effects, but there are many situations where they are not feasible nor ethically justified, and well-designed observational (non-experimental) studies are required. For example, RCTs to evaluate parachute use in preventing death and major trauma in a gravitational challenge do not exist because of clear ethical concerns. Between the experimental and observational approaches lie a class of studies called quasi-experimental studies that evaluated interventions or exposures without randomization using design and analytical techniques such as instrumental variables (natural experiments) and propensity scores [ 11 , 12 ]. The biostatistical reviewer should consider the relevant merits and tradeoffs between the experimental, non-experimental, and quasi-experimental approaches and comment on the strength of evidence for answering the study question.

We highlighted a few possible design approaches in Fig.  1 . Within each, there are innumerable design options. For example, with the RCT design, there are crossover, factorial, dose-escalating, and cluster-randomized designs, and many more [ 6 ]. The biostatistical reviewer should acknowledge the balance between rigor and feasibility, noting that the most rigorous design may not be the most efficient, least invasive, ethical, or resource preserving.

Limitations on Conclusions that Can Be Drawn Are Evident and Clear

When using innovative designs, the biostatistical reviewer must consider whether the design was selected because it is most appropriate rather than other factors such as current trends and usage in the field. There are typically multiple designs available to answer similar questions, but the protocol must note the limitations of the design proposed and justify its choice over alternative strategies. As Freidman, Furberg, DeMets, et al. note, “There is no such thing as a perfect study” [ 6 ].

When a protocol requires novel or atypical designs, it is imperative that the biostatistician’s review carefully considers potential biases and the downstream analytic implications the designs may present. For example, a dose-finding study using response-adaptive randomization will not allow for conclusions to be drawn regarding drug efficacy in comparison to placebo using classical statistical methods. It will, however, allow for estimation of a maximum tolerated dose for use in later phase studies. This imposes additional responsibilities on the biostatisticians to understand the state of the science within the field of application, the conclusions one can draw from the proposed research and their impact on subsequent studies that build upon the knowledge gained.

Population and Sample

Degree of generalizability is obvious.

We must recognize that every sample will have limits to generalizability; that is, there will be inherent biases in study design and sampling. RCTs have limits to generalizability as they require specification of eligibility criteria to define the study sample. The more restrictive these criteria are, the less generalizable the inferences become. This concept of generalizability becomes particularly important as reviewers evaluate fully translational research that moves from “bench to bedside.” Basic science and animal studies (i.e., “bench research”) occur in comparatively controlled environments, usually on samples with minimal variability or heterogeneity. The generalizability of these pre-clinical findings to heterogeneous, clinical populations in these situations is limited. For this reason, an effect size observed in pre-clinical populations cannot be generalized to that which one would expect in a clinical population.

All sample selection procedures have advantages and disadvantages, which must be considered when assessing the feasibility, validity, and interpretation of study findings. Biases may be subtle, yet they can have important implications for the interpretability and generalizability of study findings. For example, a randomized, multicenter study conducted in urban health centers evaluating implementation of a primary care quality improvement strategy will likely not allow for generalizability to rural settings. We urge statistical reviewers to evaluate sampling procedures and watch for samples of convenience that may not be merited.

Inclusion and Exclusion Criteria Appropriate for State of Knowledge

No matter the type or phase of study, the protocol should describe how eligibility of study participants is determined. The notion that therapies and diseases have differing underlying mechanisms of action or progression in different populations (e.g., children vs. adults, males vs. females) often leads to increased restrictions on inclusion and exclusion criteria. While sometimes justified scientifically as it allows for a precise estimate of effect within a specialized population, the tradeoff is less generalizability and feasibility to complete enrollment. On the other hand, sample selection or eligibility criteria may be expansive and purposefully inclusive to maximize generalizability. The tradeoff is often increased variability and potential heterogeneity of effect that within specific subgroups. Biostatistical reviewers should question eligibility decisions chosen purely for practical reasons and recognize the limits they place on a study’s generalizability, noting the potential future dilemma for managing patients who would not have completely satisfied a study’s inclusion and exclusion criteria.

Screening and Enrollment Processes Minimize Bias and Do Not Restrict Diversity

It is imperative that clinical and translational research be designed for diversity, equity, and inclusion. Aside from specification of eligibility criteria, the specific way researchers plan to identify, recruit, screen, and ultimately enroll study participants may be prone to biases. For example, reading level and language of the informed consent document may impact accessibility. The timing and location of recruitment activities also restrict access both in person and by mail or electronic communication. Using email outreach or phone outreach to screen and identify patients will exclude those without easy access to technology or stable phone service. Some populations may prefer text messaging to phone calls; some may prefer messages from providers directly rather than participating study staff. Biostatistical reviewers should consider inclusive procedures and those appropriate for the target study population as they have potential to impact bias and variability within the sample. This can ameliorate or amplify both effect sizes and study generalizability.

Measurements and Outcomes

Choice of measurements, especially the response variable, is justified and consistent with the objectives.

The statistical review should ensure that outcome measures are aligned with objectives and appropriately describe the response of the experiment at the unit of analysis of the study (e.g., participant, animal, cell). Outcomes should be clinically relevant, measured or scored on an appropriate scale, valid, objective, reliable, sensitive, specific, precise, and free from bias to the extent possible [ 13 ]. Statistically, the level of specification is important. As an example, risk of death can be assessed as the proportion of participants who die within a specified period of time (binary outcome), or as time to death (i.e., survival). These outcomes require different analytic approaches with consequences on statistical power and interpretation. Ideally, the outcome should provide the maximum possible statistical information. It is common for investigators to dichotomize continuous measurements (e.g., defining a treatment responder as a participant who achieves a certain change in the outcome rather than considering the continuous response of change from baseline). Information is lost when continuous and ordinal responses are replaced with binary or categorical outcomes, and this practice is generally discouraged [ 14 – 16 ]. If the biostatistical reviewer identifies such information loss, they should consider the resulting inefficiency (i.e., increased sample size required; loss of efficacy signal) in the context of the risks to human or animal subjects and the costs of the study.

Outcome deliberations become critical in the design of clinical trials because distinction between primary, secondary, and exploratory endpoints is important. Many considerations are the topic of a large body of literature [ 17 – 20 ]. The biostatistical reviewer should be aware that the Food and Drug Administration (FDA) differentiates among the decisions that are supported by primary, secondary, and exploratory endpoints, as described in the “Discussion of control of type I error (multiple comparisons) is present” section [ 18 ]. Results of primary and secondary endpoints must be reported in clinicaltrials.gov; results of exploratory endpoints do not require reporting. Regardless of the dictates of clinicaltrials.gov, limitations on the number of secondary endpoints are prudent, and all secondary endpoints should be explicitly detailed in the protocol [ 21 ]. It is important that the issue of multiplicity in endpoints is considered and clearly delineated in the sections on sample size and statistical analysis. Beyond multiplicity, what constitutes success of the trial needs to be clear. For example, if there are several co-primary endpoints, it should be noted whether success is achieved if any endpoint is positive, or only if all endpoints are met. It should also be clear whether secondary endpoints will be analyzed if the primary endpoint is not significant. When there are longitudinal measurements, the investigators should specify whether a single time point defines the primary endpoint or whether all time points are incorporated to define the trajectory of response as the primary endpoint. Although the description above focuses on clinical trials, all protocols should describe which outcome(s) is(are) the basis for sample size or integral to defining success of the study.

Composite outcomes deserve special attention in a biostatistical review of a protocol [ 22 – 24 ]. Composite outcomes combine several elements into a single variable. Examples include “days alive and out of hospital” or “death or recurrent myocardial infarction.” Investigators sometimes select composite outcomes because of a very low expected event rate on any one outcome. The biostatistical reviewer should carefully consider the information in the composite endpoint for appropriateness. An example of a composite outcome that would not be appropriate is death combined with lack of cognition (e.g., neurologically intact survival) when the causal pathway is divergent such that the treatment worsens mortality but improves cognitive outcome. Some patients will care about quality of life (i.e., improved cognition) over length of life, while others will not, which yields a situation where one cannot define a single utility function for the composite outcome.

Timing of Assessments and Measurements Is Clear and Standardized (Study Schedule or Visit Matrix Should Be Present)

Regardless of whether a measurement is an outcome, predictor, or other measure, a biostatistical reviewer should evaluate each measurement in terms of who, what, when, where, how, and why, as shown in Table  2 .

Aspects of measurement that should be considered in protocol evaluation

This information tends to be scattered throughout the protocol: who will be assessed is described in the eligibility criteria; the main predictor may be defined in interventions; what and when are given in the outcomes section; who will take assessments, where and when are described in data collection methods; and how the assessments are summarized may be in the outcomes or the statistical methods section. This makes the evaluation of measurements challenging, but it is a vital component and should not be undervalued. A schedule of evaluations (Table  3 ), or visit matrix, can be a very valuable tool to summarize such information and help reviewers easily identify what measurements are being made by whom, how, and when.

Schedule of evaluations

Objectively Measured and Standardized

To remove potential sources of bias, from a statistical perspective each assessment should be as objective as possible. Thus, the protocol should mention measures of standardization as appropriate (e.g., central reading of images, central laboratory processing), and measures to ensure fidelity and quality control. For example, if outcomes come from a structured interview or clinical rating, there should be discussion of measuring interrater agreement. If the agreement is lower than expected, training or re-training procedures should be described.

If Based on Subjective or Patient Report, Use Validated Instruments as Appropriate

For patient-reported outcomes and questionnaires, the validity and reliability of the instrument is key. Validated instruments with good psychometric properties should be favored over unvalidated alternatives. The biostatistical reviewer should generally be cognizant of the repercussions of even small changes to the instrument, including reformatting or digitizing instruments.

Measurements Are of Maximum Feasible Resolution with No Unnecessary Categorization in Data Collection

Reviewers should assess whether appropriate data collection methods are used to improve the quality of the outcomes. For example, outcomes sometimes require derivation or scoring, such as body mass index (BMI). The reliability of BMI is improved if height and weight are collected and BMI is calculated in analysis programs. This reduces errors in translating between feet and centimeters or incorrect calculations. For each key variable, it should be clear how the data will be generated and recorded; when a case report form is provided for review this can be remarkably helpful.

Ranges of Outcomes, Distributional Properties, and Handling in Analyses Are Clear

The distributional properties of outcomes will allow a statistical reviewer to determine whether the analytic strategy is sound. If a laboratory value is known to be highly skewed or if the investigators use a count variable with several anticipated zero values, a standard t-test may not be appropriate, but rather nonparametric tests or strategies assuming a Poisson or zero-inflated Poisson distribution may be more appropriate. These will have implications on inferences and sample size calculations. Often, time-to-event variables are confused with binary outcome variables. Cumulative measures such as death by a certain time point (e.g., 12-month mortality rate) vs. time-to-death are two different outcomes that are incorrectly used interchangeably. The statistical reviewer should pay careful attention to situations where the outcome could feasibly be treated as binary (i.e., evaluated with relative risk, odds ratio, or risk difference) or as a time-to-event outcome (i.e., evaluated with hazard ratios).

Algorithms Used to Derive Variables or Score Outcome Assessments Are Justified (e.g., Citations, Clinical Meaning)

Just as validated survey/questionnaires are preferred in study design, so are any algorithms that are being used to select participants, allocate interventions (or guide interventions), or derive outcomes. For example, there are several algorithms used to derive estimated glomerular filtration rate [ 25 – 27 ], a measure of kidney function, or percent predicted forced expiratory volume (FEV1), a measure of lung function [ 28 , 29 ]. Some studies use predictive enrichment, using algorithms to select participants for inclusion. Although a statistical reviewer may not be in a position to argue the scientific context of the algorithm, they can ensure that the choice of algorithm or its derivation is discussed in the protocol. This should include whether investigators plan to use unvalidated algorithms or scores. If the algorithms are described according to the Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis criteria [ 30 ], it provides the biostatistical reviewer the necessary information to either advocate for or against the algorithm as a component of the study.

Measurement of Important/standard Explanatory Variables that Will Describe Sample or Address Confounding

In addition to outcomes, other measures such as predictors, confounders, effect modifiers, and other characteristics of the population (e.g., concomitant medications) should be listed. For example, if obesity is a confounder and will be included in analyses, the metric used to define obesity should be provided. These variables generally do not need as much detail as the outcome unless they are important in the analyses, or the lack of explanation may cause confusion.

Treatment Assignment

Minimization of biases (e.g., randomization and blinding).

Any study that evaluates the effect of a treatment or intervention must consider how participants are allocated to receive the intervention or the comparator. Randomized assignments serve as the ideal study design for minimizing known and unknown differences between study groups and evaluating causality. With this approach, the only experimental condition that differs in comparing interventions is the intervention itself. A biostatistical reviewer should consider the fundamental components of the randomization process to ensure that threats to causal inference are not inadvertently introduced.

Units and methods of randomization will depend upon the goals and nature of the study. Units of randomization may be, for example, animals, patients, or clinics. There is a wide range of methods for controlling balance across study arms [ 31 – 33 ]. Simple randomization and block randomization are straightforward, but techniques such as stratification, minimization, or adaptive randomization may be more appropriate. The choice of randomization approach, including details of the randomization process, should be considered by the biostatistical reviewer in the context of the study design. The algorithm used to generate the allocation sequence should be explained (e.g., stratified blocks, minimization, simple randomization, response-adaptive, use of clusters). However, the reviewer should not ask for details that would defeat the purpose of concealment (e.g., size of blocks).

The integrity of a trial and its randomization process can easily be compromised if the allocation sequence is not concealed properly, and thus the biostatistical reviewer should look for a description of the concealment process, such as use of a central telephone system or centralized web-based system. Concealment is not the same as blinding (sometimes also referred to as masking); concealment of the randomization sequence is intended to prevent selection bias prior to enrollment whereas blinding is intended to prevent biases arising after enrollment. Therefore, open-label and non-blinded randomized studies should also conceal the allocation sequence. If a pre-generated sequence is not used, the biostatistical reviewer should consider how real-time randomization is deployed, as might be required in a response-adaptive design, and whether it is feasible.

Beyond considerations of generating and concealing the randomization algorithm, the biostatistical reviewer should look for biases arising from the randomization process. As an example, if there is extended time between randomization and intervention, there is a high likelihood that the participant’s baseline status has changed, and they may no longer be eligible for treatment with the intervention. This can result in an increased number of patients dropping out of the study or not receiving their assigned intervention. Under the intent-to-treat principle, this results in bias toward the null.

The use of blinding can strengthen the rigor of a study even if the participant’s treating physician cannot be blinded in the traditional sense. For example, a blinded assessor of the primary endpoint can be used. Sometimes blinding of the participant or physician is not possible, such as when intervention is a behavioral therapy in comparison to a medication. It may also be necessary to break the blind during the study in emergency cases or for study oversight by a Data and Safety Monitoring Board. The biostatistical reviewer should consider whether sufficient blinding is in place to minimize bias, and whether the process for maintaining and breaking the blind is sufficient to prevent accidentally revealing treatment allocation to those who may be positioned to introduce bias.

Control Condition(s) Allow for Comparability or Minimization of Confounding

Randomization and blinding are used in prospective interventional trials to minimize bias and maximize the ability to conclude causation of the intervention. However, in some cases randomization may be unethical or infeasible and prospective observation is proposed to assess the treatment effect. In other cases, retrospective observational studies may be proposed. In such cases, there are several biases to which a biostatistical reviewer should be attuned. These include, but are not limited to, treatment selection bias, protopathic bias [ 34 ], confounding by severity, and confounding by indication. Approaches such as multiple regression methods or propensity score matching can be used to address measurable biases, and instrumental variable analysis can address unmeasured bias. The biostatistical reviewer should rightly negate the impact of an observational study of treatment effects when there is no attempt to mitigate the inherent biases, but should also support a protocol when appropriate methods are proposed.

Data Integrity and Data Management

Data capture and management is described.

Data integrity is critical to all research. A biostatistical reviewer should be concerned with how the investigator plans to maintain the accuracy of data as they are generated, collected, and curated. Data collection and storage procedures should be sufficiently described to ascertain the integrity of the primary measurements and, if appropriate, to adjudicate compliance with regulatory and scientific oversight requirements. The amount of detail required is often proportional to the size and complexity of the research.

For larger and more complex studies, and clinical trials in particular, a standalone Data Management Plan (DMP) might be used to augment a research protocol [ 35 , 36 ]. Whether the DMP is separate from the protocol, a biostatistical reviewer should consider details about who is responsible for creation and maintenance of the database; who will perform the data entry; and who, how, and when quality checks will be performed to ensure data integrity. This work is often supported by a data management platform, a custom system used to manage electronic data from entry to creation of an analytic dataset. The wrong tool can undermine data integrity, and a biostatistical reviewer should examine the data management pathway to ensure the final dataset is an accurate representation of the collected data. Investigators should describe their chosen data management platform and how it will support the workflow and satisfy security requirements. The choice of platform should be scaled to the study needs and explained in the study protocol; for supporting complex clinical trials spanning multiple countries, the technical requirements of the data management platform can become extensive.

Studies often utilize data captured from multiple platforms that must eventually be merged with the clinical data for analysis. Examples include data from wearable mobile health technology (smart phones and wearable devices such as accelerometers and step counters), real-time data streams from inpatient data monitors, electronic health record data, and data from laboratory or imaging cores. If the protocol calls for multiple modes of data collection, the protocol should acknowledge the need for merging data source and describe how data integrity will be assured during linkage [e.g., use of Globally Unique Identifiers to identify a single participant across multiple data sources, and reconciliation of such keys during the study).

Security and Control of Access to Study Data Are Discussed

While protection of privacy and confidentiality is traditionally the purview of privacy boards or IRBs, a biostatistical reviewer should ensure the protocol describes data security measures. Expected measures will include procedures for ensuring appropriate authentication for use, storage of data on secure servers (as opposed to a local computer’s hard disk or unencrypted flash drive), and accessibility of the data or data management system. In a clinical trial, for example, the data management system might need to be available 24 h a day, 7 days a week with appropriate backup systems in place that would pick up the workflow in the event of a major failure. Conversely, a chart review study might be supported sufficiently using a simple data capture form deployed in Research Electronic Data Capture [ 37 , 38 ]. Increasingly, data and applications are being maintained on cloud-based systems with high reliability; security requirements also apply to cloud-based data storage.

Data Validation, Errors, Query Resolution Processes Are Included

Missing and erroneous data can have a significant impact on the analysis and results, so the biostatistical reviewer should evaluate plans to minimize missing data and to check for inaccuracies. Beyond traditional clinical trial monitoring for regulatory and protocol compliance, the protocol should consider how to prevent data values outside of allowable ranges and data inconsistencies. A query identification and resolution process that includes range and consistency checks is recommended. For more complex studies, the biostatistical reviewer should expect the investigator to describe plans for minimizing missing [ 39 ] or low-quality data during study implementation, such as routine data quality reporting with corrective action processes. Inherent to more complex research protocols are the practical challenges that result in protocol deviations, including dose modifications, study visits that occurred outside of the prescribed time window, and missed assessments during a study visit. The biostatistical reviewer may expect the DMP to describe the approach to documenting such events and how they are to be considered in subsequent analyses.

Statistical Analysis Plan

Statistical analysis plans provide a reproducible roadmap of analysis that can be very valuable for all studies [ 40 , 41 ]. There are many components that could be included and these will vary on the study type and design [ 42 – 44 ]. A true pilot or feasibility study may not require a statistical analysis plan with the detail that would typically be required for an RCT. The objectives of such studies are often not to answer a particular research question, but to determine feasibility of study conduct. Such studies should not involve the testing of hypotheses, but they often involve quantitative thresholds for enrollment to inform the larger studies. If the analysis plan calls for hypothesis testing, the biostatistical reviewer may rightfully reject the approach as inappropriate. For large epidemiological registries, formal hypotheses may not have been developed at the time of design. Unlike feasibility studies, however, the biostatistical reviewer should expect to see the research team’s general approach to developing and testing hypotheses or modeling the data. The following sections outline key components that should be considered by biostatistical reviewers. Most statistical inference, and the focus of this article, is frequentist. We note that many questions can be better answered using Bayesian inference. Almost all of the points of this article apply equally well to a Bayesian study.

Statistical Approach Is Consistent with Hypothesis and Objectives

Statistical analyses allow for inferences from study data to address the study objectives. If the analyses are misaligned, the upstream research question that it may address, although potentially important, will not be that which the researcher had originally sought to explore. The biostatistical reviewer should ensure alignment between the proposed analyses and (1) the research hypotheses, and (2) the study outcomes. For example, continuous outcomes should not be analyzed using statistical methods designed for the analysis of dichotomous outcomes, such as chi-squared tests or logistic regression. A misaligned analysis plan may have implications not just for missing the study objectives but for sample size considerations. A study involving a continuous outcome but which employs analysis of a binary outcome is generally less efficient and will require more experimental units. A biostatistical reviewer should identify such inefficiencies, particularly in studies that involve extensive resources or that involve risk to human subjects.

For studies with multiple objectives, the biostatistical reviewer should expect a detailed analysis plan for each primary outcome. Secondary outcomes should be described but might be grouped together based on outcome type. Exploratory outcomes might be discussed with fewer details.

A Plan for Describing the Dataset Is Given

Describing the study sample is the first step in any analysis, and it allows those evaluating results to determine generalizability. Thus, any analytic plan should call for a description of the study sample – e.g., baseline characteristics of patients, animals, cells – regardless of importance for the primary study goals. The description should include sampling, screening, and/or randomization methods, as in a Consolidated Standards of Reporting Trials (CONSORT) diagram [ 43 ] for clinical trials (see also Mathilde et al. [ 45 ] for other study types). When a study design involves repeated measurements on the same experimental unit (e.g., patient, animal, or cell), the biostatistical reviewer should consider each experimental unit’s contribution to the analysis at each time point.

Unit of Analysis Is Clearly Described for Each Analysis

Experimental units may be clinical sites, communities, groups of participants, individual participants, cells, tissue samples, or muscle fibers. They may vary by research aim within a single study. A biostatistical reviewer’s evaluation of the analytical approach and sample size estimates depends on experimental unit. In cluster-randomized trials, for example, the analytic unit may be the cluster (e.g., clinic or site) or unit within a cluster (e.g., patients). Common mistakes in cluster-randomized trials, or studies involving analytic units that are inherently correlated with one another, involve failure to specify the units of analyses and failure to adequately account for the intraclass (or sometimes termed intracluster) correlation coefficient (ICC) in both sample size calculations and analyses [ 46 , 47 ]. The biostatistical reviewer should consider whether the analytic strategy adequately accounts for potential correlation among experimental units in such studies.

Analysis Populations Clearly Described (e.g., Intention-to-Treat Set, Per Protocol Set, Full Analysis Set)

Non-adherence or data anomalies are inevitable in clinical research, but decisions to exclude participants or data points from analyses present two problems: they result in a smaller overall analytic sample size, and they introduce a potential source of bias. Many large, clinical trials employ the intention-to-treat principle. Under this principle, once participants are randomized, they are always included in the analysis, and participants are analyzed as originally assigned regardless of adherence. Study protocols should mention any plans to analyze participants according to this principle and any modifications of this principle. As may occur with a safety analysis for an experimental drug or treatment, if an analysis assigns patients to treatment arms based on what actually happened (i.e., a per protocol or as treated dataset), the biostatistical reviewer should ensure this is pre-defined. That is, the criteria that make a participant “adherent” should be clear, including how adherence will be measured (e.g., pill counts, diaries). The biostatistical reviewer should also assess handling of non-adherent participants. The analysis plan should discuss these ideas and describe the analytic dataset with these questions in mind. These same concepts can be applied to observational studies to reduce bias and variability in analyses while keeping true to the study’s aims. Whether it be in the context of the study population in an RCT or in determining causality in an observational context, the biostatistical reviewer should also consider whether causal inference methodology is an appropriate approach to addressing research aims.

Key Statistical Assumptions Are Addressed

Soundness of statistical analyses depends on several assumptions. It would be impractical to list all assumptions in a protocol, but the biostatistical reviewer should evaluate plans to check major assumptions, such as normality and independence, noting that sometimes specific assumptions may be relaxed depending upon study scenario. On the other hand, certain proposed methods may require clear articulation of “out-of-ordinary” or strong assumptions for them to be truly valid in a given context (e.g., the many assumptions that surround causal inference methods). An example of an appropriate way to acknowledge and plan for addressing model assumptions is a high-level statement: “We will assess the data for normality [with the appropriate methods stated here], transform as needed, and analyze using either Student’s t-test or the nonparametric equivalent as appropriate [again stating at least one specific method here].”

Alternative Approaches in the Event of Violations of Assumptions Are Present

While impossible to foresee all possible violations of assumptions and thus plan for all possible alternative approaches that may be appropriate, the investigator should have a contingency plan if violations of assumptions are likely. A strong analysis plan will mention how it will be updated using appropriate version control to address each shift in approach; this documentation will allow for greatest transparency in any unexpected changes in analyses.

Discussion of Control of Type I Error (Multiple Comparisons) Is Present

When an analysis plan calls for many statistical tests, the probability of making a false positive conclusion increases simply due to chance. The biostatistical reviewer should balance the possibility of such type I errors with the strength and context of inferences expected. For clinical trials designed to bring a drug to market, controlling type I error is extremely important and both the FDA and the European Medicines Agency have issued guidance on how to handle this [ 18 ]. For purely exploratory studies, controlling the type I error may be less important, but the possibility of false discovery should be acknowledged.

Description of Preventing and Handling Missing Data Is Given

Missing data are often inevitable, especially in human studies. Poor handling of the missing data problem can introduce significant biases. The analysis plan should discuss anticipated missing data rates, unacceptable rates of missing data, and those that would merit exploration of in-depth sensitivity analyses.

If data are missing completely at random, analyses are generally unbiased, but this is very rarely the case. Under missing at random and missing not at random scenarios, imputation or advanced statistical methodology may be proposed, and the statistical reviewer should expect these to be clearly explained. If there are sensitivity analyses involving imputations, these also require explanation. While it is impractical to anticipate all possible missingness scenarios a priori , the biostatistical reviewer should at minimum determine (1) whether the protocol mentions anticipated missing data rate(s), (2) whether the anticipated rate(s) seem reasonable given the scenario and study population, (3) whether any missingness assumptions are merited, and (4) whether the analysis plans for imputation to explore multiple scenarios, allowing for a true sensitivity analysis.

Interim Analyses and Statistical Stopping Guidelines Are Clear and Justified

The term interim analysis often signifies simple interim descriptive statistics to monitor accrual rates, process measures, and adverse events. A study protocol should pre-specify plans for interim data monitoring in this regard, but there are seldom statistical implications associated with these types of analyses. A biostatistical reviewer should pay more attention when the protocol calls for an interim analysis that involves hypothesis testing. This may occur in clinical trials or prospective studies that use interim data looks to make decisions about adapting study features (such as sample size) in some way, or to make decisions to stop a study for either futility or efficacy. If the study calls for stopping rules, the criteria should be pre-specified in the study protocol. These may be in the form of efficacy or safety boundaries, or futility thresholds [ 48 – 50 ]. The biostatistical reviewer should note that to control the type I error rate for stopping for benefit, an interim analysis for these purposes may necessitate a more conservative significance level upon final statistical analysis. The protocol should ideally state that no formal interim analyses will be conducted or explain the terms of such analyses to include the timing, the frequency or total number of interim “looks” planned, and approach to controlling type I and type II errors.

Sample Size Justification

Type i and ii error rates present for all sample size calculations and corresponding statistical tests.

It is common for investigators to use conventional values of type I error rate ( α = 0.05) and power (80%). However, there may be situations when more emphasis is put on controlling the type II error or the type I error. Phase II studies often aim to determine whether to proceed to a phase III confirmatory study rather than to determine whether a drug is efficacious. In this case, a significance level of 0.20 might be acceptable. For a large, confirmatory study, investigators focus on controlling the type I error, and it may be appropriate to set desired power to 90% or the significance level to 0.01. The biostatistical reviewer should evaluate the selected power and significance levels used to justify the sample size, including decisions to deviate from convention.

Parameter Assumptions Are Clearly Stated and Justified (i.e., Based on Previous Research and Considers the Population Studied)

All power and sample size calculations require a priori assumptions and information. The more complicated the analyses, the more parameter assumptions that the investigators must suggest and justify. The statistical reviewer should be able to use the parameter assumptions provided in any proposed study to replicate the sample size and power calculations (at least to an approximate degree). The justification of the assumed parameter values (e.g., median time-to-death, variance estimates, correlation estimates, control proportions, etc.) should be supported by prior studies or literature.

Additionally, common issues in research such as attrition, loss to follow-up, and withdrawal from the study can greatly affect the final sample size. Sample size justifications usually account for these issues through inflating enrollment numbers beyond the sample size required to achieve the desired significance level and power. This will help ensure analytic sample size(s), after accounting for attrition and loss, resemble the required sample size as determined in the a priori calculations.

Statistical Tests used in Sample Size Calculations Match Those Presented in Statistical Analysis Plan or Appropriately Justify Reasoning for Straying from It

It is critical that the statistical methods assumed for computing power or sample size match, as closely as possible, those that are proposed in the statistical analysis plan. If the primary statistical analysis methods are based on outcomes with continuous data, such as using a two-sample t-test, then the sample size justification should also assume the use of the two-sample t-test. If the primary statistical analysis methods are based on outcomes with categorical data, such as using the chi-squared test to compare proportions, then the sample size justification should also assume the use of the chi-squared test. A mismatch between the power and sample size calculations and the statistical approach essentially render the estimations uninformative. Even making assumptions about how the data are likely to be analyzed in practice, a biostatistical reviewer may barely be able to infer even gross accuracy of the estimates. In general, it may be acceptable to plan for a complicated analysis [e.g., an analysis of covariance, adjusting for baseline), but sample size considerations may be based on a simpler statistical method (e.g., a t-test). However, when there is sufficient information to replicate the data generation mechanism, simulation presents a straightforward solution to understanding the effect of design decisions on the sample size and is desirable when the inputs can be justified.

Minimum Clinically Important Differences or Required Precision Described

Beyond the statistical approach, the factors that have the most influence on the sample size calculation are the minimally important difference between interventions (or change) and the variability of the primary outcome variable. When the minimally important difference is put in context of variability, the “effect size” can be estimated, and this drives the sample size justification. Investigators may propose the minimally important difference based upon clinically meaningful differences or based upon biologically useful differences. When a protocol proposes a minimally important difference arbitrarily or based on observations in preliminary studies, then the biostatistical reviewer should expect some justification that this quantity is biologically relevant and that it will help advance knowledge or learning in the specific research area.

As with the minimally important difference, the protocol should carefully justify the expected distribution of the primary outcome(s). Preliminary studies may provide estimates of the distribution, but preliminary studies often include only small samples in very controlled settings. These samples may not represent the heterogeneity of the population of interest. Estimates of effect sizes or variability from the published literature may also be suspect for multiple reasons, including the use of populations different to that of the present study and publication bias. When investigators rely on estimates from these studies, it may result in a plan for a smaller sample size than is actually required to find the minimally important difference.

The biostatistical reviewer should recognize that the nature of sample size calculations is inherently approximate, but expect the study investigators to be realistic in estimating the parameters used in the calculations. It is helpful when the investigator provides a table or figure displaying ranges for these parameters, the power, the significance level, and the sample size. This can be especially important for less common or novel study designs that may require additional parameter assumptions and consideration, such as the use of the ICC to inflate the sample size to account for clustering or site effects; special considerations for the selection of a priori estimates of standard deviations or proportions; clearly stated parameters for the margins of non-inferiority (for non-inferiority studies) or equivalence (for equivalence studies) [ 51 ]; and an accounting of potential interaction effects of interest between confounding variables.

Powering for subgroup analysis

Funders and regulatory agencies increasingly require investigators explore treatment effects within subgroups. It is impossible to power a study to detect the minimally important effect in every possible subgroup, but it might be reasonable to power the study to detect interactions between treatment effect and some subgrouping variables, as might be done for testing heterogeneity of treatment effects. More frequently, subgroup analyses may be considered exploratory. In this case, the sample size required to observe a minimally important difference within a subgroup is of less importance; however, one may expect some discussion of the magnitude of difference that might be observed within the subgroup. Biostatistical reviewers should expect to see additional considerations if subgroup analyses are planned.

Reporting and Reproducibility

Plans for data sharing and archiving are present.

Many United States Federal agencies, including the NIH, now require sharing of data on completion of the research. The protocol should describe the approach to sharing data publicly in accordance with governing rules. This process can be challenging as the de-identification of data is not trivial. Providing data in a manner that encourages secondary use requires attention to the processes for gaining access and for managing and supporting the requests. The biostatistical reviewer is well positioned to comment on the investigators plans for curating the final dataset for public use.

Version Control or a Means of Ensuring Rigor, Transparency, Reproducibility in Any Processes Is Evident

Every change in research protocols, analysis plans, and datasets is an opportunity for error. Protocols that specify the process for version control and change management are generally more rigorous and reproducible than those that do not.

Plan to Report Results According to Guidelines or Law

With the increased emphasis on transparency of research, there is a growing mandate to publicize clinical research in open databases, such as clinicaltrials.gov. While this is primarily a regulatory concern, a biostatistical reviewer should be cognizant of the effort required and timelines imposed for such reporting and expect this to be reflected in the protocol timeline and, if appropriate, budget.

The Biostatistical Reviewer’s Additional Responsibility

Berger and Matthews stated that “Biostatistics is the discipline concerned with how we ought to make decisions when analyzing biomedical data. It is the evolving discipline concerned with formulating explicit rules to compensate both for the fallibility of human intuition in general and for biases in study design in particular” [ 52 ]. As such, the core of biostatistics is trying to uncover the truth. While some scientists are implicitly biased in believing the alternative hypothesis to be true, a biostatistician’s perspective is appropriately “equipoise.” For example, at the root of basic frequentist statistical hypothesis testing lies the assumption that the null hypothesis (which is often of least interest to investigators in a field) is true. This perspective may lead to viewing the biostatistician in a reviewer role as a skeptic, when in reality they are necessarily neutral. This makes the biostatistician’s perspective helpful and often imperative in protocol review. As an impartial reviewer and according to the foundations of a biostatistician’s education and training, it is therefore the biostatistician’s responsibility to (1) ensure sound study design and analyses, and to (2) be critical and look for flaws in study design that may result in invalid findings. Other content-specific reviewers may have a tendency toward overly enthusiastic review of a research study given the scientific significance of the proposed research or lack of viable treatment options for an understudied disease. The biostatistician reviewer thus often provides a viewpoint that is further removed and more impartial, with the responsibility to preserve scientific rigor and integrity for all study protocols, regardless of significance of the research. We note that a complete, impartial review may not always warrant the same level of feedback to investigators. For example, investigators submitting a grant for review will benefit from the direction of a thorough written critique with guidance. However, for an institution considering joining a multicenter protocol, the statistical review may simply be a go/no-go statement.

As biostatistical reviewers tend to possess both specialized quantitative training and collaborative experiences, exposing them to a broad range of research across multiple disciplines, we view the biostatistician reviewer as an essential voice in any protocol review process. Biostatisticians often engage collaboratively across multiple research domains throughout the study lifecycle, not just review. Given this breadth and depth of involvement, a biostatistician can contrast a proposed study with successful approaches encountered in other disciplines. The biostatistician thus inherits the responsibility to cross-fertilize important methodologies.

A biostatistical reviewer, with sound and constructive critique of study protocols prior to their implementation, has the potential to prevent issues such as poor-quality data abstraction from medical records, high rates of loss to follow-up, lack of separation between treatment groups, insufficient blinding, failure to cleanly capture primary endpoints, and overly optimistic accrual expectations, among other preventable issues. Protocol review offers a chance to predict many such failures, thereby preventing research waste and unnecessary risks.

In this article, we have discussed components of a study protocol that a biostatistical reviewer (and, indeed, all reviewers) should evaluate when assessing whether a proposed study will answer the scientific question at hand. We posit that the biostatistical reviewer, through their breadth of engagement across multiple disciplines and experience with a broad range of research designs, can and should contribute significantly beyond review of the statistical analysis plan and sample size justification. Through careful scientific review, including biostatistical review as we outline here, we hope to prevent excess resource expenditure and risk to humans and animals on poorly planned studies.

Acknowledgments

This study was supported by the following Clinical and Translational Science Awards from the National Center for Advancing Translational Science: UL1TR001422 (J.D.C.), UL1TR002240 (C.S., S.K.), UL1TR001420 (W.T.A.), UL1 TR001450 (P.J.N), UL1TR003096 (R.A.O.), UL1TR002529 (S.M.P.), UL1 TR000002 (B.H.P.), UL1TR002553 (G.P.), UL1TR002544 (L.L.P), UL1 TR002243 (T.W.R, C.J.L), UL1TR001086 (T.D.T), UL1TR001439 (H.M.S.). Other NIH grant support includes NIAMS grant P30 AR072582, NIGMS grant U54-GM104941, and NIDDK P30 DK123704 (P.J.N). Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the National Institutes of Health.

Disclosures

The authors have no conflicts of interest to declare.

COMMENTS

  1. An Introduction to Statistics: Understanding Hypothesis Testing and

    HYPOTHESIS TESTING. A clinical trial begins with an assumption or belief, and then proceeds to either prove or disprove this assumption. In statistical terms, this belief or assumption is known as a hypothesis. Counterintuitively, what the researcher believes in (or is trying to prove) is called the "alternate" hypothesis, and the opposite ...

  2. PDF STATS 8: Introduction to Biostatistics 24pt Hypothesis Testing

    For the above example, X N ; 1=25 : If the null hypothesis is true, then. X. N 98:6; 1=25 : Hypothesis testing for the population mean. In reality, we have one value, x, for the sample mean. We can use this value to quantify the evidence of departure from the null hypothesis. Suppose that from our sample of 25 people we sample mean is x = 98:4.

  3. Biostatistics and Research Design for Clinicians

    Choosing the appropriate study design to test the hypothesis is one of the most important steps in a successful research project. Before diving into the different types of study designs, it is important to distinguish between observational and experimental studies ("Study Designs"), as all study designs can be grouped into either one of ...

  4. An Introduction to Biostatistics: Randomization, Hypothesis Testing

    Note that the null hypothesis always contains the case of equality, and the alternative hypothesis contains the research question. When σ x is known, the test statistic takes the form (15.7) Z = (x ¯ − μ 0) / (σ x / n), whereas when σ x is unknown, the test statistic takes the form (15.8) T = x ¯ − μ 0 s x n, where s x = 1 n − 1 ...

  5. Introduction to Biostatistics

    1 Core Concepts. By the term "biostatistics," we mean the application of the field of probability and statistics to a wide range of topics that pertain to the biological sciences. We focus our discussion on the practical applications of fundamental biostatistics in the domain of healthcare, including experimental and clinical medicine ...

  6. 1 Biostatistics 1: Basic Concepts

    Biostatistics involves the application of statistical methods to medical and biological phenomena. In this chapter, we present the basic concepts and principles of biostatistics. Knowledge of these concepts is necessary to appraise the medical literature and understand results of biomedical research.

  7. Hypothesis Testing

    The Four Steps in Hypothesis Testing. STEP 1: State the appropriate null and alternative hypotheses, Ho and Ha. STEP 2: Obtain a random sample, collect relevant data, and check whether the data meet the conditions under which the test can be used. If the conditions are met, summarize the data using a test statistic.

  8. PDF INTRODUCTION TO BIOSTATISTICS FOR CLINICAL RESEARCH

    The power of a test tells us how likely we are to find a significant difference given that the alternative hypothesis is true, i.e. given that the true mean is different from 0. If the power is too low, then we have little chance of finding a significant difference even if the true mean is not equal to 0.

  9. Quick Guide to Biostatistics in Clinical Research: Hypothesis ...

    A good hypothesis will be clear, avoid moral judgments, specific, objective, and relevant to the research question. Above all, a hypothesis must be testable. A simple hypothesis would contain one predictor and one outcome variable. For instance, if your hypothesis was, "Chocolate consumption is linked to type II diabetes" the predictor ...

  10. Biostatistics in Clinical Trials

    Abstract. Biostatistics plays an important role in the design, conduct, and reporting of clinical trials. In recent years, the recognition and appreciation of biostatistics have greatly improved. Biostatisticians are not only responsible for the quantitative properties of clinical trial designs and for analyzing data from clinical trials, they ...

  11. (PDF) Role of biostatistics in medical research

    Bigger sam ple size gives more power to. the study. In conclusion, biostatistics has a. paramount role in medical researches. when perform ed in a sound and scientific. manner. Professor. Namir ...

  12. Biostatistics

    The research plan might include the research question, the hypothesis to be tested, the ... Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples: ... Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect ...

  13. Understanding the importance of a research hypothesis

    A research hypothesis is a specification of a testable prediction about what a researcher expects as the outcome of the study. It comprises certain aspects such as the population, variables, and the relationship between the variables. It states the specific role of the position of individual elements through empirical verification.

  14. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  15. Statistical Methods for Cardiovascular Researchers

    Biostatistics can be productively applied to cardiovascular research if investigators (1) develop and rely on a well-written protocol and analysis plan, (2) consult with a biostatistician when necessary, and (3) write results clearly, differentiating confirmatory from exploratory findings.

  16. Why do you need a biostatistician?

    Biostatistics mainly addresses the development, implementation, and application of statistical methods in the field of medical research [].Therefore, an understanding of the medical background and the clinical context of the research problem they are working on is essential for biostatisticians [].Furthermore, a specific professional expertise is inevitable, and also soft skill competencies ...

  17. Evidence‐based statistical analysis and methods in biomedical research

    We initially identified the important steps of reporting design features that may influence the choice of statistical analysis in biomedical research and essential steps of data analysis of common studies. ... inconsistencies exist in methodological practices for similar study designs with the same objective/hypothesis. As a result, the quality ...

  18. PDF Chapter 2 Descriptive Biostatistics

    20 2 Descriptive Biostatistics these functions describing and summarizing the observed data are of three types: the so called measures of central tendency, the measures of dispersion, and the measures of symmetry and form or moments.They are also called statistics, since this term— statistics—refers not only to the science we are discussing, but is also applied to

  19. From the Front Row: Using biostatistics and P-value in public health

    And then there's an alternative hypothesis, which assumes that the association or effect does exist and is potentially important. So the p-value, it's computed by assuming hypothetically that the null hypothesis is true, and then finding the probability of obtaining data similar to the data observed in your study under that assumption.

  20. Guidance for biostatisticians on their essential contributions to

    Introduction. Rigorous scientific review of research protocols is critical to making funding decisions [1, 2], and to the protection of both human and non-human research participants [].Two pillars of ethical clinical and translational research include scientific validity and independent review of the proposed research [].As such, the review process often emphasizes the scientific approach and ...