Writing an Introduction for a Scientific Paper

Dr. michelle harris, dr. janet batzli, biocore.

This section provides guidelines on how to construct a solid introduction to a scientific paper including background information, study question , biological rationale, hypothesis , and general approach . If the Introduction is done well, there should be no question in the reader’s mind why and on what basis you have posed a specific hypothesis.

Broad Question : based on an initial observation (e.g., “I see a lot of guppies close to the shore. Do guppies like living in shallow water?”). This observation of the natural world may inspire you to investigate background literature or your observation could be based on previous research by others or your own pilot study. Broad questions are not always included in your written text, but are essential for establishing the direction of your research.

Background Information : key issues, concepts, terminology, and definitions needed to understand the biological rationale for the experiment. It often includes a summary of findings from previous, relevant studies. Remember to cite references, be concise, and only include relevant information given your audience and your experimental design. Concisely summarized background information leads to the identification of specific scientific knowledge gaps that still exist. (e.g., “No studies to date have examined whether guppies do indeed spend more time in shallow water.”)

Testable Question : these questions are much more focused than the initial broad question, are specific to the knowledge gap identified, and can be addressed with data. (e.g., “Do guppies spend different amounts of time in water <1 meter deep as compared to their time in water that is >1 meter deep?”)

Biological Rationale : describes the purpose of your experiment distilling what is known and what is not known that defines the knowledge gap that you are addressing. The “BR” provides the logic for your hypothesis and experimental approach, describing the biological mechanism and assumptions that explain why your hypothesis should be true.

The biological rationale is based on your interpretation of the scientific literature, your personal observations, and the underlying assumptions you are making about how you think the system works. If you have written your biological rationale, your reader should see your hypothesis in your introduction section and say to themselves, “Of course, this hypothesis seems very logical based on the rationale presented.”

  • A thorough rationale defines your assumptions about the system that have not been revealed in scientific literature or from previous systematic observation. These assumptions drive the direction of your specific hypothesis or general predictions.
  • Defining the rationale is probably the most critical task for a writer, as it tells your reader why your research is biologically meaningful. It may help to think about the rationale as an answer to the questions— how is this investigation related to what we know, what assumptions am I making about what we don’t yet know, AND how will this experiment add to our knowledge? *There may or may not be broader implications for your study; be careful not to overstate these (see note on social justifications below).
  • Expect to spend time and mental effort on this. You may have to do considerable digging into the scientific literature to define how your experiment fits into what is already known and why it is relevant to pursue.
  • Be open to the possibility that as you work with and think about your data, you may develop a deeper, more accurate understanding of the experimental system. You may find the original rationale needs to be revised to reflect your new, more sophisticated understanding.
  • As you progress through Biocore and upper level biology courses, your rationale should become more focused and matched with the level of study e ., cellular, biochemical, or physiological mechanisms that underlie the rationale. Achieving this type of understanding takes effort, but it will lead to better communication of your science.

***Special note on avoiding social justifications: You should not overemphasize the relevance of your experiment and the possible connections to large-scale processes. Be realistic and logical —do not overgeneralize or state grand implications that are not sensible given the structure of your experimental system. Not all science is easily applied to improving the human condition. Performing an investigation just for the sake of adding to our scientific knowledge (“pure or basic science”) is just as important as applied science. In fact, basic science often provides the foundation for applied studies.

Hypothesis / Predictions : specific prediction(s) that you will test during your experiment. For manipulative experiments, the hypothesis should include the independent variable (what you manipulate), the dependent variable(s) (what you measure), the organism or system , the direction of your results, and comparison to be made.

We hypothesized that reared in warm water will have a greater sexual mating response.

(The dependent variable “sexual response” has not been defined enough to be able to make this hypothesis testable or falsifiable. In addition, no comparison has been specified— greater sexual mating response as compared to what?)

We hypothesized that ) reared in warm water temperatures ranging from 25-28 °C ( ) would produce greater ( ) numbers of male offspring and females carrying haploid egg sacs ( ) than reared in cooler water temperatures of 18-22°C.

If you are doing a systematic observation , your hypothesis presents a variable or set of variables that you predict are important for helping you characterize the system as a whole, or predict differences between components/areas of the system that help you explain how the system functions or changes over time.

We hypothesize that the frequency and extent of algal blooms in Lake Mendota over the last 10 years causes fish kills and imposes a human health risk.

(The variables “frequency and extent of algal blooms,” “fish kills” and “human health risk” have not been defined enough to be able to make this hypothesis testable or falsifiable. How do you measure algal blooms? Although implied, hypothesis should express predicted direction of expected results [ , higher frequency associated with greater kills]. Note that cause and effect cannot be implied without a controlled, manipulative experiment.)

We hypothesize that increasing ( ) cell densities of algae ( ) in Lake Mendota over the last 10 years is correlated with 1. increased numbers of dead fish ( ) washed up on Madison beaches and 2. increased numbers of reported hospital/clinical visits ( .) following full-body exposure to lake water.

Experimental Approach : Briefly gives the reader a general sense of the experiment, the type of data it will yield, and the kind of conclusions you expect to obtain from the data. Do not confuse the experimental approach with the experimental protocol . The experimental protocol consists of the detailed step-by-step procedures and techniques used during the experiment that are to be reported in the Methods and Materials section.

Some Final Tips on Writing an Introduction

  • As you progress through the Biocore sequence, for instance, from organismal level of Biocore 301/302 to the cellular level in Biocore 303/304, we expect the contents of your “Introduction” paragraphs to reflect the level of your coursework and previous writing experience. For example, in Biocore 304 (Cell Biology Lab) biological rationale should draw upon assumptions we are making about cellular and biochemical processes.
  • Be Concise yet Specific: Remember to be concise and only include relevant information given your audience and your experimental design. As you write, keep asking, “Is this necessary information or is this irrelevant detail?” For example, if you are writing a paper claiming that a certain compound is a competitive inhibitor to the enzyme alkaline phosphatase and acts by binding to the active site, you need to explain (briefly) Michaelis-Menton kinetics and the meaning and significance of Km and Vmax. This explanation is not necessary if you are reporting the dependence of enzyme activity on pH because you do not need to measure Km and Vmax to get an estimate of enzyme activity.
  • Another example: if you are writing a paper reporting an increase in Daphnia magna heart rate upon exposure to caffeine you need not describe the reproductive cycle of magna unless it is germane to your results and discussion. Be specific and concrete, especially when making introductory or summary statements.

Where Do You Discuss Pilot Studies? Many times it is important to do pilot studies to help you get familiar with your experimental system or to improve your experimental design. If your pilot study influences your biological rationale or hypothesis, you need to describe it in your Introduction. If your pilot study simply informs the logistics or techniques, but does not influence your rationale, then the description of your pilot study belongs in the Materials and Methods section.  

from an Intro Ecology Lab:

         Researchers studying global warming predict an increase in average global temperature of 1.3°C in the next 10 years (Seetwo 2003). are small zooplankton that live in freshwater inland lakes. They are filter-feeding crustaceans with a transparent exoskeleton that allows easy observation of heart rate and digestive function. Thomas et al (2001) found that heart rate increases significantly in higher water temperatures are also thought to switch their mode of reproduction from asexual to sexual in response to extreme temperatures. Gender is not mediated by genetics, but by the environment. Therefore, reproduction may be sensitive to increased temperatures resulting from global warming (maybe a question?) and may serve as a good environmental indicator for global climate change.

         In this experiment we hypothesized that reared in warm water will switch from an asexual to a sexual mode of reproduction. In order to prove this hypothesis correct we observed grown in warm and cold water and counted the number of males observed after 10 days.

Comments:

Background information

·       Good to recognize as a model organism from which some general conclusions can be made about the quality of the environment; however no attempt is made to connect increased lake temperatures and gender. Link early on to increase focus.

·       Connection to global warming is too far-reaching. First sentence gives impression that Global Warming is topic for this paper. Changes associated with global warming are not well known and therefore little can be concluded about use of as indicator species.

·       Information about heart rate is unnecessary because heart rate in not being tested in this experiment.

Rationale

·       Rationale is missing; how is this study related to what we know about D. magna survivorship and reproduction as related to water temperature, and how will this experiment contribute to our knowledge of the system?

·       Think about the ecosystem in which this organism lives and the context. Under what conditions would D. magna be in a body of water with elevated temperatures?

Hypothesis

·       Not falsifiable; variables need to be better defined (state temperatures or range tested rather than “warm” or “cold”) and predict direction and magnitude of change in number of males after 10 days.

·       It is unclear what comparison will be made or what the control is

·       What dependent variable will be measured to determine “switch” in mode of reproduction (what criteria are definitive for switch?)

Approach

·       Hypotheses cannot be “proven” correct. They are either supported or rejected.

Introduction

         are small zooplankton found in freshwater inland lakes and are thought to switch their mode of reproduction from asexual to sexual in response to extreme temperatures (Mitchell 1999). Lakes containing have an average summer surface temperature of 20°C (Harper 1995) but may increase by more than 15% when expose to warm water effluent from power plants, paper mills, and chemical industry (Baker et al. 2000). Could an increase in lake temperature caused by industrial thermal pollution affect the survivorship and reproduction of ?

         The sex of is mediated by the environment rather than genetics. Under optimal environmental conditions, populations consist of asexually reproducing females. When the environment shifts may be queued to reproduce sexually resulting in the production of male offspring and females carrying haploid eggs in sacs called ephippia (Mitchell 1999).

         The purpose of this laboratory study is to examine the effects of increased water temperature on survivorship and reproduction. This study will help us characterize the magnitude of environmental change required to induce the onset of the sexual life cycle in . Because are known to be a sensitive environmental indicator species (Baker et al. 2000) and share similar structural and physiological features with many aquatic species, they serve as a good model for examining the effects of increasing water temperature on reproduction in a variety of aquatic invertebrates.

         We hypothesized that populations reared in water temperatures ranging from 24-26 °C would have lower survivorship, higher male/female ratio among the offspring, and more female offspring carrying ephippia as compared with grown in water temperatures of 20-22°C. To test this hypothesis we reared populations in tanks containing water at either 24 +/- 2°C or 20 +/- 2°C. Over 10 days, we monitored survivorship, determined the sex of the offspring, and counted the number of female offspring containing ephippia.

Comments:

Background information

·       Opening paragraph provides good focus immediately. The study organism, gender switching response, and temperature influence are mentioned in the first sentence. Although it does a good job documenting average lake water temperature and changes due to industrial run-off, it fails to make an argument that the 15% increase in lake temperature could be considered “extreme” temperature change.

·       The study question is nicely embedded within relevant, well-cited background information. Alternatively, it could be stated as the first sentence in the introduction, or after all background information has been discussed before the hypothesis.

Rationale

·       Good. Well-defined purpose for study; to examine the degree of environmental change necessary to induce the Daphnia sexual life
cycle.

How will introductions be evaluated? The following is part of the rubric we will be using to evaluate your papers.

 

0 = inadequate

(C, D or F)

1 = adequate

(BC)

2 = good

(B)

3 = very good

(AB)

4 = excellent

(A)

Introduction

BIG PICTURE: Did the Intro convey why experiment was performed and what it was designed to test?

 

Introduction provides little to no relevant information. (This often results in a hypothesis that “comes out of nowhere.”)

Many key components are very weak or missing; those stated are unclear and/or are not stated concisely. Weak/missing components make it difficult to follow the rest of the paper.

e.g., background information is not focused on a specific question and minimal biological rationale is presented such that hypothesis isn’t entirely logical

 

Covers most key components but could be done much more logically, clearly, and/or concisely.

e.g., biological rationale not fully developed but still supports hypothesis. Remaining components are done reasonably well, though there is still room for improvement.

Concisely & clearly covers all but one key component (w/ exception of rationale; see left) clearly covers all key components but could be a little more concise and/or clear.

e.g., has done a reasonably nice job with the Intro but fails to state the approach OR has done a nice job with Intro but has also included some irrelevant background information

 

Clearly, concisely, & logically presents all key components: relevant & correctly cited background information, question, biological rationale, hypothesis, approach.

What is the Scientific Method: How does it work and why is it important?

The scientific method is a systematic process involving steps like defining questions, forming hypotheses, conducting experiments, and analyzing data. It minimizes biases and enables replicable research, leading to groundbreaking discoveries like Einstein's theory of relativity, penicillin, and the structure of DNA. This ongoing approach promotes reason, evidence, and the pursuit of truth in science.

Updated on November 18, 2023

What is the Scientific Method: How does it work and why is it important?

Beginning in elementary school, we are exposed to the scientific method and taught how to put it into practice. As a tool for learning, it prepares children to think logically and use reasoning when seeking answers to questions.

Rather than jumping to conclusions, the scientific method gives us a recipe for exploring the world through observation and trial and error. We use it regularly, sometimes knowingly in academics or research, and sometimes subconsciously in our daily lives.

In this article we will refresh our memories on the particulars of the scientific method, discussing where it comes from, which elements comprise it, and how it is put into practice. Then, we will consider the importance of the scientific method, who uses it and under what circumstances.

What is the scientific method?

The scientific method is a dynamic process that involves objectively investigating questions through observation and experimentation . Applicable to all scientific disciplines, this systematic approach to answering questions is more accurately described as a flexible set of principles than as a fixed series of steps.

The following representations of the scientific method illustrate how it can be both condensed into broad categories and also expanded to reveal more and more details of the process. These graphics capture the adaptability that makes this concept universally valuable as it is relevant and accessible not only across age groups and educational levels but also within various contexts.

a graph of the scientific method

Steps in the scientific method

While the scientific method is versatile in form and function, it encompasses a collection of principles that create a logical progression to the process of problem solving:

  • Define a question : Constructing a clear and precise problem statement that identifies the main question or goal of the investigation is the first step. The wording must lend itself to experimentation by posing a question that is both testable and measurable.
  • Gather information and resources : Researching the topic in question to find out what is already known and what types of related questions others are asking is the next step in this process. This background information is vital to gaining a full understanding of the subject and in determining the best design for experiments. 
  • Form a hypothesis : Composing a concise statement that identifies specific variables and potential results, which can then be tested, is a crucial step that must be completed before any experimentation. An imperfection in the composition of a hypothesis can result in weaknesses to the entire design of an experiment.
  • Perform the experiments : Testing the hypothesis by performing replicable experiments and collecting resultant data is another fundamental step of the scientific method. By controlling some elements of an experiment while purposely manipulating others, cause and effect relationships are established.
  • Analyze the data : Interpreting the experimental process and results by recognizing trends in the data is a necessary step for comprehending its meaning and supporting the conclusions. Drawing inferences through this systematic process lends substantive evidence for either supporting or rejecting the hypothesis.
  • Report the results : Sharing the outcomes of an experiment, through an essay, presentation, graphic, or journal article, is often regarded as a final step in this process. Detailing the project's design, methods, and results not only promotes transparency and replicability but also adds to the body of knowledge for future research.
  • Retest the hypothesis : Repeating experiments to see if a hypothesis holds up in all cases is a step that is manifested through varying scenarios. Sometimes a researcher immediately checks their own work or replicates it at a future time, or another researcher will repeat the experiments to further test the hypothesis.

a chart of the scientific method

Where did the scientific method come from?

Oftentimes, ancient peoples attempted to answer questions about the unknown by:

  • Making simple observations
  • Discussing the possibilities with others deemed worthy of a debate
  • Drawing conclusions based on dominant opinions and preexisting beliefs

For example, take Greek and Roman mythology. Myths were used to explain everything from the seasons and stars to the sun and death itself.

However, as societies began to grow through advancements in agriculture and language, ancient civilizations like Egypt and Babylonia shifted to a more rational analysis for understanding the natural world. They increasingly employed empirical methods of observation and experimentation that would one day evolve into the scientific method . 

In the 4th century, Aristotle, considered the Father of Science by many, suggested these elements , which closely resemble the contemporary scientific method, as part of his approach for conducting science:

  • Study what others have written about the subject.
  • Look for the general consensus about the subject.
  • Perform a systematic study of everything even partially related to the topic.

a pyramid of the scientific method

By continuing to emphasize systematic observation and controlled experiments, scholars such as Al-Kindi and Ibn al-Haytham helped expand this concept throughout the Islamic Golden Age . 

In his 1620 treatise, Novum Organum , Sir Francis Bacon codified the scientific method, arguing not only that hypotheses must be tested through experiments but also that the results must be replicated to establish a truth. Coming at the height of the Scientific Revolution, this text made the scientific method accessible to European thinkers like Galileo and Isaac Newton who then put the method into practice.

As science modernized in the 19th century, the scientific method became more formalized, leading to significant breakthroughs in fields such as evolution and germ theory. Today, it continues to evolve, underpinning scientific progress in diverse areas like quantum mechanics, genetics, and artificial intelligence.

Why is the scientific method important?

The history of the scientific method illustrates how the concept developed out of a need to find objective answers to scientific questions by overcoming biases based on fear, religion, power, and cultural norms. This still holds true today.

By implementing this standardized approach to conducting experiments, the impacts of researchers’ personal opinions and preconceived notions are minimized. The organized manner of the scientific method prevents these and other mistakes while promoting the replicability and transparency necessary for solid scientific research.

The importance of the scientific method is best observed through its successes, for example: 

  • “ Albert Einstein stands out among modern physicists as the scientist who not only formulated a theory of revolutionary significance but also had the genius to reflect in a conscious and technical way on the scientific method he was using.” Devising a hypothesis based on the prevailing understanding of Newtonian physics eventually led Einstein to devise the theory of general relativity .
  • Howard Florey “Perhaps the most useful lesson which has come out of the work on penicillin has been the demonstration that success in this field depends on the development and coordinated use of technical methods.” After discovering a mold that prevented the growth of Staphylococcus bacteria, Dr. Alexander Flemimg designed experiments to identify and reproduce it in the lab, thus leading to the development of penicillin .
  • James D. Watson “Every time you understand something, religion becomes less likely. Only with the discovery of the double helix and the ensuing genetic revolution have we had grounds for thinking that the powers held traditionally to be the exclusive property of the gods might one day be ours. . . .” By using wire models to conceive a structure for DNA, Watson and Crick crafted a hypothesis for testing combinations of amino acids, X-ray diffraction images, and the current research in atomic physics, resulting in the discovery of DNA’s double helix structure .

Final thoughts

As the cases exemplify, the scientific method is never truly completed, but rather started and restarted. It gave these researchers a structured process that was easily replicated, modified, and built upon. 

While the scientific method may “end” in one context, it never literally ends. When a hypothesis, design, methods, and experiments are revisited, the scientific method simply picks up where it left off. Each time a researcher builds upon previous knowledge, the scientific method is restored with the pieces of past efforts.

By guiding researchers towards objective results based on transparency and reproducibility, the scientific method acts as a defense against bias, superstition, and preconceived notions. As we embrace the scientific method's enduring principles, we ensure that our quest for knowledge remains firmly rooted in reason, evidence, and the pursuit of truth.

The AJE Team

The AJE Team

See our "Privacy Policy"

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Scientific Method

Science is an enormously successful human enterprise. The study of scientific method is the attempt to discern the activities by which that success is achieved. Among the activities often identified as characteristic of science are systematic observation and experimentation, inductive and deductive reasoning, and the formation and testing of hypotheses and theories. How these are carried out in detail can vary greatly, but characteristics like these have been looked to as a way of demarcating scientific activity from non-science, where only enterprises which employ some canonical form of scientific method or methods should be considered science (see also the entry on science and pseudo-science ). Others have questioned whether there is anything like a fixed toolkit of methods which is common across science and only science. Some reject privileging one view of method as part of rejecting broader views about the nature of science, such as naturalism (Dupré 2004); some reject any restriction in principle (pluralism).

Scientific method should be distinguished from the aims and products of science, such as knowledge, predictions, or control. Methods are the means by which those goals are achieved. Scientific method should also be distinguished from meta-methodology, which includes the values and justifications behind a particular characterization of scientific method (i.e., a methodology) — values such as objectivity, reproducibility, simplicity, or past successes. Methodological rules are proposed to govern method and it is a meta-methodological question whether methods obeying those rules satisfy given values. Finally, method is distinct, to some degree, from the detailed and contextual practices through which methods are implemented. The latter might range over: specific laboratory techniques; mathematical formalisms or other specialized languages used in descriptions and reasoning; technological or other material means; ways of communicating and sharing results, whether with other scientists or with the public at large; or the conventions, habits, enforced customs, and institutional controls over how and what science is carried out.

While it is important to recognize these distinctions, their boundaries are fuzzy. Hence, accounts of method cannot be entirely divorced from their methodological and meta-methodological motivations or justifications, Moreover, each aspect plays a crucial role in identifying methods. Disputes about method have therefore played out at the detail, rule, and meta-rule levels. Changes in beliefs about the certainty or fallibility of scientific knowledge, for instance (which is a meta-methodological consideration of what we can hope for methods to deliver), have meant different emphases on deductive and inductive reasoning, or on the relative importance attached to reasoning over observation (i.e., differences over particular methods.) Beliefs about the role of science in society will affect the place one gives to values in scientific method.

The issue which has shaped debates over scientific method the most in the last half century is the question of how pluralist do we need to be about method? Unificationists continue to hold out for one method essential to science; nihilism is a form of radical pluralism, which considers the effectiveness of any methodological prescription to be so context sensitive as to render it not explanatory on its own. Some middle degree of pluralism regarding the methods embodied in scientific practice seems appropriate. But the details of scientific practice vary with time and place, from institution to institution, across scientists and their subjects of investigation. How significant are the variations for understanding science and its success? How much can method be abstracted from practice? This entry describes some of the attempts to characterize scientific method or methods, as well as arguments for a more context-sensitive approach to methods embedded in actual scientific practices.

1. Overview and organizing themes

2. historical review: aristotle to mill, 3.1 logical constructionism and operationalism, 3.2. h-d as a logic of confirmation, 3.3. popper and falsificationism, 3.4 meta-methodology and the end of method, 4. statistical methods for hypothesis testing, 5.1 creative and exploratory practices.

  • 5.2 Computer methods and the ‘new ways’ of doing science

6.1 “The scientific method” in science education and as seen by scientists

6.2 privileged methods and ‘gold standards’, 6.3 scientific method in the court room, 6.4 deviating practices, 7. conclusion, other internet resources, related entries.

This entry could have been given the title Scientific Methods and gone on to fill volumes, or it could have been extremely short, consisting of a brief summary rejection of the idea that there is any such thing as a unique Scientific Method at all. Both unhappy prospects are due to the fact that scientific activity varies so much across disciplines, times, places, and scientists that any account which manages to unify it all will either consist of overwhelming descriptive detail, or trivial generalizations.

The choice of scope for the present entry is more optimistic, taking a cue from the recent movement in philosophy of science toward a greater attention to practice: to what scientists actually do. This “turn to practice” can be seen as the latest form of studies of methods in science, insofar as it represents an attempt at understanding scientific activity, but through accounts that are neither meant to be universal and unified, nor singular and narrowly descriptive. To some extent, different scientists at different times and places can be said to be using the same method even though, in practice, the details are different.

Whether the context in which methods are carried out is relevant, or to what extent, will depend largely on what one takes the aims of science to be and what one’s own aims are. For most of the history of scientific methodology the assumption has been that the most important output of science is knowledge and so the aim of methodology should be to discover those methods by which scientific knowledge is generated.

Science was seen to embody the most successful form of reasoning (but which form?) to the most certain knowledge claims (but how certain?) on the basis of systematically collected evidence (but what counts as evidence, and should the evidence of the senses take precedence, or rational insight?) Section 2 surveys some of the history, pointing to two major themes. One theme is seeking the right balance between observation and reasoning (and the attendant forms of reasoning which employ them); the other is how certain scientific knowledge is or can be.

Section 3 turns to 20 th century debates on scientific method. In the second half of the 20 th century the epistemic privilege of science faced several challenges and many philosophers of science abandoned the reconstruction of the logic of scientific method. Views changed significantly regarding which functions of science ought to be captured and why. For some, the success of science was better identified with social or cultural features. Historical and sociological turns in the philosophy of science were made, with a demand that greater attention be paid to the non-epistemic aspects of science, such as sociological, institutional, material, and political factors. Even outside of those movements there was an increased specialization in the philosophy of science, with more and more focus on specific fields within science. The combined upshot was very few philosophers arguing any longer for a grand unified methodology of science. Sections 3 and 4 surveys the main positions on scientific method in 20 th century philosophy of science, focusing on where they differ in their preference for confirmation or falsification or for waiving the idea of a special scientific method altogether.

In recent decades, attention has primarily been paid to scientific activities traditionally falling under the rubric of method, such as experimental design and general laboratory practice, the use of statistics, the construction and use of models and diagrams, interdisciplinary collaboration, and science communication. Sections 4–6 attempt to construct a map of the current domains of the study of methods in science.

As these sections illustrate, the question of method is still central to the discourse about science. Scientific method remains a topic for education, for science policy, and for scientists. It arises in the public domain where the demarcation or status of science is at issue. Some philosophers have recently returned, therefore, to the question of what it is that makes science a unique cultural product. This entry will close with some of these recent attempts at discerning and encapsulating the activities by which scientific knowledge is achieved.

Attempting a history of scientific method compounds the vast scope of the topic. This section briefly surveys the background to modern methodological debates. What can be called the classical view goes back to antiquity, and represents a point of departure for later divergences. [ 1 ]

We begin with a point made by Laudan (1968) in his historical survey of scientific method:

Perhaps the most serious inhibition to the emergence of the history of theories of scientific method as a respectable area of study has been the tendency to conflate it with the general history of epistemology, thereby assuming that the narrative categories and classificatory pigeon-holes applied to the latter are also basic to the former. (1968: 5)

To see knowledge about the natural world as falling under knowledge more generally is an understandable conflation. Histories of theories of method would naturally employ the same narrative categories and classificatory pigeon holes. An important theme of the history of epistemology, for example, is the unification of knowledge, a theme reflected in the question of the unification of method in science. Those who have identified differences in kinds of knowledge have often likewise identified different methods for achieving that kind of knowledge (see the entry on the unity of science ).

Different views on what is known, how it is known, and what can be known are connected. Plato distinguished the realms of things into the visible and the intelligible ( The Republic , 510a, in Cooper 1997). Only the latter, the Forms, could be objects of knowledge. The intelligible truths could be known with the certainty of geometry and deductive reasoning. What could be observed of the material world, however, was by definition imperfect and deceptive, not ideal. The Platonic way of knowledge therefore emphasized reasoning as a method, downplaying the importance of observation. Aristotle disagreed, locating the Forms in the natural world as the fundamental principles to be discovered through the inquiry into nature ( Metaphysics Z , in Barnes 1984).

Aristotle is recognized as giving the earliest systematic treatise on the nature of scientific inquiry in the western tradition, one which embraced observation and reasoning about the natural world. In the Prior and Posterior Analytics , Aristotle reflects first on the aims and then the methods of inquiry into nature. A number of features can be found which are still considered by most to be essential to science. For Aristotle, empiricism, careful observation (but passive observation, not controlled experiment), is the starting point. The aim is not merely recording of facts, though. For Aristotle, science ( epistêmê ) is a body of properly arranged knowledge or learning—the empirical facts, but also their ordering and display are of crucial importance. The aims of discovery, ordering, and display of facts partly determine the methods required of successful scientific inquiry. Also determinant is the nature of the knowledge being sought, and the explanatory causes proper to that kind of knowledge (see the discussion of the four causes in the entry on Aristotle on causality ).

In addition to careful observation, then, scientific method requires a logic as a system of reasoning for properly arranging, but also inferring beyond, what is known by observation. Methods of reasoning may include induction, prediction, or analogy, among others. Aristotle’s system (along with his catalogue of fallacious reasoning) was collected under the title the Organon . This title would be echoed in later works on scientific reasoning, such as Novum Organon by Francis Bacon, and Novum Organon Restorum by William Whewell (see below). In Aristotle’s Organon reasoning is divided primarily into two forms, a rough division which persists into modern times. The division, known most commonly today as deductive versus inductive method, appears in other eras and methodologies as analysis/​synthesis, non-ampliative/​ampliative, or even confirmation/​verification. The basic idea is there are two “directions” to proceed in our methods of inquiry: one away from what is observed, to the more fundamental, general, and encompassing principles; the other, from the fundamental and general to instances or implications of principles.

The basic aim and method of inquiry identified here can be seen as a theme running throughout the next two millennia of reflection on the correct way to seek after knowledge: carefully observe nature and then seek rules or principles which explain or predict its operation. The Aristotelian corpus provided the framework for a commentary tradition on scientific method independent of science itself (cosmos versus physics.) During the medieval period, figures such as Albertus Magnus (1206–1280), Thomas Aquinas (1225–1274), Robert Grosseteste (1175–1253), Roger Bacon (1214/1220–1292), William of Ockham (1287–1347), Andreas Vesalius (1514–1546), Giacomo Zabarella (1533–1589) all worked to clarify the kind of knowledge obtainable by observation and induction, the source of justification of induction, and best rules for its application. [ 2 ] Many of their contributions we now think of as essential to science (see also Laudan 1968). As Aristotle and Plato had employed a framework of reasoning either “to the forms” or “away from the forms”, medieval thinkers employed directions away from the phenomena or back to the phenomena. In analysis, a phenomena was examined to discover its basic explanatory principles; in synthesis, explanations of a phenomena were constructed from first principles.

During the Scientific Revolution these various strands of argument, experiment, and reason were forged into a dominant epistemic authority. The 16 th –18 th centuries were a period of not only dramatic advance in knowledge about the operation of the natural world—advances in mechanical, medical, biological, political, economic explanations—but also of self-awareness of the revolutionary changes taking place, and intense reflection on the source and legitimation of the method by which the advances were made. The struggle to establish the new authority included methodological moves. The Book of Nature, according to the metaphor of Galileo Galilei (1564–1642) or Francis Bacon (1561–1626), was written in the language of mathematics, of geometry and number. This motivated an emphasis on mathematical description and mechanical explanation as important aspects of scientific method. Through figures such as Henry More and Ralph Cudworth, a neo-Platonic emphasis on the importance of metaphysical reflection on nature behind appearances, particularly regarding the spiritual as a complement to the purely mechanical, remained an important methodological thread of the Scientific Revolution (see the entries on Cambridge platonists ; Boyle ; Henry More ; Galileo ).

In Novum Organum (1620), Bacon was critical of the Aristotelian method for leaping from particulars to universals too quickly. The syllogistic form of reasoning readily mixed those two types of propositions. Bacon aimed at the invention of new arts, principles, and directions. His method would be grounded in methodical collection of observations, coupled with correction of our senses (and particularly, directions for the avoidance of the Idols, as he called them, kinds of systematic errors to which naïve observers are prone.) The community of scientists could then climb, by a careful, gradual and unbroken ascent, to reliable general claims.

Bacon’s method has been criticized as impractical and too inflexible for the practicing scientist. Whewell would later criticize Bacon in his System of Logic for paying too little attention to the practices of scientists. It is hard to find convincing examples of Bacon’s method being put in to practice in the history of science, but there are a few who have been held up as real examples of 16 th century scientific, inductive method, even if not in the rigid Baconian mold: figures such as Robert Boyle (1627–1691) and William Harvey (1578–1657) (see the entry on Bacon ).

It is to Isaac Newton (1642–1727), however, that historians of science and methodologists have paid greatest attention. Given the enormous success of his Principia Mathematica and Opticks , this is understandable. The study of Newton’s method has had two main thrusts: the implicit method of the experiments and reasoning presented in the Opticks, and the explicit methodological rules given as the Rules for Philosophising (the Regulae) in Book III of the Principia . [ 3 ] Newton’s law of gravitation, the linchpin of his new cosmology, broke with explanatory conventions of natural philosophy, first for apparently proposing action at a distance, but more generally for not providing “true”, physical causes. The argument for his System of the World ( Principia , Book III) was based on phenomena, not reasoned first principles. This was viewed (mainly on the continent) as insufficient for proper natural philosophy. The Regulae counter this objection, re-defining the aims of natural philosophy by re-defining the method natural philosophers should follow. (See the entry on Newton’s philosophy .)

To his list of methodological prescriptions should be added Newton’s famous phrase “ hypotheses non fingo ” (commonly translated as “I frame no hypotheses”.) The scientist was not to invent systems but infer explanations from observations, as Bacon had advocated. This would come to be known as inductivism. In the century after Newton, significant clarifications of the Newtonian method were made. Colin Maclaurin (1698–1746), for instance, reconstructed the essential structure of the method as having complementary analysis and synthesis phases, one proceeding away from the phenomena in generalization, the other from the general propositions to derive explanations of new phenomena. Denis Diderot (1713–1784) and editors of the Encyclopédie did much to consolidate and popularize Newtonianism, as did Francesco Algarotti (1721–1764). The emphasis was often the same, as much on the character of the scientist as on their process, a character which is still commonly assumed. The scientist is humble in the face of nature, not beholden to dogma, obeys only his eyes, and follows the truth wherever it leads. It was certainly Voltaire (1694–1778) and du Chatelet (1706–1749) who were most influential in propagating the latter vision of the scientist and their craft, with Newton as hero. Scientific method became a revolutionary force of the Enlightenment. (See also the entries on Newton , Leibniz , Descartes , Boyle , Hume , enlightenment , as well as Shank 2008 for a historical overview.)

Not all 18 th century reflections on scientific method were so celebratory. Famous also are George Berkeley’s (1685–1753) attack on the mathematics of the new science, as well as the over-emphasis of Newtonians on observation; and David Hume’s (1711–1776) undermining of the warrant offered for scientific claims by inductive justification (see the entries on: George Berkeley ; David Hume ; Hume’s Newtonianism and Anti-Newtonianism ). Hume’s problem of induction motivated Immanuel Kant (1724–1804) to seek new foundations for empirical method, though as an epistemic reconstruction, not as any set of practical guidelines for scientists. Both Hume and Kant influenced the methodological reflections of the next century, such as the debate between Mill and Whewell over the certainty of inductive inferences in science.

The debate between John Stuart Mill (1806–1873) and William Whewell (1794–1866) has become the canonical methodological debate of the 19 th century. Although often characterized as a debate between inductivism and hypothetico-deductivism, the role of the two methods on each side is actually more complex. On the hypothetico-deductive account, scientists work to come up with hypotheses from which true observational consequences can be deduced—hence, hypothetico-deductive. Because Whewell emphasizes both hypotheses and deduction in his account of method, he can be seen as a convenient foil to the inductivism of Mill. However, equally if not more important to Whewell’s portrayal of scientific method is what he calls the “fundamental antithesis”. Knowledge is a product of the objective (what we see in the world around us) and subjective (the contributions of our mind to how we perceive and understand what we experience, which he called the Fundamental Ideas). Both elements are essential according to Whewell, and he was therefore critical of Kant for too much focus on the subjective, and John Locke (1632–1704) and Mill for too much focus on the senses. Whewell’s fundamental ideas can be discipline relative. An idea can be fundamental even if it is necessary for knowledge only within a given scientific discipline (e.g., chemical affinity for chemistry). This distinguishes fundamental ideas from the forms and categories of intuition of Kant. (See the entry on Whewell .)

Clarifying fundamental ideas would therefore be an essential part of scientific method and scientific progress. Whewell called this process “Discoverer’s Induction”. It was induction, following Bacon or Newton, but Whewell sought to revive Bacon’s account by emphasising the role of ideas in the clear and careful formulation of inductive hypotheses. Whewell’s induction is not merely the collecting of objective facts. The subjective plays a role through what Whewell calls the Colligation of Facts, a creative act of the scientist, the invention of a theory. A theory is then confirmed by testing, where more facts are brought under the theory, called the Consilience of Inductions. Whewell felt that this was the method by which the true laws of nature could be discovered: clarification of fundamental concepts, clever invention of explanations, and careful testing. Mill, in his critique of Whewell, and others who have cast Whewell as a fore-runner of the hypothetico-deductivist view, seem to have under-estimated the importance of this discovery phase in Whewell’s understanding of method (Snyder 1997a,b, 1999). Down-playing the discovery phase would come to characterize methodology of the early 20 th century (see section 3 ).

Mill, in his System of Logic , put forward a narrower view of induction as the essence of scientific method. For Mill, induction is the search first for regularities among events. Among those regularities, some will continue to hold for further observations, eventually gaining the status of laws. One can also look for regularities among the laws discovered in a domain, i.e., for a law of laws. Which “law law” will hold is time and discipline dependent and open to revision. One example is the Law of Universal Causation, and Mill put forward specific methods for identifying causes—now commonly known as Mill’s methods. These five methods look for circumstances which are common among the phenomena of interest, those which are absent when the phenomena are, or those for which both vary together. Mill’s methods are still seen as capturing basic intuitions about experimental methods for finding the relevant explanatory factors ( System of Logic (1843), see Mill entry). The methods advocated by Whewell and Mill, in the end, look similar. Both involve inductive generalization to covering laws. They differ dramatically, however, with respect to the necessity of the knowledge arrived at; that is, at the meta-methodological level (see the entries on Whewell and Mill entries).

3. Logic of method and critical responses

The quantum and relativistic revolutions in physics in the early 20 th century had a profound effect on methodology. Conceptual foundations of both theories were taken to show the defeasibility of even the most seemingly secure intuitions about space, time and bodies. Certainty of knowledge about the natural world was therefore recognized as unattainable. Instead a renewed empiricism was sought which rendered science fallible but still rationally justifiable.

Analyses of the reasoning of scientists emerged, according to which the aspects of scientific method which were of primary importance were the means of testing and confirming of theories. A distinction in methodology was made between the contexts of discovery and justification. The distinction could be used as a wedge between the particularities of where and how theories or hypotheses are arrived at, on the one hand, and the underlying reasoning scientists use (whether or not they are aware of it) when assessing theories and judging their adequacy on the basis of the available evidence. By and large, for most of the 20 th century, philosophy of science focused on the second context, although philosophers differed on whether to focus on confirmation or refutation as well as on the many details of how confirmation or refutation could or could not be brought about. By the mid-20 th century these attempts at defining the method of justification and the context distinction itself came under pressure. During the same period, philosophy of science developed rapidly, and from section 4 this entry will therefore shift from a primarily historical treatment of the scientific method towards a primarily thematic one.

Advances in logic and probability held out promise of the possibility of elaborate reconstructions of scientific theories and empirical method, the best example being Rudolf Carnap’s The Logical Structure of the World (1928). Carnap attempted to show that a scientific theory could be reconstructed as a formal axiomatic system—that is, a logic. That system could refer to the world because some of its basic sentences could be interpreted as observations or operations which one could perform to test them. The rest of the theoretical system, including sentences using theoretical or unobservable terms (like electron or force) would then either be meaningful because they could be reduced to observations, or they had purely logical meanings (called analytic, like mathematical identities). This has been referred to as the verifiability criterion of meaning. According to the criterion, any statement not either analytic or verifiable was strictly meaningless. Although the view was endorsed by Carnap in 1928, he would later come to see it as too restrictive (Carnap 1956). Another familiar version of this idea is operationalism of Percy William Bridgman. In The Logic of Modern Physics (1927) Bridgman asserted that every physical concept could be defined in terms of the operations one would perform to verify the application of that concept. Making good on the operationalisation of a concept even as simple as length, however, can easily become enormously complex (for measuring very small lengths, for instance) or impractical (measuring large distances like light years.)

Carl Hempel’s (1950, 1951) criticisms of the verifiability criterion of meaning had enormous influence. He pointed out that universal generalizations, such as most scientific laws, were not strictly meaningful on the criterion. Verifiability and operationalism both seemed too restrictive to capture standard scientific aims and practice. The tenuous connection between these reconstructions and actual scientific practice was criticized in another way. In both approaches, scientific methods are instead recast in methodological roles. Measurements, for example, were looked to as ways of giving meanings to terms. The aim of the philosopher of science was not to understand the methods per se , but to use them to reconstruct theories, their meanings, and their relation to the world. When scientists perform these operations, however, they will not report that they are doing them to give meaning to terms in a formal axiomatic system. This disconnect between methodology and the details of actual scientific practice would seem to violate the empiricism the Logical Positivists and Bridgman were committed to. The view that methodology should correspond to practice (to some extent) has been called historicism, or intuitionism. We turn to these criticisms and responses in section 3.4 . [ 4 ]

Positivism also had to contend with the recognition that a purely inductivist approach, along the lines of Bacon-Newton-Mill, was untenable. There was no pure observation, for starters. All observation was theory laden. Theory is required to make any observation, therefore not all theory can be derived from observation alone. (See the entry on theory and observation in science .) Even granting an observational basis, Hume had already pointed out that one could not deductively justify inductive conclusions without begging the question by presuming the success of the inductive method. Likewise, positivist attempts at analyzing how a generalization can be confirmed by observations of its instances were subject to a number of criticisms. Goodman (1965) and Hempel (1965) both point to paradoxes inherent in standard accounts of confirmation. Recent attempts at explaining how observations can serve to confirm a scientific theory are discussed in section 4 below.

The standard starting point for a non-inductive analysis of the logic of confirmation is known as the Hypothetico-Deductive (H-D) method. In its simplest form, a sentence of a theory which expresses some hypothesis is confirmed by its true consequences. As noted in section 2 , this method had been advanced by Whewell in the 19 th century, as well as Nicod (1924) and others in the 20 th century. Often, Hempel’s (1966) description of the H-D method, illustrated by the case of Semmelweiss’ inferential procedures in establishing the cause of childbed fever, has been presented as a key account of H-D as well as a foil for criticism of the H-D account of confirmation (see, for example, Lipton’s (2004) discussion of inference to the best explanation; also the entry on confirmation ). Hempel described Semmelsweiss’ procedure as examining various hypotheses explaining the cause of childbed fever. Some hypotheses conflicted with observable facts and could be rejected as false immediately. Others needed to be tested experimentally by deducing which observable events should follow if the hypothesis were true (what Hempel called the test implications of the hypothesis), then conducting an experiment and observing whether or not the test implications occurred. If the experiment showed the test implication to be false, the hypothesis could be rejected. If the experiment showed the test implications to be true, however, this did not prove the hypothesis true. The confirmation of a test implication does not verify a hypothesis, though Hempel did allow that “it provides at least some support, some corroboration or confirmation for it” (Hempel 1966: 8). The degree of this support then depends on the quantity, variety and precision of the supporting evidence.

Another approach that took off from the difficulties with inductive inference was Karl Popper’s critical rationalism or falsificationism (Popper 1959, 1963). Falsification is deductive and similar to H-D in that it involves scientists deducing observational consequences from the hypothesis under test. For Popper, however, the important point was not the degree of confirmation that successful prediction offered to a hypothesis. The crucial thing was the logical asymmetry between confirmation, based on inductive inference, and falsification, which can be based on a deductive inference. (This simple opposition was later questioned, by Lakatos, among others. See the entry on historicist theories of scientific rationality. )

Popper stressed that, regardless of the amount of confirming evidence, we can never be certain that a hypothesis is true without committing the fallacy of affirming the consequent. Instead, Popper introduced the notion of corroboration as a measure for how well a theory or hypothesis has survived previous testing—but without implying that this is also a measure for the probability that it is true.

Popper was also motivated by his doubts about the scientific status of theories like the Marxist theory of history or psycho-analysis, and so wanted to demarcate between science and pseudo-science. Popper saw this as an importantly different distinction than demarcating science from metaphysics. The latter demarcation was the primary concern of many logical empiricists. Popper used the idea of falsification to draw a line instead between pseudo and proper science. Science was science because its method involved subjecting theories to rigorous tests which offered a high probability of failing and thus refuting the theory.

A commitment to the risk of failure was important. Avoiding falsification could be done all too easily. If a consequence of a theory is inconsistent with observations, an exception can be added by introducing auxiliary hypotheses designed explicitly to save the theory, so-called ad hoc modifications. This Popper saw done in pseudo-science where ad hoc theories appeared capable of explaining anything in their field of application. In contrast, science is risky. If observations showed the predictions from a theory to be wrong, the theory would be refuted. Hence, scientific hypotheses must be falsifiable. Not only must there exist some possible observation statement which could falsify the hypothesis or theory, were it observed, (Popper called these the hypothesis’ potential falsifiers) it is crucial to the Popperian scientific method that such falsifications be sincerely attempted on a regular basis.

The more potential falsifiers of a hypothesis, the more falsifiable it would be, and the more the hypothesis claimed. Conversely, hypotheses without falsifiers claimed very little or nothing at all. Originally, Popper thought that this meant the introduction of ad hoc hypotheses only to save a theory should not be countenanced as good scientific method. These would undermine the falsifiabililty of a theory. However, Popper later came to recognize that the introduction of modifications (immunizations, he called them) was often an important part of scientific development. Responding to surprising or apparently falsifying observations often generated important new scientific insights. Popper’s own example was the observed motion of Uranus which originally did not agree with Newtonian predictions. The ad hoc hypothesis of an outer planet explained the disagreement and led to further falsifiable predictions. Popper sought to reconcile the view by blurring the distinction between falsifiable and not falsifiable, and speaking instead of degrees of testability (Popper 1985: 41f.).

From the 1960s on, sustained meta-methodological criticism emerged that drove philosophical focus away from scientific method. A brief look at those criticisms follows, with recommendations for further reading at the end of the entry.

Thomas Kuhn’s The Structure of Scientific Revolutions (1962) begins with a well-known shot across the bow for philosophers of science:

History, if viewed as a repository for more than anecdote or chronology, could produce a decisive transformation in the image of science by which we are now possessed. (1962: 1)

The image Kuhn thought needed transforming was the a-historical, rational reconstruction sought by many of the Logical Positivists, though Carnap and other positivists were actually quite sympathetic to Kuhn’s views. (See the entry on the Vienna Circle .) Kuhn shares with other of his contemporaries, such as Feyerabend and Lakatos, a commitment to a more empirical approach to philosophy of science. Namely, the history of science provides important data, and necessary checks, for philosophy of science, including any theory of scientific method.

The history of science reveals, according to Kuhn, that scientific development occurs in alternating phases. During normal science, the members of the scientific community adhere to the paradigm in place. Their commitment to the paradigm means a commitment to the puzzles to be solved and the acceptable ways of solving them. Confidence in the paradigm remains so long as steady progress is made in solving the shared puzzles. Method in this normal phase operates within a disciplinary matrix (Kuhn’s later concept of a paradigm) which includes standards for problem solving, and defines the range of problems to which the method should be applied. An important part of a disciplinary matrix is the set of values which provide the norms and aims for scientific method. The main values that Kuhn identifies are prediction, problem solving, simplicity, consistency, and plausibility.

An important by-product of normal science is the accumulation of puzzles which cannot be solved with resources of the current paradigm. Once accumulation of these anomalies has reached some critical mass, it can trigger a communal shift to a new paradigm and a new phase of normal science. Importantly, the values that provide the norms and aims for scientific method may have transformed in the meantime. Method may therefore be relative to discipline, time or place

Feyerabend also identified the aims of science as progress, but argued that any methodological prescription would only stifle that progress (Feyerabend 1988). His arguments are grounded in re-examining accepted “myths” about the history of science. Heroes of science, like Galileo, are shown to be just as reliant on rhetoric and persuasion as they are on reason and demonstration. Others, like Aristotle, are shown to be far more reasonable and far-reaching in their outlooks then they are given credit for. As a consequence, the only rule that could provide what he took to be sufficient freedom was the vacuous “anything goes”. More generally, even the methodological restriction that science is the best way to pursue knowledge, and to increase knowledge, is too restrictive. Feyerabend suggested instead that science might, in fact, be a threat to a free society, because it and its myth had become so dominant (Feyerabend 1978).

An even more fundamental kind of criticism was offered by several sociologists of science from the 1970s onwards who rejected the methodology of providing philosophical accounts for the rational development of science and sociological accounts of the irrational mistakes. Instead, they adhered to a symmetry thesis on which any causal explanation of how scientific knowledge is established needs to be symmetrical in explaining truth and falsity, rationality and irrationality, success and mistakes, by the same causal factors (see, e.g., Barnes and Bloor 1982, Bloor 1991). Movements in the Sociology of Science, like the Strong Programme, or in the social dimensions and causes of knowledge more generally led to extended and close examination of detailed case studies in contemporary science and its history. (See the entries on the social dimensions of scientific knowledge and social epistemology .) Well-known examinations by Latour and Woolgar (1979/1986), Knorr-Cetina (1981), Pickering (1984), Shapin and Schaffer (1985) seem to bear out that it was social ideologies (on a macro-scale) or individual interactions and circumstances (on a micro-scale) which were the primary causal factors in determining which beliefs gained the status of scientific knowledge. As they saw it therefore, explanatory appeals to scientific method were not empirically grounded.

A late, and largely unexpected, criticism of scientific method came from within science itself. Beginning in the early 2000s, a number of scientists attempting to replicate the results of published experiments could not do so. There may be close conceptual connection between reproducibility and method. For example, if reproducibility means that the same scientific methods ought to produce the same result, and all scientific results ought to be reproducible, then whatever it takes to reproduce a scientific result ought to be called scientific method. Space limits us to the observation that, insofar as reproducibility is a desired outcome of proper scientific method, it is not strictly a part of scientific method. (See the entry on reproducibility of scientific results .)

By the close of the 20 th century the search for the scientific method was flagging. Nola and Sankey (2000b) could introduce their volume on method by remarking that “For some, the whole idea of a theory of scientific method is yester-year’s debate …”.

Despite the many difficulties that philosophers encountered in trying to providing a clear methodology of conformation (or refutation), still important progress has been made on understanding how observation can provide evidence for a given theory. Work in statistics has been crucial for understanding how theories can be tested empirically, and in recent decades a huge literature has developed that attempts to recast confirmation in Bayesian terms. Here these developments can be covered only briefly, and we refer to the entry on confirmation for further details and references.

Statistics has come to play an increasingly important role in the methodology of the experimental sciences from the 19 th century onwards. At that time, statistics and probability theory took on a methodological role as an analysis of inductive inference, and attempts to ground the rationality of induction in the axioms of probability theory have continued throughout the 20 th century and in to the present. Developments in the theory of statistics itself, meanwhile, have had a direct and immense influence on the experimental method, including methods for measuring the uncertainty of observations such as the Method of Least Squares developed by Legendre and Gauss in the early 19 th century, criteria for the rejection of outliers proposed by Peirce by the mid-19 th century, and the significance tests developed by Gosset (a.k.a. “Student”), Fisher, Neyman & Pearson and others in the 1920s and 1930s (see, e.g., Swijtink 1987 for a brief historical overview; and also the entry on C.S. Peirce ).

These developments within statistics then in turn led to a reflective discussion among both statisticians and philosophers of science on how to perceive the process of hypothesis testing: whether it was a rigorous statistical inference that could provide a numerical expression of the degree of confidence in the tested hypothesis, or if it should be seen as a decision between different courses of actions that also involved a value component. This led to a major controversy among Fisher on the one side and Neyman and Pearson on the other (see especially Fisher 1955, Neyman 1956 and Pearson 1955, and for analyses of the controversy, e.g., Howie 2002, Marks 2000, Lenhard 2006). On Fisher’s view, hypothesis testing was a methodology for when to accept or reject a statistical hypothesis, namely that a hypothesis should be rejected by evidence if this evidence would be unlikely relative to other possible outcomes, given the hypothesis were true. In contrast, on Neyman and Pearson’s view, the consequence of error also had to play a role when deciding between hypotheses. Introducing the distinction between the error of rejecting a true hypothesis (type I error) and accepting a false hypothesis (type II error), they argued that it depends on the consequences of the error to decide whether it is more important to avoid rejecting a true hypothesis or accepting a false one. Hence, Fisher aimed for a theory of inductive inference that enabled a numerical expression of confidence in a hypothesis. To him, the important point was the search for truth, not utility. In contrast, the Neyman-Pearson approach provided a strategy of inductive behaviour for deciding between different courses of action. Here, the important point was not whether a hypothesis was true, but whether one should act as if it was.

Similar discussions are found in the philosophical literature. On the one side, Churchman (1948) and Rudner (1953) argued that because scientific hypotheses can never be completely verified, a complete analysis of the methods of scientific inference includes ethical judgments in which the scientists must decide whether the evidence is sufficiently strong or that the probability is sufficiently high to warrant the acceptance of the hypothesis, which again will depend on the importance of making a mistake in accepting or rejecting the hypothesis. Others, such as Jeffrey (1956) and Levi (1960) disagreed and instead defended a value-neutral view of science on which scientists should bracket their attitudes, preferences, temperament, and values when assessing the correctness of their inferences. For more details on this value-free ideal in the philosophy of science and its historical development, see Douglas (2009) and Howard (2003). For a broad set of case studies examining the role of values in science, see e.g. Elliott & Richards 2017.

In recent decades, philosophical discussions of the evaluation of probabilistic hypotheses by statistical inference have largely focused on Bayesianism that understands probability as a measure of a person’s degree of belief in an event, given the available information, and frequentism that instead understands probability as a long-run frequency of a repeatable event. Hence, for Bayesians probabilities refer to a state of knowledge, whereas for frequentists probabilities refer to frequencies of events (see, e.g., Sober 2008, chapter 1 for a detailed introduction to Bayesianism and frequentism as well as to likelihoodism). Bayesianism aims at providing a quantifiable, algorithmic representation of belief revision, where belief revision is a function of prior beliefs (i.e., background knowledge) and incoming evidence. Bayesianism employs a rule based on Bayes’ theorem, a theorem of the probability calculus which relates conditional probabilities. The probability that a particular hypothesis is true is interpreted as a degree of belief, or credence, of the scientist. There will also be a probability and a degree of belief that a hypothesis will be true conditional on a piece of evidence (an observation, say) being true. Bayesianism proscribes that it is rational for the scientist to update their belief in the hypothesis to that conditional probability should it turn out that the evidence is, in fact, observed (see, e.g., Sprenger & Hartmann 2019 for a comprehensive treatment of Bayesian philosophy of science). Originating in the work of Neyman and Person, frequentism aims at providing the tools for reducing long-run error rates, such as the error-statistical approach developed by Mayo (1996) that focuses on how experimenters can avoid both type I and type II errors by building up a repertoire of procedures that detect errors if and only if they are present. Both Bayesianism and frequentism have developed over time, they are interpreted in different ways by its various proponents, and their relations to previous criticism to attempts at defining scientific method are seen differently by proponents and critics. The literature, surveys, reviews and criticism in this area are vast and the reader is referred to the entries on Bayesian epistemology and confirmation .

5. Method in Practice

Attention to scientific practice, as we have seen, is not itself new. However, the turn to practice in the philosophy of science of late can be seen as a correction to the pessimism with respect to method in philosophy of science in later parts of the 20 th century, and as an attempted reconciliation between sociological and rationalist explanations of scientific knowledge. Much of this work sees method as detailed and context specific problem-solving procedures, and methodological analyses to be at the same time descriptive, critical and advisory (see Nickles 1987 for an exposition of this view). The following section contains a survey of some of the practice focuses. In this section we turn fully to topics rather than chronology.

A problem with the distinction between the contexts of discovery and justification that figured so prominently in philosophy of science in the first half of the 20 th century (see section 2 ) is that no such distinction can be clearly seen in scientific activity (see Arabatzis 2006). Thus, in recent decades, it has been recognized that study of conceptual innovation and change should not be confined to psychology and sociology of science, but are also important aspects of scientific practice which philosophy of science should address (see also the entry on scientific discovery ). Looking for the practices that drive conceptual innovation has led philosophers to examine both the reasoning practices of scientists and the wide realm of experimental practices that are not directed narrowly at testing hypotheses, that is, exploratory experimentation.

Examining the reasoning practices of historical and contemporary scientists, Nersessian (2008) has argued that new scientific concepts are constructed as solutions to specific problems by systematic reasoning, and that of analogy, visual representation and thought-experimentation are among the important reasoning practices employed. These ubiquitous forms of reasoning are reliable—but also fallible—methods of conceptual development and change. On her account, model-based reasoning consists of cycles of construction, simulation, evaluation and adaption of models that serve as interim interpretations of the target problem to be solved. Often, this process will lead to modifications or extensions, and a new cycle of simulation and evaluation. However, Nersessian also emphasizes that

creative model-based reasoning cannot be applied as a simple recipe, is not always productive of solutions, and even its most exemplary usages can lead to incorrect solutions. (Nersessian 2008: 11)

Thus, while on the one hand she agrees with many previous philosophers that there is no logic of discovery, discoveries can derive from reasoned processes, such that a large and integral part of scientific practice is

the creation of concepts through which to comprehend, structure, and communicate about physical phenomena …. (Nersessian 1987: 11)

Similarly, work on heuristics for discovery and theory construction by scholars such as Darden (1991) and Bechtel & Richardson (1993) present science as problem solving and investigate scientific problem solving as a special case of problem-solving in general. Drawing largely on cases from the biological sciences, much of their focus has been on reasoning strategies for the generation, evaluation, and revision of mechanistic explanations of complex systems.

Addressing another aspect of the context distinction, namely the traditional view that the primary role of experiments is to test theoretical hypotheses according to the H-D model, other philosophers of science have argued for additional roles that experiments can play. The notion of exploratory experimentation was introduced to describe experiments driven by the desire to obtain empirical regularities and to develop concepts and classifications in which these regularities can be described (Steinle 1997, 2002; Burian 1997; Waters 2007)). However the difference between theory driven experimentation and exploratory experimentation should not be seen as a sharp distinction. Theory driven experiments are not always directed at testing hypothesis, but may also be directed at various kinds of fact-gathering, such as determining numerical parameters. Vice versa , exploratory experiments are usually informed by theory in various ways and are therefore not theory-free. Instead, in exploratory experiments phenomena are investigated without first limiting the possible outcomes of the experiment on the basis of extant theory about the phenomena.

The development of high throughput instrumentation in molecular biology and neighbouring fields has given rise to a special type of exploratory experimentation that collects and analyses very large amounts of data, and these new ‘omics’ disciplines are often said to represent a break with the ideal of hypothesis-driven science (Burian 2007; Elliott 2007; Waters 2007; O’Malley 2007) and instead described as data-driven research (Leonelli 2012; Strasser 2012) or as a special kind of “convenience experimentation” in which many experiments are done simply because they are extraordinarily convenient to perform (Krohs 2012).

5.2 Computer methods and ‘new ways’ of doing science

The field of omics just described is possible because of the ability of computers to process, in a reasonable amount of time, the huge quantities of data required. Computers allow for more elaborate experimentation (higher speed, better filtering, more variables, sophisticated coordination and control), but also, through modelling and simulations, might constitute a form of experimentation themselves. Here, too, we can pose a version of the general question of method versus practice: does the practice of using computers fundamentally change scientific method, or merely provide a more efficient means of implementing standard methods?

Because computers can be used to automate measurements, quantifications, calculations, and statistical analyses where, for practical reasons, these operations cannot be otherwise carried out, many of the steps involved in reaching a conclusion on the basis of an experiment are now made inside a “black box”, without the direct involvement or awareness of a human. This has epistemological implications, regarding what we can know, and how we can know it. To have confidence in the results, computer methods are therefore subjected to tests of verification and validation.

The distinction between verification and validation is easiest to characterize in the case of computer simulations. In a typical computer simulation scenario computers are used to numerically integrate differential equations for which no analytic solution is available. The equations are part of the model the scientist uses to represent a phenomenon or system under investigation. Verifying a computer simulation means checking that the equations of the model are being correctly approximated. Validating a simulation means checking that the equations of the model are adequate for the inferences one wants to make on the basis of that model.

A number of issues related to computer simulations have been raised. The identification of validity and verification as the testing methods has been criticized. Oreskes et al. (1994) raise concerns that “validiation”, because it suggests deductive inference, might lead to over-confidence in the results of simulations. The distinction itself is probably too clean, since actual practice in the testing of simulations mixes and moves back and forth between the two (Weissart 1997; Parker 2008a; Winsberg 2010). Computer simulations do seem to have a non-inductive character, given that the principles by which they operate are built in by the programmers, and any results of the simulation follow from those in-built principles in such a way that those results could, in principle, be deduced from the program code and its inputs. The status of simulations as experiments has therefore been examined (Kaufmann and Smarr 1993; Humphreys 1995; Hughes 1999; Norton and Suppe 2001). This literature considers the epistemology of these experiments: what we can learn by simulation, and also the kinds of justifications which can be given in applying that knowledge to the “real” world. (Mayo 1996; Parker 2008b). As pointed out, part of the advantage of computer simulation derives from the fact that huge numbers of calculations can be carried out without requiring direct observation by the experimenter/​simulator. At the same time, many of these calculations are approximations to the calculations which would be performed first-hand in an ideal situation. Both factors introduce uncertainties into the inferences drawn from what is observed in the simulation.

For many of the reasons described above, computer simulations do not seem to belong clearly to either the experimental or theoretical domain. Rather, they seem to crucially involve aspects of both. This has led some authors, such as Fox Keller (2003: 200) to argue that we ought to consider computer simulation a “qualitatively different way of doing science”. The literature in general tends to follow Kaufmann and Smarr (1993) in referring to computer simulation as a “third way” for scientific methodology (theoretical reasoning and experimental practice are the first two ways.). It should also be noted that the debates around these issues have tended to focus on the form of computer simulation typical in the physical sciences, where models are based on dynamical equations. Other forms of simulation might not have the same problems, or have problems of their own (see the entry on computer simulations in science ).

In recent years, the rapid development of machine learning techniques has prompted some scholars to suggest that the scientific method has become “obsolete” (Anderson 2008, Carrol and Goodstein 2009). This has resulted in an intense debate on the relative merit of data-driven and hypothesis-driven research (for samples, see e.g. Mazzocchi 2015 or Succi and Coveney 2018). For a detailed treatment of this topic, we refer to the entry scientific research and big data .

6. Discourse on scientific method

Despite philosophical disagreements, the idea of the scientific method still figures prominently in contemporary discourse on many different topics, both within science and in society at large. Often, reference to scientific method is used in ways that convey either the legend of a single, universal method characteristic of all science, or grants to a particular method or set of methods privilege as a special ‘gold standard’, often with reference to particular philosophers to vindicate the claims. Discourse on scientific method also typically arises when there is a need to distinguish between science and other activities, or for justifying the special status conveyed to science. In these areas, the philosophical attempts at identifying a set of methods characteristic for scientific endeavors are closely related to the philosophy of science’s classical problem of demarcation (see the entry on science and pseudo-science ) and to the philosophical analysis of the social dimension of scientific knowledge and the role of science in democratic society.

One of the settings in which the legend of a single, universal scientific method has been particularly strong is science education (see, e.g., Bauer 1992; McComas 1996; Wivagg & Allchin 2002). [ 5 ] Often, ‘the scientific method’ is presented in textbooks and educational web pages as a fixed four or five step procedure starting from observations and description of a phenomenon and progressing over formulation of a hypothesis which explains the phenomenon, designing and conducting experiments to test the hypothesis, analyzing the results, and ending with drawing a conclusion. Such references to a universal scientific method can be found in educational material at all levels of science education (Blachowicz 2009), and numerous studies have shown that the idea of a general and universal scientific method often form part of both students’ and teachers’ conception of science (see, e.g., Aikenhead 1987; Osborne et al. 2003). In response, it has been argued that science education need to focus more on teaching about the nature of science, although views have differed on whether this is best done through student-led investigations, contemporary cases, or historical cases (Allchin, Andersen & Nielsen 2014)

Although occasionally phrased with reference to the H-D method, important historical roots of the legend in science education of a single, universal scientific method are the American philosopher and psychologist Dewey’s account of inquiry in How We Think (1910) and the British mathematician Karl Pearson’s account of science in Grammar of Science (1892). On Dewey’s account, inquiry is divided into the five steps of

(i) a felt difficulty, (ii) its location and definition, (iii) suggestion of a possible solution, (iv) development by reasoning of the bearing of the suggestions, (v) further observation and experiment leading to its acceptance or rejection. (Dewey 1910: 72)

Similarly, on Pearson’s account, scientific investigations start with measurement of data and observation of their correction and sequence from which scientific laws can be discovered with the aid of creative imagination. These laws have to be subject to criticism, and their final acceptance will have equal validity for “all normally constituted minds”. Both Dewey’s and Pearson’s accounts should be seen as generalized abstractions of inquiry and not restricted to the realm of science—although both Dewey and Pearson referred to their respective accounts as ‘the scientific method’.

Occasionally, scientists make sweeping statements about a simple and distinct scientific method, as exemplified by Feynman’s simplified version of a conjectures and refutations method presented, for example, in the last of his 1964 Cornell Messenger lectures. [ 6 ] However, just as often scientists have come to the same conclusion as recent philosophy of science that there is not any unique, easily described scientific method. For example, the physicist and Nobel Laureate Weinberg described in the paper “The Methods of Science … And Those By Which We Live” (1995) how

The fact that the standards of scientific success shift with time does not only make the philosophy of science difficult; it also raises problems for the public understanding of science. We do not have a fixed scientific method to rally around and defend. (1995: 8)

Interview studies with scientists on their conception of method shows that scientists often find it hard to figure out whether available evidence confirms their hypothesis, and that there are no direct translations between general ideas about method and specific strategies to guide how research is conducted (Schickore & Hangel 2019, Hangel & Schickore 2017)

Reference to the scientific method has also often been used to argue for the scientific nature or special status of a particular activity. Philosophical positions that argue for a simple and unique scientific method as a criterion of demarcation, such as Popperian falsification, have often attracted practitioners who felt that they had a need to defend their domain of practice. For example, references to conjectures and refutation as the scientific method are abundant in much of the literature on complementary and alternative medicine (CAM)—alongside the competing position that CAM, as an alternative to conventional biomedicine, needs to develop its own methodology different from that of science.

Also within mainstream science, reference to the scientific method is used in arguments regarding the internal hierarchy of disciplines and domains. A frequently seen argument is that research based on the H-D method is superior to research based on induction from observations because in deductive inferences the conclusion follows necessarily from the premises. (See, e.g., Parascandola 1998 for an analysis of how this argument has been made to downgrade epidemiology compared to the laboratory sciences.) Similarly, based on an examination of the practices of major funding institutions such as the National Institutes of Health (NIH), the National Science Foundation (NSF) and the Biomedical Sciences Research Practices (BBSRC) in the UK, O’Malley et al. (2009) have argued that funding agencies seem to have a tendency to adhere to the view that the primary activity of science is to test hypotheses, while descriptive and exploratory research is seen as merely preparatory activities that are valuable only insofar as they fuel hypothesis-driven research.

In some areas of science, scholarly publications are structured in a way that may convey the impression of a neat and linear process of inquiry from stating a question, devising the methods by which to answer it, collecting the data, to drawing a conclusion from the analysis of data. For example, the codified format of publications in most biomedical journals known as the IMRAD format (Introduction, Method, Results, Analysis, Discussion) is explicitly described by the journal editors as “not an arbitrary publication format but rather a direct reflection of the process of scientific discovery” (see the so-called “Vancouver Recommendations”, ICMJE 2013: 11). However, scientific publications do not in general reflect the process by which the reported scientific results were produced. For example, under the provocative title “Is the scientific paper a fraud?”, Medawar argued that scientific papers generally misrepresent how the results have been produced (Medawar 1963/1996). Similar views have been advanced by philosophers, historians and sociologists of science (Gilbert 1976; Holmes 1987; Knorr-Cetina 1981; Schickore 2008; Suppe 1998) who have argued that scientists’ experimental practices are messy and often do not follow any recognizable pattern. Publications of research results, they argue, are retrospective reconstructions of these activities that often do not preserve the temporal order or the logic of these activities, but are instead often constructed in order to screen off potential criticism (see Schickore 2008 for a review of this work).

Philosophical positions on the scientific method have also made it into the court room, especially in the US where judges have drawn on philosophy of science in deciding when to confer special status to scientific expert testimony. A key case is Daubert vs Merrell Dow Pharmaceuticals (92–102, 509 U.S. 579, 1993). In this case, the Supreme Court argued in its 1993 ruling that trial judges must ensure that expert testimony is reliable, and that in doing this the court must look at the expert’s methodology to determine whether the proffered evidence is actually scientific knowledge. Further, referring to works of Popper and Hempel the court stated that

ordinarily, a key question to be answered in determining whether a theory or technique is scientific knowledge … is whether it can be (and has been) tested. (Justice Blackmun, Daubert v. Merrell Dow Pharmaceuticals; see Other Internet Resources for a link to the opinion)

But as argued by Haack (2005a,b, 2010) and by Foster & Hubner (1999), by equating the question of whether a piece of testimony is reliable with the question whether it is scientific as indicated by a special methodology, the court was producing an inconsistent mixture of Popper’s and Hempel’s philosophies, and this has later led to considerable confusion in subsequent case rulings that drew on the Daubert case (see Haack 2010 for a detailed exposition).

The difficulties around identifying the methods of science are also reflected in the difficulties of identifying scientific misconduct in the form of improper application of the method or methods of science. One of the first and most influential attempts at defining misconduct in science was the US definition from 1989 that defined misconduct as

fabrication, falsification, plagiarism, or other practices that seriously deviate from those that are commonly accepted within the scientific community . (Code of Federal Regulations, part 50, subpart A., August 8, 1989, italics added)

However, the “other practices that seriously deviate” clause was heavily criticized because it could be used to suppress creative or novel science. For example, the National Academy of Science stated in their report Responsible Science (1992) that it

wishes to discourage the possibility that a misconduct complaint could be lodged against scientists based solely on their use of novel or unorthodox research methods. (NAS: 27)

This clause was therefore later removed from the definition. For an entry into the key philosophical literature on conduct in science, see Shamoo & Resnick (2009).

The question of the source of the success of science has been at the core of philosophy since the beginning of modern science. If viewed as a matter of epistemology more generally, scientific method is a part of the entire history of philosophy. Over that time, science and whatever methods its practitioners may employ have changed dramatically. Today, many philosophers have taken up the banners of pluralism or of practice to focus on what are, in effect, fine-grained and contextually limited examinations of scientific method. Others hope to shift perspectives in order to provide a renewed general account of what characterizes the activity we call science.

One such perspective has been offered recently by Hoyningen-Huene (2008, 2013), who argues from the history of philosophy of science that after three lengthy phases of characterizing science by its method, we are now in a phase where the belief in the existence of a positive scientific method has eroded and what has been left to characterize science is only its fallibility. First was a phase from Plato and Aristotle up until the 17 th century where the specificity of scientific knowledge was seen in its absolute certainty established by proof from evident axioms; next was a phase up to the mid-19 th century in which the means to establish the certainty of scientific knowledge had been generalized to include inductive procedures as well. In the third phase, which lasted until the last decades of the 20 th century, it was recognized that empirical knowledge was fallible, but it was still granted a special status due to its distinctive mode of production. But now in the fourth phase, according to Hoyningen-Huene, historical and philosophical studies have shown how “scientific methods with the characteristics as posited in the second and third phase do not exist” (2008: 168) and there is no longer any consensus among philosophers and historians of science about the nature of science. For Hoyningen-Huene, this is too negative a stance, and he therefore urges the question about the nature of science anew. His own answer to this question is that “scientific knowledge differs from other kinds of knowledge, especially everyday knowledge, primarily by being more systematic” (Hoyningen-Huene 2013: 14). Systematicity can have several different dimensions: among them are more systematic descriptions, explanations, predictions, defense of knowledge claims, epistemic connectedness, ideal of completeness, knowledge generation, representation of knowledge and critical discourse. Hence, what characterizes science is the greater care in excluding possible alternative explanations, the more detailed elaboration with respect to data on which predictions are based, the greater care in detecting and eliminating sources of error, the more articulate connections to other pieces of knowledge, etc. On this position, what characterizes science is not that the methods employed are unique to science, but that the methods are more carefully employed.

Another, similar approach has been offered by Haack (2003). She sets off, similar to Hoyningen-Huene, from a dissatisfaction with the recent clash between what she calls Old Deferentialism and New Cynicism. The Old Deferentialist position is that science progressed inductively by accumulating true theories confirmed by empirical evidence or deductively by testing conjectures against basic statements; while the New Cynics position is that science has no epistemic authority and no uniquely rational method and is merely just politics. Haack insists that contrary to the views of the New Cynics, there are objective epistemic standards, and there is something epistemologically special about science, even though the Old Deferentialists pictured this in a wrong way. Instead, she offers a new Critical Commonsensist account on which standards of good, strong, supportive evidence and well-conducted, honest, thorough and imaginative inquiry are not exclusive to the sciences, but the standards by which we judge all inquirers. In this sense, science does not differ in kind from other kinds of inquiry, but it may differ in the degree to which it requires broad and detailed background knowledge and a familiarity with a technical vocabulary that only specialists may possess.

  • Aikenhead, G.S., 1987, “High-school graduates’ beliefs about science-technology-society. III. Characteristics and limitations of scientific knowledge”, Science Education , 71(4): 459–487.
  • Allchin, D., H.M. Andersen and K. Nielsen, 2014, “Complementary Approaches to Teaching Nature of Science: Integrating Student Inquiry, Historical Cases, and Contemporary Cases in Classroom Practice”, Science Education , 98: 461–486.
  • Anderson, C., 2008, “The end of theory: The data deluge makes the scientific method obsolete”, Wired magazine , 16(7): 16–07
  • Arabatzis, T., 2006, “On the inextricability of the context of discovery and the context of justification”, in Revisiting Discovery and Justification , J. Schickore and F. Steinle (eds.), Dordrecht: Springer, pp. 215–230.
  • Barnes, J. (ed.), 1984, The Complete Works of Aristotle, Vols I and II , Princeton: Princeton University Press.
  • Barnes, B. and D. Bloor, 1982, “Relativism, Rationalism, and the Sociology of Knowledge”, in Rationality and Relativism , M. Hollis and S. Lukes (eds.), Cambridge: MIT Press, pp. 1–20.
  • Bauer, H.H., 1992, Scientific Literacy and the Myth of the Scientific Method , Urbana: University of Illinois Press.
  • Bechtel, W. and R.C. Richardson, 1993, Discovering complexity , Princeton, NJ: Princeton University Press.
  • Berkeley, G., 1734, The Analyst in De Motu and The Analyst: A Modern Edition with Introductions and Commentary , D. Jesseph (trans. and ed.), Dordrecht: Kluwer Academic Publishers, 1992.
  • Blachowicz, J., 2009, “How science textbooks treat scientific method: A philosopher’s perspective”, The British Journal for the Philosophy of Science , 60(2): 303–344.
  • Bloor, D., 1991, Knowledge and Social Imagery , Chicago: University of Chicago Press, 2 nd edition.
  • Boyle, R., 1682, New experiments physico-mechanical, touching the air , Printed by Miles Flesher for Richard Davis, bookseller in Oxford.
  • Bridgman, P.W., 1927, The Logic of Modern Physics , New York: Macmillan.
  • –––, 1956, “The Methodological Character of Theoretical Concepts”, in The Foundations of Science and the Concepts of Science and Psychology , Herbert Feigl and Michael Scriven (eds.), Minnesota: University of Minneapolis Press, pp. 38–76.
  • Burian, R., 1997, “Exploratory Experimentation and the Role of Histochemical Techniques in the Work of Jean Brachet, 1938–1952”, History and Philosophy of the Life Sciences , 19(1): 27–45.
  • –––, 2007, “On microRNA and the need for exploratory experimentation in post-genomic molecular biology”, History and Philosophy of the Life Sciences , 29(3): 285–311.
  • Carnap, R., 1928, Der logische Aufbau der Welt , Berlin: Bernary, transl. by R.A. George, The Logical Structure of the World , Berkeley: University of California Press, 1967.
  • –––, 1956, “The methodological character of theoretical concepts”, Minnesota studies in the philosophy of science , 1: 38–76.
  • Carrol, S., and D. Goodstein, 2009, “Defining the scientific method”, Nature Methods , 6: 237.
  • Churchman, C.W., 1948, “Science, Pragmatics, Induction”, Philosophy of Science , 15(3): 249–268.
  • Cooper, J. (ed.), 1997, Plato: Complete Works , Indianapolis: Hackett.
  • Darden, L., 1991, Theory Change in Science: Strategies from Mendelian Genetics , Oxford: Oxford University Press
  • Dewey, J., 1910, How we think , New York: Dover Publications (reprinted 1997).
  • Douglas, H., 2009, Science, Policy, and the Value-Free Ideal , Pittsburgh: University of Pittsburgh Press.
  • Dupré, J., 2004, “Miracle of Monism ”, in Naturalism in Question , Mario De Caro and David Macarthur (eds.), Cambridge, MA: Harvard University Press, pp. 36–58.
  • Elliott, K.C., 2007, “Varieties of exploratory experimentation in nanotoxicology”, History and Philosophy of the Life Sciences , 29(3): 311–334.
  • Elliott, K. C., and T. Richards (eds.), 2017, Exploring inductive risk: Case studies of values in science , Oxford: Oxford University Press.
  • Falcon, Andrea, 2005, Aristotle and the science of nature: Unity without uniformity , Cambridge: Cambridge University Press.
  • Feyerabend, P., 1978, Science in a Free Society , London: New Left Books
  • –––, 1988, Against Method , London: Verso, 2 nd edition.
  • Fisher, R.A., 1955, “Statistical Methods and Scientific Induction”, Journal of The Royal Statistical Society. Series B (Methodological) , 17(1): 69–78.
  • Foster, K. and P.W. Huber, 1999, Judging Science. Scientific Knowledge and the Federal Courts , Cambridge: MIT Press.
  • Fox Keller, E., 2003, “Models, Simulation, and ‘computer experiments’”, in The Philosophy of Scientific Experimentation , H. Radder (ed.), Pittsburgh: Pittsburgh University Press, 198–215.
  • Gilbert, G., 1976, “The transformation of research findings into scientific knowledge”, Social Studies of Science , 6: 281–306.
  • Gimbel, S., 2011, Exploring the Scientific Method , Chicago: University of Chicago Press.
  • Goodman, N., 1965, Fact , Fiction, and Forecast , Indianapolis: Bobbs-Merrill.
  • Haack, S., 1995, “Science is neither sacred nor a confidence trick”, Foundations of Science , 1(3): 323–335.
  • –––, 2003, Defending science—within reason , Amherst: Prometheus.
  • –––, 2005a, “Disentangling Daubert: an epistemological study in theory and practice”, Journal of Philosophy, Science and Law , 5, Haack 2005a available online . doi:10.5840/jpsl2005513
  • –––, 2005b, “Trial and error: The Supreme Court’s philosophy of science”, American Journal of Public Health , 95: S66-S73.
  • –––, 2010, “Federal Philosophy of Science: A Deconstruction-and a Reconstruction”, NYUJL & Liberty , 5: 394.
  • Hangel, N. and J. Schickore, 2017, “Scientists’ conceptions of good research practice”, Perspectives on Science , 25(6): 766–791
  • Harper, W.L., 2011, Isaac Newton’s Scientific Method: Turning Data into Evidence about Gravity and Cosmology , Oxford: Oxford University Press.
  • Hempel, C., 1950, “Problems and Changes in the Empiricist Criterion of Meaning”, Revue Internationale de Philosophie , 41(11): 41–63.
  • –––, 1951, “The Concept of Cognitive Significance: A Reconsideration”, Proceedings of the American Academy of Arts and Sciences , 80(1): 61–77.
  • –––, 1965, Aspects of scientific explanation and other essays in the philosophy of science , New York–London: Free Press.
  • –––, 1966, Philosophy of Natural Science , Englewood Cliffs: Prentice-Hall.
  • Holmes, F.L., 1987, “Scientific writing and scientific discovery”, Isis , 78(2): 220–235.
  • Howard, D., 2003, “Two left turns make a right: On the curious political career of North American philosophy of science at midcentury”, in Logical Empiricism in North America , G.L. Hardcastle & A.W. Richardson (eds.), Minneapolis: University of Minnesota Press, pp. 25–93.
  • Hoyningen-Huene, P., 2008, “Systematicity: The nature of science”, Philosophia , 36(2): 167–180.
  • –––, 2013, Systematicity. The Nature of Science , Oxford: Oxford University Press.
  • Howie, D., 2002, Interpreting probability: Controversies and developments in the early twentieth century , Cambridge: Cambridge University Press.
  • Hughes, R., 1999, “The Ising Model, Computer Simulation, and Universal Physics”, in Models as Mediators , M. Morgan and M. Morrison (eds.), Cambridge: Cambridge University Press, pp. 97–145
  • Hume, D., 1739, A Treatise of Human Nature , D. Fate Norton and M.J. Norton (eds.), Oxford: Oxford University Press, 2000.
  • Humphreys, P., 1995, “Computational science and scientific method”, Minds and Machines , 5(1): 499–512.
  • ICMJE, 2013, “Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals”, International Committee of Medical Journal Editors, available online , accessed August 13 2014
  • Jeffrey, R.C., 1956, “Valuation and Acceptance of Scientific Hypotheses”, Philosophy of Science , 23(3): 237–246.
  • Kaufmann, W.J., and L.L. Smarr, 1993, Supercomputing and the Transformation of Science , New York: Scientific American Library.
  • Knorr-Cetina, K., 1981, The Manufacture of Knowledge , Oxford: Pergamon Press.
  • Krohs, U., 2012, “Convenience experimentation”, Studies in History and Philosophy of Biological and BiomedicalSciences , 43: 52–57.
  • Kuhn, T.S., 1962, The Structure of Scientific Revolutions , Chicago: University of Chicago Press
  • Latour, B. and S. Woolgar, 1986, Laboratory Life: The Construction of Scientific Facts , Princeton: Princeton University Press, 2 nd edition.
  • Laudan, L., 1968, “Theories of scientific method from Plato to Mach”, History of Science , 7(1): 1–63.
  • Lenhard, J., 2006, “Models and statistical inference: The controversy between Fisher and Neyman-Pearson”, The British Journal for the Philosophy of Science , 57(1): 69–91.
  • Leonelli, S., 2012, “Making Sense of Data-Driven Research in the Biological and the Biomedical Sciences”, Studies in the History and Philosophy of the Biological and Biomedical Sciences , 43(1): 1–3.
  • Levi, I., 1960, “Must the scientist make value judgments?”, Philosophy of Science , 57(11): 345–357
  • Lindley, D., 1991, Theory Change in Science: Strategies from Mendelian Genetics , Oxford: Oxford University Press.
  • Lipton, P., 2004, Inference to the Best Explanation , London: Routledge, 2 nd edition.
  • Marks, H.M., 2000, The progress of experiment: science and therapeutic reform in the United States, 1900–1990 , Cambridge: Cambridge University Press.
  • Mazzochi, F., 2015, “Could Big Data be the end of theory in science?”, EMBO reports , 16: 1250–1255.
  • Mayo, D.G., 1996, Error and the Growth of Experimental Knowledge , Chicago: University of Chicago Press.
  • McComas, W.F., 1996, “Ten myths of science: Reexamining what we think we know about the nature of science”, School Science and Mathematics , 96(1): 10–16.
  • Medawar, P.B., 1963/1996, “Is the scientific paper a fraud”, in The Strange Case of the Spotted Mouse and Other Classic Essays on Science , Oxford: Oxford University Press, 33–39.
  • Mill, J.S., 1963, Collected Works of John Stuart Mill , J. M. Robson (ed.), Toronto: University of Toronto Press
  • NAS, 1992, Responsible Science: Ensuring the integrity of the research process , Washington DC: National Academy Press.
  • Nersessian, N.J., 1987, “A cognitive-historical approach to meaning in scientific theories”, in The process of science , N. Nersessian (ed.), Berlin: Springer, pp. 161–177.
  • –––, 2008, Creating Scientific Concepts , Cambridge: MIT Press.
  • Newton, I., 1726, Philosophiae naturalis Principia Mathematica (3 rd edition), in The Principia: Mathematical Principles of Natural Philosophy: A New Translation , I.B. Cohen and A. Whitman (trans.), Berkeley: University of California Press, 1999.
  • –––, 1704, Opticks or A Treatise of the Reflections, Refractions, Inflections & Colors of Light , New York: Dover Publications, 1952.
  • Neyman, J., 1956, “Note on an Article by Sir Ronald Fisher”, Journal of the Royal Statistical Society. Series B (Methodological) , 18: 288–294.
  • Nickles, T., 1987, “Methodology, heuristics, and rationality”, in Rational changes in science: Essays on Scientific Reasoning , J.C. Pitt (ed.), Berlin: Springer, pp. 103–132.
  • Nicod, J., 1924, Le problème logique de l’induction , Paris: Alcan. (Engl. transl. “The Logical Problem of Induction”, in Foundations of Geometry and Induction , London: Routledge, 2000.)
  • Nola, R. and H. Sankey, 2000a, “A selective survey of theories of scientific method”, in Nola and Sankey 2000b: 1–65.
  • –––, 2000b, After Popper, Kuhn and Feyerabend. Recent Issues in Theories of Scientific Method , London: Springer.
  • –––, 2007, Theories of Scientific Method , Stocksfield: Acumen.
  • Norton, S., and F. Suppe, 2001, “Why atmospheric modeling is good science”, in Changing the Atmosphere: Expert Knowledge and Environmental Governance , C. Miller and P. Edwards (eds.), Cambridge, MA: MIT Press, 88–133.
  • O’Malley, M., 2007, “Exploratory experimentation and scientific practice: Metagenomics and the proteorhodopsin case”, History and Philosophy of the Life Sciences , 29(3): 337–360.
  • O’Malley, M., C. Haufe, K. Elliot, and R. Burian, 2009, “Philosophies of Funding”, Cell , 138: 611–615.
  • Oreskes, N., K. Shrader-Frechette, and K. Belitz, 1994, “Verification, Validation and Confirmation of Numerical Models in the Earth Sciences”, Science , 263(5147): 641–646.
  • Osborne, J., S. Simon, and S. Collins, 2003, “Attitudes towards science: a review of the literature and its implications”, International Journal of Science Education , 25(9): 1049–1079.
  • Parascandola, M., 1998, “Epidemiology—2 nd -Rate Science”, Public Health Reports , 113(4): 312–320.
  • Parker, W., 2008a, “Franklin, Holmes and the Epistemology of Computer Simulation”, International Studies in the Philosophy of Science , 22(2): 165–83.
  • –––, 2008b, “Computer Simulation through an Error-Statistical Lens”, Synthese , 163(3): 371–84.
  • Pearson, K. 1892, The Grammar of Science , London: J.M. Dents and Sons, 1951
  • Pearson, E.S., 1955, “Statistical Concepts in Their Relation to Reality”, Journal of the Royal Statistical Society , B, 17: 204–207.
  • Pickering, A., 1984, Constructing Quarks: A Sociological History of Particle Physics , Edinburgh: Edinburgh University Press.
  • Popper, K.R., 1959, The Logic of Scientific Discovery , London: Routledge, 2002
  • –––, 1963, Conjectures and Refutations , London: Routledge, 2002.
  • –––, 1985, Unended Quest: An Intellectual Autobiography , La Salle: Open Court Publishing Co..
  • Rudner, R., 1953, “The Scientist Qua Scientist Making Value Judgments”, Philosophy of Science , 20(1): 1–6.
  • Rudolph, J.L., 2005, “Epistemology for the masses: The origin of ‘The Scientific Method’ in American Schools”, History of Education Quarterly , 45(3): 341–376
  • Schickore, J., 2008, “Doing science, writing science”, Philosophy of Science , 75: 323–343.
  • Schickore, J. and N. Hangel, 2019, “‘It might be this, it should be that…’ uncertainty and doubt in day-to-day science practice”, European Journal for Philosophy of Science , 9(2): 31. doi:10.1007/s13194-019-0253-9
  • Shamoo, A.E. and D.B. Resnik, 2009, Responsible Conduct of Research , Oxford: Oxford University Press.
  • Shank, J.B., 2008, The Newton Wars and the Beginning of the French Enlightenment , Chicago: The University of Chicago Press.
  • Shapin, S. and S. Schaffer, 1985, Leviathan and the air-pump , Princeton: Princeton University Press.
  • Smith, G.E., 2002, “The Methodology of the Principia”, in The Cambridge Companion to Newton , I.B. Cohen and G.E. Smith (eds.), Cambridge: Cambridge University Press, 138–173.
  • Snyder, L.J., 1997a, “Discoverers’ Induction”, Philosophy of Science , 64: 580–604.
  • –––, 1997b, “The Mill-Whewell Debate: Much Ado About Induction”, Perspectives on Science , 5: 159–198.
  • –––, 1999, “Renovating the Novum Organum: Bacon, Whewell and Induction”, Studies in History and Philosophy of Science , 30: 531–557.
  • Sober, E., 2008, Evidence and Evolution. The logic behind the science , Cambridge: Cambridge University Press
  • Sprenger, J. and S. Hartmann, 2019, Bayesian philosophy of science , Oxford: Oxford University Press.
  • Steinle, F., 1997, “Entering New Fields: Exploratory Uses of Experimentation”, Philosophy of Science (Proceedings), 64: S65–S74.
  • –––, 2002, “Experiments in History and Philosophy of Science”, Perspectives on Science , 10(4): 408–432.
  • Strasser, B.J., 2012, “Data-driven sciences: From wonder cabinets to electronic databases”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 85–87.
  • Succi, S. and P.V. Coveney, 2018, “Big data: the end of the scientific method?”, Philosophical Transactions of the Royal Society A , 377: 20180145. doi:10.1098/rsta.2018.0145
  • Suppe, F., 1998, “The Structure of a Scientific Paper”, Philosophy of Science , 65(3): 381–405.
  • Swijtink, Z.G., 1987, “The objectification of observation: Measurement and statistical methods in the nineteenth century”, in The probabilistic revolution. Ideas in History, Vol. 1 , L. Kruger (ed.), Cambridge MA: MIT Press, pp. 261–285.
  • Waters, C.K., 2007, “The nature and context of exploratory experimentation: An introduction to three case studies of exploratory research”, History and Philosophy of the Life Sciences , 29(3): 275–284.
  • Weinberg, S., 1995, “The methods of science… and those by which we live”, Academic Questions , 8(2): 7–13.
  • Weissert, T., 1997, The Genesis of Simulation in Dynamics: Pursuing the Fermi-Pasta-Ulam Problem , New York: Springer Verlag.
  • William H., 1628, Exercitatio Anatomica de Motu Cordis et Sanguinis in Animalibus , in On the Motion of the Heart and Blood in Animals , R. Willis (trans.), Buffalo: Prometheus Books, 1993.
  • Winsberg, E., 2010, Science in the Age of Computer Simulation , Chicago: University of Chicago Press.
  • Wivagg, D. & D. Allchin, 2002, “The Dogma of the Scientific Method”, The American Biology Teacher , 64(9): 645–646
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Blackmun opinion , in Daubert v. Merrell Dow Pharmaceuticals (92–102), 509 U.S. 579 (1993).
  • Scientific Method at philpapers. Darrell Rowbottom (ed.).
  • Recent Articles | Scientific Method | The Scientist Magazine

al-Kindi | Albert the Great [= Albertus magnus] | Aquinas, Thomas | Arabic and Islamic Philosophy, disciplines in: natural philosophy and natural science | Arabic and Islamic Philosophy, historical and methodological topics in: Greek sources | Arabic and Islamic Philosophy, historical and methodological topics in: influence of Arabic and Islamic Philosophy on the Latin West | Aristotle | Bacon, Francis | Bacon, Roger | Berkeley, George | biology: experiment in | Boyle, Robert | Cambridge Platonists | confirmation | Descartes, René | Enlightenment | epistemology | epistemology: Bayesian | epistemology: social | Feyerabend, Paul | Galileo Galilei | Grosseteste, Robert | Hempel, Carl | Hume, David | Hume, David: Newtonianism and Anti-Newtonianism | induction: problem of | Kant, Immanuel | Kuhn, Thomas | Leibniz, Gottfried Wilhelm | Locke, John | Mill, John Stuart | More, Henry | Neurath, Otto | Newton, Isaac | Newton, Isaac: philosophy | Ockham [Occam], William | operationalism | Peirce, Charles Sanders | Plato | Popper, Karl | rationality: historicist theories of | Reichenbach, Hans | reproducibility, scientific | Schlick, Moritz | science: and pseudo-science | science: theory and observation in | science: unity of | scientific discovery | scientific knowledge: social dimensions of | simulations in science | skepticism: medieval | space and time: absolute and relational space and motion, post-Newtonian theories | Vienna Circle | Whewell, William | Zabarella, Giacomo

Copyright © 2021 by Brian Hepburn < brian . hepburn @ wichita . edu > Hanne Andersen < hanne . andersen @ ind . ku . dk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

scientific method essay introduction

Writing the Scientific Paper

When you write about scientific topics to specialists in a particular scientific field, we call that scientific writing. (When you write to non-specialists about scientific topics, we call that science writing.)

The scientific paper has developed over the past three centuries into a tool to communicate the results of scientific inquiry. The main audience for scientific papers is extremely specialized. The purpose of these papers is twofold: to present information so that it is easy to retrieve, and to present enough information that the reader can duplicate the scientific study. A standard format with six main part helps readers to find expected information and analysis:

  • Title--subject and what aspect of the subject was studied.
  • Abstract--summary of paper: The main reason for the study, the primary results, the main conclusions
  • Introduction-- why the study was undertaken
  • Methods and Materials-- how the study was undertaken
  • Results-- what was found
  • Discussion-- why these results could be significant (what the reasons might be for the patterns found or not found)

There are many ways to approach the writing of a scientific paper, and no one way is right. Many people, however, find that drafting chunks in this order works best: Results, Discussion, Introduction, Materials & Methods, Abstract, and, finally, Title.

The title should be very limited and specific. Really, it should be a pithy summary of the article's main focus.

  • "Renal disease susceptibility and hypertension are under independent genetic control in the fawn hooded rat"
  • "Territory size in Lincoln's Sparrows ( Melospiza lincolnii )"
  • "Replacement of deciduous first premolars and dental eruption in archaeocete whales"
  • "The Radio-Frequency Single-Electron Transistor (RF-SET): A Fast and Ultrasensitive Electrometer"

This is a summary of your article. Generally between 50-100 words, it should state the goals, results, and the main conclusions of your study. You should list the parameters of your study (when and where was it conducted, if applicable; your sample size; the specific species, proteins, genes, etc., studied). Think of the process of writing the abstract as taking one or two sentences from each of your sections (an introductory sentence, a sentence stating the specific question addressed, a sentence listing your main techniques or procedures, two or three sentences describing your results, and one sentence describing your main conclusion).

Example One

Hypertension, diabetes and hyperlipidemia are risk factors for life-threatening complications such as end-stage renal disease, coronary artery disease and stroke. Why some patients develop complications is unclear, but only susceptibility genes may be involved. To test this notion, we studied crosses involving the fawn-hooded rat, an animal model of hypertension that develops chronic renal failure. Here, we report the localization of two genes, Rf-1 and Rf-2 , responsible for about half of the genetic variation in key indices of renal impairment. In addition, we localize a gene, Bpfh-1 , responsible for about 26% of the genetic variation in blood pressure. Rf-1 strongly affects the risk of renal impairment, but has no significant effect on blood pressure. Our results show that susceptibility to a complication of hypertension is under at least partially independent genetic control from susceptibility to hypertension itself.

Brown, Donna M, A.P. Provoost, M.J. Daly, E.S. Lander, & H.J. Jacob. 1996. "Renal disease susceptibility and hypertension are under indpendent genetic control in the faun-hooded rat." Nature Genetics , 12(1):44-51.

Example Two

We studied survival of 220 calves of radiocollared moose ( Alces alces ) from parturition to the end of July in southcentral Alaska from 1994 to 1997. Prior studies established that predation by brown bears ( Ursus arctos ) was the primary cause of mortality of moose calves in the region. Our objectives were to characterize vulnerability of moose calves to predation as influenced by age, date, snow depths, and previous reproductive success of the mother. We also tested the hypothesis that survival of twin moose calves was independent and identical to that of single calves. Survival of moose calves from parturition through July was 0.27 ± 0.03 SE, and their daily rate of mortality declined at a near constant rate with age in that period. Mean annual survival was 0.22 ± 0.03 SE. Previous winter's snow depths or survival of the mother's previous calf was not related to neonatal survival. Selection for early parturition was evidenced in the 4 years of study by a 6.3% increase in the hazard of death with each daily increase in parturition date. Although there was no significant difference in survival of twin and single moose calves, most twins that died disappeared together during the first 15 days after birth and independently thereafter, suggesting that predators usually killed both when encountered up to that age.

Key words: Alaska, Alces alces , calf survival, moose, Nelchina, parturition synchrony, predation

Testa, J.W., E.F. Becker, & G.R. Lee. 2000. "Temporal patterns in the survival of twin and single moose ( alces alces ) calves in southcentral Alaska." Journal of Mammalogy , 81(1):162-168.

Example Three

We monitored breeding phenology and population levels of Rana yavapaiensis by use of repeated egg mass censuses and visual encounter surveys at Agua Caliente Canyon near Tucson, Arizona, from 1994 to 1996. Adult counts fluctuated erratically within each year of the study but annual means remained similar. Juvenile counts peaked during the fall recruitment season and fell to near zero by early spring. Rana yavapaiensis deposited eggs in two distinct annual episodes, one in spring (March-May) and a much smaller one in fall (September-October). Larvae from the spring deposition period completed metamorphosis in earlv summer. Over the two years of study, 96.6% of egg masses successfully produced larvae. Egg masses were deposited during periods of predictable, moderate stream flow, but not during seasonal periods when flash flooding or drought were likely to affect eggs or larvae. Breeding phenology of Rana yavapaiensis is particularly well suited for life in desert streams with natural flow regimes which include frequent flash flooding and drought at predictable times. The exotic predators of R. yavapaiensis are less able to cope with fluctuating conditions. Unaltered stream flow regimes that allow natural fluctuations in stream discharge may provide refugia for this declining ranid frog from exotic predators by excluding those exotic species that are unable to cope with brief flash flooding and habitat drying.

Sartorius, Shawn S., and Philip C. Rosen. 2000. "Breeding phenology of the lowland leopard frog ( Rana yavepaiensis )." Southwestern Naturalist , 45(3): 267-273.

Introduction

The introduction is where you sketch out the background of your study, including why you have investigated the question that you have and how it relates to earlier research that has been done in the field. It may help to think of an introduction as a telescoping focus, where you begin with the broader context and gradually narrow to the specific problem addressed by the report. A typical (and very useful) construction of an introduction proceeds as follows:

"Echimyid rodents of the genus Proechimys (spiny rats) often are the most abundant and widespread lowland forest rodents throughout much of their range in the Neotropics (Eisenberg 1989). Recent studies suggested that these rodents play an important role in forest dynamics through their activities as seed predators and dispersers of seeds (Adler and Kestrell 1998; Asquith et al 1997; Forget 1991; Hoch and Adler 1997)." (Lambert and Adler, p. 70)

"Our laboratory has been involved in the analysis of the HLA class II genes and their association with autoimmune disorders such as insulin-dependent diabetes mellitus. As part of this work, the laboratory handles a large number of blood samples. In an effort to minimize the expense and urgency of transportation of frozen or liquid blood samples, we have designed a protocol that will preserve the integrity of lymphocyte DNA and enable the transport and storage of samples at ambient temperatures." (Torrance, MacLeod & Hache, p. 64)

"Despite the ubiquity and abundance of P. semispinosus , only two previous studies have assessed habitat use, with both showing a generalized habitat use. [brief summary of these studies]." (Lambert and Adler, p. 70)

"Although very good results have been obtained using polymerase chain reaction (PCR) amplification of DNA extracted from dried blood spots on filter paper (1,4,5,8,9), this preservation method yields limited amounts of DNA and is susceptible to contamination." (Torrance, MacLeod & Hache, p. 64)

"No attempt has been made to quantitatively describe microhabitat characteristics with which this species may be associated. Thus, specific structural features of secondary forests that may promote abundance of spiny rats remains unknown. Such information is essential to understand the role of spiny rats in Neotropical forests, particularly with regard to forest regeneration via interactions with seeds." (Lambert and Adler, p. 71)

"As an alternative, we have been investigating the use of lyophilization ("freeze-drying") of whole blood as a method to preserve sufficient amounts of genomic DNA to perform PCR and Southern Blot analysis." (Torrance, MacLeod & Hache, p. 64)

"We present an analysis of microhabitat use by P. semispinosus in tropical moist forests in central Panama." (Lambert and Adler, p. 71)

"In this report, we summarize our analysis of genomic DNA extracted from lyophilized whole blood." (Torrance, MacLeod & Hache, p. 64)

Methods and Materials

In this section you describe how you performed your study. You need to provide enough information here for the reader to duplicate your experiment. However, be reasonable about who the reader is. Assume that he or she is someone familiar with the basic practices of your field.

It's helpful to both writer and reader to organize this section chronologically: that is, describe each procedure in the order it was performed. For example, DNA-extraction, purification, amplification, assay, detection. Or, study area, study population, sampling technique, variables studied, analysis method.

Include in this section:

  • study design: procedures should be listed and described, or the reader should be referred to papers that have already described the used procedure
  • particular techniques used and why, if relevant
  • modifications of any techniques; be sure to describe the modification
  • specialized equipment, including brand-names
  • temporal, spatial, and historical description of study area and studied population
  • assumptions underlying the study
  • statistical methods, including software programs

Example description of activity

Chromosomal DNA was denatured for the first cycle by incubating the slides in 70% deionized formamide; 2x standard saline citrate (SSC) at 70ºC for 2 min, followed by 70% ethanol at -20ºC and then 90% and 100% ethanol at room temperature, followed by air drying. (Rouwendal et al ., p. 79)

Example description of assumptions

We considered seeds left in the petri dish to be unharvested and those scattered singly on the surface of a tile to be scattered and also unharvested. We considered seeds in cheek pouches to be harvested but not cached, those stored in the nestbox to be larderhoarded, and those buried in caching sites within the arena to be scatterhoarded. (Krupa and Geluso, p. 99)

Examples of use of specialized equipment

  • Oligonucleotide primers were prepared using the Applied Biosystems Model 318A (Foster City, CA) DNA Synthesizer according to the manufacturers' instructions. (Rouwendal et al ., p.78)
  • We first visually reviewed the complete song sample of an individual using spectrograms produced on a Princeton Applied Research Real Time Spectrum Analyzer (model 4512). (Peters et al ., p. 937)

Example of use of a certain technique

Frogs were monitored using visual encounter transects (Crump and Scott, 1994). (Sartorius and Rosen, p. 269)

Example description of statistical analysis

We used Wilcox rank-sum tests for all comparisons of pre-experimental scores and for all comparisons of hue, saturation, and brightness scores between various groups of birds ... All P -values are two-tailed unless otherwise noted. (Brawner et al ., p. 955)

This section presents the facts--what was found in the course of this investigation. Detailed data--measurements, counts, percentages, patterns--usually appear in tables, figures, and graphs, and the text of the section draws attention to the key data and relationships among data. Three rules of thumb will help you with this section:

  • present results clearly and logically
  • avoid excess verbiage
  • consider providing a one-sentence summary at the beginning of each paragraph if you think it will help your reader understand your data

Remember to use table and figures effectively. But don't expect these to stand alone.

Some examples of well-organized and easy-to-follow results:

  • Size of the aquatic habitat at Agua Caliente Canyon varied dramatically throughout the year. The site contained three rockbound tinajas (bedrock pools) that did not dry during this study. During periods of high stream discharge seven more seasonal pools and intermittent stretches of riffle became available. Perennial and seasonal pool levels remained stable from late February through early May. Between mid-May and mid-July seasonal pools dried until they disappeared. Perennial pools shrank in surface area from a range of 30-60 m² to 3-5- M². (Sartorius and Rosen, Sept. 2000: 269)

Notice how the second sample points out what is important in the accompanying figure. It makes us aware of relationships that we may not have noticed quickly otherwise and that will be important to the discussion.

A similar test result is obtained with a primer derived from the human ß-satellite... This primer (AGTGCAGAGATATGTCACAATG-CCCC: Oligo 435) labels 6 sites in the PRINS reaction: the chromosomes 1, one pair of acrocentrics and, more weakly, the chromosomes 9 (Fig. 2a). After 10 cycles of PCR-IS, the number of sites labeled has doubled (Fig. 2b); after 20 cycles, the number of sites labeled is the same but the signals are stronger (Fig. 2c) (Rouwendal et al ., July 93:80).

Related Information: Use Tables and Figures Effectively

Do not repeat all of the information in the text that appears in a table, but do summarize it. For example, if you present a table of temperature measurements taken at various times, describe the general pattern of temperature change and refer to the table.

"The temperature of the solution increased rapidly at first, going from 50º to 80º in the first three minutes (Table 1)."

You don't want to list every single measurement in the text ("After one minute, the temperature had risen to 55º. After two minutes, it had risen to 58º," etc.). There is no hard and fast rule about when to report all measurements in the text and when to put the measurements in a table and refer to them, but use your common sense. Remember that readers have all that data in the accompanying tables and figures, so your task in this section is to highlight key data, changes, or relationships.

In this section you discuss your results. What aspect you choose to focus on depends on your results and on the main questions addressed by them. For example, if you were testing a new technique, you will want to discuss how useful this technique is: how well did it work, what are the benefits and drawbacks, etc. If you are presenting data that appear to refute or support earlier research, you will want to analyze both your own data and the earlier data--what conditions are different? how much difference is due to a change in the study design, and how much to a new property in the study subject? You may discuss the implication of your research--particularly if it has a direct bearing on a practical issue, such as conservation or public health.

This section centers on speculation . However, this does not free you to present wild and haphazard guesses. Focus your discussion around a particular question or hypothesis. Use subheadings to organize your thoughts, if necessary.

This section depends on a logical organization so readers can see the connection between your study question and your results. One typical approach is to make a list of all the ideas that you will discuss and to work out the logical relationships between them--what idea is most important? or, what point is most clearly made by your data? what ideas are subordinate to the main idea? what are the connections between ideas?

Achieving the Scientific Voice

Eight tips will help you match your style for most scientific publications.

  • Develop a precise vocabulary: read the literature to become fluent, or at least familiar with, the sort of language that is standard to describe what you're trying to describe.
  • Once you've labeled an activity, a condition, or a period of time, use that label consistently throughout the paper. Consistency is more important than creativity.
  • Define your terms and your assumptions.
  • Include all the information the reader needs to interpret your data.
  • Remember, the key to all scientific discourse is that it be reproducible . Have you presented enough information clearly enough that the reader could reproduce your experiment, your research, or your investigation?
  • When describing an activity, break it down into elements that can be described and labeled, and then present them in the order they occurred.
  • When you use numbers, use them effectively. Don't present them so that they cause more work for the reader.
  • Include details before conclusions, but only include those details you have been able to observe by the methods you have described. Do not include your feelings, attitudes, impressions, or opinions.
  • Research your format and citations: do these match what have been used in current relevant journals?
  • Run a spellcheck and proofread carefully. Read your paper out loud, and/ or have a friend look over it for misspelled words, missing words, etc.

Applying the Principles, Example 1

The following example needs more precise information. Look at the original and revised paragraphs to see how revising with these guidelines in mind can make the text clearer and more informative:

Before: Each male sang a definite number of songs while singing. They start with a whistle and then go from there. Each new song is always different, but made up an overall repertoire that was completed before starting over again. In 16 cases (84%), no new songs were sung after the first 20, even though we counted about 44 songs for each bird.
After: Each male used a discrete number of song types in his singing. Each song began with an introductory whistle, followed by a distinctive, complex series of fluty warbles (Fig. 1). Successive songs were always different, and five of the 19 males presented their entire song repertoire before repeating any of their song types (i.e., the first IO recorded songs revealed the entire repertoire of 10 song types). Each song type recurred in long sequences of singing, so that we could be confident that we had recorded the entire repertoire of commonly used songs by each male. For 16 of the 19 males, no new song types were encountered after the first 20 songs, even though we analyzed and average of 44 songs/male (range 30-59).

Applying the Principles, Example 2

In this set of examples, even a few changes in wording result in a more precise second version. Look at the original and revised paragraphs to see how revising with these guidelines in mind can make the text clearer and more informative:

Before: The study area was on Mt. Cain and Maquilla Peak in British Columbia, Canada. The study area is about 12,000 ha of coastal montane forest. The area is both managed and unmanaged and ranges from 600-1650m. The most common trees present are mountain hemlock ( Tsuga mertensiana ), western hemlock ( Tsuga heterophylla ), yellow cedar ( Chamaecyparis nootkatensis ), and amabilis fir ( Abies amabilis ).
After: The study took place on Mt. Cain and Maquilla Peak (50'1 3'N, 126'1 8'W), Vancouver Island, British Columbia. The study area encompassed 11,800 ha of coastal montane forest. The landscape consisted of managed and unmanaged stands of coastal montane forest, 600-1650 m in elevation. The dominant tree species included mountain hemlock ( Tsuga mertensiana ), western hemlock ( Tsuga heterophylla ), yellow cedar ( Chamaecyparis nootkatensis ), and amabilis fir ( Abies amabilis ).

Two Tips for Sentence Clarity

Although you will want to consider more detailed stylistic revisions as you become more comfortable with scientific writing, two tips can get you started:

First, the verb should follow the subject as soon as possible.

Really Hard to Read : "The smallest of the URF's (URFA6L), a 207-nucleotide (nt) reading frame overlapping out of phase the NH2- terminal portion of the adenosinetriphosphatase (ATPase) subunit 6 gene has been identified as the animal equivalent of the recently discovered yeast H+-ATPase subunit gene."

Less Hard to Read : "The smallest of the UR-F's is URFA6L, a 207-nucleotide (nt) reading frame overlapping out of phase the NH2-terminal portion of the adenosinetriphosphatase (ATPase) subunit 6 gene; it has been identified as the animal equivalent of the recently discovered yeast H+-ATPase subunit 8 gene."

Second, place familiar information first in a clause, a sentence, or a paragraph, and put the new and unfamiliar information later.

More confusing : The epidermis, the dermis, and the subcutaneous layer are the three layers of the skin. A layer of dead skin cells makes up the epidermis, which forms the body's shield against the world. Blood vessels, carrying nourishment, and nerve endings, which relay information about the outside world, are found in the dermis. Sweat glands and fat cells make up the third layer, the subcutaneous layer.

Less confusing : The skin consists of three layers: the epidermis, the dermis, and the subcutaneous layer. The epidermis is made up of dead skin cells, and forms a protective shield between the body and the world. The dermis contains the blood vessels and nerve endings that nourish the skin and make it receptive to outside stimuli. The subcutaneous layer contains the sweat glands and fat cells which perform other functions of the skin.

Bibliography

  • Scientific Writing for Graduate Students . F. P. Woodford. Bethesda, MD: Council of Biology Editors, 1968. [A manual on the teaching of writing to graduate students--very clear and direct.]
  • Scientific Style and Format . Council of Biology Editors. Cambridge: Cambridge University Press, 1994.
  • "The science of scientific writing." George Gopen and Judith Swann. The American Scientist , Vol. 78, Nov.-Dec. 1990. Pp 550-558.
  • "What's right about scientific writing." Alan Gross and Joseph Harmon. The Scientist , Dec. 6 1999. Pp. 20-21.
  • "A Quick Fix for Figure Legends and Table Headings." Donald Kroodsma. The Auk , 117 (4): 1081-1083, 2000.

Wortman-Wunder, Emily, & Kate Kiefer. (1998). Writing the Scientific Paper. Writing@CSU . Colorado State University. https://writing.colostate.edu/resources/writing/guides/.

Encyclopedia Britannica

  • History & Society
  • Science & Tech
  • Biographies
  • Animals & Nature
  • Geography & Travel
  • Arts & Culture
  • Games & Quizzes
  • On This Day
  • One Good Fact
  • New Articles
  • Lifestyles & Social Issues
  • Philosophy & Religion
  • Politics, Law & Government
  • World History
  • Health & Medicine
  • Browse Biographies
  • Birds, Reptiles & Other Vertebrates
  • Bugs, Mollusks & Other Invertebrates
  • Environment
  • Fossils & Geologic Time
  • Entertainment & Pop Culture
  • Sports & Recreation
  • Visual Arts
  • Demystified
  • Image Galleries
  • Infographics
  • Top Questions
  • Britannica Kids
  • Saving Earth
  • Space Next 50
  • Student Center

flow chart of scientific method

scientific method

Our editors will review what you’ve submitted and determine whether to revise the article.

  • University of Nevada, Reno - College of Agriculture, Biotechnology and Natural Resources Extension - The Scientific Method
  • World History Encyclopedia - Scientific Method
  • LiveScience - What Is Science?
  • Verywell Mind - Scientific Method Steps in Psychology Research
  • WebMD - What is the Scientific Method?
  • Chemistry LibreTexts - The Scientific Method
  • National Center for Biotechnology Information - PubMed Central - Redefining the scientific method: as the use of sophisticated scientific methods that extend our mind
  • Khan Academy - The scientific method
  • Simply Psychology - What are the steps in the Scientific Method?
  • Stanford Encyclopedia of Philosophy - Scientific Method

flow chart of scientific method

scientific method , mathematical and experimental technique employed in the sciences . More specifically, it is the technique used in the construction and testing of a scientific hypothesis .

The process of observing, asking questions, and seeking answers through tests and experiments is not unique to any one field of science. In fact, the scientific method is applied broadly in science, across many different fields. Many empirical sciences, especially the social sciences , use mathematical tools borrowed from probability theory and statistics , together with outgrowths of these, such as decision theory , game theory , utility theory, and operations research . Philosophers of science have addressed general methodological problems, such as the nature of scientific explanation and the justification of induction .

scientific method essay introduction

The scientific method is critical to the development of scientific theories , which explain empirical (experiential) laws in a scientifically rational manner. In a typical application of the scientific method, a researcher develops a hypothesis , tests it through various means, and then modifies the hypothesis on the basis of the outcome of the tests and experiments. The modified hypothesis is then retested, further modified, and tested again, until it becomes consistent with observed phenomena and testing outcomes. In this way, hypotheses serve as tools by which scientists gather data. From that data and the many different scientific investigations undertaken to explore hypotheses, scientists are able to develop broad general explanations, or scientific theories.

See also Mill’s methods ; hypothetico-deductive method .

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • How to write an essay introduction | 4 steps & examples

How to Write an Essay Introduction | 4 Steps & Examples

Published on February 4, 2019 by Shona McCombes . Revised on July 23, 2023.

A good introduction paragraph is an essential part of any academic essay . It sets up your argument and tells the reader what to expect.

The main goals of an introduction are to:

  • Catch your reader’s attention.
  • Give background on your topic.
  • Present your thesis statement —the central point of your essay.

This introduction example is taken from our interactive essay example on the history of Braille.

The invention of Braille was a major turning point in the history of disability. The writing system of raised dots used by visually impaired people was developed by Louis Braille in nineteenth-century France. In a society that did not value disabled people in general, blindness was particularly stigmatized, and lack of access to reading and writing was a significant barrier to social participation. The idea of tactile reading was not entirely new, but existing methods based on sighted systems were difficult to learn and use. As the first writing system designed for blind people’s needs, Braille was a groundbreaking new accessibility tool. It not only provided practical benefits, but also helped change the cultural status of blindness. This essay begins by discussing the situation of blind people in nineteenth-century Europe. It then describes the invention of Braille and the gradual process of its acceptance within blind education. Subsequently, it explores the wide-ranging effects of this invention on blind people’s social and cultural lives.

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

upload-your-document-ai-proofreader

Table of contents

Step 1: hook your reader, step 2: give background information, step 3: present your thesis statement, step 4: map your essay’s structure, step 5: check and revise, more examples of essay introductions, other interesting articles, frequently asked questions about the essay introduction.

Your first sentence sets the tone for the whole essay, so spend some time on writing an effective hook.

Avoid long, dense sentences—start with something clear, concise and catchy that will spark your reader’s curiosity.

The hook should lead the reader into your essay, giving a sense of the topic you’re writing about and why it’s interesting. Avoid overly broad claims or plain statements of fact.

Examples: Writing a good hook

Take a look at these examples of weak hooks and learn how to improve them.

  • Braille was an extremely important invention.
  • The invention of Braille was a major turning point in the history of disability.

The first sentence is a dry fact; the second sentence is more interesting, making a bold claim about exactly  why the topic is important.

  • The internet is defined as “a global computer network providing a variety of information and communication facilities.”
  • The spread of the internet has had a world-changing effect, not least on the world of education.

Avoid using a dictionary definition as your hook, especially if it’s an obvious term that everyone knows. The improved example here is still broad, but it gives us a much clearer sense of what the essay will be about.

  • Mary Shelley’s  Frankenstein is a famous book from the nineteenth century.
  • Mary Shelley’s Frankenstein is often read as a crude cautionary tale about the dangers of scientific advancement.

Instead of just stating a fact that the reader already knows, the improved hook here tells us about the mainstream interpretation of the book, implying that this essay will offer a different interpretation.

Prevent plagiarism. Run a free check.

Next, give your reader the context they need to understand your topic and argument. Depending on the subject of your essay, this might include:

  • Historical, geographical, or social context
  • An outline of the debate you’re addressing
  • A summary of relevant theories or research about the topic
  • Definitions of key terms

The information here should be broad but clearly focused and relevant to your argument. Don’t give too much detail—you can mention points that you will return to later, but save your evidence and interpretation for the main body of the essay.

How much space you need for background depends on your topic and the scope of your essay. In our Braille example, we take a few sentences to introduce the topic and sketch the social context that the essay will address:

Now it’s time to narrow your focus and show exactly what you want to say about the topic. This is your thesis statement —a sentence or two that sums up your overall argument.

This is the most important part of your introduction. A  good thesis isn’t just a statement of fact, but a claim that requires evidence and explanation.

The goal is to clearly convey your own position in a debate or your central point about a topic.

Particularly in longer essays, it’s helpful to end the introduction by signposting what will be covered in each part. Keep it concise and give your reader a clear sense of the direction your argument will take.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

scientific method essay introduction

As you research and write, your argument might change focus or direction as you learn more.

For this reason, it’s often a good idea to wait until later in the writing process before you write the introduction paragraph—it can even be the very last thing you write.

When you’ve finished writing the essay body and conclusion , you should return to the introduction and check that it matches the content of the essay.

It’s especially important to make sure your thesis statement accurately represents what you do in the essay. If your argument has gone in a different direction than planned, tweak your thesis statement to match what you actually say.

To polish your writing, you can use something like a paraphrasing tool .

You can use the checklist below to make sure your introduction does everything it’s supposed to.

Checklist: Essay introduction

My first sentence is engaging and relevant.

I have introduced the topic with necessary background information.

I have defined any important terms.

My thesis statement clearly presents my main point or argument.

Everything in the introduction is relevant to the main body of the essay.

You have a strong introduction - now make sure the rest of your essay is just as good.

  • Argumentative
  • Literary analysis

This introduction to an argumentative essay sets up the debate about the internet and education, and then clearly states the position the essay will argue for.

The spread of the internet has had a world-changing effect, not least on the world of education. The use of the internet in academic contexts is on the rise, and its role in learning is hotly debated. For many teachers who did not grow up with this technology, its effects seem alarming and potentially harmful. This concern, while understandable, is misguided. The negatives of internet use are outweighed by its critical benefits for students and educators—as a uniquely comprehensive and accessible information source; a means of exposure to and engagement with different perspectives; and a highly flexible learning environment.

This introduction to a short expository essay leads into the topic (the invention of the printing press) and states the main point the essay will explain (the effect of this invention on European society).

In many ways, the invention of the printing press marked the end of the Middle Ages. The medieval period in Europe is often remembered as a time of intellectual and political stagnation. Prior to the Renaissance, the average person had very limited access to books and was unlikely to be literate. The invention of the printing press in the 15th century allowed for much less restricted circulation of information in Europe, paving the way for the Reformation.

This introduction to a literary analysis essay , about Mary Shelley’s Frankenstein , starts by describing a simplistic popular view of the story, and then states how the author will give a more complex analysis of the text’s literary devices.

Mary Shelley’s Frankenstein is often read as a crude cautionary tale. Arguably the first science fiction novel, its plot can be read as a warning about the dangers of scientific advancement unrestrained by ethical considerations. In this reading, and in popular culture representations of the character as a “mad scientist”, Victor Frankenstein represents the callous, arrogant ambition of modern science. However, far from providing a stable image of the character, Shelley uses shifting narrative perspectives to gradually transform our impression of Frankenstein, portraying him in an increasingly negative light as the novel goes on. While he initially appears to be a naive but sympathetic idealist, after the creature’s narrative Frankenstein begins to resemble—even in his own telling—the thoughtlessly cruel figure the creature represents him as.

If you want to know more about AI tools , college essays , or fallacies make sure to check out some of our other articles with explanations and examples or go directly to our tools!

  • Ad hominem fallacy
  • Post hoc fallacy
  • Appeal to authority fallacy
  • False cause fallacy
  • Sunk cost fallacy

College essays

  • Choosing Essay Topic
  • Write a College Essay
  • Write a Diversity Essay
  • College Essay Format & Structure
  • Comparing and Contrasting in an Essay

 (AI) Tools

  • Grammar Checker
  • Paraphrasing Tool
  • Text Summarizer
  • AI Detector
  • Plagiarism Checker
  • Citation Generator

Your essay introduction should include three main things, in this order:

  • An opening hook to catch the reader’s attention.
  • Relevant background information that the reader needs to know.
  • A thesis statement that presents your main point or argument.

The length of each part depends on the length and complexity of your essay .

The “hook” is the first sentence of your essay introduction . It should lead the reader into your essay, giving a sense of why it’s interesting.

To write a good hook, avoid overly broad statements or long, dense sentences. Try to start with something clear, concise and catchy that will spark your reader’s curiosity.

A thesis statement is a sentence that sums up the central point of your paper or essay . Everything else you write should relate to this key idea.

The thesis statement is essential in any academic essay or research paper for two main reasons:

  • It gives your writing direction and focus.
  • It gives the reader a concise summary of your main point.

Without a clear thesis statement, an essay can end up rambling and unfocused, leaving your reader unsure of exactly what you want to say.

The structure of an essay is divided into an introduction that presents your topic and thesis statement , a body containing your in-depth analysis and arguments, and a conclusion wrapping up your ideas.

The structure of the body is flexible, but you should always spend some time thinking about how you can organize your essay to best serve your ideas.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, July 23). How to Write an Essay Introduction | 4 Steps & Examples. Scribbr. Retrieved August 21, 2024, from https://www.scribbr.com/academic-essay/introduction/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, how to write a thesis statement | 4 steps & examples, academic paragraph structure | step-by-step guide & examples, how to conclude an essay | interactive example, what is your plagiarism score.

Scientific Method: Role and Importance Essay

  • To find inspiration for your paper and overcome writer’s block
  • As a source of information (ensure proper referencing)
  • As a template for you assignment

The scientific method is a problem-solving strategy that is at the heart of biology and other sciences. There are five steps included in the scientific method that is making an observation, asking a question, forming a hypothesis or an explanation that could be tested, and predicting the test. After that, in the feedback step that is iterating, the results are used to make new predictions. The scientific method is almost always an iterative process. In other words, rather than a straight line, it is a cycle. The outcome of one round of questioning generates feedback that helps to enhance the next round of questioning.

Science is an activity that involves the logical explanation, prediction, and control of empirical phenomena. The concepts of reasoning applicable to the pursuit of this endeavor are referred to as scientific reasoning (Cowles, 2020). They include topics such as experimental design, hypothesis testing, and data interpretation. All sciences, including social sciences, follow the scientific method (Cowles, 2020). Different questions and tests are asked and performed by scientists in various domains. They do, however, have a common approach to finding logical and evidence-based answers.

Scientific reasoning is fundamental for all types of scientific study, not simply institutional research. Scientists do employ specific ideas that non-scientists do not have to use in everyday life. However, many reasoning principles are useful in everyday life. Even if one is not a scientist, they must use excellent reasoning to understand, anticipate, and regulate the events that occur in the environment. When one wants to start their careers, preserve their finances, or enhance their health, they need to acquire evidence to determine the most effective method for achieving our goals. Good scientific thinking skills come in handy in all of these situations.

Experiments, surveys, case studies, descriptive studies, and non-descriptive studies are all forms of research used in the scientific method. In an experiment, a researcher manipulates certain factors in a controlled environment and assesses their impact on other variables (Black, 2018). Descriptive research focuses on the nature of the relationship between the variables being studied rather than on cause and effect. A case study is a detailed examination of a single instance in which something unexpected has occurred. This is normally done with a single individual in extreme or exceptional instances. Large groups of individuals are polled to answer questions about certain topics in surveys. Correlational approaches are used in non-descriptive investigations to anticipate the link between two or more variables.

The Lau and Chan technique describes how to assess the validity of a theory or hypothesis using the scientific method, also known as the hypothetical-deductive method (Lau & Chan, 2017). For testing theories or hypotheses, the hypothetical-deductive technique (HD method) is highly useful. It is sometimes referred to as “scientific procedure.” This is not quite right because science can’t possibly employ only one approach. However, the HD technique is critical since it is one of the most fundamental approaches used in many scientific disciplines, including economics, physics, and biochemistry. Its implementation can be broken down into four stages. The stages include using the hypothetical-deductive method, identifying the testable hypothesis, generating the predictions according to the hypothesis, and using experiments in order to check the predictions (Cowles, 2020). If the predictions that are tested turn out to be correct, the hypothesis will be confirmed. Suppose the results are incorrect; the hypothesis would be disconfirmed.

The HD method instructs us on how to test a hypothesis, and each scientific theory must be testable.

One cannot discover evidence to illustrate whether a theory is likely or not if it cannot be tested. It cannot be considered scientific information in that circumstance. Consider the possibility that there are ghosts that people cannot see, cannot communicate with, and cannot be detected directly or indirectly. This hypothesis is defined in such a way that testing is not possible. It could still be real, and there could be such ghosts, but people would never know; thus, this cannot be considered a scientific hypothesis. In general, validating a theory’s predictions raises the likelihood that it is right. However, this does not establish definitively that the theory is right in and of itself. When given additional assumptions, a hypothesis frequently creates a prediction. When a forecast fails in this way, the theory may still be valid.

When a theory makes a faulty prediction, it might be difficult to determine whether the theory should be rejected or whether the auxiliary assumptions are flawed. Astronomers in the 19th century, for example, discovered that Newtonian physics could not adequately explain the orbit of the planet Mercury. This is due to the fact that Newtonian physics is incorrect, and you require relativity to get a more accurate orbit prediction. When astronomers discovered Uranus in 1781, they discovered that its orbit did not match Newtonian physics predictions. However, astronomers concluded that it could be explained if Uranus was being affected by another planet, and Neptune was discovered as a result.

I had several instances where I have made assumptions on an important issue regardless of evidence. Once I have prepared the work on the topic of power distribution in the workplace and its relation to gender, I have assumed that possibly because of the general feminine traits, women are less likely to create a strong image of power in comparison with men. In fact, such a hypothesis needs to be tested, and it is testable. For example, I could first define what is meant by feminine traits by collecting data from different biological and psychological sources. After that, I could observe the information regarding what factors or behavior patterns contribute to establishing power in the workplace. If I found the correlation between the feminine character traits, communication style, and behavioral patterns with the distribution of power in the workplace, then I could confirm my hypothesis.

Thus, applying the scientific method can help to improve critical reasoning by using tools from scientific reasoning. By supporting the provided hypothesis with evidence from scientific research and statistical data, one can make their claim more valuable and objective. The scientific method is essential for the creation of scientific theories that explain information and ideas in a scientifically rational manner. In a typical scientific method application, a researcher makes a hypothesis, tests it using various methods, and then alters it based on the results of the tests and experiments. The new hypothesis is then retested, further changed, and retested until it matches observable events and testing results. Hypotheses serve as tools for scientists to collect data in this way. Scientists can build broad general explanations, or scientific theories, based on that evidence and the numerous scientific experiments conducted to investigate possibilities. In conclusion, a scientific method is an important approach to examining the hypothesis. By using the tools of the scientific method, the inferences become rational and objective.

Black, M. (2018). Critical thinking: An introduction to logic and scientific method . Pickle Partners Publishing.

Cowles, H. M. (2020). The Scientific Method . Harvard University Press.

Lau, J., & Chan, J. (2017). Scientific methodology: Tutorials 1-9 .

  • Market Research Experiment: Developing and Testing Hypothesis Statements
  • Credible Findings of Statistical Studies
  • The Cosmic Dance of Siva
  • A Trip to Mars: Approximate Time, Attaining Synchrony & Parking Orbit
  • Social Research Conduction
  • Pragmatic Development Description and Explanation
  • Outreach Chicago's Study Variables and Research Design
  • Types of Studies for the Outreach Chicago Firm
  • Nobel Prize for Natural Experiments
  • System Dynamics and Soft Systems and Action Research
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2023, March 14). Scientific Method: Role and Importance. https://ivypanda.com/essays/scientific-method-role-and-importance/

"Scientific Method: Role and Importance." IvyPanda , 14 Mar. 2023, ivypanda.com/essays/scientific-method-role-and-importance/.

IvyPanda . (2023) 'Scientific Method: Role and Importance'. 14 March.

IvyPanda . 2023. "Scientific Method: Role and Importance." March 14, 2023. https://ivypanda.com/essays/scientific-method-role-and-importance/.

1. IvyPanda . "Scientific Method: Role and Importance." March 14, 2023. https://ivypanda.com/essays/scientific-method-role-and-importance/.

Bibliography

IvyPanda . "Scientific Method: Role and Importance." March 14, 2023. https://ivypanda.com/essays/scientific-method-role-and-importance/.

Scientific Method Example

Illustration by J.R. Bee. ThoughtCo. 

  • Cell Biology
  • Weather & Climate
  • B.A., Biology, Emory University
  • A.S., Nursing, Chattahoochee Technical College

The scientific method is a series of steps that scientific investigators follow to answer specific questions about the natural world. Scientists use the scientific method to make observations, formulate hypotheses , and conduct scientific experiments .

A scientific inquiry starts with an observation. Then, the formulation of a question about what has been observed follows. Next, the scientist will proceed through the remaining steps of the scientific method to end at a conclusion.

The six steps of the scientific method are as follows:

Observation

The first step of the scientific method involves making an observation about something that interests you. Taking an interest in your scientific discovery is important, for example, if you are doing a science project , because you will want to work on something that holds your attention. Your observation can be of anything from plant movement to animal behavior, as long as it is something you want to know more about.​ This step is when you will come up with an idea if you are working on a science project.

Once you have made your observation, you must formulate a question about what you observed. Your question should summarize what it is you are trying to discover or accomplish in your experiment. When stating your question, be as specific as possible.​ For example, if you are doing a project on plants , you may want to know how plants interact with microbes. Your question could be: Do plant spices inhibit bacterial growth ?

The hypothesis is a key component of the scientific process. A hypothesis is an idea that is suggested as an explanation for a natural event, a particular experience, or a specific condition that can be tested through definable experimentation. It states the purpose of your experiment, the variables used, and the predicted outcome of your experiment. It is important to note that a hypothesis must be testable. That means that you should be able to test your hypothesis through experimentation .​ Your hypothesis must either be supported or falsified by your experiment. An example of a good hypothesis is: If there is a relation between listening to music and heart rate, then listening to music will cause a person's resting heart rate to either increase or decrease.

Once you have developed a hypothesis, you must design and conduct an experiment that will test it. You should develop a procedure that states clearly how you plan to conduct your experiment. It is important you include and identify a controlled variable or dependent variable in your procedure. Controls allow us to test a single variable in an experiment because they are unchanged. We can then make observations and comparisons between our controls and our independent variables (things that change in the experiment) to develop an accurate conclusion.​

The results are where you report what happened in the experiment. That includes detailing all observations and data made during your experiment. Most people find it easier to visualize the data by charting or graphing the information.​

Developing a conclusion is the final step of the scientific method. This is where you analyze the results from the experiment and reach a determination about the hypothesis. Did the experiment support or reject your hypothesis? If your hypothesis was supported, great. If not, repeat the experiment or think of ways to improve your procedure.

  • Glycolysis Steps
  • Biology Prefixes and Suffixes: chrom- or chromo-
  • Biology Prefixes and Suffixes: proto-
  • 6 Things You Should Know About Biological Evolution
  • Biology Prefixes and Suffixes: Aer- or Aero-
  • Taxonomy and Organism Classification
  • Homeostasis
  • Biology Prefixes and Suffixes: diplo-
  • The Biology Suffix -lysis
  • Biology Prefixes and Suffixes Index
  • Biology Prefixes and Suffixes: tel- or telo-
  • Parasitism: Definition and Examples
  • Biology Prefixes and Suffixes: Erythr- or Erythro-
  • Biology Prefixes and Suffixes: ana-
  • Biology Prefixes and Suffixes: phago- or phag-
  • Biology Prefixes and Suffixes: -phyll or -phyl
  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Sweepstakes
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

Scientific Method Steps in Psychology Research

Steps, Uses, and Key Terms

Verywell / Theresa Chiechi

How do researchers investigate psychological phenomena? They utilize a process known as the scientific method to study different aspects of how people think and behave.

When conducting research, the scientific method steps to follow are:

  • Observe what you want to investigate
  • Ask a research question and make predictions
  • Test the hypothesis and collect data
  • Examine the results and draw conclusions
  • Report and share the results 

This process not only allows scientists to investigate and understand different psychological phenomena but also provides researchers and others a way to share and discuss the results of their studies.

Generally, there are five main steps in the scientific method, although some may break down this process into six or seven steps. An additional step in the process can also include developing new research questions based on your findings.

What Is the Scientific Method?

What is the scientific method and how is it used in psychology?

The scientific method consists of five steps. It is essentially a step-by-step process that researchers can follow to determine if there is some type of relationship between two or more variables.

By knowing the steps of the scientific method, you can better understand the process researchers go through to arrive at conclusions about human behavior.

Scientific Method Steps

While research studies can vary, these are the basic steps that psychologists and scientists use when investigating human behavior.

The following are the scientific method steps:

Step 1. Make an Observation

Before a researcher can begin, they must choose a topic to study. Once an area of interest has been chosen, the researchers must then conduct a thorough review of the existing literature on the subject. This review will provide valuable information about what has already been learned about the topic and what questions remain to be answered.

A literature review might involve looking at a considerable amount of written material from both books and academic journals dating back decades.

The relevant information collected by the researcher will be presented in the introduction section of the final published study results. This background material will also help the researcher with the first major step in conducting a psychology study: formulating a hypothesis.

Step 2. Ask a Question

Once a researcher has observed something and gained some background information on the topic, the next step is to ask a question. The researcher will form a hypothesis, which is an educated guess about the relationship between two or more variables

For example, a researcher might ask a question about the relationship between sleep and academic performance: Do students who get more sleep perform better on tests at school?

In order to formulate a good hypothesis, it is important to think about different questions you might have about a particular topic.

You should also consider how you could investigate the causes. Falsifiability is an important part of any valid hypothesis. In other words, if a hypothesis was false, there needs to be a way for scientists to demonstrate that it is false.

Step 3. Test Your Hypothesis and Collect Data

Once you have a solid hypothesis, the next step of the scientific method is to put this hunch to the test by collecting data. The exact methods used to investigate a hypothesis depend on exactly what is being studied. There are two basic forms of research that a psychologist might utilize: descriptive research or experimental research.

Descriptive research is typically used when it would be difficult or even impossible to manipulate the variables in question. Examples of descriptive research include case studies, naturalistic observation , and correlation studies. Phone surveys that are often used by marketers are one example of descriptive research.

Correlational studies are quite common in psychology research. While they do not allow researchers to determine cause-and-effect, they do make it possible to spot relationships between different variables and to measure the strength of those relationships. 

Experimental research is used to explore cause-and-effect relationships between two or more variables. This type of research involves systematically manipulating an independent variable and then measuring the effect that it has on a defined dependent variable .

One of the major advantages of this method is that it allows researchers to actually determine if changes in one variable actually cause changes in another.

While psychology experiments are often quite complex, a simple experiment is fairly basic but does allow researchers to determine cause-and-effect relationships between variables. Most simple experiments use a control group (those who do not receive the treatment) and an experimental group (those who do receive the treatment).

Step 4. Examine the Results and Draw Conclusions

Once a researcher has designed the study and collected the data, it is time to examine this information and draw conclusions about what has been found.  Using statistics , researchers can summarize the data, analyze the results, and draw conclusions based on this evidence.

So how does a researcher decide what the results of a study mean? Not only can statistical analysis support (or refute) the researcher’s hypothesis; it can also be used to determine if the findings are statistically significant.

When results are said to be statistically significant, it means that it is unlikely that these results are due to chance.

Based on these observations, researchers must then determine what the results mean. In some cases, an experiment will support a hypothesis, but in other cases, it will fail to support the hypothesis.

So what happens if the results of a psychology experiment do not support the researcher's hypothesis? Does this mean that the study was worthless?

Just because the findings fail to support the hypothesis does not mean that the research is not useful or informative. In fact, such research plays an important role in helping scientists develop new questions and hypotheses to explore in the future.

After conclusions have been drawn, the next step is to share the results with the rest of the scientific community. This is an important part of the process because it contributes to the overall knowledge base and can help other scientists find new research avenues to explore.

Step 5. Report the Results

The final step in a psychology study is to report the findings. This is often done by writing up a description of the study and publishing the article in an academic or professional journal. The results of psychological studies can be seen in peer-reviewed journals such as  Psychological Bulletin , the  Journal of Social Psychology ,  Developmental Psychology , and many others.

The structure of a journal article follows a specified format that has been outlined by the  American Psychological Association (APA) . In these articles, researchers:

  • Provide a brief history and background on previous research
  • Present their hypothesis
  • Identify who participated in the study and how they were selected
  • Provide operational definitions for each variable
  • Describe the measures and procedures that were used to collect data
  • Explain how the information collected was analyzed
  • Discuss what the results mean

Why is such a detailed record of a psychological study so important? By clearly explaining the steps and procedures used throughout the study, other researchers can then replicate the results. The editorial process employed by academic and professional journals ensures that each article that is submitted undergoes a thorough peer review, which helps ensure that the study is scientifically sound.

Once published, the study becomes another piece of the existing puzzle of our knowledge base on that topic.

Before you begin exploring the scientific method steps, here's a review of some key terms and definitions that you should be familiar with:

  • Falsifiable : The variables can be measured so that if a hypothesis is false, it can be proven false
  • Hypothesis : An educated guess about the possible relationship between two or more variables
  • Variable : A factor or element that can change in observable and measurable ways
  • Operational definition : A full description of exactly how variables are defined, how they will be manipulated, and how they will be measured

Uses for the Scientific Method

The  goals of psychological studies  are to describe, explain, predict and perhaps influence mental processes or behaviors. In order to do this, psychologists utilize the scientific method to conduct psychological research. The scientific method is a set of principles and procedures that are used by researchers to develop questions, collect data, and reach conclusions.

Goals of Scientific Research in Psychology

Researchers seek not only to describe behaviors and explain why these behaviors occur; they also strive to create research that can be used to predict and even change human behavior.

Psychologists and other social scientists regularly propose explanations for human behavior. On a more informal level, people make judgments about the intentions, motivations , and actions of others on a daily basis.

While the everyday judgments we make about human behavior are subjective and anecdotal, researchers use the scientific method to study psychology in an objective and systematic way. The results of these studies are often reported in popular media, which leads many to wonder just how or why researchers arrived at the conclusions they did.

Email this link

Writing a scientific paper.

  • Writing a lab report
  • INTRODUCTION

Writing a "good" methods section

"methods checklist" from: how to write a good scientific paper. chris a. mack. spie. 2018..

  • LITERATURE CITED
  • Bibliography of guides to scientific writing and presenting
  • Peer Review
  • Presentations
  • Lab Report Writing Guides on the Web

The purpose is to provide enough detail that a competent worker could repeat the experiment. Many of your readers will skip this section because they already know from the Introduction the general methods you used. However careful writing of this section is important because for your results to be of scientific merit they must be reproducible. Otherwise your paper does not represent good science.

  • Exact technical specifications and quantities and source or method of preparation
  • Describe equipment used and provide illustrations where relevant.
  • Chronological presentation (but related methods described together)
  • Questions about "how" and "how much" are answered for the reader and not left for them to puzzle over
  • Discuss statistical methods only if unusual or advanced
  • When a large number of components are used prepare tables for the benefit of the reader
  • Do not state the action without stating the agent of the action
  • Describe how the results were generated with sufficient detail so that an independent researcher (working in the same field) could reproduce the results sufficiently to allow validation of the conclusions.
  • Can the reader assess internal validity (conclusions are supported by the results presented)?
  • Can the reader assess external validity (conclusions are properly generalized beyond these specific results)?
  • Has the chosen method been justified?
  • Are data analysis and statistical approaches justified, with assumptions and biases considered?
  • Avoid: including results in the Method section; including extraneous details (unnecessary to enable reproducibility or judge validity); treating the method as a chronological history of events; unneeded references to commercial products; references to “proprietary” products or processes unavailable to the reader. 
  • << Previous: INTRODUCTION
  • Next: RESULTS >>
  • Last Updated: Aug 4, 2023 9:33 AM
  • URL: https://guides.lib.uci.edu/scientificwriting

Off-campus? Please use the Software VPN and choose the group UCIFull to access licensed content. For more information, please Click here

Software VPN is not available for guests, so they may not have access to some content when connecting from off-campus.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Linking essay-writing tests using many-facet models and neural automated essay scoring

  • Original Manuscript
  • Open access
  • Published: 20 August 2024

Cite this article

You have full access to this open access article

scientific method essay introduction

  • Masaki Uto   ORCID: orcid.org/0000-0002-9330-5158 1 &
  • Kota Aramaki 1  

Explore all metrics

For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees’ abilities while accounting for the impact of rater characteristics, thereby enhancing the accuracy of ability measurement. However, difficulties can arise when different groups of examinees are evaluated by different sets of raters. In such cases, test linking is essential for unifying the scale of model parameters estimated for individual examinee–rater groups. Traditional test-linking methods typically require administrators to design groups in which either examinees or raters are partially shared. However, this is often impractical in real-world testing scenarios. To address this, we introduce a novel method for linking the parameters of IRT models with rater parameters that uses neural automated essay scoring technology. Our experimental results indicate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

Similar content being viewed by others

scientific method essay introduction

Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability

scientific method essay introduction

Integration of Automated Essay Scoring Models Using Item Response Theory

scientific method essay introduction

Robust Neural Automated Essay Scoring Using Item Response Theory

Avoid common mistakes on your manuscript.

Introduction

The growing demand for assessing higher-order skills, such as logical reasoning and expressive capabilities, has led to increased interest in essay-writing assessments (Abosalem, 2016 ; Bernardin et al., 2016 ; Liu et al., 2014 ; Rosen & Tager, 2014 ; Schendel & Tolmie, 2017 ). In these assessments, human raters assess the written responses of examinees to specific writing tasks. However, a major limitation of these assessments is the strong influence that rater characteristics, including severity and consistency, have on the accuracy of ability measurement (Bernardin et al., 2016 ; Eckes, 2005 , 2023 ; Kassim, 2011 ; Myford & Wolfe, 2003 ). Several item response theory (IRT) models that incorporate parameters representing rater characteristics have been proposed to mitigate this issue (Eckes, 2023 ; Myford & Wolfe, 2003 ; Uto & Ueno, 2018 ).

The most prominent among them are many-facet Rasch models (MFRMs) (Linacre, 1989 ), and various extensions of MFRMs have been proposed to date (Patz & Junker, 1999 ; Patz et al., 2002 ; Uto & Ueno, 2018 , 2020 ). These IRT models have the advantage of being able to estimate examinee ability while accounting for rater effects, making them more accurate than simple scoring methods based on point totals or averages.

However, difficulties can arise when essays from different groups of examinees are evaluated by different sets of raters, a scenario often encountered in real-world testing. For instance, in academic settings such as university admissions, individual departments may use different pools of raters to assess essays from specific applicant pools. Similarly, in the context of large-scale standardized tests, different sets of raters may be allocated to various test dates or locations. Thus, when applying IRT models with rater parameters to account for such real-world testing cases while also ensuring that ability estimates are comparable across groups of examinees and raters, test linking becomes essential for unifying the scale of model parameters estimated for each group.

Conventional test-linking methods generally require some overlap of examinees or raters across the groups being linked (Eckes, 2023 ; Engelhard, 1997 ; Ilhan, 2016 ; Linacre, 2014 ; Uto, 2021a ). For example, linear linking based on common examinees, a popular linking method, estimates the IRT parameters for shared examinees using data from each group. These estimates are then used to build a linear regression model, which adjusts the parameter scales across groups. However, the design of such overlapping groups can often be impractical in real-world testing environments.

To facilitate test linking in these challenging environments, we introduce a novel method that leverages neural automated essay scoring (AES) technology. Specifically, we employ a cutting-edge deep neural AES method (Uto & Okano, 2021 ) that can predict IRT-based abilities from examinees’ essays. The central concept of our linking method is to construct an AES model using the ability estimates of examinees in a reference group, along with their essays, and then to apply this model to predict the abilities of examinees in other groups. An important point is that the AES model is trained to predict examinee abilities on the scale established by the reference group. This implies that the trained AES model can predict the abilities of examinees in other groups on the ability scale established by the reference group. Therefore, we use the predicted abilities to calculate the linking coefficients required for linear linking and to perform a test linking. In this study, we conducted experiments based on real-world data to demonstrate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

It should be noted that previous studies have attempted to employ AES technologies for test linking (Almond, 2014 ; Olgar, 2015 ), but their focus has primarily been on linking tests with varied writing tasks or a mixture of essay tasks and objective items, while overlooking the influence of rater characteristics. This differs from the specific scenarios and goals that our study aims to address. To the best of our knowledge, this is the first study that employs AES technologies to link IRT models incorporating rater parameters for writing assessments without the need for common examinees and raters.

Setting and data

In this study, we assume scenarios in which two groups of examinees respond to the same writing task and their written essays are assessed by two distinct sets of raters following the same scoring rubric. We refer to one group as the reference group , which serves as the basis for the scale, and the other as the focal group , whose scale we aim to align with that of the reference group.

Let \(u^{\text {ref}}_{jr}\) be the score assigned by rater \(r \in \mathcal {R}^{\text {ref}}\) to the essay of examinee \(j \in \mathcal {J}^{\text {ref}}\) , where \(\mathcal {R}^{\text {ref}}\) and \(\mathcal {J}^{\text {ref}}\) denote the sets of raters and examinees in the reference group, respectively. Then, a collection of scores for the reference group can be defined as

where \(\mathcal{K} = \{1,\ldots ,K\}\) represents the rating categories, and \(-1\) indicates missing data.

Similarly, a collection of scores for the focal group can be defined as

where \(u^{\text {foc}}_{jr}\) indicates the score assigned by rater \(r \in \mathcal {R}^{\text {foc}}\) to the essay of examinee \(j \in \mathcal {J}^{\text {foc}}\) , and \(\mathcal {R}^{\text {foc}}\) and \(\mathcal {J}^{\text {foc}}\) represent the sets of raters and examinees in the focal group, respectively.

The primary objective of this study is to apply IRT models with rater parameters to the two sets of data, \(\textbf{U}^{\text {ref}}\) and \(\textbf{U}^{\text {foc}}\) , and to establish IRT parameter linking without shared examinees and raters: \(\mathcal {J}^{\text {ref}} \cap \mathcal {J}^{\text {foc}} = \emptyset \) and \(\mathcal {R}^{\text {ref}} \cap \mathcal {R}^{\text {foc}} = \emptyset \) . More specifically, we seek to align the scale derived from \(\textbf{U}^{\text {foc}}\) with that of \(\textbf{U}^{\text {ref}}\) .

  • Item response theory

IRT (Lord, 1980 ), a test theory grounded in mathematical models, has recently gained widespread use in various testing situations due to the growing prevalence of computer-based testing. In objective testing contexts, IRT makes use of latent variable models, commonly referred to as IRT models. Traditional IRT models, such as the Rasch model and the two-parameter logistic model, give the probability of an examinee’s response to a test item as a probabilistic function influenced by both the examinee’s latent ability and the item’s characteristic parameters, such as difficulty and discrimination. These IRT parameters can be estimated from a dataset consisting of examinees’ responses to test items.

However, traditional IRT models are not directly applicable to essay-writing test data, where the examinees’ responses to test items are assessed by multiple human raters. Extended IRT models with rater parameters have been proposed to address this issue (Eckes, 2023 ; Jin and Wang, 2018 ; Linacre, 1989 ; Shin et al., 2019 ; Uto, 2023 ; Wilson & Hoskens, 2001 ).

Many-facet Rasch models and their extensions

The MFRM (Linacre, 1989 ) is the most commonly used IRT model that incorporates rater parameters. Although several variants of the MFRM exist (Eckes, 2023 ; Myford & Wolfe, 2004 ), the most representative model defines the probability that the essay of examinee j for a given test item (either a writing task or prompt) i receives a score of k from rater r as

where \(\theta _j\) is the latent ability of examinee j , \(\beta _{i}\) represents the difficulty of item i , \(\beta _{r}\) represents the severity of rater  r , and \(d_{m}\) is a step parameter denoting the difficulty of transitioning between scores \(m-1\) and m . \(D = 1.7\) is a scaling constant used to minimize the difference between the normal and logistic distribution functions. For model identification, \(\sum _{i} \beta _{i} = 0\) , \(d_1 = 0\) , \(\sum _{m = 2}^{K} d_{m} = 0\) , and a normal distribution for the ability \(\theta _j\) are assumed.

Another popular MFRM is one in which \(d_{m}\) is replaced with \(d_{rm}\) , a rater-specific step parameter denoting the severity of rater r when transitioning from score  \(m-1\) to m . This model is often used to investigate variations in rating scale criteria among raters caused by differences in the central tendency, extreme response tendency, and range restriction among raters (Eckes, 2023 ; Myford & Wolfe, 2004 ; Qiu et al., 2022 ; Uto, 2021a ).

A recent extension of the MFRM is the generalized many-facet model (GMFM) (Uto & Ueno, 2020 ) Footnote 1 , which incorporates parameters denoting rater consistency and item discrimination. GMFM defines the probability \(P_{ijrk}\) as

where \(\alpha _i\) indicates the discrimination power of item i , and \(\alpha _r\) indicates the consistency of rater r . For model identification, \(\prod _{r} \alpha _i = 1\) , \(\sum _{i} \beta _{i} = 0\) , \(d_{r1} = 0\) , \(\sum _{m = 2}^{K} d_{rm} = 0\) , and a normal distribution for the ability \(\theta _j\) are assumed.

In this study, we seek to apply the aforementioned IRT models to data involving a single test item, as detailed in the Setting and data section. When there is only one test item, the item parameters in the above equations become superfluous and can be omitted. Consequently, the equations for these models can be simplified as follows.

MFRM with rater-specific step parameters (referred to as MFRM with RSS in the subsequent sections):

Note that the GMFM can simultaneously capture the following typical characteristics of raters, whereas the MFRM and MFRM with RSS can only consider a subset of these characteristics.

Severity : This refers to the tendency of some raters to systematically assign higher or lower scores compared with other raters regardless of the actual performance of the examinee. This tendency is quantified by the parameter \(\beta _r\) .

Consistency : This is the extent to which raters maintain their scoring criteria consistently over time and across different examinees. Consistent raters exhibit stable scoring patterns, which make their evaluations more reliable and predictable. In contrast, inconsistent raters show varying scoring tendencies. This characteristic is represented by the parameter \(\alpha _r\) .

Range Restriction : This describes the limited variability in scores assigned by a rater. Central tendency and extreme response tendency are special cases of range restriction. This characteristic is represented by the parameter \(d_{rm}\) .

For details on how these characteristics are represented in the GMFM, see the article (Uto & Ueno, 2020 ).

Based on the above, it is evident that both the MFRM and MFRM with RSS are special cases of the GMFM. Specifically, the GMFM with constant rater consistency corresponds to the MFRM with RSS. Moreover, the MFRM with RSS that assumes no differences in the range restriction characteristic among raters aligns with the MFRM.

When the aforementioned IRT models are applied to datasets from multiple groups composed of different examinees and raters, such as \(\textbf{U}^{\text {red}}\) and \(\textbf{U}^{\text {foc}}\) , the scales of the estimated parameters generally differ among them. This discrepancy arises because IRT permits arbitrary scaling of parameters for each independent dataset. An exception occurs when it is feasible to assume equality in between-test distributions of examinee abilities and rater parameters (Linacre, 2014 ). However, real-world testing conditions may not always satisfy this assumption. Therefore, if the aim is to compare parameter estimates between different groups, test linking is generally required to unify the scale of model parameters estimated from each individual group’s dataset.

One widely used approach for test linking is linear linking . In the context of the essay-writing test considered in this study, implementing linear linking necessitates designing two groups so that there is some overlap in examinees between them. With this design, IRT parameters for the shared examinees are estimated individually for each group. These estimates are then used to construct a linear regression model for aligning the parameter scales across groups, thereby rendering them comparable. We now introduce the mean and sigma method  (Kolen & Brennan, 2014 ; Marco, 1977 ), a popular method for linear linking, and illustrate the procedures for parameter linking specifically for the GMFM, as defined in Eq.  7 , because both the MFRM and the MFRM with RSS can be regarded as special cases of the GMFM, as explained earlier.

To elucidate this, let us assume that the datasets corresponding to the reference and focal groups, denoted as \(\textbf{U}^{\text {ref}}\) and \(\textbf{U}^{\text {foc}}\) , contain overlapping sets of examinees. Furthermore, let us assume that \(\hat{\varvec{\theta }}^{\text {foc}}\) , \(\hat{\varvec{\alpha }}^{\text {foc}}\) , \(\hat{\varvec{\beta }}^{\text {foc}}\) , and \(\hat{\varvec{d}}^{\text {foc}}\) are the GMFM parameters estimated from \(\textbf{U}^{\text {foc}}\) . The mean and sigma method aims to transform these parameters linearly so that their scale aligns with those estimated from \(\textbf{U}^{\text {ref}}\) . This transformation is guided by the equations

where \(\tilde{\varvec{\theta }}^{\text {foc}}\) , \(\tilde{\varvec{\alpha }}^{\text {foc}}\) , \(\tilde{\varvec{\beta }}^{\text {foc}}\) , and \(\tilde{\varvec{d}}^{\text {foc}}\) represent the scale-transformed parameters for the focal group. The linking coefficients are defined as

where \({\mu }^{\text {ref}}\) and \({\sigma }^{\text {ref}}\) represent the mean and standard deviation (SD) of the common examinees’ ability values estimated from \(\textbf{U}^{\text {ref}}\) , and \({\mu }^{\text {foc}}\) and \({\sigma }^{\text {foc}}\) represent those values obtained from \(\textbf{U}^{\text {foc}}\) .

This linear linking method is applicable when there are common examinees across different groups. However, as discussed in the introduction, arranging for multiple groups with partially overlapping examinees (and/or raters) can often be impractical in real-world testing environments. To address this limitation, we aim to facilitate test linking without the need for common examinees and raters by leveraging AES technology.

Automated essay scoring models

Many AES methods have been developed over recent decades and can be broadly categorized into either feature-engineering or automatic feature extraction approaches (Hussein et al., 2019 ; Ke & Ng, 2019 ). The feature-engineering approach predicts essay scores using either a regression or classification model that employs manually designed features, such as essay length and the number of spelling errors (Amorim et al., 2018 ; Dascalu et al., 2017 ; Nguyen & Litman, 2018 ; Shermis & Burstein, 2002 ). The advantages of this approach include greater interpretability and explainability. However, it generally requires considerable effort in developing effective features to achieve high scoring accuracy for various datasets. Automatic feature extraction approaches based on deep neural networks (DNNs) have recently attracted attention as a means of eliminating the need for feature engineering. Many DNN-based AES models have been proposed in the last decade and have achieved state-of-the-art accuracy (Alikaniotis et al., 2016 ; Dasgupta et al., 2018 ; Farag et al., 2018 ; Jin et al., 2018 ; Mesgar & Strube, 2018 ; Mim et al., 2019 ; Nadeem et al., 2019 ; Ridley et al., 2021 ; Taghipour & Ng, 2016 ; Uto, 2021b ; Wang et al., 2018 ). In the next section, we introduce the most widely used DNN-based AES model, which utilizes Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019 ).

BERT-based AES model

BERT, a pre-trained language model developed by Google’s AI language team, achieved state-of-the-art performance in various natural language processing (NLP) tasks in 2019 (Devlin et al., 2019 ). Since then, it has frequently been applied to AES (Rodriguez et al., 2019 ) and automated short-answer grading (Liu et al., 2019 ; Lun et al., 2020 ; Sung et al., 2019 ) and has demonstrated high accuracy.

BERT is structured as a multilayer bidirectional transformer network, where the transformer is a neural network architecture designed to handle ordered sequences of data using an attention mechanism. See Ref. (Vaswani et al., 2017 ) for details of transformers.

BERT undergoes training in two distinct phases, pretraining and fine-tuning . The pretraining phase utilizes massive volumes of unlabeled text data and is conducted through two unsupervised learning tasks, specifically, masked language modeling and next-sentence prediction . Masked language modeling predicts the identities of words that have been masked out of the input text, while next-sequence prediction predicts whether two given sentences are adjacent.

Fine-tuning is required to adapt a pre-trained BERT model for a specific NLP task, including AES. This entails retraining the BERT model using a task-specific supervised dataset after initializing the model parameters with pre-trained values and augmenting with task-specific output layers. For AES applications, the addition of a special token, [CLS] , at the beginning of each input is required. Then, BERT condenses the entire input text into a fixed-length real-valued hidden vector referred to as the distributed text representation , which corresponds to the output of the special token [CLS]  (Devlin et al., 2019 ). AES scores can thus be derived by feeding the distributed text representation into a linear layer with sigmoid activation , as depicted in Fig.  1 . More formally, let \( \varvec{h} \) be the distributed text representation. The linear layer with sigmoid activation is defined as \(\sigma (\varvec{W}\varvec{h}+\text{ b})\) , where \(\varvec{W}\) is a weight matrix and \(\text{ b }\) is a bias, both learned during the fine-tuning process. The sigmoid function \(\sigma ()\) maps its input to a value between 0 and 1. Therefore, the model is trained to minimize an error loss function between the predicted scores and the gold-standard scores, which are normalized to the [0, 1] range. Moreover, score prediction using the trained model is performed by linearly rescaling the predicted scores back to the original score range.

figure 1

BERT-based AES model architecture. \(w_{jt}\) is the t -th word in the essay of examinee j , \(n_j\) is the number of words in the essay, and \(\hat{y}_{j}\) represents the predicted score from the model

Problems with AES model training

As mentioned above, to employ BERT-based and other DNN-based AES models, they must be trained or fine-tuned using a large dataset of essays that have been graded by human raters. Typically, the mean-squared error (MSE) between the predicted and the gold-standard scores serves as the loss function for model training. Specifically, let \(y_{j}\) be the normalized gold-standard score for the j -th examinee’s essay, and let \(\hat{y}_{j}\) be the predicted score from the model. The MSE loss function is then defined as

where J denotes the number of examinees, which is equivalent to the number of essays, in the training dataset.

Here, note that a large-scale training dataset is often created by assigning a few raters from a pool of potential raters to each essay to reduce the scoring burden and to increase scoring reliability. In such cases, the gold-standard score for each essay is commonly determined by averaging the scores given by multiple raters assigned to that essay. However, as discussed in earlier sections, these straightforward average scores are highly sensitive to rater characteristics. When training data includes rater bias effects, an AES model trained on that data can show decreased performance as a result of inheriting these biases (Amorim et al., 2018 ; Huang et al., 2019 ; Li et al., 2020 ; Wind et al., 2018 ). An AES method that uses IRT has been proposed to address this issue (Uto & Okano, 2021 ).

AES method using IRT

The main idea behind the AES method using IRT (Uto & Okano, 2021 ) is to train an AES model using the ability value \(\theta _j\) estimated by IRT models with rater parameters, such as MFRM and its extensions, from the data given by multiple raters for each essay, instead of a simple average score. Specifically, AES model training in this method occurs in two steps, as outlined in Fig.  2 .

Estimate the IRT-based abilities \(\varvec{\theta }\) from a score dataset, which includes scores given to essays by multiple raters.

Train an AES model given the ability estimates as the gold-standard scores. Specifically, the MSE loss function for training is defined as

where \(\hat{\theta }_j\) represents the AES’s predicted ability of the j -th examinee, and \(\theta _{j}\) is the gold-standard ability for the examinee obtained from Step 1. Note that the gold-standard scores are rescaled into the range [0, 1] by applying a linear transformation from the logit range \([-3, 3]\) to [0, 1]. See the original paper (Uto & Okano, 2021 ) for details.

figure 2

Architecture of a BERT-based AES model that uses IRT

A trained AES model based on this method will not reflect bias effects because IRT-based abilities \(\varvec{\theta }\) are estimated while removing rater bias effects.

In the prediction phase, the score for an essay from examinee \(j^{\prime }\) is calculated in two steps.

Predict the IRT-based ability \(\theta _{j^{\prime }}\) for the examinee using the trained AES model, and then linearly rescale it to the logit range \([-3, 3]\) .

Calculate the expected score \(\mathbb {E}_{r,k}\left[ P_{j^{\prime }rk}\right] \) , which corresponds to an unbiased original-scaled score, given \(\theta _{j'}\) and the rater parameters. This is used as a predicted essay score in this method.

This method originally aimed to train an AES model while mitigating the impact of varying rater characteristics present in the training data. A key feature, however, is its ability to predict an examinee’s IRT-based ability from their essay texts. Our linking approach leverages this feature to enable test linking without requiring common examinees and raters.

figure 3

Outline of our proposed method, steps 1 and 2

figure 4

Outline of our proposed method, steps 3–6

Proposed method

The core idea behind our method is to develop an AES model that predicts examinee ability using score and essay data from the reference group, and then to use this model to predict the abilities of examinees in the focal group. These predictions are then used to estimate the linking coefficients for a linear linking. An outline of our method is illustrated in Figs.  3 and 4 . The detailed steps involved in the procedure are as follows.

Estimate the IRT model parameters from the reference group’s data \(\textbf{U}^{\text {ref}}\) to obtain \(\hat{\varvec{\theta }}^{\text {ref}}\) indicating the ability estimates of the examinees in the reference group.

Use the ability estimates \(\hat{\varvec{\theta }}^{\text {ref}}\) and the essays written by the examinees in the reference group to train the AES model that predicts examinee ability.

Use the trained AES model to predict the abilities of examinees in the focal group by inputting their essays. We designate these AES-predicted abilities as \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) from here on. An important point to note is that the AES model is trained to predict ability values on the parameter scale aligned with the reference group’s data, meaning that the predicted abilities for examinees in the focal group follow the same scale.

Estimate the IRT model parameters from the focal group’s data \(\textbf{U}^{\text {foc}}\) .

Calculate the linking coefficients A and K using the AES-predicted abilities \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) and the IRT-based ability estimates \(\hat{\varvec{\theta }}^{\text {foc}}\) for examinees in the focal group as follows.

where \({\mu }^{\text {foc}}_{\text {pred}}\) and \({\sigma }^{\text {foc}}_{\text {pred}}\) represent the mean and the SD of the AES-predicted abilities \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) , respectively. Furthermore, \({\mu }^{\text {foc}}\) and \({\sigma }^{\text {foc}}\) represent the corresponding values for the IRT-based ability estimates \(\hat{\varvec{\theta }}^{\text {foc}}\) .

Apply linear linking based on the mean and sigma method given in Eq.  8 using the above linking coefficients and the parameter estimates for the focal group obtained in Step 4. This procedure yields parameter estimates for the focal group that are aligned with the scale of the parameters of the reference group.

As described in Step 3, the AES model used in our method is trained to predict examinee abilities on the scale derived from the reference data \(\textbf{U}^{\text {ref}}\) . Therefore, the abilities predicted by the trained AES model for the examinees in the focal group, denoted as \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) , also follow the ability scale derived from the reference data. Consequently, by using the AES-predicted abilities, we can infer the differences in the ability distribution between the reference and focal groups. This enables us to estimate the linking coefficients, which then allows us to perform linear linking based on the mean and sigma method. Thus, our method allows for test linking without the need for common examinees and raters.

It is important to note that the current AES model for predicting examinees’ abilities does not necessarily offer sufficient prediction accuracy for individual ability estimates. This implies that their direct use in mid- to high-stakes assessments could be problematic. Therefore, we focus solely on the mean and SD values of the ability distribution based on predicted abilities, rather than using individual predicted ability values. Our underlying assumption is that these AES models can provide valuable insights into differences in the ability distribution across various groups, even though the individual predictions might be somewhat inaccurate, thereby substantiating their utility for test linking.

Experiments

In this section, we provide an overview of the experiments we conducted using actual data to evaluate the effectiveness of our method.

Actual data

We used the dataset previously collected in Uto and Okano ( 2021 ). It consists of essays written in English by 1805 students from grades 7 to 10 along with scores from 38 raters for these essays. The essays originally came from the ASAP (Automated Student Assessment Prize) dataset, which is a well-known benchmark dataset for AES studies. The raters were native English speakers recruited from Amazon Mechanical Turk (AMT), a popular crowdsourcing platform. To alleviate the scoring burden, only a few raters were assigned to each essay, rather than having all raters evaluate every essay. Rater assignment was conducted based on a systematic links design  (Shin et al., 2019 ; Uto, 2021a ; Wind & Jones, 2019 ) to achieve IRT-scale linking. Consequently, each rater evaluated approximately 195 essays, and each essay was graded by four raters on average. The raters were asked to grade the essays using a holistic rubric with five rating categories, which is identical to the one used in the original ASAP dataset. The raters were provided no training before the scoring process began. The average Pearson correlation between the scores from AMT raters and the ground-truth scores included in the original ASAP dataset was 0.70 with an SD of 0.09. The minimum and maximum correlations were 0.37 and 0.81, respectively. Furthermore, we also calculated the intraclass correlation coefficient (ICC) between the scores from each AMT rater and the ground-truth scores. The average ICC was 0.60 with an SD of 0.15, and the minimum and maximum ICCs were 0.29 and 0.79, respectively. The calculation of the correlation coefficients and ICC for each AMT rater excluded essays that the AMT rater did not assess. Furthermore, because the ground-truth scores were given as the total scores from two raters, we divided them by two in order to align the score scale with the AMT raters’ scores.

For further analysis, we also evaluated the ICC among the AMT raters as their interrater reliability. In this analysis, missing value imputation was required because all essays were evaluated by a subset of AMT raters. Thus, we first applied multiple imputation with predictive mean matching to the AMT raters’ score dataset. In this process, we generated five imputed datasets. For each imputed dataset, we calculated the ICC among all AMT raters. Finally, we aggregated the ICC values from each imputed dataset to calculate the mean ICC and its SD. The results revealed a mean ICC of 0.43 with an SD of 0.01.

These results suggest that the reliability of raters is not necessarily high. This variability in scoring behavior among raters underscores the importance of applying IRT models with rater parameters. For further details of the dataset see Uto and Okano ( 2021 ).

Experimental procedures

Using this dataset, we conducted the following experiment for three IRT models with rater parameters, MFRM, MFRM with RSS, and GMFM, defined by Eqs.  5 , 6 , and 7 , respectively.

We estimated the IRT parameters from the dataset using the No-U-Turn sampler-based Markov chain Monte Carlo (MCMC) algorithm, given the prior distributions \(\theta _j, \beta _r, d_m, d_{rm} \sim N(0, 1)\) , and \(\alpha _r \sim LN(0, 0.5)\) following the previous work (Uto & Ueno, 2020 ). Here, \( N(\cdot , \cdot )\) and \(LN(\cdot , \cdot )\) indicate normal and log-normal distributions with mean and SD values, respectively. The expected a posteriori (EAP) estimator was used as the point estimates.

We then separated the dataset randomly into two groups, the reference group and the focal group, ensuring no overlap of examinees and raters between them. In this separation, we selected examinees and raters in each group to ensure distinct distributions of examinee abilities and rater severities. Various separation patterns were tested and are listed in Table  1 . For example, condition 1 in Table  1 means that the reference group comprised randomly selected high-ability examinees and low-severity raters, while the focal group comprised low-ability examinees and high-severity raters. Condition 2 provided a similar separation but controlled for narrower variance in rater severity in the focal group. Details of the group creation procedures can be found in Appendix  A .

Using the obtained data for the reference and focal groups, we conducted test linking using our method, the details of which are given in the Proposed method section. In it, the IRT parameter estimations were carried out using the same MCMC algorithm as in Step 1.

We calculated the Root Mean Squared Error (RMSE) between the IRT parameters for the focal group, which were linked using our proposed method, and their gold-standard parameters. In this context, the gold-standard parameters were obtained by transforming the scale of the parameters estimated from the entire dataset in Step 1 so that it aligned with that of the reference group. Specifically, we estimated the IRT parameters using data from the reference group and collected those estimated from the entire dataset in Step 1. Then, using the examinees in the reference group as common examinees, we applied linear linking based on the mean and sigma method to adjust the scale of the parameters estimated from the entire dataset to match that of the reference group.

For comparison, we also calculated the RMSE between the focal group’s IRT parameters, obtained without applying the proposed linking, and their gold-standard parameters. This functions as the worst baseline against which the results of the proposed method are compared. Additionally, we examined other baselines that use linear linking based on common examinees. For these baselines, we randomly selected five or ten examinees from the reference group, who were assigned scores by at least two focal group’s raters in the entire dataset. The scores given to these selected examinees by the focal group’s raters were then merged with the focal group’s data, where the added examinees worked as common examinees between the reference and focal groups. Using this data, we examined linear linking using common examinees. Specifically, we estimated the IRT parameters from the data of the focal group with common examinees and applied linear linking based on the mean and sigma method using the ability estimates of the common examinees to align its scale with that of the reference group. Finally, we calculated the RMSE between the linked parameter estimates for the examinees and raters belonging only to the original focal group and their gold-standard parameters. Note that this common examinee approach operates under more advantageous conditions compared with the proposed linking method because it can utilize larger samples for estimating the parameters of raters in the focal group.

We repeated Steps 2–5 ten times for each data separation condition and calculated the average RMSE for four cases: one in which our proposed linking method was applied, one without linking, and two others where linear linkings using five and ten common examinees were applied.

The parameter estimation program utilized in Steps 1, 4, and 5 was implemented using RStan (Stan Development Team, 2018 ). The EAP estimates were calculated as the mean of the parameter samples obtained from 2,000 to 5,000 periods using three independent chains. The AES model was developed in Python, leveraging the PyTorch library Footnote 2 . For the AES model training in Step 3, we randomly selected \(90\%\) of the data from the reference group to serve as the training set, with the remaining \(10\%\) designated as the development set. We limited the maximum number of steps for training the AES model to 800 and set the maximum number of epochs to 800 divided by the number of mini-batches. Additionally, we employed early stopping based on the performance on the development set. The AdamW optimization algorithm was used, and the mini-batch size was set to 8.

MCMC statistics and model fitting

Before delving into the results of the aforementioned experiments, we provide some statistics related to the MCMC-based parameter estimation. Specifically, we computed the Gelman–Rubin statistic \(\hat{R}\)  (Gelman et al., 2013 ; Gelman & Rubin, 1992 ), a well-established diagnostic index for convergence, as well as the effective sample size (ESS) and the number of divergent transitions for each IRT model during the parameter estimation phase in Step 1. Across all models, the \(\hat{R}\) statistics were below 1.1 for all parameters, indicating convergence of the MCMC runs. Furthermore, as shown in the first row of Table  2 , our ESS values for all parameters in all models exceeded the criterion of 400, which is considered sufficiently large according to Zitzmann and Hecht ( 2019 ). We also observed no divergent transitions in any of the cases. These results support the validity of the MCMC-based parameter estimation.

Furthermore, we evaluated the model – data fit for each IRT model during the parameter estimation step in Step 1. To assess this fit, we employed the posterior predictive p  value ( PPP -value) (Gelman et al., 2013 ), a commonly used metric for evaluating the model–data fit in Bayesian frameworks (Nering & Ostini, 2010 ; van der Linden, 2016 ). Specifically, we calculated the PPP -value using an averaged standardized residual, a conventional metric for IRT model fit in non-Bayesian settings, as a discrepancy function, similar to the approach in Nering and Ostini ( 2010 ); Tran ( 2020 ); Uto and Okano ( 2021 ). A well-fitted model yields a PPP -value close to 0.5, while poorly fitted models exhibit extreme low or high values, such as those below 0.05 or above 0.95. Additionally, we calculated two information criteria, the widely applicable information criterion (WAIC) (Watanabe, 2010 ) and the widely applicable Bayesian information criterion (WBIC) (Watanabe, 2013 ). The model that minimizes these criteria is considered optimal.

The last three rows in Table  2 shows the results. We can see that the PPP -value for GMFM is close to 0.5, indicating a good fit to the data. In contrast, the other models exhibit high values, suggesting a poor fit to the data. Furthermore, among the three IRT models evaluated, GMFM exhibits the lowest WAIC and WBIC values. These findings suggest that GMFM offers the best fit to the data, corroborating previous work that investigated the same dataset using IRT models (Uto & Okano, 2021 ). We provide further discussion about the model fit in the Analysis of rater characteristics section given later.

According to these results, the following section focuses on the results for GMFM. Note that we also include the results for MFRM and MFRM with RSS in Appendix  B , along with the open practices statement.

Effectiveness of our proposed linking method

The results of the aforementioned experiments for GMFM are shown in Table  3 . In the table, the Unlinked row represents the average RMSE between the focal group’s IRT parameters without applying our linking method and their gold-standard parameters. Similarly, the Linked by proposed method row represents the average RMSE between the focal group’s IRT parameters after applying our linking method and their gold-standard parameters. The rows labeled Linked by five/ten common examinees represent the results for linear linking using common examinees.

A comparison of the results from the unlinked condition and the proposed method reveals that the proposed method improved the RMSEs for the ability and rater severity parameters, namely, \(\theta _j\) and \(\beta _r\) , which we intentionally varied between the reference and focal groups. The degree of improvement is notably substantial when the distributional differences between the reference and focal groups are large, as is the case in Conditions 1–5. On the other hand, for Conditions 6–8, where the distributional differences are relatively minor, the improvements are also smaller in comparison. This is because the RMSEs for the unlinked parameters are already lower in these conditions than in Conditions 1–5. Nonetheless, it is worth emphasizing that the RMSEs after employing our linking method are exceptionally low in Conditions 6–8.

Furthermore, the table indicates that the RMSEs for the step parameters and rater consistency parameters, namely, \(d_{rm}\) and \(\alpha _r\) , also improved in many cases, while the impact of applying our linking method is relatively small for these parameters compared with the ability and rater severity parameters. This is because we did not intentionally vary their distribution between the reference and focal groups, and thus their distribution differences were smaller than those for the ability and rater severity parameters, as shown in the next section.

Comparing the results from the proposed method and linear linking using five common examinees, we observe that the proposed method generally exhibits lower RMSE values for the ability \(\theta _j\) and the rater severity parameters \(\beta _r\) , except for conditions 2–3. Furthermore, when comparing the proposed method with linear linking using ten common examinees, it achieves superior performance in conditions 4–8 and slightly lower performance in conditions 1–3 for \(\theta _j\) and \(\beta _r\) , while the differences are more minor overall than those observed when comparing the proposed method with the condition of five common examinees. Note that the reasons why the proposed method tends to show lower performance for conditions 1–3 are as follows.

The proposed method utilizes fewer samples to estimate the rater parameters compared with the linear linking method using common examinees.

In situations where distributional differences between the reference and focal groups are relatively large, as in conditions 1–3, constructing an accurate AES model for the focal group becomes challenging due to the limited overlap in the ability value range. We elaborate on this point in the next section.

Furthermore, in terms of the rater consistency parameter \(\alpha _r\) and the step parameter \(d_{rm}\) , the proposed method typically shows lower RMSE values compared with linear linking using common examinees. We attribute this to the fact that the performance of the linking method using common examinees is highly dependent on the choice of common examinees, which can sometimes result in significant errors in these parameters. This issue is also further discussed in the next section.

These results suggest that our method can perform linking with comparable accuracy to linear linking using few common examinees, even in the absence of common examinees and raters. Additionally, as reported in Tables  15 and 16 in Appendix  B , both MFRM and MFRM with RSS also exhibit a similar tendency, further validating the effectiveness of our approach regardless of the IRT models employed.

Detailed analysis

Analysis of parameter scale transformation using the proposed method.

In this section, we detail how our method transforms the parameter scale. To demonstrate this, we first summarize the mean and SD values of the gold-standard parameters for both the reference and focal groups in Table  4 . The values in the table are averages calculated from ten repetitions of the experimental procedures. The table shows that the mean and SD values of both examinee ability and rater severity vary significantly between the reference and focal groups following our intended settings, as outlined in Table  1 . Additionally, the mean and SD values for the rater consistency parameter \(\alpha _r\) and the rater-specific step parameters \(d_{rm}\) also differ slightly between the groups, although we did not intentionally alter them.

Second, the averaged values of the means and SDs of the parameters, estimated solely from either the reference or the focal group’s data over ten repetitions, are presented in Table  5 . The table reveals that the estimated parameters for both groups align with a normal distribution centered at nearly zero, despite the actual ability distributions differing between the groups. This phenomenon arises because IRT permits arbitrary scaling of parameters for each independent dataset, as mentioned in the Linking section. This leads to differences in the parameter scale for the focal group compared with their gold-standard values, thereby highlighting the need for parameter linking.

Next, the first two rows of Table  6 display the mean and SD values of the ability estimates for the focal group’s examinees, as predicted by the BERT-based AES model. In the table, the RMSE row indicates the RMSE between the AES-predicted ability values and the gold-standard ability values for the focal groups. The Linking Coefficients row presents the linking coefficients calculated based on the AES-predicted abilities. As with the abovementioned tables, these values are also averages over ten experimental repetitions. According to the table, for Conditions 6–8, where the distributional differences between the groups are relatively minor, both the mean and SD estimates align closely with those of the gold-standard parameters. In contrast, for Conditions 1–5, where the distributional differences are more pronounced, the mean and SD estimates tend to deviate from the gold-standard values, highlighting the challenges of parameter linking under such conditions.

In addition, as indicated in the RMSE row, the AES-predicted abilities may lack accuracy under specific conditions, such as Conditions 1, 2, and 3. This inaccuracy could arise because the AES model, trained on the reference group’s data, could not cover the ability range of the focal group due to significant differences in the ability distribution between the groups. Note that even in cases where the mean and SD estimates are relatively inaccurate, these values are closer to the gold-standard ones than those estimated solely from the focal group’s data. This leads to meaningful linking coefficients, which transform the focal group’s parameters toward the scale of their gold-standard values.

Finally, Table  7 displays the averaged values of the means and SDs of the focal group’s parameters obtained through our linking method over ten repetitions. Note that the mean and SD values of the ability estimates are the same as those reported in Table  6 because the proposed method is designed to align them. The table indicates that the differences in the mean and SD values between the proposed method and the gold-standard condition, shown in Table  4 , tend to be smaller compared with those between the unlinked condition, shown in Table  5 , and the gold-standard. To verify this point more precisely, Table  8 shows the average absolute differences in the mean and SD values of the parameters for the focal groups between the proposed method and the gold-standard condition, as well as those between the unlinked condition and the gold-standard. These values were calculated by averaging the absolute differences in the mean and SD values obtained from each of the ten repetitions, unlike the simple absolute differences in the values reported in Tables  4 and 7 . The table shows that the proposed linking method tends to derive lower values, especially for \(\theta _j\) and \(\beta _r\) , than the unlinked condition. Furthermore, this tendency is prominent for conditions 6–8 in which the distributional differences between the focal and reference groups are relatively small. These trends are consistent with the cases for which our method revealed high linking performance, detailed in the previous section.

In summary, the above analyses suggest that although the AES model’s predictions may not always be perfectly accurate, they can offer valuable insights into scale differences between the reference and focal groups, thereby facilitating successful IRT parameter linking without common examinees and raters.

We now present the distributions of examinee ability and rater severity for the focal group, comparing their gold-standard values with those before and after the application of the linking method. Figures  5 , 6 , 7 , 8 , 9 , 10 , 11 , and 12 are illustrative examples for the eight data-splitting conditions. The gray bars depict the distributions of the gold-standard parameters, the blue bars represent those of the parameters estimated from the focal group’s data, the red bars signify those of the parameters obtained using our linking method, and the green bars indicate the ability distribution as predicted by the BERT-based AES. The upper part of the figure presents results for examinee ability \(\theta _j\) and the lower part presents those for rater severity \(\beta _r\) .

The blue bars in these figures reveal that the parameters estimated from the focal group’s data exhibit distributions with different locations and/or scales compared with their gold-standard values. Meanwhile, the red bars reveal that the distributions of the parameters obtained through our linking method tend to align closely with those of the gold-standard parameters. This is attributed to the fact that the ability distributions for the focal group given by the BERT-based AES model, as depicted by the green bars, were informative for performing linear linking.

Analysis of the linking method based on common examinees

For a detailed analysis of the linking method based on common examinees, Table  9 reports the averaged values of means and SDs of the focal groups’ parameter estimates obtained by the linking method based on five and ten common examinees for each condition. Furthermore, Table  10 shows the average absolute differences between these values and those from the gold standard condition. Table  10 shows that an increase in the number of common examinees tends to lower the average absolute differences, which is a reasonable trend. Furthermore, comparing the results with those of the proposed method reported in Table  8 , the proposed method tends to achieve smaller absolute differences in conditions 4–8 for \(\theta _j\) and \(\beta _r\) , which is consistent with the tendency of the linking performance discussed in the “Effectiveness of our proposed linking method” section.

Note that although the mean and SD values in Table  9 are close to those of the gold-standard parameters shown in Table  4 , this does not imply that linear linking based on five or ten common examinees achieves high linking accuracy for each repetition. To explain this, Table  11 shows the means of the gold-standard ability values for the focal group and their estimates obtained from the proposed method and the linking method based on ten common examinees, for each of ten repetitions under condition 8. This table also shows the absolute differences between the estimated ability means and the corresponding gold-standard means.

figure 5

Example of ability and rater severity distributions for the focal group under data-splitting condition 1

figure 6

Example of ability and rater severity distributions for the focal group under data-splitting condition 2

figure 7

Example of ability and rater severity distributions for the focal group under data-splitting condition 3

figure 8

Example of ability and rater severity distributions for the focal group under data-splitting condition 4

figure 9

Example of ability and rater severity distributions for the focal group under data-splitting condition 5

figure 10

Example of ability and rater severity distributions for the focal group under data-splitting condition 6

figure 11

Example of ability and rater severity distributions for the focal group under data-splitting condition 7

figure 12

Example of ability and rater severity distributions for the focal group under data-splitting condition 8

The table shows that the results of the proposed method are relatively stable, consistently revealing low absolute differences for every repetition. In contrast, the results of linear linking based on ten common examinees vary significantly across repetitions, resulting in large absolute differences for some repetitions. These results yield a smaller average absolute difference for the proposed method compared with linear linking based on ten common examinees. However, in terms of the absolute difference in the averaged ability means, linear linking based on ten common examinees shows a smaller difference ( \(|0.38-0.33| = 0.05\) ) compared with the proposed method ( \(|0.38-0.46| = 0.08\) ). This occurs because the results of linear linking based on ten common examinees for ten repetitions fluctuate around the ten-repetition average of the gold standard, thereby canceling out the positive and negative differences. However, this does not imply that linear linking based on ten common examinees achieves high linking accuracy for each repetition. Thus, it is reasonable to interpret the average of the absolute differences calculated for each of the ten repetitions, as reported in Tables  8 and  10 .

This greater variability in performance of the linking method based on common examinees also relates to the tendency of the proposed method to show lower RMSE values for the rater consistency parameter \(\alpha _r\) and the step parameters \(d_{rm}\) compared with linking based on common examinees, as mentioned in the Effectiveness of our proposed linking method section. In that section, we mentioned that this is due to the fact that linear linking based on common examinees is highly dependent on the selection of common examinees, which can sometimes lead to significant errors in these parameters.

To confirm this point, Table  12 displays the SD of RMSEs calculated from ten repetitions of the experimental procedures for both the proposed method and linear linking using ten common examinees. The table indicates that the linking method using common examinees tends to exhibit larger SD values overall, suggesting that this linking method sometimes becomes inaccurate, as we also exemplified in Table  11 . This variability also implies that the estimation of the linking coefficient can be unstable.

Furthermore, the tendency of having larger SD values in the common examinee approach is particularly pronounced for the step parameters at the extreme categories, namely, \(d_{r2}\) and \(d_{r5}\) . We consider this comes from the instability of linking coefficients and the fact that the step parameters for the extreme categories tend to have large absolute values (see Table  13 for detailed estimates). Linear linking multiplies the step parameters by a linking coefficient A , although applying an inappropriate linking coefficient to larger absolute values can have a more substantial impact than when applied to smaller values. We concluded that this is why the RMSEs of the step difficulty parameters in the common examinee approach were deteriorated compared with those in the proposed method. The same reasoning would be applicable to the rater consistency parameter, given that it is distributed among positive values with a mean over one. See Table  13 for details.

Prerequisites of the proposed method

As demonstrated thus far, the proposed method can perform IRT parameter linking without the need for common examinees and raters. As outlined in the Introduction section, certain testing scenarios may encounter challenges or incur significant costs in assembling common examinees or raters. Our method provides a viable solution in these situations. However, it does come with specific prerequisites and inherent costs.

The prerequisites of our proposed method are as follows.

The same essay writing task is offered to both the reference and focal groups, and the written essays for it are scored by different groups of raters using the same rubric.

Raters will function identically across both the reference and focal groups, and the established scales can be adjusted through linear transformations. This implies that there are no systematic differences in scoring that are correlated with the groups but are unrelated to the measured construct, such as differential rater functioning (Leckie & Baird, 2011 ; Myford & Wolfe, 2009 ; Uto, 2023 ; Wind & Guo, 2019 ).

The ability ranges of the reference and focal groups require some overlap because the ability prediction accuracy of the AES decreases as the differences in the ability distributions between the groups increases, as discussed in the Detailed analysis section. This is a limitation of this approach, which requires future studies to overcome.

The reference group consists of a sufficient number of examinees for training AES models using their essays as training data.

Related to the fourth point, we conducted an additional experiment to investigate the number of samples required to train AES models. In this experiment, we assessed the ability prediction accuracy of the BERT-based AES model used in this study by varying the number of training samples. The detailed experimental procedures are outlined below.

Estimate the ability of all 1805 examinees from the entire dataset based on the GMFM.

Randomly split the examinees into 80% (1444) and 20% (361) groups. The 20% subset, consisting of examinees’ essays and their ability estimates, was used as test data to evaluate the ability prediction accuracy of the AES model trained through the following steps.

The 80% subset was further divided into 80% (1155) and 20% (289) groups. Here, the essays and ability estimates of the 80% subset were used as the training data, while those of the 20% served as development data for selecting the optimal epoch.

Train the BERT-based AES model using the training data and select the optimal epoch that minimizes the RMSE between the predicted and gold-standard ability values for the development set.

Use the trained AES model at the optimal epoch to evaluate the RMSE between the predicted and gold-standard ability values for the test data.

Randomly sample 50, 100, 200, 300, 500, 750, and 1000 examinees from the training data created in Step 3.

Train the AES model using each sampled set as training data, and select the optimal epoch using the same development data as before.

Use the trained AES model to evaluate the RMSE for the same test data as before.

Repeat Steps 2–8 five times and calculate the average RMSE for the test data.

figure 13

Relationship between the number of training samples and the ability prediction accuracy of AES

figure 14

Item response curves of four representative raters found in experiments using actual data

Figure  13 displays the results. The horizontal axis represents the number of training samples, and the vertical axis shows the RMSE values. Each plot illustrates the average RMSE, with error bars indicating the SD ranges. The results demonstrate that larger sample sizes enhance the accuracy of the AES model. Furthermore, while the RMSE decreases significantly when the sample size is small, the improvements tend to plateau beyond 500 samples. This suggests that, for this dataset, approximately 500 samples would be sufficient to train the AES model with reasonable accuracy. However, note that the required number of samples may vary depending on the essay tasks. A detailed analysis of the relationship between the required number of samples and the characteristics of essay writing tasks is planned for future work.

An inherent cost associated with the proposed method is the computational expense required to construct the BERT-based AES model. Specifically, a computer with a reasonably powerful GPU is necessary to efficiently train the AES model. In this study, for example, we utilized an NVIDIA Tesla T4 GPU on Google Colaboratory. To elaborate on the computational expense, we calculated the computation times and costs for the above experiment under a condition where 1155 training samples were used. Consequently, training the AES model with 1155 samples, including evaluating the RMSE for the development set of 289 essays in each epoch, took approximately 10 min in total. Moreover, it required about 10 s to predict the abilities of 361 examinees from their essays using the trained model. The computational units consumed on Google Colaboratory for both training and inference amounted to 0.44, which corresponds to approximately $0.044. These costs and the time required are significantly smaller than what is required for human scoring.

Analysis of rater characteristics

The MCMC statistics and model fitting section demonstrated that the GMFM provides a better fit to the actual data compared with the MFRM and MFRM with RSS. To explain this, Table  13 shows the rater parameters estimated by the GMFM using the entire dataset. Additionally, Fig.  14 illustrates the item response curves (IRCs) for raters 3, 16, 31, and 34, where the horizontal axis represents the ability \(\theta _j\) , and the vertical axis depicts the response probability for each category.

The table and figure reveal that the raters exhibit diverse and unique characteristics in terms of severity, consistency, and range restriction. For instance, Rater 3 demonstrates nearly average values for all parameters, indicating standard rating characteristics. In contrast, Rater 16 exhibits a pronounced extreme response tendency, as evidenced by higher \(d_{r2}\) and lower \(d_{r5}\) values. Additionally, Rater 31 is characterized by a low severity score, generally preferring higher scores (four and five). Rater 34 exhibits a low consistency value \(\alpha _r\) , which results in minimal variation in response probabilities among categories. This indicates that the rater is likely to assign different ratings to essays of similar quality.

As detailed in the Item Response Theory section, the GMFM can capture these variations in rater severity, consistency, and range restriction simultaneously, while the MFRM and MFRM with RSS can consider only its subsets. We can infer that this capability, along with the large variety of rater characteristics, contributed to the superior model fit of the GMFM compared with the other models.

It is important to note that, the proposed method is also useful for facilitating linking for MFRM and MFRM with RSS, even though the model fits for them were relatively worse, as well as for the GMFM, which we mentioned earlier and is shown in Appendix B .

Effect of using cloud workers as raters

As we detailed in the Actual data section, we used scores given by untrained non-expert cloud workers instead of expert raters. A concern with using raters from cloud workers without adequate training is the potential for greater variability in rating characteristics compared with expert raters. This variability is evidenced by the diverse correlations between the raters’ scores and their ground truth, reported in the Actual data section, and the large variety of rater parameters discussed above. These observations suggest the importance of the following two strategies for ensuring reliable essay scoring when employing crowd workers as raters.

Assigning a larger number of raters to each essay than would typically be used with expert raters.

Estimating the standardized essay scores while accounting for differences in rater characteristics, potentially through the use of IRT models that incorporate rater parameters, which we used in this study.

In this study, we propose a novel IRT-based linking method for essay-writing tests that uses AES technology to enable parameter linking based on IRT models with rater parameters across multiple groups in which neither examinees nor raters are shared. Specifically, we use a deep neural AES method capable of predicting IRT-based examinee abilities based on their essays. The core concept of our approach involves developing an AES model to predict examinee abilities using data from a reference group. This AES model is then applied to predict the abilities of examinees in the focal group. These predictions are used to estimate the linking coefficients required for linear linking. Experimental results with real data demonstrate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

In our experiments, we compared the linking performance of the proposed method with linear linking based on the mean and sigma method using only five or ten common examinees. However, such a small number of common examinees is generally insufficient for accurate linear linking and thus leads to unstable estimation of linking coefficients, as discussed in the “Analysis of the linking method based on common examinees” section. Although this study concluded that our method could perform linking with accuracy comparable to that of linear linking using few common examinees, further detailed evaluations of our method involving comparisons with various conventional linking methods using different numbers of common examinees and raters will be the target of future work.

Additionally, our experimental results suggest that although the AES model may not provide sufficient predictive accuracy for individual examinee abilities, it does tend to yield reasonable mean and SD values for the ability distribution of focal groups. This lends credence to our assumption stated in the Proposed method section that AES models incorporating IRT can offer valuable insights into differences in ability distribution across various groups, thereby validating their utility for test linking. This result also supports the use of the mean and sigma method for linking. While concurrent calibration, another common linking method, requires highly accurate individual AES-predicted abilities to serve as anchor values, linear linking through the mean and sigma method necessitates only the mean and SD of the ability distribution. Given that the AES model can provide accurate estimates for these statistics, successful linking can be achieved, as shown in our experiments.

A limitation of this study is that our method is designed for test situations where a single essay writing item is administered to multiple groups, each comprising different examinees and raters. Consequently, the method is not directly applicable for linking multiple tests that offer different items. Developing an extension of our approach to accommodate such test situations is one direction for future research. Another involves evaluating the effectiveness of our method using other datasets. To the best of our knowledge, there are no open datasets that include examinee essays along with scores from multiple assigned raters. Therefore, we plan to develop additional datasets and to conduct further evaluations. Further investigation of the impact of the AES model’s accuracy on linking performance is also warranted.

Availability of data and materials

The data and materials from our experiments are available at https://github.com/AI-Behaviormetrics/LinkingIRTbyAES.git . This includes all experimental results and a sample dataset.

Code availability

The source code for our linking method, developed in R and Python, is available in the same GitHub repository.

The original paper referred to this model as the generalized MFRM. However, in this paper, we refer to it as GMFM because it does not strictly belong to the family of Rasch models.

https://pytorch.org/

Abosalem, Y. (2016). Assessment techniques and students’ higher-order thinking skills. International Journal of Secondary Education, 4 (1), 1–11. https://doi.org/10.11648/j.ijsedu.20160401.11

Article   Google Scholar  

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. Proceedings of the annual meeting of the association for computational linguistics (pp. 715–725).

Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed response writing tests. International Journal of Testing, 14 (1), 73–91. https://doi.org/10.1080/15305058.2013.816309

Amorim, E., Cançado, M., & Veloso, A. (2018). Automated essay scoring in the presence of biased ratings. Proceedings of the annual conference of the north american chapter of the association for computational linguistics (pp. 229–237).

Bernardin, H. J., Thomason, S., Buckley, M. R., & Kane, J. S. (2016). Rater rating-level bias and accuracy in performance appraisals: The impact of rater personality, performance management competence, and rater accountability. Human Resource Management, 55 (2), 321–340. https://doi.org/10.1002/hrm.21678

Dascalu, M., Westera, W., Ruseti, S., Trausan-Matu, S., & Kurvers, H. (2017). ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch language. Proceedings of the international conference on artificial intelligence in education (pp. 52–63).

Dasgupta, T., Naskar, A., Dey, L., & Saha, R. (2018). Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. Proceedings of the workshop on natural language processing techniques for educational applications (pp. 93–102).

Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the annual conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186).

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2 (3), 197–221. https://doi.org/10.1207/s15434311laq0203_2

Eckes, T. (2023). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments . Peter Lang Pub. Inc.

Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1 (1), 19–33.

PubMed   Google Scholar  

Farag, Y., Yannakoudakis, H., & Briscoe, T. (2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 263–271).

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., & Rubin, D. (2013). Bayesian data analysis (3rd ed.). Taylor & Francis.

Book   Google Scholar  

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7 (4), 457–472. https://doi.org/10.1214/ss/1177011136

Huang, J., Qu, L., Jia, R., & Zhao, B. (2019). O2U-Net: A simple noisy label detection approach for deep neural networks. Proceedings of the IEEE international conference on computer vision .

Hussein, M. A., Hassan, H. A., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5 , e208. https://doi.org/10.7717/peerj-cs.208

Article   PubMed   PubMed Central   Google Scholar  

Ilhan, M. (2016). A comparison of the results of many-facet Rasch analyses based on crossed and judge pair designs. Educational Sciences: Theory and Practice, 16 (2), 579–601. https://doi.org/10.12738/estp.2016.2.0390

Jin, C., He, B., Hui, K., & Sun, L. (2018). TDNN: A two-stage deep neural network for prompt-independent automated essay scoring. Proceedings of the annual meeting of the association for computational linguistics (pp. 1088–1097).

Jin, K. Y., & Wang, W. C. (2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55 (4), 543–563. https://doi.org/10.1111/jedm.12191

Kassim, N. L. A. (2011). Judging behaviour and rater errors: An application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11 (3), 179–197.

Google Scholar  

Ke, Z., & Ng, V. (2019). Automated essay scoring: A survey of the state of the art. Proceedings of the international joint conference on artificial intelligence (pp. 6300–6308).

Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking . New York: Springer.

Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48 (4), 399–418. https://doi.org/10.1111/j.1745-3984.2011.00152.x

Li, S., Ge, S., Hua, Y., Zhang, C., Wen, H., Liu, T., & Wang, W. (2020). Coupled-view deep classifier learning from multiple noisy annotators. Proceedings of the association for the advancement of artificial intelligence (vol. 34, pp. 4667–4674).

Linacre, J. M. (1989). Many-faceted Rasch measurement . MESA Press.

Linacre, J. M. (2014). A user’s guide to FACETS Rasch-model computer programs .

Liu, O. L., Frankel, L., & Roohr, K. C. (2014). Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series, 2014 (1), 1–23. https://doi.org/10.1002/ets2.12009

Liu, T., Ding, W., Wang, Z., Tang, J., Huang, G. Y., & Liu, Z. (2019). Automatic short answer grading via multiway attention networks. Proceedings of the international conference on artificial intelligence in education (pp. 169–173).

Lord, F. (1980). Applications of item response theory to practical testing problems . Routledge.

Lun, J., Zhu, J., Tang, Y., & Yang, M. (2020). Multiple data augmentation strategies for improving performance on automatic short answer scoring. Proceedings of the association for the advancement of artificial intelligence (vol. 34, pp. 13389–13396).

Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14 (2), 139–160.

Mesgar, M., & Strube, M. (2018). A neural local coherence model for text quality assessment. Proceedings of the conference on empirical methods in natural language processing (pp. 4328–4339).

Mim, F. S., Inoue, N., Reisert, P., Ouchi, H., & Inui, K. (2019). Unsupervised learning of discourse-aware text representation for essay scoring. Proceedings of the annual meeting of the association for computational linguistics: Student research workshop (pp. 378–385).

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4 (4), 386–422.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5 (2), 189–227.

Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46 (4), 371–389. https://doi.org/10.1111/j.1745-3984.2009.00088.x

Nadeem, F., Nguyen, H., Liu, Y., & Ostendorf, M. (2019). Automated essay scoring with discourse-aware neural models. Proceedings of the workshop on innovative use of NLP for building educational applications (pp. 484–493).

Nering, M. L., & Ostini, R. (2010). Handbook of polytomous item response theory models . Evanston, IL, USA: Routledge.

Nguyen, H. V., & Litman, D. J. (2018). Argument mining for improving the automated scoring of persuasive essays. Proceedings of the association for the advancement of artificial intelligence (Vol. 32).

Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed-format tests [Doctoral dissertation, The Florida State University].

Patz, R. J., & Junker, B. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24 (4), 342–366. https://doi.org/10.3102/10769986024004342

Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27 (4), 341–384. https://doi.org/10.3102/10769986027004341

Qiu, X. L., Chiu, M. M., Wang, W. C., & Chen, P. H. (2022). A new item response theory model for rater centrality using a hierarchical rater model approach. Behavior Research Methods, 54 , 1854–1868. https://doi.org/10.3758/s13428-021-01699-y

Article   PubMed   Google Scholar  

Ridley, R., He, L., Dai, X. Y., Huang, S., & Chen, J. (2021). Automated cross-prompt scoring of essay traits. Proceedings of the association for the advancement of artificial intelligence (vol. 35, pp. 13745–13753).

Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and automated essay scoring. https://doi.org/10.48550/arXiv.1909.09482 . arXiv:1909.09482

Rosen, Y., & Tager, M. (2014). Making student thinking visible through a concept map in computer-based assessment of critical thinking. Journal of Educational Computing Research, 50 (2), 249–270. https://doi.org/10.2190/EC.50.2.f

Schendel, R., & Tolmie, A. (2017). Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Assessment & Evaluation in Higher Education, 42 (5), 673–689. https://doi.org/10.1080/02602938.2016.1177484

Shermis, M. D., & Burstein, J. C. (2002). Automated essay scoring: A cross-disciplinary perspective . Routledge.

Shin, H. J., Rabe-Hesketh, S., & Wilson, M. (2019). Trifactor models for Multiple-Ratings data. Multivariate Behavioral Research, 54 (3), 360–381. https://doi.org/10.1080/00273171.2018.1530091

Stan Development Team. (2018). RStan: the R interface to stan . R package version 2.17.3.

Sung, C., Dhamecha, T. I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. Proceedings of the international conference on artificial intelligence in education (pp. 469–481).

Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. Proceedings of the conference on empirical methods in natural language processing (pp. 1882–1891).

Tran, T. D. (2020). Bayesian analysis of multivariate longitudinal data using latent structures with applications to medical data. (Doctoral dissertation, KU Leuven).

Uto, M. (2021a). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53 , 1440–1454. https://doi.org/10.3758/s13428-020-01498-x

Uto, M. (2021b). A review of deep-neural automated essay scoring models. Behaviormetrika, 48 , 459–484. https://doi.org/10.1007/s41237-021-00142-y

Uto, M. (2023). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Behavior Research Methods, 55 , 3910–3928. https://doi.org/10.3758/s13428-022-01997-z

Uto, M., & Okano, M. (2021). Learning automated essay scoring models using item-response-theory-based scores to decrease effects of rater biases. IEEE Transactions on Learning Technologies, 14 (6), 763–776. https://doi.org/10.1109/TLT.2022.3145352

Uto, M., & Ueno, M. (2018). Empirical comparison of item response theory models with rater’s parameters. Heliyon, Elsevier , 4 (5), , https://doi.org/10.1016/j.heliyon.2018.e00622

Uto, M., & Ueno, M. (2020). A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo. Behaviormetrika, Springer, 47 , 469–496. https://doi.org/10.1007/s41237-020-00115-7

van der Linden, W. J. (2016). Handbook of item response theory, volume two: Statistical tools . Boca Raton, FL, USA: CRC Press.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998–6008).

Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018). Automatic essay scoring incorporating rating schema via reinforcement learning. Proceedings of the conference on empirical methods in natural language processing (pp. 791–797).

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11 , 3571–3594. https://doi.org/10.48550/arXiv.1004.2316

Watanabe, S. (2013). A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14 (1), 867–897. https://doi.org/10.48550/arXiv.1208.6338

Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26 (3), 283–306. https://doi.org/10.3102/10769986026003283

Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79 (5), 962–987. https://doi.org/10.1177/0013164419834613

Wind, S. A., & Jones, E. (2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56 (1), 76–100. https://doi.org/10.1111/jedm.12201

Wind, S. A., Wolfe, E. W., Jr., G.E., Foltz, P., & Rosenstein, M. (2018). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18 (1), 27–49. https://doi.org/10.1080/15305058.2017.1361426

Zitzmann, S., & Hecht, M. (2019). Going beyond convergence in Bayesian estimation: Why precision matters too and how to assess it. Structural Equation Modeling: A Multidisciplinary Journal, 26 (4), 646–661. https://doi.org/10.1080/10705511.2018.1545232

Download references

This work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers 19H05663, 21H00898, and 23K17585.

Author information

Authors and affiliations.

The University of Electro-Communications, Tokyo, Japan

Masaki Uto & Kota Aramaki

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Masaki Uto .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflicts of interest.

Ethics approval

Not applicable

Consent to participate

Consent for publication.

All authors agreed to publish the article.

Open Practices Statement

All results presented from our experiments for all models, including MFRM, MFRM with RSS, and GMFM, as well as the results for each repetition, are available for download at https://github.com/AI-Behaviormetrics/LinkingIRTbyAES.git . This repository also includes programs for performing our linking method, along with a sample dataset. These programs were developed using R and Python, along with RStan and PyTorch. Please refer to the README file for information on program usage and data format details.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Data splitting procedures

In this appendix, we explain the detailed procedures used to construct the reference group and the focal group while aiming to ensure distinct distributions of examinee abilities and rater severities, as outlined in experimental Procedure 2 in the Experimental procedures section.

Let \(\mu ^{\text {all}}_\theta \) and \(\sigma ^{\text {all}}_\theta \) be the mean and SD of the examinees’ abilities estimated from the entire dataset in Procedure 1 of the Experimental procedures section. Similarly, let \(\mu ^{\text {all}}_\beta \) and \(\sigma ^{\text {all}}_\beta \) be the mean and SD of the rater severity parameter estimated from the entire dataset. Using these values, we set target mean and SD values of abilities and severities for both the reference and focal groups. Specifically, let \(\acute{\mu }^{\text {ref}}_{\theta }\) and \(\acute{\sigma }^{\text {ref}}_{\theta }\) denote the target mean and SD for the abilities of examinees in the reference group, and \(\acute{\mu }^{\text {ref}}_{\beta }\) and \(\acute{\sigma }^{\text {ref}}_{\beta }\) be those for the rater severities in the reference group. Similarly, let \(\acute{\mu }^{\text {foc}}_{\theta }\) , \(\acute{\sigma }^{\text {foc}}_{\theta }\) , \(\acute{\mu }^{\text {foc}}_{\beta }\) , and \(\acute{\sigma }^{\text {foc}}_{\beta }\) represent the target mean and SD for the examinee abilities and rater severities in the focal group. Each of the eight conditions in Table 1 uses these target values, as summarized in Table  14 .

Given these target means and SDs, we constructed the reference and focal groups for each condition through the following procedure.

Prepare the entire set of examinees and raters along with their ability and severity estimates. Specifically, let \(\hat{\varvec{\theta }}\) and \(\hat{\varvec{\beta }}\) be the collections of ability and severity estimates, respectively.

Randomly sample a value from the normal distribution \(N(\acute{\mu }^{\text {ref}}_\theta , \acute{\sigma }^{\text {ref}}_\theta )\) , and choose an examinee with \(\hat{\theta }_j \in \hat{\varvec{\theta }}\) nearest to the sampled value. Add the examinee to the reference group, and remove it from the remaining pool of examinee candidates \(\hat{\varvec{\theta }}\) .

Similarly, randomly sample a value from \(N(\acute{\mu }^{\text {ref}}_\beta ,\acute{\sigma }^{\text {ref}}_\beta )\) , and choose a rater with \(\hat{\beta }_j \in \hat{\varvec{\beta }}\) nearest to the sampled value. Then, add the rater to the reference group, and remove it from the remaining pool of rater candidates \(\hat{\varvec{\beta }}\) .

Repeat Steps 2 and 3 for the focal group, using \(N(\acute{\mu }^{\text {foc}}_\theta , \) \(\acute{\sigma }^{\text {foc}}_\theta )\) and \(N(\acute{\mu }^{\text {foc}}_\beta ,\acute{\sigma }^{\text {foc}}_\beta )\) as the sampling distributions.

Continue to repeat Steps 2, 3, and 4 until the pools \(\hat{\varvec{\theta }}\) and \(\hat{\varvec{\beta }}\) are empty.

Given the examinees and raters in each group, create the data for the reference group \(\textbf{U}^{\text {ref}}\) and the focal group \(\textbf{U}^{\text {foc}}\) .

Remove examinees from each group, as well as their data, if they have received scores from only one rater, thereby ensuring that each examinee is graded by at least two raters.

Appendix B: Experimental results for MFRM and MFRM with RSS

The experiments discussed in the main text focus on the results obtained from GMFM, as this model demonstrated the best fit to the dataset. However, it is important to note that our linking method is not restricted to GMFM and can also be applied to other models, including MFRM and MFRM with RSS. Experiments involving these models were carried out in the manner described in the Experimental procedures section, and the results are shown in Tables  15 and 16 . These tables reveal trends similar to those observed for GMFM, validating the effectiveness of our linking method under the MFRM and MFRM with RSS as well.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Uto, M., Aramaki, K. Linking essay-writing tests using many-facet models and neural automated essay scoring. Behav Res (2024). https://doi.org/10.3758/s13428-024-02485-2

Download citation

Accepted : 26 July 2024

Published : 20 August 2024

DOI : https://doi.org/10.3758/s13428-024-02485-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Writing assessment
  • Many-facet Rasch models
  • IRT linking
  • Automated essay scoring
  • Educational measurement
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Scientific method essay example

    scientific method essay introduction

  2. Essay Summary Of Scientific Method And Research Example

    scientific method essay introduction

  3. The "Scientific Method":

    scientific method essay introduction

  4. Introduction to Scientific Method by Joshua Wickline

    scientific method essay introduction

  5. Applying the Scientific Method Essay Example

    scientific method essay introduction

  6. How to Write a Research Paper

    scientific method essay introduction

COMMENTS

  1. Writing an Introduction for a Scientific Paper

    Dr. Michelle Harris, Dr. Janet Batzli,Biocore. This section provides guidelines on how to construct a solid introduction to a scientific paper including background information, study question, biological rationale, hypothesis, and general approach. If the Introduction is done well, there should be no question in the reader's mind why and on ...

  2. Scientific Writing Made Easy: A Step‐by‐Step Guide to Undergraduate

    Clear scientific writing generally follows a specific format with key sections: an introduction to a particular topic, hypotheses to be tested, a description of methods, key results, and finally, a discussion that ties these results to our broader knowledge of the topic (Day and Gastel 2012). This general format is inherent in most scientific ...

  3. Research Guides: Writing a Scientific Paper: INTRODUCTION

    The introduction supplies sufficient background information for the reader to understand and evaluate the experiment you did. It also supplies a rationale for the study. Goals: Present the problem and the proposed solution. Presents nature and scope of the problem investigated. Reviews the pertinent literature to orient the reader.

  4. What is the Scientific Method: How does it work and why is it important

    The scientific method is a systematic process involving steps like defining questions, forming hypotheses, conducting experiments, and analyzing data. It minimizes biases and enables replicable research, leading to groundbreaking discoveries like Einstein's theory of relativity, penicillin, and the structure of DNA.

  5. The scientific method (article)

    The scientific method. At the core of biology and other sciences lies a problem-solving approach called the scientific method. The scientific method has five basic steps, plus one feedback step: Make an observation. Ask a question. Form a hypothesis, or testable explanation. Make a prediction based on the hypothesis.

  6. PDF Guide to Scientific Writing

    A Guide to Scientific Writing Neal Lerner Marilee Ogren-Balkama Massachusetts Institute of Technology Introductions What's an Introduction? An introduction is a method to familiarize and orient your readers. The content of an introduction depends on its purpose and the audience. All models share a direct approach.Don't hide your main point or save it until the end of

  7. Writing a Research Paper Introduction

    Table of contents. Step 1: Introduce your topic. Step 2: Describe the background. Step 3: Establish your research problem. Step 4: Specify your objective (s) Step 5: Map out your paper. Research paper introduction examples. Frequently asked questions about the research paper introduction.

  8. PDF Tutorial Essays for Science Subjects

    Tutorial Essays for Science Subjects. This guide is designed to provide help and advice on scientific writing. Although students studying Medical and Life Sciences are most likely to have to write essays for tutorials at Oxford, it is important all scientists learn to write clearly and concisely to present their data and conclusions.

  9. 1.2: The Scientific Method

    Step 4: Peer Review, Publication, and Replication. Scientists share the results of their research by publishing articles in scientific journals, such as Science and Nature.Reputable journals and publishing houses will not publish an experimental study until they have determined its methods are scientifically rigorous and the conclusions are supported by evidence.

  10. The scientific method (article)

    The scientific method. At the core of physics and other sciences lies a problem-solving approach called the scientific method. The scientific method has five basic steps, plus one feedback step: Make an observation. Ask a question. Form a hypothesis, or testable explanation. Make a prediction based on the hypothesis.

  11. Scientific Method

    The study of scientific method is the attempt to discern the activities by which that success is achieved. Among the activities often identified as characteristic of science are systematic observation and experimentation, inductive and deductive reasoning, and the formation and testing of hypotheses and theories.

  12. Guide: Writing the Scientific Paper

    Title--subject and what aspect of the subject was studied. There are many ways to approach the writing of a scientific paper, and no one way is right. Many people, however, find that drafting chunks in this order works best: Results, Discussion, Introduction, Materials & Methods, Abstract, and, finally, Title.

  13. Scientific method

    The scientific method is critical to the development of scientific theories, which explain empirical (experiential) laws in a scientifically rational manner.In a typical application of the scientific method, a researcher develops a hypothesis, tests it through various means, and then modifies the hypothesis on the basis of the outcome of the tests and experiments.

  14. How to Write an Essay Introduction

    Step 1: Hook your reader. Step 2: Give background information. Step 3: Present your thesis statement. Step 4: Map your essay's structure. Step 5: Check and revise. More examples of essay introductions. Other interesting articles. Frequently asked questions about the essay introduction.

  15. Scientific method

    The scientific method is an empirical method for acquiring knowledge that has characterized the development of science since at least the 17th century. The scientific method involves careful observation coupled with rigorous scepticism, because cognitive assumptions can distort the interpretation of the observation.Scientific inquiry includes creating a hypothesis through inductive reasoning ...

  16. Scientific Method: Role and Importance

    The scientific method is a problem-solving strategy that is at the heart of biology and other sciences. There are five steps included in the scientific method that is making an observation, asking a question, forming a hypothesis or an explanation that could be tested, and predicting the test. After that, in the feedback step that is iterating ...

  17. Scientific Method: Definition and Examples

    Regina Bailey. Updated on August 16, 2024. The scientific method is a series of steps that scientific investigators follow to answer specific questions about the natural world. Scientists use the scientific method to make observations, formulate hypotheses, and conduct scientific experiments . A scientific inquiry starts with an observation.

  18. The Scientific Method Steps, Uses, and Key Terms

    When conducting research, the scientific method steps to follow are: Observe what you want to investigate. Ask a research question and make predictions. Test the hypothesis and collect data. Examine the results and draw conclusions. Report and share the results. This process not only allows scientists to investigate and understand different ...

  19. How to Write the Introduction to a Scientific Paper?

    A scientific paper should have an introduction in the form of an inverted pyramid. The writer should start with the general information about the topic and subsequently narrow it down to the specific topic-related introduction. Fig. 17.1. Flow of ideas from the general to the specific. Full size image.

  20. Steps of the Scientific Method

    The six steps of the scientific method include: 1) asking a question about something you observe, 2) doing background research to learn what is already known about the topic, 3) constructing a hypothesis, 4) experimenting to test the hypothesis, 5) analyzing the data from the experiment and drawing conclusions, and 6) communicating the results ...

  21. Research Guides: Writing a Scientific Paper: METHODS

    However careful writing of this section is important because for your results to be of scientific merit they must be reproducible. Otherwise your paper does not represent good science. Goals: Describe equipment used and provide illustrations where relevant. "Methods Checklist" from: How to Write a Good Scientific Paper. Chris A. Mack. SPIE. 2018.

  22. Khan Academy

    Learn for free about math, art, computer programming, economics, physics, chemistry, biology, medicine, finance, history, and more. Khan Academy is a nonprofit with the mission of providing a free, world-class education for anyone, anywhere.

  23. Linking essay-writing tests using many-facet models and neural

    For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees' abilities while accounting ...