LL.M. Program

5005   Wasserstein Hall (WCC) 1585 Massachusetts Avenue Cambridge ,  MA 02138

The LL.M. (Master of Laws) program is a one-year degree program that typically includes 180 students from some 65 countries. The Graduate Program is interested in attracting intellectually curious and thoughtful candidates from a variety of legal systems and backgrounds and with various career plans. Harvard’s LL.M. students include lawyers working in firms, government officials, law professors, judges, diplomats, human rights activists, doctoral students, business people, and others. The diversity of the participants in the LL.M. program contributes significantly to the educational experience of all students at the School.

LL.M. Degree Overview

Ll.m. degree requirements, academic resources, ll.m. class profile, modal gallery, gallery block modal gallery.

Support NYU Law

  • Degrees and Programs

Master of Laws (LLM)

Take full advantage of NYU Law's extraordinarily wide range of courses to design an individualized curriculum that matches your intellectual and professional interests.

Watch: Mohammed Alismail LLM ‘22 provides two reasons that NYU stands out.

NYU Law's  more than 100 full-time faculty members  are among the top scholars in their fields and teach a diverse array of courses. Expect these foremost experts to be ready to connect with you individually, advise you about the curriculum, supervise your research and writing, and guide you to opportunities that will maximize your experience in the program.

You'll find several professors—not only one or two—in corporate law, constitutional law, criminal law, environmental law, intellectual property, international law, legal philosophy, and taxation, among other areas. These influential groups of scholars often collaborate on each other's research, contribute significantly to the evolving academic discussion in the field, and comment on policy and proposed regulations.

View Our Experts by Topic

Design Your Own LLM

You will choose from 300+ courses to plan a curriculum that meets your intellectual and professional interests. You can choose to specialize in one or two areas, or take a broad range of classes. You also will have the chance to write a paper in close consultation with a professor, or expand a typical research assignment into a master’s thesis. Experienced graduate student advisors will assist in choosing courses to meet your goals.

View Degree Requirements

Real-World Training

Gain hands-on experience and a lawyering toolkit you'll need for practice. For example, the  Graduate Lawyering Program  is designed for foreign-trained students to learn US legal skills.

Our simulation courses  make you perform legal tasks or conduct mock trials or negotiations. You also can apply for clinics and externships which involve fieldwork. And, in our transactional classes , you'll study the deals that shape New York business; top lawyers who took part in them often help teach the class.

Advanced Certificate in Law and Business

The  Advanced Certificate in Law and Business  from NYU's Stern School of Business gives you tools to understand the finance and accounting underlying transactions. You can complete it with your LLM degree.

Intellectual Life

Our centers and institutes convene a calendar events and provide opportunities for you to gain expertise and practical training. Student groups and journals are a great way to connect with JDs and other LLMs, and be a part of smaller communities within the larger Law School.

Centers and Institutes

Center for Human Rights and Global Justice (CHRGJ) Center for Transnational Litigation, Arbitration, and Commercial Law Engelberg Center on Innovation Law and Policy Frank J. Guarini Center on Environmental, Energy and Land Use Law Guarini Institute: Global Law and Tech Information Law Institute (ILI) Institute for Corporate Governance & Finance Institute for International Law & Justice (IILJ) Pollack Center for Law and Business US-Asia Law Institute (USALI)

View All Centers & Institutes

Centers' Opportunities for Students

CHRGJ's Transitional Justice Leadership Program CHRGJ's Human Rights Scholarship Program IILJ's Salzburg Cutler Fellowship ILI's Privacy Research Group Pollack Center's Student Research Fellowship USALI's Student Scholars Program

Student Groups

Africa Law Association Asia Law Society Christian Legal Fellowship Intellectual Property and Entertainment Law Society International Arbitration Association International Law Society Jewish Law Students Association Law and Business Association Muslim Law Students Association OUTLaw Student Lawyer Athletic Program Women of Color Collective

Environmental Law Journal Journal of IP and Entertainment Law Journal of International Law and Politics Journal of Law and Business Journal of Law and Liberty Review of Law and Social Change

Career Resources

Get ready for your next career move as you prepare to join NYU Law's network of 40,000+ alumni:

  • The Office of Career Services supports your private sector job search.
  • The Public Interest Law Center assists with your future public service career.
  • Apply for post-graduate fellowships for LLMs in human rights or international finance and development.
  • Explore the fully-funded JSD program , research fellowships at some of our centers and institutes , and the Law School's academic career fellowships .
  • Learn more about bar exams and admission to practice in the US.

Meet the 2024-25 Faculty Director

Robert Howse

Robert Howse   Lloyd C. Nelson Professor of International Law

Robert Howse's teaching and research focus on international economic law (trade, investment, and finance) and legal and political philosophy. He is a co-founder and co-convener of the New York City Area Working Group on International Economic Law and serves on the American Bar Association Working Group on Investment Treaties.  Read more about Professor Howse

© 2024 New York University School of Law. 40 Washington Sq. South, New York, NY 10012.   Tel. (212) 998-6100

LLM Program

The exterior bay window of the Lawyers Club with the fall leaves bright with color.

LLM Program At a Glance

City Icon

About the LLM Program

The Michigan Law LLM program began more than 130 years ago and continues to flourish today. Unique among its peers, our program stands out for many reasons.

In our LLM program, you will:

  • develop meaningful and lasting relationships with your classmates
  • be fully immersed in the US legal system by taking classes with JD  students
  • learn from our faculty of globally renowned scholars and practitioners
  • become part of a community of more than 50,000 graduate and undergraduate students from six continents
  • reside in a small city that is widely recognized for its high quality of living in the heart of the United States
  • have easy access to restaurants, cafes, shops, theaters, museums, athletic venues, and parks

Do you have questions? We have answers!

We’ve asked our current and former LLM students to offer an insider’s view on some of the questions we hear most from prospective students. Browse popular questions, read student profiles, or submit your own. 

Ask an LLM  Student

Why Choose Michigan Law’s LLM Program?

We offer a general LLM program that gives you the freedom to tailor your learning experience. You will design your own focus, and our flexible curriculum empowers you to select from nearly all of the Law School’s courses to advance your professional and personal goals.

Some LLM students aim to enhance their knowledge in a single area of law, and so they select almost all their courses in their chosen field. Others select their courses based on a professional goal, such as taking a bar exam in the United States. Many students take classes related to topics of personal interest or subjects they have never studied before. Our LLM program easily supports any of these objectives.

Areas of Interest

Course Catalog

Mini-seminars Address Unique Topics in Intimate Settings

As an LLM student, you’ll have the option to participate in mini-seminars, which are small classes that typically meet in professors’ homes. Our mini-seminar topics have ranged from the explorations of the law behind book bans to the portrayals of lawyers in films to technology trends that are redefining the practice of law. The only limit is our faculty’s imaginations.

About our Mini-seminars

Interdisciplinary Opportunities from Across the University

By studying at one of the world’s leading public research institutions, you will have a wealth of interdisciplinary opportunities at your fingertips. We encourage you to take courses from across the University of Michigan’s multitude of other top-ranked programs.

In our innovative Problem Solving Initiative ( PSI ) courses, you’ll become part of a multidisciplinary team learning to address real-world challenges. Working with graduate students from other units on campus—such as business, engineering, public policy, and public health—our PSI students tackle issues related to autonomous vehicles, climate change, social media, human trafficking, and other societal concerns.

Taking Non-law Classes

About the Problem Solving Initiative

We purposely keep our LLM class small to maximize the quality of each student’s experience and engagement in life at the Law School. Because each LLM student is carefully selected, the cohort is extremely bright, inquisitive, and diverse. You won’t be just a number; we will get to know you very well.

With only a few dozen participants in the LLM program, you will become friends with your classmates and receive individualized mentoring and support (ranging from academic and professional advising to winter apparel suggestions) from faculty and staff. You will develop meaningful bonds that will last long after your LLM  year.

LLM students attend class with our 1,000 JD students, who come from nearly every US state and more than 15 countries. By learning and studying with JD students, you will broaden your social and professional networks and be fully integrated into life at the Law School.

Your immersion into the Michigan Law community begins before you arrive on campus. During the summer before the academic year begins, we match each LLM student with a Michigan Law Ambassador, a second- or third-year JD student who acts as a mentor.

Outside the classroom, you have the option to join more than 70 student organizations at Michigan Law and more than 1,700 University-sponsored student organizations. By joining student organizations, you’ll connect with other students who share similar identities, affiliations, and interests.

Student Life and Community

Law School Student Activities

With an interdisciplinary mindset and a genuine love of teaching, our faculty are outstanding scholars, practitioners, and mentors with expertise in a wide variety of legal areas. Many have lived, studied, or worked outside the United States, and they are fluent in more than a dozen languages.

Thanks to our small student-to-faculty ratio, you will participate in rich classroom discussions. Most of our upper-class courses contain fewer than 49 students, so you will actively engage with your professors and classmates.

We pair each LLM student with a faculty mentor who offers advice on course offerings and resources at the Law School. If you would like to dive into an area of legal interest, you have the option to earn academic credit while pursuing independent research under the direct supervision of your professors.

Many of our faculty live within a 15-minute drive of campus, which allows them to spend a lot of time at the Law School, even when they aren’t teaching a class. Our faculty value connections with students in and out of the classroom, and they host seminars at their homes and invite students to group lunches. In fact, you’re welcome to invite faculty for lunches at the Lawyers Club (an on-campus dormitory for law students), and the Law School will cover the cost of lunch.

Our Faculty      

Better Know a Professor

Michigan Law has long been held in high esteem among the international legal community.

In 1878, the first students from Japan earned law degrees from the University of Michigan. The Law School established one of the first LLM programs in the United States, granting its first LLM degrees in 1890.

The Law Library became the first depository for European Union documents at an American university in 1957, and Ann Arbor is the birthplace of the American Society of Comparative Law .

We foster the development of international legal scholars and lawyers by providing our students opportunities to present research in the Salzburg Cutler Fellows Program and participate in the Overseas-Trained LLM Student Interview Program every winter.

The University of Michigan fosters a global campus, as evidenced by the more than 8,000 international students from more than 100 countries who enroll at the University every year. U-M consistently ranks among the top 15 universities in the United States in hosting international students.

Global Opportunities      

Center for International and Comparative Law

Graduates join the ranks of more than 22,000 Michigan Law alumni around the globe, becoming lifelong members of the Law School community.

Our alumni’s expertise and experience span six continents, and they:

  • serve as members of the judiciary and government (international, federal, and local)
  • teach and research at renowned law schools
  • advocate for public interest and non-governmental organizations
  • practice at elite law firms and businesses

Students enjoy a once-in-a-lifetime experience in Ann Arbor, a small city of about 120,000 permanent residents. 

Consistently regarded as one of the top places to live in the country, Ann Arbor is a charming college town with many businesses and services geared toward students’ needs.

The Law School is just steps away from downtown Ann Arbor, the center of the city. With cultural activities (theater, dance, music, films, museums), more than 400 restaurants, an eclectic mix of independent shops, and sports venues (including the largest stadium in the United States), Ann Arbor has something to offer everyone.

Local businesses, such as Zingerman’s Deli, attract fiercely loyal residents and visitors, and Ann Arbor has a dynamic startup culture. Global companies housed in Ann Arbor include the headquarters of Domino’s Pizza, a Toyota research facility, and offices for Google and Thomson Reuters.

Affectionately nicknamed “Tree Town,” Ann Arbor is home to more than 160 parks. The Huron River bisects the University of Michigan campus, providing a scenic backdrop for walking, running, biking, kayaking, and tubing.

Detroit, the largest city in Michigan, is within a 45-minute drive of Ann Arbor. With spectacular architecture, world-class museums and theaters, a lively restaurant scene, and professional sports teams, Detroit offers all the attractions of a major US city. The Detroit Metropolitan Airport, a 30-minute drive from Ann Arbor, serves as an international hub with direct connections to more than 140 destinations in North America, South America, Europe, and Asia.

Ann Arbor is in Michigan, a northern US state that borders Canada. Consisting of two peninsulas surrounded by the magnificent Great Lakes, Michigan has the longest freshwater coastline in the United States. The Mackinac Bridge, one of the longest suspension bridges in the world, connects the Lower Peninsula (commonly referred to as “the Mitten”) and the Upper Peninsula (“the UP ,” as Michiganders like to call it).

Explore Ann Arbor

Experience Pure Michigan

LLM Degree Requirements

  • Earn at least 24 credits. At least 18 of these credits must be earned in Michigan Law School courses.  
  • Satisfy the constitutional law requirement. Successfully pass either Introduction to Constitutional Law and American Legal Process (Law 631, for LLM students only) or Introduction to Constitutional Law (Law 540, the JD required course). Law 631     Law 540  
  • Satisfy the research requirement. Successfully complete a qualifying seminar or course or earn two credits of Independent Research (Law 900).   Law 900
  • We encourage students not only to consider courses in their area of legal interest, but also to take courses that expand the way they think about the law and legal problems. 
  • Students should also consider the professor teaching the course. Many students select classes for the opportunity to engage with specific professors, not only for the course topic, but for the excitement of their intellectual approach to legal studies.

Degree Requirements

Apply Now for Michigan Law’s LLM  Program

The Michigan LLM is a full-time program, and all students begin classes in late August and graduate in early May. LLM students are permitted to enroll in most Law School courses, including several clinics. We offer two courses that are exclusive to LLM students (a constitutional law course and a research and writing course ). Both courses are optional, though generally recommended.

Email Us      Apply Now

Rotimi Adejoorin smiles and leans against a Michigan Law classroom table.

I chose Michigan Law for its small class size, which fosters rich interactions and meaningful connections among peers. The close-knit environment is the ideal setting for building networks and forging lasting relationships. Rotimi Adejoorin, LLM ’24 Lagos, Nigeria

LLM Program FAQs

We have a full-time, residential LLM program that begins in August and ends the following May. 

The academic year is split into two semesters: the fall semester is from August to December, and the winter semester is from January to May.

As an LLM student at Michigan, you can take almost all the classes the Law School offers. Although classes are subject to change every year, we have extensive breadth and depth in our faculty’s expertise and course offerings. In a typical year, students choose from more than 200 courses, and so you will find classes suited to your interests.

In the fall semester, we offer two classes for LLM students only: (1) Introduction to Constitutional Law and American Legal Process and (2) Research and Analysis in American Law. Neither of these classes is mandatory for an LLM degree, but we strongly recommend them for LLM students with prior legal training in civil law jurisdictions.

Apart from the two LLM -only classes, all other courses are with JD students, and LLM students participate in classes on the same level as JD students. Our LLM curriculum is demanding, and LLM and JD students are subject to the same grading curve.

This academic rigor leads to great rewards: our LLM students are immersed in life at the Law School, and they establish bonds with their classmates and professors through challenging, thought-provoking, and respectful discussions. By fully engaging in the classroom, you will develop enriching connections with the entire Law School community.

Even though most Law School classes are available to LLM students, there are a couple of restrictions. While LLM students can participate in many of the Law School’s transactional clinics, enrollment in litigation-based clinics is generally limited to LLM students who previously earned a JD from a law school in the United States. And although LLM students may not work for academic credit (i.e., externships), they have access to most other experiential learning opportunities, such as Problem Solving Initiative courses, practice simulations, and the pro bono program.

Class Schedule

All LLM students can count a total of six credits in non-law, graduate-level courses at the University of Michigan toward their degree. With the University of Michigan’s status as one of the world’s leading research institutions, you have tremendous options (ranging from business, art, physical sciences, information and technology, social work, music, public policy—just to name a few) for non-law classes.

Explore U-M’s Graduate Programs

Although international LLM students are eligible to work part-time in on-campus positions, we discourage LLM students from doing significant work outside of classes, especially during the first semester. 

Our LLM curriculum is rigorous, and most students undergo an adjustment period with the course load and highly interactive US teaching style. We advise you to minimize external pressures and responsibilities as much as you can to ease the transition.

LLM students have a range of advising resources during their time at Michigan Law. 

The Law School’s Center for International and Comparative Law provides comprehensive resources for LLM students, including academic and professional advising, extracurricular programming, and connections to other parts of the University. In August, the Center for International and Comparative Law hosts a weeklong Orientation for all LLM students immediately before the start of classes.

During the summer before the academic year begins, we match each incoming LLM student with a faculty mentor and a Michigan Law Ambassador who is a second- or third-year JD student. Faculty mentors and Michigan Law Ambassadors provide insight into life at the Law School and in Ann Arbor and connect LLM students with other members of the community.

Approximately half of our LLM class elects to take a state bar exam in the United States after they earn their degree at Michigan Law. However, please be aware that an LLM degree—whether from Michigan Law or any other US law school—does not automatically qualify an internationally educated student to sit for a bar exam in the United States.

Each US state has a bar admission agency that sets the requirements for obtaining a license to practice law, and state bar admission agencies often have strict requirements for LLM students whose prior legal training was outside the US . For instance, some state bar admission agencies require the evaluation of a student’s previous legal education to determine their eligibility to sit for the bar exam.

New York and California are the most common US bar exams for our LLM graduates to take, and we provide informational programs and individualized advice to our LLM students who wish to take a bar exam in the United States. However, bar admission requirements are subject to change every year, and so it is imperative that you research the bar admission requirements of the states you’re interested in.

To do so, we advise you to contact the bar admission agencies for the specific US state(s) you’re interested in. The National Conference of Bar Examiners ( NCBE ) provides information about bar admission requirements in each state and a directory of state bar admission agencies, and you do not need to wait until after you begin an LLM program to contact bar admission agencies.

To apply to sit for a bar exam in the United States, you may need to submit academic records and other documentation to a state bar admission agency shortly after you begin your LLM year. It’s often easier to arrange for the delivery of these documents while you’re in your home country or country of legal education—in other words, we recommend collecting any required documentation before you begin your LLM  year.

National Conference of Bar Examiners

Directory of State Bar Admission Agencies

The Law School’s Office of Career Planning ( OCP ) has a team of attorney counselors—including an attorney counselor who focuses on advising LLM students—that provides individualized guidance to students. OCP hosts specialized group seminars and programs that emphasize orientation to the legal employment market in the United States, the development of professional résumés and cover letters, and interviewing and networking skills enhancement. All LLM students have access to a career library and online databases and resources.

Each winter, Michigan Law participates in the Overseas-Trained LLM Student Interview Program, where LLM students can interview for positions with international law firms. OCP also compiles and distributes LLM students’ résumés to employers who have indicated interest in hiring internationally trained students for temporary or permanent positions.

We provide excellent support to our LLM students to explore postgraduate opportunities, whether in the United States or elsewhere. In fact, we have a dedicated career counselor at the Law School who works with LLM students. Michigan Law graduates are employed across the US and the world, reflecting the Law School’s extensive national and global reach.

As you consider employment options, please note that LLM programs for internationally trained students are well-suited for individuals who plan to practice law outside the United States. The vast majority of our LLM graduates work and live outside the United States, and a US LLM is often crucial for professional advancement in other countries.

Legal employers in the United States typically have a strong preference for JD graduates, and so employment opportunities in the United States for internationally trained LLM graduates are limited, regardless of the school an LLM student attends. (In other words, this is prevalent for all US law schools and is not specific to Michigan Law.) In addition, caps on H1 -B visas (the most common work authorization for attorneys) and rising demand for visa sponsorship have caused legal employers to be increasingly cautious about sponsoring international employees.

Because there can be significant challenges in obtaining employment in the United States, we advise you to be proactive in networking and start applying for jobs early in your LLM year. You will put yourself in the best position to find opportunities if you use a variety of career development resources and make the most of all your networks.

During the winter, we experience our fair share of cold, snow, ice, and wind—and sometimes a mix of all four! For students who are accustomed to mild or warm climates all year, we recognize winter may seem intimidating, as many regions in the US (not just Michigan!) are subject to freezing temperatures in the winter. However, nothing beats the majesty of our Law Quad on a snowy day.

As seasoned (pun intended) veterans of winters in Ann Arbor, we can provide tips to manage—and enjoy—cold weather. Suitable clothing goes a long way in wintry weather, and we’re ardent proponents of dressing in layers; even with cold temperatures outside, the inside of the Law School is often toasty. Waterproof and water-resistant coats, gloves, and boots with good traction are invaluable, and hats and scarves are useful to retain body heat.

Although appropriate apparel is key to enjoying the winter, you don’t need to bring any with you at the start of the academic year. Ann Arbor has a number of stores with winter clothing, and so you can get all your necessities here.

Average Temperatures in Ann Arbor

One of the benefits of Ann Arbor is that LLM students can choose from a variety of housing options, including those administered by the University and others managed by independent landlords.

The Lawyers Club is an “on-campus” dormitory (managed by the University of Michigan) exclusively for law students. It typically houses about half of the LLM and first-year JD classes and some second- and third-year JD students. The Lawyers Club is part of the Law Quadrangle, which includes Michigan Law’s academic buildings. The Lawyers Club offers furnished single rooms with private or semi-private bathrooms, and a meal plan is included in the residence fee.

Another on-campus housing option is Munger Graduate Residences, which is open to all University of Michigan graduate students. Munger provides an interdisciplinary experience designed to foster community across academic departments. Residents live in a furnished 6- or 7-bedroom suite with single bedrooms and private bathrooms. Munger is located two blocks (about a 5-minute walk) from the Law School.

Northwood Community Apartments are a popular on-campus housing option for U-M students with partners and families, but single students are also welcome to live there. Northwood offers furnished and unfurnished apartment options, and it is a free, 30-minute bus ride away from the Law School.

About half the LLM class and the majority of JD students live “off campus” (renting from landlords who are not affiliated with the University of Michigan), with most living in apartments and houses within walking distance of the Law School. The University’s Off-Campus Housing Office has a website that includes listings of apartments, rooms, and co-ops. It has a roommate matching service and a list of landlords and management companies who have met certain criteria for inclusion. We also provide resources for LLM students to conduct a housing search.

Although some JD students have cars in Ann Arbor, it is rare for LLM students to have them. 

A car is not necessary for you to get around campus or Ann Arbor, as you have access to free public transportation options. For transit on campus, the University of Michigan has a bus system that is free to the public. In the greater Ann Arbor area, the Ann Arbor Area Transportation Authority (TheRide) bus system is free for all University of Michigan students.

If you would like to use a car for errands, the University of Michigan has a partnership with Zipcar, where students can rent a car by the hour or day. Rideshare services such as Uber and Lyft are also readily available.

U-M Bus System

Ann Arbor Area Transportation Authority

In each LLM class, we have multiple students with children. Ann Arbor is a particularly wonderful city for students with families, as it has excellent public schools, libraries, parks, and cultural activities.

The University offers a variety of resources for students with children. Many students with families live in U-M’s Northwood Community Apartments. Northwood hosts social and recreational events for residents, which provide a convenient way to meet other students and their families.

The University also has children’s centers, a list of child care resources, and a posting board for families to indicate their needs for short-term child care assistance. Other options for child care include the Campus Child Care Homes Network and Kids Kare at Home Backup Child Care.

Ann Arbor Public Schools

Ann Arbor District Library

Child Care Resources

News and Events

Four people smiling, standing in front of the camera

Professors Arato, Bennoune in Spotlight at American Society of International Law Annual Meeting

Light filled scene of Law Quad

Excellence in Pro Bono Service Awards: 2024 Michigan Law Honorees

Two people standing in front of a festive background. One person is handing a leadership award to the other person.

Curtis Mack, LLM ’73, Recognized with U-M Volunteer Leadership Award

LLM Class of 2024 portrait

Meet Michigan Law’s LLM Class of 2024

llm research

Michigan Law Recognizes Outstanding Student Papers in Constitutional, International Law

Welcome to the LLM Class of 2023

Welcome to the LLM Class of 2023

Drafting job search materials of llms, networking and interviewing for llms, llm job fair information session, also of interest.

Research LLM

Osgoode’s Research LLM is a full-time, research-intensive program that is ideal for students who want to pursue a specific area of legal study in depth, including those who are considering a PhD. Students conduct their research under the supervision of an Osgoode faculty member .

The Research LLM does not qualify students to practise law in Canada. Students interested in practising law should review the licencing rules of the Law Society of the province in which they intend to practice.

Program Requirements

Graduate seminar i: legal research (gs law 6610).

  • One study group
  • Elective courses
  • A major written research work (thesis or major research paper)

The Graduate Seminar is the core course for the Graduate Program in Law. Designed to complement other courses, the seminar provides a venue for developing critical assessments of the law and facilitating students’ progress on their own research, papers and dissertation proposals. The seminar also provides students with an intellectual community and introduces them to Osgoode research resources.

One Study Group

Students participating in study groups read and discuss a significant number of articles with their groups each week. The groups are not structured as courses but as venues for reflection and discourse. LLM students must participate in one study group. They can choose among five options, depending on their research interests:

  • Regulation and Governance
  • Law and Economic Relations
  • Theoretical Perspectives in Legal Research
  • Law and Social Justice
  • Law in a Global Context

Elective Courses

Research LLM students can fulfil their elective courses through:

  • a variety of graduate courses in law
  • integrated courses with the JD program
  • independent study
  • courses in other programs

Major Written Research Work

A major paper is at the core of the Research LLM program. Most students complete a thesis, but students may also choose to submit a major research paper and complete additional coursework.

All theses and major research papers should contain an analysis of scholarship on the student’s chosen topic and the results of the student’s research – based on primary sources – in the form of a sustained argument. They should have standard scholarly apparatus, footnotes and a bibliography, prepared in accordance with the McGill Guide to Legal Citations.

Thesis Option

Major Research Paper (MRP) Option

100-125 pages

60-70 pages

Additional elective courses required to complete the LLM

Evaluation and defence

Students must succeed in an oral defence of their thesis before an examination committee.

MRPs are evaluated by the student’s supervisor and by one other member of the Graduate Program chosen by the supervisor in consultation with the Graduate Program Director. In exceptional circumstances, the second examiner may be a member of another Graduate Program at York University or another university.

Additional notes

Some students choose to fulfill the program’s thesis requirement with a Portfolio Thesis: one or two published articles (depending on length and scope) developed during their time in the Osgoode graduate degree, submitted in lieu of a traditional thesis.

The MRP is an original piece of scholarly work equivalent to an article of publishable quality for a reputable law journal. It’s typically more substantial than a research paper for a regular course, but less substantial than a thesis.

Additional Courses

Students entering the Research LLM without an LLB or JD may be required to take additional courses on the advice of their supervisor. Completing this extra coursework during their program can be helpful to students whose research relates to fields of law in which they do not have extensive background. The Graduate Program Director determines whether students must pursue additional courses in order to fulfill the requirements of the LLM.

Time to Completion

Both the Thesis and MRP options should be completed in three or four terms. Generally, students take courses in the fall and winter terms, conduct their research in the winter term and write the Thesis or MRP in the summer term. Graduate students must register in each term (fall, winter, summer) from the start of their program to completion.

Residency Requirement

Students must be located such that they are able to progress on all program requirements requiring geographical availability on campus.

More Detail:

Faculty research advisors, related topics:, funding and fees, intellectual life, meet our current doctoral students.

The University of Chicago The Law School

Master of laws (llm), program info.

On behalf of our Graduate Studies Committee, welcome to the University of Chicago Law School!

The University of Chicago Law School uniquely offers the combination of a small (70–80 students) and diverse (more than 25 nationalities) LLM program with a  real sense of community  among our students. The rigorous and elite academic atmosphere at the Law School is part  of the experience both inside and outside the classroom, and is enhanced by our urban location in one of the  great cities  of the world.

I encourage you to explore our website, learn more about our LLM program and visit us virtually .

If you have any questions please reach out to us by email at [email protected] .  We are more than happy to help or set up one on one conversations with members of our admissions team.

Justin Swinsick, Senior Director of Graduate Programs

Justin Swinsick, Director of Graduate Programs

Why come to the United States for an LLM program? What's special about the LLM experience at the University of Chicago? What advice would LLM graduates give prospective students? Four graduates of the University of Chicago Law School's LLM program share their insights.

Patrícia Mendonça de Almeida, Miguel Bernardo, Florence Jaeger, and Tuvshintuguldur Maralkhuu, members of the LLM Class of 2024, are profiled in the Law School’s ‘Meet the Class’ series.

Experience the UChicago campus with 360º photos and videos

In this interview, Justin shares his thoughts on the value of an international LLM, common mistakes to avoid while applying, employment opportunities in the US for the international LLM graduate, and a whole lot more.

Having the chance to completely take yourself out of your legal system and your cultural comfort zone and to force yourself to compare and contrast your system with that of the U.S. will make you a better lawyer no matter where you work.

The University of Edinburgh home

  • Schools & departments

Postgraduate study

Law LLM by Research

Awards: LLM by Research

Study modes: Full-time, Part-time

Funding opportunities

Programme website: Law

Discovery Day

Join us online on 21st August to learn more about postgraduate study at Edinburgh.

Find out more and register

Research profile

The Edinburgh Law School is a vibrant, collegial and enriching community of legal, sociolegal and criminology researchers and offers an excellent setting for doctoral research.

We are ranked 3rd in the UK for law for the quality and breadth of our research by Research Professional, based on the 2021 Research Excellence Framework (REF2021).

Our doctoral researchers are key to the School’s research activities and we work hard to ensure that they are fully engaged with staff and projects across all of our legal disciplines.

You will find opportunities in the following fields:

  • company and commercial law
  • comparative law
  • constitutional and administrative law
  • criminal law
  • criminology and criminal justice
  • environmental law
  • European law, policy and institutions
  • European private law
  • evidence and procedure
  • gender and sexuality
  • human rights law
  • information technology law
  • intellectual property law
  • international law
  • legal theory
  • medical law and ethics
  • obligations
  • contract delict
  • unjustified enrichment
  • property, trusts and successions
  • Roman law and legal history
  • socio-legal studies

Programme structure

The framework of the LLM by Research allows you time and intellectual space to work in your chosen field, and to refine and develop this initial phase of the project for future doctoral work.

The programme does not have formal coursework elements, other than initial training seminars alongside PhD students.

This makes the LLM by Research a particularly attractive option for those wishing to undertake postgraduate research on a part-time basis, while pursuing legal practice or other employment.

Find out more about compulsory and optional courses

We link to the latest information available. Please note that this may be for a previous academic year and should be considered indicative.

AwardTitleDurationStudy mode
LLM by ResearchLaw1 YearFull-time
LLM by ResearchLaw2 YearsPart-time

Training and support

Postgraduate researchers enjoy full access to the University’s research skills training which the Law School complements with a tailored research and wider skills programme.

The training programme in Semester One (six seminars) includes workshops on research design, writing and research ethics.

  • Find out more about training and support on the LLM by Research

Postgraduate researchers are able to draw upon a fantastic range of resources and facilities to support their research.

The Law School has one of the most significant academic law libraries in the UK which offers outstanding digital resources alongside a world-leading print collection (almost 60,000 items including a unique collection for Scots law research).

You will also have access to the University’s Main Library which has one of the largest and most important collections in Britain, as well as the legal collection of the National Library of Scotland.

Entry requirements

These entry requirements are for the 2024/25 academic year and requirements for future academic years may differ. Entry requirements for the 2025/26 academic year will be published on 1 Oct 2024.

A UK 2:1 honours degree, or its international equivalent, in law, or a social science subject.

Entry to this programme is competitive. Meeting minimum requirements for consideration does not guarantee an offer of study.

International qualifications

Check whether your international qualifications meet our general entry requirements:

  • Entry requirements by country
  • English language requirements

Regardless of your nationality or country of residence, you must demonstrate a level of English language competency at a level that will enable you to succeed in your studies.

English language tests

We accept the following English language qualifications at the grades specified:

  • IELTS Academic: total 7.0 with at least 7.0 in writing and 6.5 in all other components. We do not accept IELTS One Skill Retake to meet our English language requirements.
  • TOEFL-iBT (including Home Edition): total 100 with at least 25 in writing and 23 in all other components.
  • C1 Advanced ( CAE ) / C2 Proficiency ( CPE ): total 185 with at least 185 in writing and 176 in all other components.
  • Trinity ISE : ISE III with passes in all four components.
  • PTE Academic: total 70 with at least 70 in writing and 62 in all other components.

Your English language qualification must be no more than three and a half years old from the start date of the programme you are applying to study, unless you are using IELTS , TOEFL, Trinity ISE or PTE , in which case it must be no more than two years old.

Degrees taught and assessed in English

We also accept an undergraduate or postgraduate degree that has been taught and assessed in English in a majority English speaking country, as defined by UK Visas and Immigration:

  • UKVI list of majority English speaking countries

We also accept a degree that has been taught and assessed in English from a university on our list of approved universities in non-majority English speaking countries (non-MESC).

  • Approved universities in non-MESC

If you are not a national of a majority English speaking country, then your degree must be no more than five years old* at the beginning of your programme of study. (*Revised 05 March 2024 to extend degree validity to five years.)

Find out more about our language requirements:

Fees and costs

Scholarships and funding, featured funding.

* School of Law funding opportunities

Other funding opportunities

Search for scholarships and funding opportunities:

  • Search for funding

Further information

  • Postgraduate Research Office
  • Phone: +44 (0)131 650 2022
  • Contact: [email protected]
  • School of Law (Postgraduate Research Office)
  • Old College
  • South Bridge
  • Central Campus
  • Programme: Law
  • School: Law
  • College: Arts, Humanities & Social Sciences

Select your programme and preferred start date to begin your application.

LLM by Research Law - 1 Year (Full-time)

Llm by research law - 2 years (part-time), application deadlines.

Programme start date Application deadline
6 January 2025 29 September 2024

We encourage you to apply at least one month prior to entry so that we have enough time to process your application. If you are also applying for funding or will require a visa then we strongly recommend you apply as early as possible.

  • How to apply

You must submit two references with your application.

Find out more about the general application process for postgraduate programmes:

  • Current students
  • Staff intranet
  • Find an event

Where Will Postgraduate Study in Law Lead You?

The Master of Laws (Research) equips students for careers in advanced research, policy development, public service, tertiary teaching or professional leadership. It will enable you to acquire and develop sophisticated research and analysis skills, honed through work on a topic of your choice that expands legal thinking and understanding.

The Master of Laws is up to two years full-time and four years part-time and is awarded on the basis of a supervised thesis of 50,000 words. The thesis must make a substantial contribution to the knowledge of the subject concerned. Students are also required to undertake the compulsory research-support coursework unit, LAWS6077 Legal Research 1.

Subject areas

Shared pool, entry, fees, funding & how to apply, your entry requirements, english language proficiency.

For academic requirements check the ‘Admission requirements’ section on this page.

How to apply

Please apply by 15 September for commencement on 1 March and 15 March for commencement on 1 July. If your application cannot be assessed in time for commencement, it will be considered for the next possible start date.

Starting date

Research Period 2: 1 March and Research Period 3: 1 July

Please apply by 15 September for commencement on 1 March and 15 March for commencement on 1 July. If your application cannot be assessed in time for commencement, it will be considered for the next possible start date. 

Research areas

Master of Laws researchers perform original research in an area of law or regulation involving legal or interdisciplinary methodologies under the supervision of a member of the University of Sydney Law School who is an expert in the subject matter. 

Learn more about  Sydney Law School research

What you'll study

The Master of Laws (Research) is awarded on the basis of a supervised thesis of a maximum 50,000 words. The thesis must make a substantial contribution to the knowledge of the subject concerned. Students are also required to complete the compulsory research-support coursework unit, LAWS6077 Legal Research 1 within the first 12 months of their candidature.

This research degree includes some coursework curriculum to support research success. Masters students will complete 6 credit points of coursework .

Unit of study code

Unit of study name

Course

Course stage

Advice

LAWS6077

Legal Research 1

Doctor of Philosophy, Doctor of Juridical Studies, Master of Laws (Research), Master of Criminology (Research)

Year 1

Semester 1 

There is no separate tuition fee cost for the coursework units of study you will undertake, it is part of the tuition fee for the course .

See the 'Your Fee' section for fee information. Additional non-tuition course costs vary depending on the units of study.

You will be able to see and enrol in any of the units available, subject to capacity restraints and your own background. Note that your faculty may elect to make certain units compulsory for a given degree.

Applying for admission

To apply for admission to a Master of Laws (Research) degree, you must submit a formal application for admission.

Expression of Interest (Optional)

While you are not required to submit an Expression of Interest before applying, Sydney Law School recommends that you do so before submitting a formal application, especially if:

· you are seeking funding assistance;

· have not identified a potential supervisor ; or

· you are an international applicant. 

Submitting an Expression of Interest will allow the School to support you in presenting a formal application and provide you with feedback on whether your application is likely to succeed.

The Expression of Interest form includes information about your intended research topic, academic and professional qualifications, and publications.

To allow the School to consider your information and provide you appropriate and timely guidance, applicants are encouraged to submit an Expression of Interest as early as possible and no later than:

30 June

15 September*

1 March

31 December

15 March*

1 July

*Note: If you intend to apply for an Australian Government Research Training Program (RTP) scholarship, please submit a full admission application by the relevant  RTP scholarship closing date . 

Formal Application for Admission

To apply for a Master of Laws (Research) degree, you will submit a formal application through the University's Online Application portal.

You must ensure that all required supporting documents are submitted with your application, including the following documents requested by Sydney Law School:

. expression of interest acceptance (if submitted one), otherwise please include evidence of consultation/comments from potential supervisors.  The nomination of supervisors is determined by the Law Postgraduate Research Education Committee.

· full research proposal (approximately 10 pages) which outlines:

- a ims of the proposed research thesis

- background to the research, including a brief reference to the relevant literature and law (including case law where appropriate)

- a clear statement of the area to be researched

- rationale for the research and a statement of why it is significant

- working hypotheses or research questions

- research methodology including theoretical and empirical considerations for the research

- statement indicating how you will be able to sufficiently fund your proposed field work or overseas study/research. Explain why this work is essential for completion of your thesis.

· motivation statement

· time availability statement

· curriculum vitae

· list of publications (if available)

· timeline for completion of the thesis and the compulsory unit of study, LAWS6077 Legal Research 1

· two referee statements in support of your application (in addition to the referee forms)

Before you apply, please check the University of Sydney’s eligibility criteria for admission to a research program at Apply for Postgraduate Research .

To Apply now

Scholarships

To be considered for a RTP scholarship, you must select “Yes” in the “Scholarship Details” field on your application form and apply by the relevant  RTP scholarship closing date . Information about the Sydney Law School Postgraduate Research Scholarships in available  here .

Completion requirement

To qualify for the award of Master of Laws, a student must complete the unit of study LAWS6077 Legal Research 1 within the first 12 months of their candidature and a thesis in the approved subject with an upper limit of 50,000 words. The thesis must satisfy the examiners that it is a substantial contribution to the subject concerned. Thesis submission requirements and examination procedure as set out in the Academic Board resolutions for this course and the Higher Degree (HDR) Rule 2011.

Admission requirement

A successful applicant for admission to candidature for the Master of Laws (LLM) requires an Honours degree with first or upper second class honours. Applications for admission to candidature for the Master of Laws (LLM) by thesis are assessed on the basis of: suitability and sufficiency of merit of the applicant's prior qualification (Bachelor of Laws, Juris Doctor or equivalent); suitability of proposed topic; and availability of appropriate supervision.

Careers & future study

Career pathways.

The Master of Laws by Research (LLM) at the University of Sydney Law School is a pathway to a number of careers, including tertiary education, policy development, advanced research, and specialisation for employment in government, inter-governmental and international organisations, and civil society organisations. You will conduct a research project that makes a substantial and original contribution to knowledge and will have a highly developed knowledge base, with strong written, oral, and critical analytical skills. The Master of Laws by Research is also an excellent starting point for further postgraduate study in the doctoral (PhD) program.

Important fee information

Domestic students, international students.

The course information on this website applies only to future students. Current students should refer to faculty handbooks for current or past course information.

  • Find an expert
  • Media contacts

Student links

  • How to log in to University systems
  • Class timetables
  • Our rankings
  • Faculties and schools
  • Research centres
  • Campus locations
  • Find a staff member
  • Careers at Sydney
  • Emergencies and personal safety

Group Of Eight

  • Accessibility
  • Website feedback
  • Home »
  • Editorial »
  • Advice »
  • Choosing Your LLM Program »
  • Ways of studying<br/>an LLM »

Ways of studying an LLM program – the research option

Find your perfect llm program search our database of over 2500 courses.

Many LLM programs are to be completed by coursework only; others by a combination of coursework, exams  and thesis. A minority of programs offer the opportunity to skip the coursework entirely and complete the degree via a thesis. An LLM by Research is intended to develop the student's legal research and writing skills by directing them towards planning and executing a large piece of academic research – usually around 30,000-40,000 words – on their chosen field of law. Although this dissertation will be completed under specialist supervision, the student will be expected to demonstrate the ability to work independently. The LLM by Research will develop the student’s ability to present legal arguments by utilising various legal sources and other academic literature.  Although the thesis is the main form of assessment for qualification, many universities also offer the opportunity to participate in taught courses as well, offering the chance to broaden your horizons in the legal fields.

llm research

Advantages of studying an LLM by Research

Studying an LLM program by Research is a great option to choose if you want to continue with your legal education to a postgraduate level, especially if you are considering going on to study a doctoral research in law (PhD). It is also a good option if you want to continue working while studying part time for your Master of Laws.

Being part of research community, and meeting eminent researchers, thereby gaining invaluable skills and experience, are other benefits when choosing the research option. As well as developing your research skills, you will also develop other transferable skills that will aid your legal and/or academic career.

Another advantage of a research-only program is that you may be able to do most of your work elsewhere – wherever you have a suitable library or internet connection, for instance. Although many programs have formal residency requirements, they are often not enforced. Make sure you check your eligibility to study, as recent legislation by the UK Border Agency, can affect overseas students in the UK, making them only eligible to apply for full-time study.

Applying for an LLM by Research

Although all universities have different application procedures, if you are applying to do an LLM program by Research, you will have to submit a decent research proposal. This should include the title of your proposed research, a concise introduction, intended methodology, benefits of the research to the wider community, overall summary, as well as details of any supporting supervisors.

There are several factors you will need to consider when choosing where to do your LLM by Research. Obviously the institution’s reputation, specialist fields, and attached professors/specialist researchers will all play an important part in helping you make your decision. Other factors to consider are the funding opportunities available at the law school and, of course, its location.

Almost all of the law schools offering LLMs by Research can be found in current or former British Commonwealth countries: Australia, Britain, Canada, Ireland, New Zealand and South Africa.

In the United States, although a handful of schools offer so-called ‘LLMs by Research’, the typical program, such as that at the University of Michigan, requires one semester of coursework and one of research and writing. The University of Wisconsin is nearly unique in offering a degree that does not require coursework. Unlike most other types of LLM programs, LLMs by Research often allow students to start at different times of year. The University of Bristol, for example, is not unusual in allowing students to start in January, April or October. There are several other LLM by Research programs available in the UK, for example at the Schools of Law at the University of Edinburgh and the University of Glasgow, as well as the Warwick School of Law.

Related articles

What Exams Can You Expect As An LLM Student?

Studying An LLM Program Full Time Or Part Time

Distance Learning: An Overview

Online LLM Study: A Flexible Modern Option

Global LLM Study Bursaries

LLMStudy.com

  • Law bursaries
  • Open day alerts
  • Funding advice
  • Application tips
  • Law & LLM news

Complete Our Destination Survey!

Take 2 minutes to complete our Destination Survey for the chance to win a £2,000 LLM & PG Law Bursary.

All we need to know is:

  • Your university or law school
  • Your PG law course
  • Skip to content
  • About Accessibility on our website

University of Aberdeen

  • Staff Directory

LLM By Research

  • University Home
  • Postgraduate Research
  • Our Research Areas

Introduction

The LLM by Research is a Master’s degree path that allows you to improve your research skills and conduct independent investigation on any legal topic of your choice.

Study Information

At a glance, want to know more.

llm research

The main requirement is to write a dissertation of 40,000 words (maximum). The programme is particularly suitable for students who aim to pursue a PhD later on or who are looking for an alternative to traditional taught LLM.

The Law School welcomes students who wish to pursue research degrees in law, and is able to provide supervision in a wide range of subjects. You will receive advanced training in research skills and methods. Moreover, you will be assigned two supervisors, who will guide and advise you over the year.

There is an active research community, with students from many parts of the world registered for a PhD or Masters Degree by research. LLM by Research students are given access to dedicated hot desk space.

The specialist law library is housed in the same building as the Law School's teaching and staff rooms. The library contains not only a comprehensive collection of Scots and English sources but also a substantial collection of European Union, Commonwealth and American materials. There is a particularly fine collection of books in the field of Roman and Roman-Dutch law. The University is a European Documentation Centre, and that collection is housed in the Law Library.

This LLM Program is offered on a full-time (12 months) and part-time basis (24 months). Hence, it is flexible enough to accommodate the needs of working students.

You can start the LLM by Research in September or January of each year.

You may also be interested to learn about our PhD in Law .

Our Research

The legal research at the School of Law at the University of Aberdeen is cutting-edge and first class. Our faculty publish at all the top publishing houses, such as Cambridge University Press, Oxford University Press, and MIT Press. They also publish in a wide array of premier law journals across Africa, Australia, East Asia, Europe, and North America and they also publish in many different languages.

The Law research in the School is integrated across five distinct   Research Centres

  • Centre for Commercial Law
  • Centre for Constitutional and Public International Law
  • Centre for Energy Law
  • Centre for Private International Law
  • Centre for Scots Law

Entry Requirements

  • A first or upper-second class Honours degree in law or a relevant discipline; or any international equivalent
  • “Postgraduate Higher” English requirements: https://www.abdn.ac.uk/study/international/requirements-pg-266.php

The application process is free, no payment is required. To apply, please upload the following documents:

  • Your undergraduate and postgraduate transcripts;
  • Your graduation certificates or diplomas;
  • A Research Proposal;
  • Your Personal Statement on why you want to study for a LLM by Research. Please limit to a maximum of 2 pages of A4 paper;
  • Your letters of recommendation.

In addition, you can also submit:

  • Examples of previous academic writing, which could include previous dissertations, theses, or published research articles.
  • Other evidence of your experience of research and writing.

Your application should be accompanied by a research proposal. This tells us about what you want to research and why it is important that the topic be researched. Please visit our School of Law webpages for details of what to include in your research proposal.

Please be aware that your research proposal may be passed through originality checking software.

International Applicants

Additional details for international applicants, including country-specific information, are available here .

Fees and Funding

Fees - PGResearch 2022-23 (1).pdf (abdn.ac.uk)

More information about Fee status, living costs, and work allowances for international students is available here .

Our Funding Database

View all funding options in our Funding Database .

Top 10 UK Law School

We are ranked Top 10 in the UK for Law by the Times and Sunday Times Good University Guide 2024.

There are many opportunities at the University of Aberdeen to develop your knowledge, gain experience and build a competitive set of skills to enhance your employability. This is essential for your future career success. The Careers and Employability Service can help you to plan your career and support your choices throughout your time with us, from first to final year – and beyond.

  • More information on the Careers and Employability Service

What our Alumni Say

Dr eunice pinn.

Dr Eunice Pinn

Aberdeen was one of the few universities in the UK with a School of Law that included a focus on the marine environment. I had no formal legal training prior to starting my LLM, but had been involved in the practical implementation of the legal requirements for nature conservation for many years. The support from my supervisor was fantastic and still continues - I am now an Honorary Senior Lecturer for the School of Law.

Find out more

Get in Touch

Contact details.

Social Media

  • Follow the Law School on Facebook
  • Connect with us on LinkedIn
  • Follow the Law School on Twitter

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Sustainability
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

LLMs develop their own understanding of reality as their language abilities improve

Press contact :.

A cartoon robot inspects a pile of wingdings with a magnifying glass, helping it think about how to piece together a jigsaw puzzle of a robot moving to different locations.

Previous image Next image

Ask a large language model (LLM) like GPT-4 to smell a rain-soaked campsite, and it’ll politely decline. Ask the same system to describe that scent to you, and it’ll wax poetic about “an air thick with anticipation" and “a scent that is both fresh and earthy," despite having neither prior experience with rain nor a nose to help it make such observations. One possible explanation for this phenomenon is that the LLM is simply mimicking the text present in its vast training data, rather than working with any real understanding of rain or smell. But does the lack of eyes mean that language models can’t ever “understand" that a lion is “larger" than a house cat? Philosophers and scientists alike have long considered the ability to assign meaning to language a hallmark of human intelligence — and pondered what essential ingredients enable us to do so. Peering into this enigma, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have uncovered intriguing results suggesting that language models may develop their own understanding of reality as a way to improve their generative abilities. The team first developed a set of small Karel puzzles, which consisted of coming up with instructions to control a robot in a simulated environment. They then trained an LLM on the solutions, but without demonstrating how the solutions actually worked. Finally, using a machine learning technique called “probing,” they looked inside the model’s “thought process” as it generates new solutions.  After training on over 1 million random puzzles, they found that the model spontaneously developed its own conception of the underlying simulation, despite never being exposed to this reality during training. Such findings call into question our intuitions about what types of information are necessary for learning linguistic meaning — and whether LLMs may someday understand language at a deeper level than they do today. “At the start of these experiments, the language model generated random instructions that didn’t work. By the time we completed training, our language model generated correct instructions at a rate of 92.4 percent,” says MIT electrical engineering and computer science (EECS) PhD student and CSAIL affiliate Charles Jin, who is the lead author of a new paper on the work . “This was a very exciting moment for us because we thought that if your language model could complete a task with that level of accuracy, we might expect it to understand the meanings within the language as well. This gave us a starting point to explore whether LLMs do in fact understand text, and now we see that they’re capable of much more than just blindly stitching words together.” Inside the mind of an LLM

The probe helped Jin witness this progress firsthand. Its role was to interpret what the LLM thought the instructions meant, unveiling that the LLM developed its own internal simulation of how the robot moves in response to each instruction. As the model’s ability to solve puzzles improved, these conceptions also became more accurate, indicating that the LLM was starting to understand the instructions. Before long, the model was consistently putting the pieces together correctly to form working instructions. Jin notes that the LLM’s understanding of language develops in phases, much like how a child learns speech in multiple steps. Starting off, it’s like a baby babbling: repetitive and mostly unintelligible. Then, the language model acquires syntax, or the rules of the language. This enables it to generate instructions that might look like genuine solutions, but they still don’t work.

The LLM’s instructions gradually improve, though. Once the model acquires meaning, it starts to churn out instructions that correctly implement the requested specifications, like a child forming coherent sentences. Separating the method from the model: A “Bizarro World”

The probe was only intended to “go inside the brain of an LLM” as Jin characterizes it, but there was a remote possibility that it also did some of the thinking for the model. The researchers wanted to ensure that their model understood the instructions independently of the probe, instead of the probe inferring the robot’s movements from the LLM’s grasp of syntax.

“Imagine you have a pile of data that encodes the LM’s thought process,” suggests Jin. “The probe is like a forensics analyst: You hand this pile of data to the analyst and say, ‘Here’s how the robot moves, now try and find the robot’s movements in the pile of data.’ The analyst later tells you that they know what’s going on with the robot in the pile of data. But what if the pile of data actually just encodes the raw instructions, and the analyst has figured out some clever way to extract the instructions and follow them accordingly? Then the language model hasn't really learned what the instructions mean at all.”

To disentangle their roles, the researchers flipped the meanings of the instructions for a new probe. In this “Bizarro World,” as Jin calls it, directions like “up” now meant “down” within the instructions moving the robot across its grid.  “If the probe is translating instructions to robot positions, it should be able to translate the instructions according to the bizarro meanings equally well,” says Jin. “But if the probe is actually finding encodings of the original robot movements in the language model’s thought process, then it should struggle to extract the bizarro robot movements from the original thought process.” As it turned out, the new probe experienced translation errors, unable to interpret a language model that had different meanings of the instructions. This meant the original semantics were embedded within the language model, indicating that the LLM understood what instructions were needed independently of the original probing classifier. “This research directly targets a central question in modern artificial intelligence: are the surprising capabilities of large language models due simply to statistical correlations at scale, or do large language models develop a meaningful understanding of the reality that they are asked to work with? This research indicates that the LLM develops an internal model of the simulated reality, even though it was never trained to develop this model,” says Martin Rinard, an MIT professor in EECS, CSAIL member, and senior author on the paper.

This experiment further supported the team’s analysis that language models can develop a deeper understanding of language. Still, Jin acknowledges a few limitations to their paper: They used a very simple programming language and a relatively small model to glean their insights. In an  upcoming work , they’ll look to use a more general setting. While Jin’s latest research doesn’t outline how to make the language model learn meaning faster, he believes future work can build on these insights to improve how language models are trained.

“An intriguing open question is whether the LLM is actually using its internal model of reality to reason about that reality as it solves the robot navigation problem,” says Rinard. “While our results are consistent with the LLM using the model in this way, our experiments are not designed to answer this next question.”

“There is a lot of debate these days about whether LLMs are actually ‘understanding’ language or rather if their success can be attributed to what is essentially tricks and heuristics that come from slurping up large volumes of text,” says Ellie Pavlick, assistant professor of computer science and linguistics at Brown University, who was not involved in the paper. “These questions lie at the heart of how we build AI and what we expect to be inherent possibilities or limitations of our technology. This is a nice paper that looks at this question in a controlled way — the authors exploit the fact that computer code, like natural language, has both syntax and semantics, but unlike natural language, the semantics can be directly observed and manipulated for experimental purposes. The experimental design is elegant, and their findings are optimistic, suggesting that maybe LLMs can learn something deeper about what language ‘means.’”

Jin and Rinard’s paper was supported, in part, by grants from the U.S. Defense Advanced Research Projects Agency (DARPA). 

Share this news article on:

Related links.

  • Charles Jin
  • Martin Rinard
  • Computer Science and Artificial Intelligence Laboratory (CSAIL)
  • Department of Electrical Engineering and Computer Science

Related Topics

  • Artificial intelligence
  • Programming
  • Computer science and technology
  • Electrical Engineering & Computer Science (eecs)
  • Defense Advanced Research Projects Agency (DARPA)

Related Articles

A cartoon android recites an answer to a math problem from a textbook in one panel and reasons about that same answer in another

Reasoning skills of large language models are often overestimated

Four sequential LLM illustrations of a gray swivel chair, each more complete than the last. Beneath each is a word balloon. The first one says “Draw a swivel chair.”  Subsequent commands are “Draw a rectangular backrest,” “Make the chair more accurate,” and “Give it a more aesthetically pleasing appearance."

Understanding the visual knowledge of language models

Three boxes demonstrate different tasks assisted by natural language. One is a rectangle showing colorful lines of code with a white speech bubble highlighting an abstraction; another is a pale 3D kitchen, and another is a robotic quadruped dropping a can into a trash bin.

Natural language boosts LLM performance in coding, planning, and robotics

3D illustration of a balance scale with question marks in each pan

Exact symbolic artificial intelligence for faster, better assessment of AI fairness

Previous item Next item

More MIT News

Dominika Ďurovčíková stands in front of a giant photo of a galaxy.

When the lights turned on in the universe

Read full story →

Isometric drawing shows rows of robots on phones, and in the middle is a human looking up.

3 Questions: How to prove humanity online

Rachael Rosco and Brandon Sun face one another across a desk strewn with various tools and components

Lincoln Laboratory and National Strategic Research Institute launch student research program to tackle biothreats to national security

Christine Ortiz headshot

Christine Ortiz named director of MIT Technology and Policy Program

Rendering of four square batteries in fluid

MIT engineers design tiny batteries for powering cell-sized robots

Screenshot of NeuroTrALE software shows hundreds of neuron filaments in red and one neuron highlighted in yellow.

New open-source tool helps to detangle the brain

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 16 November 2023

A study of generative large language model for medical research and healthcare

  • Cheng Peng   ORCID: orcid.org/0000-0002-1994-893X 1 ,
  • Xi Yang 1 , 2 ,
  • Aokun Chen 1 , 2 ,
  • Kaleb E. Smith 3 ,
  • Nima PourNejatian 3 ,
  • Anthony B. Costa 3 ,
  • Cheryl Martin 3 ,
  • Mona G. Flores   ORCID: orcid.org/0000-0002-7362-3044 3 ,
  • Ying Zhang   ORCID: orcid.org/0000-0003-4210-2104 4 ,
  • Tanja Magoc 5 ,
  • Gloria Lipori   ORCID: orcid.org/0000-0001-5616-2701 5 , 6 ,
  • Duane A. Mitchell   ORCID: orcid.org/0000-0001-6049-213X 6 ,
  • Naykky S. Ospina 7 ,
  • Mustafa M. Ahmed 8 ,
  • William R. Hogan   ORCID: orcid.org/0000-0002-9881-1017 1 ,
  • Elizabeth A. Shenkman   ORCID: orcid.org/0000-0003-4903-1804 1 ,
  • Yi Guo   ORCID: orcid.org/0000-0003-0587-4105 1 , 2 ,
  • Jiang Bian   ORCID: orcid.org/0000-0002-2238-5429 1 , 2 &
  • Yonghui Wu   ORCID: orcid.org/0000-0002-6780-6135 1 , 2  

npj Digital Medicine volume  6 , Article number:  210 ( 2023 ) Cite this article

27k Accesses

39 Citations

144 Altmetric

Metrics details

  • Health care
  • Translational research

There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians’ Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability ( p  = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance ( p  = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them ( p  < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.

Similar content being viewed by others

llm research

The future landscape of large language models in medicine

llm research

A large language model for electronic health records

llm research

Large language models in medicine

Introduction.

Generative large language models (LLMs) such as the ChatGPT 1 have surprised the world by answering questions conversationally and generating textual content such as emails, articles, and even computer codes, triggering enormous enthusiasm in applying LLMs to healthcare 2 , 3 , 4 . People are enthusiastic about LLMs in the potential to facilitate documentation of patient reports (e.g., a progress report) 3 , 4 , improving diagnostic accuracy 5 , and assisting in various clinical care 6 , 7 , while at the same time concerning the hallucinations and fabrications 7 , 8 , bias and stereotype 9 , and risks of patient privacy and ethics 10 . Yet, this enthusiasm and concerns are based on ChatGPT, which is not designed for healthcare use 1 . Until now, it is unclear how this disruptive technology can help medical research and potentially improve the quality of healthcare.

Language model is a simple statistical distribution used in natural language processing (NLP) to formulate the probability of a sequence of words or the next word in a sequence. Surprisingly, when it is used as a learning objective to train a specific neural network architecture named transformer, and when the model size is very large such as billions or hundreds of billions of parameters, important artificial intelligence (AI) emerges. For example, LLMs can learn knowledge from one task and apply it to another task (i.e., transfer learning), learn from very few labeled samples (i.e., few-shot learning), and learn without human-labeled samples (i.e., zero-shot learning) 11 , 12 , 13 . The LLM pretrained using decoder-only transformer such as GPT-3 is known as generative LLM as it can generate human-like text. The conversational ability of LLMs is achieved using prompt-based text generation 14 , the key technology guiding LLMs to generate reasonable answers and contextual contents.

This study aims to develop a generative LLM using real-world clinical text and evaluate its utility for medical research and healthcare. We train GatorTronGPT using 82 billion words of de-identified clinical text 15 from University of Florida (UF) Health and 195 billion diverse English words from the Pile 16 dataset. We train GatorTronGPT from scratch using the GPT-3 17 architecture. We formulate biomedical relation extraction and question answering using a unified text generation architecture 18 to evaluate how GatorTronGPT could benefit medical research using 6 benchmark datasets. To examine the utility of text generation in the clinical domain, we apply GatorTronGPT to generate 20 billion words of synthetic clinical text, which are used to train synthetic NLP models using BERT 19 architecture, denoted as GatorTronS (‘S’ stands for synthetic). We compare GatorTronS models with GatorTron 15 , a clinical NLP model trained using real-world 90 billion words of text, to test the hypothesis that generative clinical LLMs can be used to generate synthetic clinical text for medical research. To test if LLMs could be used in healthcare, two internal medicine subspecialists from endocrinology (NSO) and cardiology (MMA) manually evaluate clinical paragraphs written by GatorTronGPT compared with real-world paragraphs written by UF Health physicians. Figure 1 shows an overview of the study design. This study provides valuable insights into the opportunities and challenges of LLMs for medical research and healthcare.

figure 1

a Train GatorTronGPT from scratch using GPT-3 architecture with up to 20 billion parameters. b Solve biomedical relation extraction and question answering using a unified P-tuning base text generation architecture. c Apply GatorTronGPT to generate 20 billion words of synthetic clinical text, which was used to train synthetic natural language processing model, GatorTronS. d Turing evaluation of 30 paragraphs of text written by GatorTronGPT mixed with 30 real-world paragraphs written by UF Health physicians. TrM transformer unit; B billion.

Training of GatorTronGPT from scratch

Training the 5 billion GatorTronGPT model used approximately 6 days and the 20 billion model used about 20 days on 560 A100 80 G GPUs from 70 NVIDIA DGX nodes using the NVIDIA SuperPOD reference cluster architecture. Figure 2 shows the training and validation loss. Table 1 compares GatorTronGPT with GatorTronS and GatorTron on model architecture, training dataset, parameter size, and whether the model is a generative LLM, to help differentiate the three LLMs.

figure 2

a Training loss. b Validation loss.

GatorTronGPT for Biomedical natural language processing

Table 2a compares GatorTronGPT with four existing biomedical transformer models on end-to-end relation extraction of drug-drug interaction, chemical-disease relation, and drug-target interaction. GatorTronGPT outperformed all existing models, with the best F1-score of 0.500, 0.494, and 0.419, respectively. GatorTronGPT improved state-of-the-art by 3–10% compared with the second-best BioGPT 18 model. We consistently observed performance improvement when scaling up the size of GatorTronGPT. Table 2b compares GatorTronGPT with six existing biomedical transformers using three benchmark datasets for biomedical question answering. The GatorTronGPT model with 20 billion parameters tied with BioLinkBERT on the MedQA dataset achieving the best performance of 0.451. GatorTronGPT also achieved the second-best performance of 0.776 for the PubMedQA dataset compared with the best performance of 0.782 from BioGPT. The performance of GatorTronGPT on the MedMCQA dataset was lower than a much larger LLM, Galactica, with 120 billion parameters.

Evaluation of GatorTronS

Tables 3 and 4 compare GatorTronS trained with different sizes of synthetic clinical text with ClinicalBERT and GatorTron 15 . For clinical concept extraction, GatorTronS, trained using 20 billion and 5 billion synthetic clinical text, achieved the best F1-score for the three benchmark datasets. GatorTronS outperformed the original GatorTron model by >1% F1-score on all three benchmark datasets. For medical relation extraction, the GatorTronS trained using 10 billion synthetic clinical text achieved the best F1-score of 0.962 on the 2018 n2c2 challenge benchmark dataset, which is comparable with the original GatorTron model (0.960). For semantic textual similarity and natural language inference, GatorTronS achieved the best evaluation scores, outperforming the original GatorTron by >1%. For question answering using emrQA dataset, GatorTronS outperformed the original GatorTron model trained using real-world clinical text by >1%. The comparison results show that a minimum of 5 billion words of synthetic clinical text are required to train a synthetic model with comparable performance to GatorTron, a transformer trained using 82 billion words of real-world UF Health clinical text. Figure 3 compares GatorTronS models trained with different sizes of synthetic text using line plots. We observed consistent performance improvements from all eight datasets by increasing the size of synthetic text from 1 billion to 5 billion words. The improvements are not consistent when increasing the data size from 5 billion up to 20 billion words.

figure 3

B billion words of text.

Physicians’ Turing test

The Turing test results show that, on average, less than half (49.2%) of the clinical notes were identified correctly, including 36.7% of the synthetic notes and 61.7% of the human notes (Table 5a ). Among the 30 synthetic notes written by GatorTronGPT, 9 (30.0%) and 13 (43.4%) were correctly labeled as ‘AI’ by the two physicians, respectively. Among the 30 human notes written by physicians, 17 (56.7%) and 20 (66.7%) were correctly labeled as ‘Human’, respectively. Considering GatorTronGPT was considered as a human for more than 30% of the instances (the criteria from Turing test) 20 , GatorTronGPT passed the Turing test ( p  < 0.001). Table 5b summarizes the means and standard deviations of the linguistic readability and clinical relevance and consistency. Statistical tests show that there is no significant difference between notes written by GatorTronGPT and human physicians in both linguistic readability ( p  = 0.22) and clinical relevance and consistency ( p  = 0.91). Table 5c shows two examples written by GatorTronGPT; more examples are provided in Supplementary Table S1 . Percent agreement and interrater reliability were found to be good or excellent, as summarized in Supplementary Tables S2 and S3 .

This study develops a generative clinical LLM, GatorTronGPT, using the GPT-3 architecture 13 with 277 billion words of mixed clinical and English text. GatorTronGPT achieves state-of-the-art performance for four out of six biomedical NLP benchmark datasets. Our previous GatorTron 15 model, trained using an encoder-only BERT architecture with 8.9 billion parameters, also achieved state-of-the-art performance on six clinical NLP benchmark datasets. The two studies demonstrate the benefit of LLMs for biomedical and clinical research. GatorTronGPT can generate synthetic clinical text for developing synthetic clinical NLP models (i.e., GatorTronS), which achieve better or comparable performance to GatorTron, an NLP model trained using real-world clinical text, demonstrating the utility of synthetic clinical text generation. The physicians’ Turing test show that GatorTronGPT can generate clinical text with comparable linguistic readability and clinical relevance to real-world clinical notes. This study provides valuable insights into the opportunities and challenges of generative LLMs for medical research and healthcare.

We discover an important utility of synthetic clinical text generation. To date, there has been a gap in accessing and sharing large-scale clinical text and clinical LLMs due to the sensitive nature of clinical text and the fact that automatic de-identification systems cannot remove 100% protected health information (PHI). Not surprisingly, a recent study 21 on clinical foundation models point out that most LLMs in the medical domain are trained using “small, narrowly-scoped” clinical dataset with limited note types (e.g., MIMIC 22 ) or “broad, public” biomedical literature (e.g., PubMed) that has limited insights to healthcare. Generative LLMs can provide large-scale synthetic clinical text to fill the gap. We compare the synthetic text with real-world clinical text to examine why GatorTronS, a transformer model trained using a much smaller (e.g., 5 billion words) synthetic clinical text corpus, could achieve better or comparable performance to GatorTron 15 , a transformer model trained using a much larger (90 billion words) real-world clinical text corpus. We identify potential reasons including (1) real-world clinical text has significant redundancies, which is a well-known characteristic of clinical narratives 23 , and (2) GatorTronGPT generates more diverse synthetic clinical text. We randomly sample a subset of real-world clinical notes with number of words comparable to the synthetic text (i.e., 20 billion words) to compare the coverage of unigrams (i.e., individual tokens) and bigrams (i.e., two consecutive tokens). The comparison results show that the synthetic text generated by GatorTronGPT contain remarkably more diverse unigrams (40.43 million : 4.82 million, ratios are reported as “synthetic” : “real notes”) and bigrams (416.35 million : 62.51 million); the synthetic text also has higher entropy than the real-world clinical text (4.97: 4.95). Supplementary Table S4 provides detailed comparison results and examples. A previous study 24 has reported that by augmenting real-world clinical training data using additional human annotated synthetic text generated by a smaller generative LLM, GPT-2, NLP models can achieve better performance. Our study further demonstrates that, without additional human annotation and augmentation of training data, a larger clinical GPT-3 model can generate synthetic clinical text to train synthetic NLP models outperforming NLP models trained using real-world clinical text. Text generation using generative LLMs could mitigate the risk of exposing patient privacy and improve accessing and sharing of large-scale clinical text and NLP models, thus enabling the next generation of clinical text analytics using synthetic clinical text.

Generative LLMs aspire to become a “Unified Field Theory” to unify most fundamental NLP tasks using a single model architecture. It might be still early to judge if LLMs will become the one and only foundation model 12 for NLP, but it looks like we are closer than ever. Generative LLMs have the potential to impact medical research in many aspects. In addition to performance improvement demonstrated in this study, generative LLMs provide a unified solution using prompt-based text generation 25 , which leads to a new paradigm of “one model for all NLP tasks” and has better few-shot learning and transfer learning ability to deliver portable clinical NLP systems 13 , 26 . The evaluation of GatorTronGPT shows that clinical LLMs can be used to generate clinical-relevant content with the potential to help document 3 and code patient information in EHR systems, thus reducing the extensively onerous documentation burden for clinicians 27 , 28 , 29 . The prompt-based text generation of LLMs can potentially help compose treatment plans by integrating instructions from clinical guidelines and patients’ historical records in EHRs. The conversational ability of LLMs provides opportunities to develop intelligent EHR systems with human-like communication 2 , where healthcare providers, patients, and other stakeholders can communicate in an intelligent electronic health record (EHR) system. Industry stakeholders such as Epic and Nuance have been reported to be exploring these potentials 30 , 31 .

Our Turing test focuses on (1) linguistic readability; (2) clinical relevance; and (3) physicians’ ability to differentiate synthetic and human notes. The statistical tests show that there are no significant differences in linguistic readability ( p  = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) or clinical relevance ( p  = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human). Further, physicians cannot differentiate them ( p  < 0.001), suggesting the potential utility of GatorTronGPT for text generation in healthcare. Two physician evaluators find that the texts written by GatorTronGPT generally lack clinical logic, indicating that more research and development are needed to make this technology mature for healthcare. Our Turing test focuses on statistical differences not utility in real-world clinical practice, which should be examined in future studies when this technology matures. A recent study 32 examined an LLM developed at New York University, i.e., NYUTron, and our previously developed GatorTron 15 for prediction of readmission, in-hospital mortality, comorbidity, length of stay, and insurance denial, demonstrating the potential utility of LLMs in healthcare.

While LLMs are promising for healthcare applications, much more research and development are needed to achieve this goal. Current general-purpose LLMs are designed for conversation as a chatbot outside of healthcare. Therefore, the current use of ChatGPT for healthcare is more like a typical case of intended use versus actual use as described in the medical device regulation 33 . Domain-specific LLMs are needed for clinical applications. Due to the noisy data and probabilistic nature of text generation, LLMs are prone to confabulation or hallucination, which is dangerous for healthcare. In this study, we adopted robust decoding strategies (e.g., nucleus sampling) to alleviate potential off-target text generation. Researchers are exploring solutions such as reinforcement learning from human feedback (RLHF) 34 to reduce hallucinations, but it is still a not yet solved limitation of current LLMs. Future studies should explore strategies to better control the hallucinations at a minimal level to ensure the safety of using LLMs in healthcare. The security and risk of LLMs must be carefully examined in healthcare settings. We applied a de-identification system to remove PHIs from UF Health notes before training GatorTronGPT, future studies should carefully examine if GatorTronGPT has potential risk of speaking out PHIs and quantify the potential risk of re-identify real-world patients. Synthetic data, though generated by AI models, may still mirror the characteristics of its source material (e.g., UF health clinical notes). For example, ChatGPT has been reported to accidentally leak sensitive business data from a private company 35 . In addition, people are increasingly aware of the potential bias of AI applications in healthcare. Bias inherited from the original training data may be imitated and sometimes even amplified by AI models, which may cause systematic bias to specific patient groups 36 . Future studies should explore strategies to mitigate potential bias and ensure fairness of LLM applications. Like any medical AI applications, it is necessary to carefully examine this disruptive new technology to guide its application and make it “approved ” AI-enabled medical tool 37 .

We developed GatorTronGPT using 82 billion words of de-identified clinical text 15 from the University of Florida (UF) Health and 195 billion diverse English words from the Pile 16 dataset. We trained GatorTronGPT from scratch using the GPT-3 17 architecture (used by ChatGPT). We formulated biomedical relation extraction and question answering using a unified text generation architecture 18 and evaluated GatorTronGPT using 6 biomedical benchmark datasets. To examine the utility of text generation, we applied GatorTronGPT to generate 20 billion words of synthetic clinical text, which were used to train synthetic NLP models, denoted as GatorTronS (“S” stands for synthetic). We compared GatorTronS with GatorTron 15 , a clinical NLP model trained with the same architecture but using real-world clinical text. To test if LLMs could generate text for healthcare settings, two internal medicine subspecialists from endocrinology (NSO) and cardiology (MMA) manually evaluated 60 clinical paragraphs including 30 paragraphs written by GatorTronGPT randomly mixed with 30 real-world paragraphs written by UF Health physicians. Figure 1 shows an overview of the study design.

Data source

This study used 82 billion words of clinical narratives from UF Health Integrated Data Repository (IDR) and 195 billion words of diverse English words from the Pile 16 corpus. This study was approved by the University of Florida Institutional Review Board under IRB202102223; the need for patient consent was waived. At UF Health, we collected approximately 290 million clinical notes from 2011–2021 from over 126 departments, approximately 2 million patients and 50 million encounters from inpatient, outpatient, and emergency settings 15 . We merged the UF Health clinical corpus with the Pile 16 dataset to generate a large corpus with 277 billion words. We performed minimal preprocessing for the Pile dataset and applied a de-identification system to remove 18 PHI categories defined in the Health Insurance Portability and Accountability Act (HIPAA) from the UF Health notes.

Preprocessing and de-identification of clinical text

Following our previous study 15 , we performed a minimal preprocessing procedure. First, we removed all empty notes and the notes with less than 10 characters followed by performing a deduplication at the note level using the exact string match strategy. Then, we leveraged an internally developed preprocessing tool ( https://github.com/uf-hobi-informatics-lab/NLPreprocessing ) to normalize the clinical text. The normalization processing consists of three steps including (1) unifying all text into UTF-8 encoding, removing illegal UTF-8 strings, and removing HTML/XML tags if any; (2) sentence boundary detection where we normalize the clinical notes into sentences; (3) word tokenization where we used heuristic rules to separate punctuation and special symbols (e.g., slash, parenthesis) from words (e.g., converting “(HbA1c)” to “(HbA1c)” and “excision/chemo” to “excision/chemo”) and fixing concatenations (e.g., missing white space like converting “CancerScreening ” to “Cancer Screening”). After preprocessing, we performed another deduplication at the sentence level using the exact string match strategy.

To de-identified the UF Health clinical notes, we adopted an internally developed de-identification system which consists of an LSTM-CRFs based model and a postprocessing module replacing system-detected protected health information (PHI) entities with dummy strings (e.g., replace patients’ names with [**NAME**]). We adopted the safe-harbor method to identify 18 PHI categories defined in the Health Insurance Portability and Accountability Act (HIPAA). The LSTM-CRFs model for PHI detection was trained using the publicly available 2014 i2b2 de-identification datasets and an internal dataset with over 1100 clinical notes from UF Health annotated for PHI removal (named as UF-deid-dataset; not publicly available due to IRB restrictions). After three years of continuous customization and improvement at UF Health, the current model achieved an overall F1 score of 97.98% (precision of 96.27% and recall of 99.76%) on the UF-deid-dataset test set, which means our de-identification system can remove 99.76% of all PHIs. Detailed information about the development of the de-identification system can be accessed from our previous paper 38 .

Train GatorTronGPT from scratch

We trained GatorTronGPT using 5 billion parameters and 20 billion parameters and determined the number of layers, hidden sizes, and number of attention heads according to the guidelines for optimal depth-to-width parameter allocation proposed by ref. 39 as well as our previous experience in developing GatorTron 15 . The 5 billion model has 24 layers, hidden size of 4,096, and number of attention heads of 32; the 20 billion model has 44 layers, hidden size of 6144, and number of attention heads of 48. We trained the 5 billion model using a 2-way tensor model parallel with a batch size of 1120 and learning rate of 1.200E-05. We trained the 20 billion model using an 8-way tensor model parallel with a batch size of 560 and a learning rate of 1.000E-05. We adopted a dropout rate of 0.1. We inherited the GPT-3 architecture implemented in the MegaTron-LM 40 and trained GatorTronGPT models from scratch with the default GPT-3 loss function 13 . We used a total number of 560 NVIDIA DGX A100 GPUs from 70 superPOD nodes at UF’s HiPerGator-AI cluster to train GatorTronGPT by leveraging both data-level and model-level parallelisms implemented by the Megatron-LM package 40 . (See https://github.com/NVIDIA/Megatron-LM for more details) We monitored the training progress by training loss and validation loss using 3% of the data and stopped the training when there was no improvement.

GatorTronGPT for biomedical relation extraction and question answering

End-to-end relation extraction is an NLP task to identify the triplets < concept1, concept2, relation > from biomedical text. Question answering is to identify the answer for a given question and the context . Following previous studies 18 , 41 , we approached the two tasks using a unified prompt-based text generation architecture. Specifically, we adopted a fixed-LLM prompt-tuning strategy 42 to attach a continuous embedding (i.e., virtue tokens) to the input sequence [ virtual tokens; x; y ] as a soft prompt to control the text generation; the LLM was not changed during training. We provide details in the Supplement.

End-to-end biomedical relation extraction

We compared the two GatorTronGPT models with four existing transformer models including GPT-2 43 , REBEL, REBEL-pt 25 , and BioGPT 18 on three biomedical tasks for end-to-end relation extraction using three benchmark datasets including drug-drug interaction 44 (DDI), BioCreative V chemical-disease relation 45 (BC5CDR), and drug-target interaction 46 (KD-DTI).

GPT-2 was trained using text data from 8 million webpages with 1.5 billion parameters, which is a scale-up of the first generation of GPT45 model. The GPT model outperformed previous transformer models on 9 out of 12 NLP tasks, whereas, the GPT-2 model further demonstrated text generation ability, which laid foundation for complex NLP tasks such as machine reading comprehension and question answering.

REBEL and REBEL-pt

REBEL is a transformer model based on the BART architecture designed for end-to-end relation extraction using sequence-to-sequence modeling, which outperformed previous relation extraction models based on classifications. REBEL-pt is an enhanced version of REBEL by further fine-tuning it using the triplets derived using Wikipedia hyperlinks.

BioGPT is a domain-specific generative transformer-based LLM developed using the GPT-2 architecture and the Pubmed biomedical literature, which achieved good performance in NLP tasks including relation extraction and question answering in the biomedical domain.

Following the previous study 18 , we formulated both biomedical relation extraction and question answering as a prompt-based text generation model and applied prompt-tuning (p-tuning) algorithms. We concatenate learnable soft prompts (also called virtual prompt embeddings) with the word embeddings from the context (i.e., input sentence). The sample sequence is constructed as [ prompt , context , relation ], where the prompt is generated using a LSTM model and the relation is the gold standard label including the head entity, tail entity, and their relation type. During the inference, the context and the prompt are used as the input for our GatorTronGPT model to condition and let the model generate the relations. We converted the original relation triplets into a sequence representation. For example, there is an “ agonist ” relation between a drug - “ Igmesine ” and a target “ Opioid receptor sigma 1 ”, which was converted as: “the relation between [ Igmesine ] and [ Opioid receptor sigma 1 ] is [ agonist ] ” . Thus, the relation extraction can be solved as a text generation. During inference, we converted the generated text back to triplets for evaluation. We fine-tuned and evaluated our GatorTronGPT on the end-to-end relation extraction task across four biomedical datasets: BC5CDR (chemical–disease–relation extraction), KD-DTI (drug–target–interaction extraction), DDI (drug–drug–interaction extraction) and 2018 n2c2 (Drug-ADE-relation extraction). The precision, recall, and F1 score were used for evaluation.

Biomedical question answering

We compared GatorTronGPT with six existing transformer models using three widely used benchmark dataset including PubMedQA 47 —a biomedical question answering dataset collected from PubMed abstracts, which requires answering questions with ‘ yes/no/maybe ’ ; MedMCQA 48 —a large-scale multi-choice question answering dataset designed to address real world medical entrance exam questions covering 2400 healthcare topics and 21 medical subjects; and MedQA-USMLE 49 —a multi-choice dataset collected from the professional medical board exams. These datasets have been widely used to evaluate LLMs 18 , 47 , 48 , 49 .

Given a question, a context, and candidate answers, we concatenated the context and the candidate answers into a source sequence and compose the target sequence as: “the answer to the question given possible options is:”, “answer”: “C”. Then, we adopted soft prompts instead of hard prompts (manually designed clear text phrases) in p-tuning. Specifically, we used a randomly initiated continuous embedding as soft prompts, which were fine-tuned in the training. For the PubMedQA dataset, we explored the provided artificially generated text data. Specifically, we automatically labeled the generated text using our p-tuning model developed using the training set and experimented to feedback different proportion of auto-labeled data into training. The best performance was achieved by using 5% of the auto-labeled artificially generated text data. For p-tuning, we used the implementation in NVIDIA NeMo 50 , which is optimized for LLMs. We used the following parameters in our p-tuning: a global batch size of 32, virtual tokens for p-tuning 15, encoder MLP with encoder hidden size of 2048, max sequence length of 4096 for PubMedQA (long abstracts), 2048 for MedMCQA and MedQA-USMLE, and a fused Adam optimizer with a learning rate of 1e-4 and a weight decay of 0·01, betas of 0·9 and 0·98, a cosine annealing scheduler monitoring validation loss with a 50 step warm up. For example, the below is a prompt we used for MedQA-USMLE.

{“taskname”: “usmle-qa”, “prompt”: “QUESTION: A 23-year-old man comes to the physician for evaluation of decreased hearing, dizziness, and ringing in his right ear for the past 6 months. Physical examination shows multiple soft, yellow plaques and papules on his arms, chest, and back. There is sensorineural hearing loss and weakness of facial muscles bilaterally. His gait is unsteady. An MRI of the brain shows a 3-cm mass near the right internal auditory meatus and a 2-cm mass at the left cerebellopontine angle. The abnormal cells in these masses are most likely derived from which of the following embryological structures?\nMULTIPLE CHOICES: (A) Neural tube\n(B) Surface ectoderm\n(C) Neural crest\n(D) Notochord\nTARGET: the answer to the question given possible options is: “, “answer”: “C”}

GatorTronGPT for synthetic clinical text generation

We sought to test the hypothesis that LLMs can generate synthetic clinical text to train synthetic NLP models useful for medical research. We applied GatorTronGPT to generate synthetic clinical text according to a set of seeds without any fine-tuning, which is a typical zero-shot learning setting. Then, using the generated synthetic clinical text, we trained synthetic transformer-based NLP models using our previous BERT-based GatorTron architecture 15 , denoted as GatorTronS (‘S’ stands for synthetic). We trained GatorTronS models using different sizes of synthetic clinical text and compared them with the original GatorTron model trained using UF Health clinical text. To make it comparable, we trained GatorTronS using the same architecture and number of parameters (i.e., 345 million) as GatorTron 15 . We provide detailed information in the Supplement.

Synthetic clinical text generation

Following previous studies 51 , we approached synthetic clinical text generation using an iterative sampling algorithm and applied top-p (i.e., nucleus sampling) sampling and temperature sampling to balance the diversity and quality of text generation 51 . We approached the synthetic clinical text generation as an open-ended text-to-text generation task 52 , 53 , where the generated clinical text is restricted by the context (e.g., the prompts). Specifically, given a sequence of \(m\) tokens \({{X}_{{pre}}=x}_{1}{x}_{2}...{x}_{m}\) as input context, the task is to generate the next \(n\) continuation tokens \({{X}_{{cont}}=x}_{m+1}{x}_{m+2}...{x}_{m+n}\) until reaching the max length of 512 tokens. We generate text through iteratively sampling from the pre-trained language model GatorTronGPT one token at a time by conditioning on the preceding context:

where \(P({x}_{i}|{x}_{1}\ldots {x}_{i-1})\) is the next token distribution. We adopt Top-p (nucleus) sampling 54 during sampling to select words whose cumulative probability exceeds a predefined threshold p .

where \({V}^{(p)}\) is the top-p vocabulary used to sample the next word. This approach dynamically adapts the number of words considered at each step based on their probabilities, balancing diversity and coherence of the generated text.

We set the parameter of top-p sampling at 0.9 and the parameter for temperature sampling at 1.2 according to our empirical assessment. We sampled the beginning 15 tokens from all sections of the de-identified notes from the MIMIC III database 22 and generated approximately 8 million prompts. We also tried several random seeds in GatorTronGPT to generate multiple documents from one prompt. We controlled GatorTronGPT to generate a maximum length of 512 tokens.

Synthetic NLP model development

We applied GatorTronGPT to generate different sizes of synthetic clinical text including 1 billion, 5 billion, 10 billion, and 20 billion words of clinical text and developed corresponding synthetic NLP models, denoted as GatorTronS. Following our previous study 15 , we trained GatorTronS using the same architecture of GatorTron – a BERT architecture with 345 million parameters.

Comparison with existing transformer models

We compared GatorTronS models with ClinicalBERT 55 —an existing clinical transformer model and GatorTron 15 , the current largest clinical transformer model trained using >90 billion words of text, using 5 clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, and question answering.

Turing test of text generation for healthcare settings

We randomly sampled 30 narrative sections from real-world UF Health clinical notes, including “past medical history”, “history of present illness”, “assessment/plan”, and “chief complaint”. For each of the 30 sections, we extracted the beginning 15 tokens as a seed for GatorTronGPT to generate a synthetic paragraph up to 512 tokens. We cut off the 30 real-world clinical sections to 512 tokens, removed all format information, and randomly mixed them with 30 synthetic sections written by GatorTronGPT. Two UF Health physicians (NSO, MMA) manually reviewed the 60 paragraphs of notes to evaluate: (1) linguistic readability on a 1(worst) to 9 (best) scale, (2) clinical relevance and consistency on a 1 to 9 scale, (3) determine if it was written by a human physician or GatorTronGPT. Percent agreement and Gwet’s AC 1 were calculated to evaluate interrater reliability 56 .

Data availability

The benchmark datasets that support the findings of this study are available from the official websites of natural language processing challenges with Data Use Agreements. More specifically: 1. i2b2 2010, 2012 datasets and n2c2 2018, 2019 datasets: https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ . 2. MedNLI dataset: https://physionet.org/content/mednli/1.0.0/ . 3. emrQA dataset: https://github.com/panushri25/emrQA#download-dataset . 4. The Pile dataset: https://pile.eleuther.ai/ . 5. UF Health IDR clinical notes are not open to the public due to patient privacy information. The GatorTronS, and GatorTron models are available as open-source resources. The synthetic clinical transformer model, GatorTronS, is available from: https://huggingface.co/UFNLP/gatortronS . The GatorTron model trained using real-world clinical text is available: https://huggingface.co/UFNLP/gatortron-base .

Code availability

The computer codes to train GatorTronGPT models are available from: https://github.com/NVIDIA/Megatron-LM/blob/main/pretrain_gpt.py . The scripts used for data preprocessing, vocabulary training and other utilities are available from: https://github.com/uf-hobi-informatics-lab/GatorTronGPT . The computer codes to train GatorTronS models are available from: https://github.com/NVIDIA/Megatron-LM and https://github.com/NVIDIA/NeMo . The computer codes for preprocessing of text data are available from: https://github.com/uf-hobi-informatics-lab/NLPreprocessing .

Introducing ChatGPT. https://openai.com/blog/chatgpt .

Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388 , 1233–1239 (2023).

Article   PubMed   Google Scholar  

Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries? Lancet Digit Health 5 , e107–e108 (2023).

Article   CAS   PubMed   Google Scholar  

Ali, S. R., Dobbs, T. D., Hutchings, H. A. & Whitaker, I. S. Using ChatGPT to write patient clinic letters. Lancet Digit Health 5 , e179–e181 (2023).

Hirosawa, T. et al. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int. J. Environ. Res. Public Health 20 , 3378 (2023).

Grünebaum, A., Chervenak, J., Pollet, S. L., Katz, A. & Chervenak, F. A. The Exciting Potential for ChatGPT in Obstetrics and Gynecology. Am. J. Obstet. Gynecol . https://doi.org/10.1016/j.ajog.2023.03.009 (2023).

Cascella, M., Montomoli, J., Bellini, V. & Bignami, E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst. 47 , 33 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Azamfirei, R., Kudchadkar, S. R. & Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 27 , 120 (2023).

Straw, I. & Callison-Burch, C. Artificial Intelligence in mental health and the biases of language based models. PLoS One 15 , e0240376 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Li, H. et al. Ethics of large language models in medicine and medical research. Lancet Digital Health https://doi.org/10.1016/S2589-7500(23)00083-3 (2023).

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst . 35 , 22199–213 (2022).

Bommasani, R. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).

Brown, T., Mann, B. & Ryder, N. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Google Scholar  

Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55 , 1–35 (2023).

CAS   Google Scholar  

Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5 , 194 (2022).

Gao, L. et al. The Pile: an 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 (2020).

Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30 , 681–694 (2020).

Article   Google Scholar  

Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform . 23 , bbac409 (2022).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019). https://doi.org/10.18653/v1/N19-1423 .

Mohammed, M., Khan, M. B. & Bashier, E. B. M. Machine Learning (CRC Press, 2016). https://doi.org/10.1201/9781315371658 .

Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 6 , 135 (2023).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).

Searle, T., Ibrahim, Z., Teo, J. & Dobson, R. Estimating redundancy in clinical text. J. Biomed. Inform. 124 , 103938 (2021).

Li, J. et al. Are synthetic clinical notes useful for real natural language processing tasks: a case study on clinical entity recognition. J. Am. Med. Inform. Assoc. 28 , 2193–2201 (2021).

Huguet Cabot, P.-L. & Navigli, R. REBEL: relation extraction by end-to-end language generation. in Findings of the Association for Computational Linguistics: EMNLP 2021 2370–2381 (Association for Computational Linguistics, 2021). https://doi.org/10.18653/v1/2021.findings-emnlp.204 .

Peng, C. et al. Clinical concept and relation extraction using prompt-based machine reading comprehension. J. Am. Med. Inform. Assoc . https://doi.org/10.1093/jamia/ocad107 (2023).

Gaffney, A. et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern. Med. 182 , 564–566 (2022).

Downing, N. L., Bates, D. W. & Longhurst, C. A. Physician burnout in the electronic health record era: are we ignoring the real cause? Ann. Intern. Med. 169 , 50 (2018).

Kroth, P. J. et al. Association of electronic health record design and use factors with clinician stress and burnout. JAMA Netw. Open 2 , e199609 (2019).

Diaz, N. Epic to use Microsoft’s GPT-4 in EHRs. https://www.beckershospitalreview.com/ehrs/epic-to-use-microsofts-open-ai-in-ehrs.html .

Trang, B. We’re getting much more aggressive’: Microsoft’s Nuance adds GPT-4 AI to its medical note-taking tool. https://www.statnews.com/2023/03/20/microsoft-nuance-gpt4-dax-chatgpt/ .

Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619 , 357–362 (2023).

Kleesiek, J., Wu, Y., Stiglic, G., Egger, J. & Bian, J. An opinion on ChatGPT in health care-written by humans only. J. Nucl. Med . https://doi.org/10.2967/jnumed.123.265687 (2023).

Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv [cs.CL] (2022).

Ray, S. Samsung bans ChatGPT among employees after sensitive code leak. Forbes Magazine (2023).

Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356 , 183–186 (2017).

Center for Devices & Radiological Health. Artificial Intelligence and Machine Learning in Software as a Medical Device. U.S. Food and Drug Administration https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device .

Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19 , 232 (2019).

Levine, Y., Wies, N., Sharir, O., Bata, H. & Shashua, A. The depth-to-width interplay in self-attention. arXiv [cs.LG] (2020).

Shoeybi, M. et al. Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv [cs.CL] (2019).

Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 4582–4597 (Association for Computational Linguistics, 2021). https://doi.org/10.18653/v1/2021.acl-long.353 .

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys. 59 , 1–35 (2023).

Radford A., Wu J., Child R., Luan D. & Amodei D. Language models are unsupervised multitask learners. OpenAI, 1 , (2019)

The ddi corpus: An annotated corpus with pharmacological sub-stances and drug-drug interactions . J. Biomed. Inform . 46 , 914–920 (2013).

Li, J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxf.) 2016 , baw068 (2016).

Hou, Y. et al. Discovering drug–target interaction knowledge from biomedical literature. Bioinformatics 38 , 5100–5107 (2022).

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics, 2019). https://doi.org/10.18653/v1/d19-1259 .

Singhal, K. et al. Large language models encode clinical knowledge. arXiv [cs.CL] (2022).

Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. NATO Adv. Sci. Inst. E Appl. Sci. 11 , 6421 (2021).

NeMo: NeMo: a toolkit for conversational AI. (NVIDIA GitHub).

Holtzman A., Buys J., Forbes M. & Choi Y. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).

Clark, E., Ji, Y. & Smith, N. A. Neural text generation in stories using entity representations as context. in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) 2250–2260 (Association for Computational Linguistics, 2018). https://doi.org/10.18653/v1/N18-1204 .

Celikyilmaz, A., Clark, E. & Gao, J. Evaluation of text generation: a survey. arXiv preprint arXiv:2006.14799 (2020).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).

Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).

Wongpakaran, N., Wongpakaran, T., Wedding, D. & Gwet, K. L. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med. Res. Methodol. 13 , 61 (2013).

Download references

Acknowledgements

This study was partially supported by a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-2018C3-14754), a grant from the National Cancer Institute, 1R01CA246418, grants from the National Institute on Aging, NIA R56AG069880 and 1R01AG080624, and the Cancer Informatics and eHealth core jointly supported by the UF Health Cancer Center and the UF Clinical and Translational Science Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions. We would like to thank the UF Research Computing team, led by Dr. Erik Deumens, for providing computing power through UF HiperGator-AI cluster.

Author information

Authors and affiliations.

Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA

Cheng Peng, Xi Yang, Aokun Chen, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian & Yonghui Wu

Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA

Xi Yang, Aokun Chen, Yi Guo, Jiang Bian & Yonghui Wu

NVIDIA, Santa Clara, CA, USA

Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin & Mona G. Flores

Research Computing, University of Florida, Gainesville, FL, USA

Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA

Tanja Magoc & Gloria Lipori

Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA

Gloria Lipori & Duane A. Mitchell

Division of Endocrinology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA

Naykky S. Ospina

Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA

Mustafa M. Ahmed

You can also search for this author in PubMed   Google Scholar

Contributions

Y.W., J.B., X.Y., N.P., A.B.C., and M.G.F. were responsible for the overall design, development, and evaluation of this study. X.Y., C.P., A.C., and K.E.S. had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Y.G. and Y.W. designed the Turing evaluation of synthetic clinical text generated by GatorTronGPT. N.S.O. and M.M.A. are the two human physicians who performed Turing test. Y.W., X.Y., K.E.S., C.P., Y.G., and J.B. did the bulk of the writing, W.H., E.A.S., D.A.M., T.M., C.A.H., A.B.C., and G.L. also contributed to writing and editing of this manuscript. All authors reviewed the manuscript critically for scientific content, and all authors gave final approval of the manuscript for publication.

Corresponding author

Correspondence to Yonghui Wu .

Ethics declarations

Competing interests.

K.E.S., N.P.N., A.B.C., C.M., and M.G.F. are employed by NVIDIA. There are no other competing financial or non-financial interests. The work presented in this study was conducted exclusively within the University of Florida Health.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xi Yang finished this work when he was a full-time employee at the University of Florida.

Supplementary information

Supplementary information, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Peng, C., Yang, X., Chen, A. et al. A study of generative large language model for medical research and healthcare. npj Digit. Med. 6 , 210 (2023). https://doi.org/10.1038/s41746-023-00958-w

Download citation

Received : 05 June 2023

Accepted : 01 November 2023

Published : 16 November 2023

DOI : https://doi.org/10.1038/s41746-023-00958-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

End-to-end pseudonymization of fine-tuned clinical bert models.

  • Thomas Vakili
  • Aron Henriksson
  • Hercules Dalianis

BMC Medical Informatics and Decision Making (2024)

A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population

  • Emma Urquhart
  • Bairbre A. McNicholas

Intensive Care Medicine Experimental (2024)

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

  • Simone Kresevic
  • Mauro Giuffrè
  • Dennis L. Shung

npj Digital Medicine (2024)

Artificial intelligence in neurology: opportunities, challenges, and policy implications

  • Sebastian Voigtlaender
  • Johannes Pawelczyk
  • Sebastian F. Winter

Journal of Neurology (2024)

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

  • Marco Cascella
  • Federico Semeraro
  • Elena Bignami

Journal of Medical Systems (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

llm research

  • Browse Law Schools
  • LLM Articles
  • LLM Info Events
  • Law School Rankings
  • Top 10 Lists
  • LLM Scholarships
  • LLM Discussions
  • Application Tracker
  • Advanced LLM Search
  • UK / Ireland
  • Australia / New Zealand
  • Canada & Latin America
  • Africa / Middle East
  • By Concentration
  • General LL.M. Programs
  • Alternative Dispute Resolution / Arbitration / Mediation
  • American Law / U.S. Law
  • Banking Law / Finance Law / Securities Law
  • Business Law / Commercial Law
  • Corporate Law / Company Law
  • Human Rights

All Resources

United States

LL.M. in the United States (USA)

For many international law applicants, doing an LL.M. is the ultimate goal. The country has some of the biggest and best law schools in the world, and many of them offer LL.M. programs.

Those pursuing an LL.M. in the United States do so for a number of reasons. Some law professionals use the degree to learn more about American law, which can be valuable not only in the US, but around the world as well—especially for those working in firms that do business with US companies.

Those pursuing an LL.M. in the United States do so for a number of reasons. Some law professionals use the degree to learn more about American law, which can be valuable not only in the US, but around the world as well—especially for those working in firms that do business with US companies.

But beyond US law, LL.M.s in the USA might also cover various other topics. Indeed, when studying in the US, students can opt to do specialized master's programs in a number of different fields, including Business Law, Tax Law, Alternative Dispute Resolution, Human Rights Law, International Law, and many more.

In terms of location, students who want to pursue an LL.M. in the USA have a huge range of options. Many students opt to study in large, well-known cities, such as New York, Chicago, and San Francisco. But of course, the United States is a large country, with law schools in all corners, in locations such as Los Angeles, Seattle, Miami, and Buffalo, and countless cities in between.

Graduates from LL.M. programs in the United States also have a wide variety of career options. Some international LL.M. graduates opt to stay in the USA, many taking advantage of the country's Post-Completion Optional Practical Training visa program (otherwise known as the OPT visa). Others decide to return to their home countries to leverage their new skills back home. 

See below for a list of all LL.M. programs in the USA.

  • 100 most popular schools in USA

Most popular states: California , District of Columbia , Florida , Massachusetts , New York , Texas

All LL.M. Programs in the USA

1-15 of 176 results sorted by featured popularity name city

15 50 100 schools per page

Columbia

Full-Time: Master of Laws (LL.M.), Executive Master of Laws (LL.M.) in Global Business Law more…

Part-Time: Executive Master of Laws (LL.M.) in Global Business Law more…

By Research: J.S.D. more…

Dual Degree: JD / LL.M. (Frankfurt), JD / LL.M. (London), JD / Master in French Law, JD / Master in Economic Law or LL.M. in Transnational Arbitration an... more…

Georgetown

Full-Time: General Studies LL.M., Environmental and Energy Law LL.M., National and Global Health Law LL.M., International Business & Economic Law LL.M.... more…

Distance Learning: Executive LL.M. Securities and Financial Regulation, Online Taxation LL.M., Master of Studies in Law (M.S.L.) in Taxation more…

Dual Degree: Global Health Law & Governance LL.M., LL.M. Dual Degree, J.D. / LL.M. more…

Loyola Los Angeles

Full-Time: American Law LL.M., LL.M. in Civil Litigation and Advocacy, LL.M. in Criminal Justice, LL.M. in Cybersecurity & Data Privacy, LL.M. in Enter... more…

Distance Learning: Online Master of Laws (Tax LLM) more…

Penn Carey Law

Full-Time: LL.M., LLCM (Masters in Comparative Law), Master in Law (ML) more…

Part-Time: Artificial Intelligence, Industry, and the Law Certificate Program, Regulatory Analysis and Decision-Making Certificate Program, U.S. Corpor... more…

By Research: SJD Program more…

Dual Degree: LL.M. / MFS (Master's in International Finance and Law), JD / LL.M. more…

Berkeley Law

Full-Time: LL.M. Traditional Track, LL.M. Executive Track (Two Summer), LL.M. Executive Track (Remote + Summer), J.S.D. more…

Distance Learning: LL.M. Executive Track (Remote + Summer) more…

UCLA

Full-Time: General LL.M., Doctor of Juridical Science (S.J.D.), Master of Legal Studies (M.L.S.) more…

USC Gould

Full-Time: Master of Laws (LL.M.), Two-Year Extended LL.M., Master of Laws in Alternative Dispute Resolution (LL.M. in ADR), Master of Laws (LL.M.) in... more…

Distance Learning: Master of Laws (LL.M.), Online Master of Studies in Law (MSL), Online Certificates in Business Law, Compliance, Entertainment Law & Industry... more…

Fordham Law

Full-Time: LL.M. in Banking, Corporate, and Finance Law, LL.M. in Corporate Compliance, LL.M. in Fashion Law, LL.M. in Intellectual Property and Inform... more…

Distance Learning: Online LL.M. in U.S. Law more…

By Research: Doctor of Juridical Science (S.J.D.) more…

Yeshiva - Cardozo School of Law

Full-Time: General Studies LL.M., Comparative Legal Thought LL.M., Dispute Resolution and Advocacy LL.M., Intellectual Property Law LL.M. more…

Houston Law Center (UHLC)

Full-Time: LL.M. in U.S. Law, Energy, Environment, and Natural Resources Law LL.M., Health Law LL.M., Intellectual Property & Information Law LL.M., In... more…

UChicago Law

Full-Time: Master of Laws (LL.M.), JSD, Master of Legal Studies (MLS) more…

Miami Law

Full-Time: LL.M. in Entertainment, Arts and Sports Law, LL.M. in Estate Planning, White & Case International Arbitration LL.M., LL.M. in International... more…

Distance Learning: LL.M. in Real Estate / Property Development, LL.M. in Taxation of Cross-Border Investment more…

Texas Law

Full-Time: Master of Laws (LL.M.) more…

BU Law

Full-Time: LL.M. in American Law (for non-US lawyers), LL.M. in Banking and Financial Law, LL.M. in Intellectual Property Law, LL.M. in Taxation, Maste... more…

Arizona (UA Law)

Full-Time: International Trade and Business Law LL.M., Indigenous Peoples Law and Policy LL.M., General LL.M., Master of Legal Studies (MLS) more…

Distance Learning: Master of Legal Studies (MLS) Online more…

By Research: Indigenous Peoples Law and Policy SJD, International Trade and Business Law SJD more…

Related LLM News

Employment Reaches Record High for 2023 Law Grads

Employment Reaches Record High for 2023 Law Grads

Aug 09, 2024

More LLM News

Related Articles

Five Careers (Besides Lawyer) you Can Pursue with Your LL.M.

Five Careers (Besides Lawyer) you Can Pursue with Your LL.M.

Aug 14, 2024

Advanced legal knowledge and skills acquired through an LL.M. degree can open doors to a variety of exciting and impactful careers besides becoming a lawyer.

Five of the Hottest LL.M. Specialties Right Now

Five of the Hottest LL.M. Specialties Right Now

Aug 07, 2024

If you’re considering an LL.M., here are five specialties currently in high demand, according to several top law schools.

How to Know When It's Time to Switch Legal Specialties

How to Know When It's Time to Switch Legal Specialties

Jul 19, 2024

Switching legal specialties can be a big career decision. But it’s often why people embark on an LL.M. degree. Understanding when it’s time to make the leap, and how to navigate the transition successfully, can lead to greater job satisfaction and career fulfillment.

More Articles

Related Top 10 Lists

Top LL.M. Programs for Cybersecurity Law

More Top 10 Lists

  • Brazilian lawyers applying for 2025/2026 Admission 4 hours ago  9  0
  • Chance to be accepted in a LLM Aug 14 05:36 PM  52  0
  • NYSBA Young Lawyers Pub Night August 2024 Aug 14 03:27 AM  26  0
  • Bar preparation: recommendations? Aug 02, 2024  84  0
  • CA/NY Bar Exam Eligibility - US citizen, foreign LLM Jul 29, 2024  94  0
  • LLM Admission Opinion Jul 26, 2024  162  0
  • Best Criminal Law programs in the US? Jul 19, 2024  83  1
  • Harvard LLM 2025-2026 Jul 30, 2024  475  3
  • CARDOZO - LLM Jul 15, 2024  60  0
  • LLM in ADR Jul 12, 2024  183  3

What Can I Do With an LL.M.?

More Top Lists

  • Terms of Use
  • Cookie Policy
  • Privacy Policy

Information

  • Featured LLM Programs
  • MBA Programs
  • Online MBA Programs
  • Executive Courses

Search LLM Programs

Go to Advanced Search

Subscribe to the LLM GUIDE Newsletter

Receive the latest news and tips

© 2001–2024 Pritzwalks – LLM GUIDE – Master of Laws (LL.M.) Programs Worldwide

llm research

Committed to your wellbeing

 LLM Research has conducted a variety of clinical studies in adult and pediatric patients and has contributed to the development of new therapies for healthy participants and patients struggling with diseases in multiple therapeutic areas including but not limited to pulmonology, gastroenterology, gynecology, oncology, hematology, hepatology, endocrinology, dermatology, and psychiatry. LLM Research conducts Phase I, II, III, and IV clinical trials for both large and small pharmaceutical companies and Contract Research Organizations. Our research center has earned a reputation for excellence. Our physician investigators are Board Certified in Internal Medicine, Pediatrics, Dermatology, Gynecology, and Psychiatry.

llm research

Community Partnerships

At LLM we are serious about bringing options home to you and your family! LLM has partnership with local home health organizations capable of performing research activities in your own home no matter where that is. Let nothing stop you from the treatment you and your loved ones deserve

llm research

Top Enrolling Research Site

In 2021, LLM was recognized as top enrolling research site for COVID-19 vaccine. Not only do we plan on continuing to fight covid-19 we at LLM Research, look forward to bringing other vaccine options to our community.

Do you have a family member or friend that needs treatment options? Let us know, we have them covered too!

Connect With Us

Copyright © 2024 LLM Research - All Rights Reserved.

Powered by GoDaddy

This website uses cookies.

We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.

share this!

August 13, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

trusted source

Large language models pose a risk to society and need tighter regulation, say researchers

by University of Oxford

speech

Leading experts in regulation and ethics at the Oxford Internet Institute, have identified a new type of harm created by LLMs which they believe poses long-term risks to democratic societies and needs to be addressed by creating a new legal duty for LLM providers.

In their paper "Do large language models have a legal duty to tell the truth ?" published by the Royal Society Open Science , the Oxford researchers set out how LLMs produce responses that are plausible, helpful and confident but contain factual inaccuracies, misleading references and biased information. They term this problematic phenomenon as 'careless speech,' which they believe causes long-term harm to science , education and society.

Lead author Professor Sandra Wachter, Professor of Technology and Regulation, Oxford Internet Institute says, "LLMs pose a unique risk to science, education, democracy, and society that current legal frameworks did not anticipate. This is what we call 'careless speech' or speech that lacks appropriate care for truth.

"Spreading careless speech causes subtle, immaterial harms that are difficult to measure over time. It leads to the erosion of truth, knowledge and shared history and can have serious consequences for evidence-based policy-making in areas where details and truth matter such as health care, finance, climate change, media, the legal profession, and education.

"In our new paper, we aim to address this gap by analyzing the feasibility of creating a new legal duty requiring LLM providers to create AI models that, put simply, will 'tell the truth."'

This phenomenon of 'careless speech' is further complicated by human feedback that often favors outputs that align with their personal biases, and annotations that train models to generate 'assertive sounding outputs,' among other factors unrelated to advancing truthful outputs.

Associate Professor and Research Associate Dr. Chris Russell, Oxford Internet Institute said, "While LLMs are built so that using them feels like a conversation with an honest and accurate assistant, the similarity is only skin deep, and these models are not designed to give truthful or reliable answers. The apparent truthfulness of outputs is a 'happy statistical accident' that cannot be relied on."

To better understand the legal restrictions faced when using LLMs, the researchers carried out a comprehensive analysis, assessing the existence of truth-telling obligations in the current legal frameworks such as the Artificial Intelligence Act, the Digital Services Act, Product Liability Directive and the Artificial Intelligence Liability Directive.

They find that current legal obligations tend to be limited to specific sectors, professions or state institutions and rarely apply to the private sector.

Commenting on the findings, Director of Research, Associate Professor Brent Mittelstadt said, "Existing regulations provide weak regulatory mechanisms to mitigate careless speech and will only be applicable to LLM providers in a very limited range of cases.

"Nevertheless, in their attempts to eliminate 'hallucinations' in LLMs, companies are placing significant guardrails and limitation on these models. This creates a substantial risk of further centralizing power in a few large tech companies to decide which topics are appropriate to discuss or off limits, which information sources are reliable, and ultimately what is true."

The Oxford academics argue that LLM providers should better align their models with truth through open, democratic processes. They propose the creation of a legal duty for LLM providers to create models that prioritize the truthfulness of outputs above other factors like persuasiveness, helpfulness or profitability.

Among other things, this would mean being open about the training data they use and the limitations of their models, explaining how they fine-tune models through practices such as reinforcement learning from human feedback or prompt constraints, and building in fact checking and confidence scoring functions into outputs.

Professor Wachter concludes, "Current governance incentives focus on reducing the liability of developers and operators and on maximizing profit, rather than making the technology more truthful. Our proposed approach aims to minimize the risk of careless speech and long-term adverse societal impact while redirecting development towards public governance of truth in LLMs."

Journal information: Royal Society Open Science

Provided by University of Oxford

Explore further

Feedback to editors

llm research

Evidence stacks up for poisonous books containing toxic dyes

22 hours ago

llm research

Researchers develop an instant version of trendy, golden turmeric milk

llm research

Saturday Citations: Citizen scientists observe fast thing; controlling rat populations; clearing nanoplastic from water

Aug 17, 2024

llm research

New AI tool captures how proteins behave in context

llm research

Scientists discover phenomenon impacting Earth's radiation belts

llm research

Geophysicists find link between seismic waves called PKP precursors and strange anomalies in Earth's mantle

llm research

New twist on synthesis technique promises sustainable manufacturing

llm research

Researchers discover smarter way to recycle polyurethane

Aug 16, 2024

llm research

DNA study challenges thinking on ancestry of people in Japan

llm research

A visionary approach: How a team developed accessible maps for colorblind scientists

Relevant physicsforums posts, cover songs versus the original track, which ones are better.

16 hours ago

Why are ABBA so popular?

Today's fusion music: t square, cassiopeia, rei & kanade sato, favorite songs (cont.), talent worthy of wider recognition, history of railroad safety - spotlight on current derailments.

More from Art, Music, History, and Linguistics

Related Stories

llm research

Ethicists wonder if LLM makers have a legal duty to ensure reliability

Aug 7, 2024

llm research

Large language models pose risk to science with false answers, says study

Nov 20, 2023

llm research

Research into 'hallucinating' generative models advances reliability of artificial intelligence

Jun 20, 2024

llm research

Researcher suggests how to effectively utilize large language models

May 29, 2024

llm research

Using AI to train AI: Model collapse could be coming for LLMs, say researchers

Jul 25, 2024

llm research

New open-source platform allows users to evaluate performance of AI-powered chatbots

Jun 4, 2024

Recommended for you

llm research

How some states help residents avoid costly debt during hard times

llm research

Singing from memory unlocks a surprisingly common musical superpower

Aug 15, 2024

llm research

Renewable energy policies provide benefits across state lines, study shows

llm research

Study suggests five-second break can defuse an argument between coupled partners

Aug 14, 2024

llm research

Findings suggest empowering women is key to both sustainable energy and gender justice

Aug 13, 2024

llm research

Those with the biggest biases choose first, according to new math study

Aug 12, 2024

Let us know if there is a problem with our content

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Phys.org in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

Home

Postgraduate Studies

Research Masters Degree (LLM)

llm research

  • Programme name
  • Programme code

6CB N01 - 6CB N11 

Mahikeng and Potchefstroom

  • Delivery mode
  • Program leader

Dr Nelson Kekana

  • Introduction

The degree is awarded based on the strength of a dissertation of approximately 40,000 words.  An internal and external examiner examines it.  The topic must fall within the law, justice, and sustainability focus.  The faculty must have sufficient expertise to provide effective study supervision.

  • Duration (minimum and maximum duration)

For full-time students, the study period is at least one year and a maximum of three years. For part-time students, the minimum study duration is one year, and the maximum is four years. If a student has not completed the study within the maximum duration of studies allowed, the student may be terminated. Students must have met all the requirements for the LLB degree set by this University or any other South African university. If a previous qualification was obtained in a foreign country, an evaluation certificate issued by the South African Qualifications Authority (SAQA) must be submitted. An average of 60% for the final year of the LLB degree (or similarly recognized four-year degree) and a sub-minimum of 65% for the research project (where applicable). An applicant must furnish a four-page concept proposal (link) with the application form as proof of his/her research skills.

  • Allocation of supervisors or promoters

Students applying for a research masters must consult with possible supervisors during the application process. The Faculty Board may, in exceptional circumstances, approve the appointment of a co- or assistant supervisor based on relevant expertise. The supervision agreement form (link) must be completed and signed by yourself and the agreed supervisor and submitted with your application.

  • Faculty-specific requirement for a Research Masters Degree

a) If there is not sufficient capacity with regards to supervision for a programme in an academic year, the Director: Postgraduate Programmes may decide not to offer the programme in question in that year.

b) Research Masters degree students must (in consultation with his/her supervisor) submit the research proposal for a dissertation six months after the final date of registration (and no later than 31 October) in their first year of registration.

c) Students work under a supervisor approved by the Director: Postgraduate Programmes and the Faculty Board.

d) A student is required to complete a research discussion within six months after the approval of the research proposal . The research discussion should be in a major and two ancillary subjects prescribed in consultation with the Director: Postgraduate Programmes for the specific study, to be permitted to write a research dissertation. The evaluation of the student takes place before an appointed panel generally consisting of the Director: Postgraduate Programmes, Director: Research Unit (ex officio) , a research professor and one internal member with expertise in the field of study , as well as one external member with expertise outside the University. The appointment of the research discussion panel and assessment procedure is conducted in accordance with the procedure approved by the Faculty Board.

e) Students are required to attend compulsory seminars of the Research Methodology programme arranged during the academic year. Permission for absence is granted only by the programme leader on good grounds.

  • Examination

a) The suggested guideline for the length of a dissertation is 40 000 words (including content and footnotes and excluding bibliography). Any substantial digression from this guideline is subject to the prior approval of the Director: Postgraduate Programmes before submission of the dissertation for examination. The Director: Postgraduate Programmes will determine whether the length of the dissertation is appropriate in the particular case. Students must comply with the prescribed faculty reference style. b) Students must comply with the requirements of the General Academic Rule 4.10. c) The Turnitin or similar report which is generated must be submitted with the dissertation. d) The dissertation must be language edited and a certificate issued by a competent language editor must be attached to the thesis. e) The research dissertation is assessed according to Academic Rule 4.11. The research dissertation is assessed by at least two examiners, of which at least one must be an external examiner who is not attached to the University. The final mark of the research dissertation is the average of the examiners’ marks. If there is any ambiguity in an examiner’s report, or if there is a material difference (the marks awarded by the examiners differ by more than 15%) in the final result recommended by the examiners, the procedure as approved by the Faculty Board will determine the final result of the student. The general provisions relating to the assessment of the dissertation and the guidelines to examiners and/or arbitrators are followed in accordance with faculty guidelines. f) A research dissertation may only be referred back to a candidate once, and after revision, be submitted once for re-examination within a period of one year. Refer to the General Academic Rules 4.11.7.3 and 4.11.7.4. g) A student’s studies may be terminated if he/she fails to comply with the requirements laid down by the faculty or exceeds the maximum duration of the study period as determined by the faculty and has received a letter of warning refer to General Academic Rule 1.18 regarding the termination of studies. h) A student who is dissatisfied with any substantive aspect of the guidance provided by a supervisor can raise such matters in writing with the Director: Postgraduate Programmes. The matter will be dealt with in accordance with the procedure as prescribed in the General Academic Rules and the Manual for Postgraduate Studies. The director must respond in writing to the student before a research dissertation is submitted for examination.

  • Qualification outcomes

On completion of this programme the student should be able to demonstrate: a) A comprehensive and systematic knowledge base in a specific field of study and the ability to apply the knowledge. b) A coherent and critical understanding of the methodology of the specific field of study as to rigorously critique and evaluate current research in this field, participate in scholarly debates and research relating to theory and practice. c) An ability to use advanced information-retrieval and processing skills to identify, critically analyse and synthesise information relevant to complex and/or real-world problems, cases and issues in the field of the specific field of study where applicable, debating solutions from theoretical and research perspectives published in current literature and presenting the information to specialist and non-specialist audiences using IT effectively; and d) The ability to critically evaluate and apply the ethics, values, rules, norms, and regulations pertaining to the specific field of study.

  • Curricula Master of Laws

6CB N01

Criminal and Procedural Law

CPLM 871

MC/PC

180

6CB N02

Mercantile Law

MCLM 871

MC/PC

180

6CB N03

Public Law and Legal Philosophy

PPLM 871

MC/PC

180

6CB N04

Private and Customary Law

PVLM 871

MC/PC

180

6CB N05

International Aspects of Law

LVIA 871

MC/PC

180

6CB N06

Perspectives on Law

LVEP 871

MC/PC

180

6CB N07

Trade and Business Law

LVTB 871

MC/PC

180

6CB N08

Private Law

LVPR 871

MC/PC

180

6CB N09

Constitutional Law

LVCL 871

MC/PC

180

6CB N10

Formal Law

LVFL 871

MC/PC

180

6CB N11

Legal Profession

LVLP 871

MC/PC

180

A tutorial on open-source large language models for behavioral science

  • Original Manuscript
  • Open access
  • Published: 15 August 2024

Cite this article

You have full access to this open access article

llm research

  • Zak Hussain 1 , 2 ,
  • Marcel Binz 3 , 4 ,
  • Rui Mata 1 &
  • Dirk U. Wulff 1 , 2  

474 Accesses

1 Altmetric

Explore all metrics

Large language models (LLMs) have the potential to revolutionize behavioral science by accelerating and improving the research cycle, from conceptualization to data analysis. Unlike closed-source solutions, open-source frameworks for LLMs can enable transparency, reproducibility, and adherence to data protection standards, which gives them a crucial advantage for use in behavioral science. To help researchers harness the promise of LLMs, this tutorial offers a primer on the open-source Hugging Face ecosystem and demonstrates several applications that advance conceptual and empirical work in behavioral science, including feature extraction, fine-tuning of models for prediction, and generation of behavioral responses. Executable code is made available at github.com/Zak-Hussain/LLM4BeSci.git . Finally, the tutorial discusses challenges faced by research with (open-source) LLMs related to interpretability and safety and offers a perspective on future research at the intersection of language modeling and behavioral science.

Similar content being viewed by others

llm research

Enabling Empirical Research: A Corpus of Large-Scale Python Systems

llm research

A Tribute to Antoni Olivé on the Occasion of His Retirement

llm research

1st Workshop on the History of Expressive Systems

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Introduction

Large language models (LLMs) – machine learning systems trained on vast amounts of text and other inputs – are increasingly being used in science (Van Noorden & Perkel, 2023 ), and significantly advancing the capacity to analyze and generate meaningful linguistic information. These models are poised to change the scientific workflow in numerous ways and are already used across all aspects of the research cycle, from conceptualization to data analysis. For example, in psychology (Demszky et al., 2023 ) and related disciplines (Korinek, 2023 ), LLMs are being used to automate research processes, predict human judgments, and run in-silico behavioral experiments.

figure 1

Overview of the LLM processing pipeline. The diagram illustrates the sequence of operations performed on an input prompt, including how it is tokenized and then processed by the transformer architecture to produce a set of output probabilities

Scientific applications of LLMs require high levels of transparency and reproducibility (Bockting et al., 2023 ). In addition, many applications in behavioral science involve sensitive information (e.g., personal or health data) or target vulnerable populations (e.g., children) and thus require specific data protection protocols. Open-source frameworks that provide full transparency and respect privacy requirements are therefore indispensable for applications of LLMs in behavioral science.

We aim to help advance the responsible use of LLMs in behavioral science by providing a comprehensive tutorial on applications using an open-source framework that maximizes transparency, reproducibility, and data privacy. Specifically, we provide a primer on the Hugging Face ecosystem, covering several applications of LLMs, including conceptual clarification, prediction of behavioral outcomes, and generation of human-like responses. Our target audience consists of behavioral researchers with a basic knowledge of programming principles who are interested in adding LLMs to their workflows. We hope that this resource will help researchers in psychology and related disciplines to adopt LLMs for a wide range of tasks, whilst also maintaining an appreciation of the subtle complexities of drawing scientific conclusions from such flexible and opaque models.

In what follows, we first provide a short primer on transformer-based language models. Second, we consider applications of LLMs in behavioral science and introduce the Hugging Face ecosystem and associated Python libraries. Third, we present three areas of application – feature extraction, fine-tuning, and text generation – and present several use cases in behavioral research. Finally, we address some advantages and limitations of current open-source approaches and consider future directions at the intersection of LLMs and behavioral research.

A primer on transformer-based language models

Today’s LLMs are based on the transformer architecture (Vaswani et al., 2017 ), which is a family of neural network models that draw on a common set of building blocks. In this section, we first introduce these building blocks in sequence (see Fig. 1 ) before discussing important architectural variants and ways of applying LLMs in behavioral science.

When text input is fed into an LLM, it is first broken up into digestible pieces known as tokens. This decomposition is carried out by the LLM’s tokenizer, which is a model in its own right. For instance, the (uncased) WordPiece tokenizer (Wu et al., 2016 ) underlying the popular class of BERT models (Bidirectional Encoder Representations from Transformers; Devlin et al., 2018 ) breaks up the sentence "Open-source LLMs rock." into "[CLS]", "open", "-", "source", "ll", "##ms", "rock", ".", and "[SEP]". This example illustrates that tokens are often words, but not always. They can also be punctuation ("-" and "."), subwords ("ll" and "##ms"), and special tokens ("[CLS]" – shorthand for "classification" – and "[SEP]" – for "separator"). Tokenizers can differ between models and include lower- and upper-case tokens.

figure 2

Transformer attention block. The diagram illustrates how the input embeddings are passed to multiple attention heads, each performing a series of operations to generate queries, keys, and values, and leading ultimately to contextualized embeddings

There are several principles behind tokenization. First, including punctuation helps the LLM represent the logical structure of sentences and produce text that contains sentence clauses or multiple sentences. Second, the use of subwords significantly reduces the number of tokens the LLM must learn by assigning stand-alone tokens only to frequent words and constructing all other words using subword tokens. Note that subword tokens that do not start a word begin with "##" to signify that they follow the previous token without a space. Whether a word is assigned a stand-alone token or decomposed into subwords depends on the specific algorithm and the text corpus used by the algorithm to identify the most effective set of tokens. Which words the tokenizer assigns stand-alone tokens to has been found to have important implications for what an LLM learns (Ali et al., 2023 ). Third, placing special tokens at the beginning ("[CLS]") and end ("[SEP]") of texts enables the LLM to predict the first word and the end of a text and to learn numerical representations that can stand in as a representation of the entire text.

Before we continue, we should note that consistent with much of the literature on large language models (e.g., Devlin et al., 2018 ), our use of the term "token" does not follow the distinction between types and tokens in philosophy and linguistics, where tokens refer to specific instances or occurrences of types (Wetzel, 2018 ). Thus, here and in the literature on LLMs, the term "token" refers to both types and tokens.

Input embeddings

After tokenization, each token is assigned a numeric vector. Prior to training, this simply consists of random numbers. In the case of certain BERT and GPT-2 models (i.e., DistilBERT, BERT-Base, and GPT-2 Small), these vectors consist of 768 numbers (dimensions). The embedding vectors mark the starting point of the neural network and are learned during training, where their values are adjusted iteratively to assist the model in achieving its objective. After training, the input embeddings reflect, in an abstract fashion, a context-general meaning of each token relative to the other tokens.

In contrast to other approaches to semantic representation (Li, 2022 ; Siew et al., 2019 ), such as those treating words or tokens as nodes in a network, vector-based embeddings have two key advantages. First, they permit LLMs to be more efficient: The number of embedding dimensions is typically at least an order of magnitude smaller than the number of tokens, resulting in substantially fewer parameters being needed to represent the relationships between tokens. For instance, the WordPiece tokenizer used by BERT encompasses about 30,000 tokens, which is roughly 40 times more than the number of embedding dimensions (i.e., 768). Second, they assist the model in performing generalization: The embedding vectors encode the relationships between tokens to one another, such that the model shows similar behavior for similar tokens.

Input embeddings do not reflect the location of tokens within a given input or context. To capture order, most LLMs, such as GPT-2 and successors, include another component called positional encoding. This is a second token embedding that represents the relative position of tokens using a combination of sine and cosine functions. The positional encoding is typically added to the input embedding to form embeddings that reflects both the context-general meaning of a token and its position in the input to the model.

Attention blocks

The attention block is the central building block of transformer models and is what distinguishes them from other neural network-based language model architectures, such as Word2Vec (e.g., Mikolov et al., 2013 ) and recurrent neural network-based language models (Graves, 2012 ). The purpose of the attention block is to produce embeddings that represent the in-context meaning of tokens. For example, consider again the sentence "Open-source LLMs rock." After input embedding and positional encoding, each token in the sentence is represented using a context-general embedding vector. These context-general embeddings reflect the meaning of tokens broadly, not considering the specific context in which they occur. However, the meaning of tokens can vary considerably across contexts: consider, for example, the polysemous "rock." The transformer architecture uses the attention mechanism to capture these context-specific meanings.

The components of the attention block are illustrated in Fig. 2 . It begins with the token embeddings, which, in the case of the first attention block, are the sum of the input embeddings and the positional encoding, normalized to have a zero mean and unit variance. Entering the attention mechanism, these embeddings are transformed by a linear layer into three new, lower-dimensional embeddings called queries , keys , and values . This transformation can be likened to a method – principal component analysis – known to behavioral researchers in that it produces output variables that are lower dimensional linear combinations of the input variables. The names of these three smaller embeddings suggest that they can be likened to a retrieval, where a query is compared to keys to identify a value matching the query. Although this analogy should not be taken too literally, it does reflect how the queries , keys , and values collaborate to produce contextualized embeddings for each token. Specifically, the queries and keys combine to determine how to recombine the values to build contextualized embeddings.

Computationally, the attention mechanism works as follows. First, the dot product of each pair of queries and keys is computed, forming a square matrix of attention scores . The attention scores are then normalized by \(\sqrt{d_k}\) , to account for the dimensionality d of the keys k . These normalized attention scores are next fed row-wise into the softmax function \(e^{x_{i}}/\sum _j e^{x_{j}}\) , where each \(x_i\) is a scalar attention score on key i , to produce attention weights \(w_{ij}\) . Finally, the attention weights of each row, which now add to 1, are used to produce a weighted sum \(\sum _j w_{ij} * v_j\) of the values v of all tokens. These weighted sums are the new contextualized embeddings for each token. This attention mechanism can also be represented using the following matrix notation

where Q , K , and V are matrices respectively containing the queries , keys , and values , and T denotes the matrix transpose. This process of generating contextualized embeddings by mixing values as prescribed by the queries and keys does not per se produce useful contextualized embeddings. This is only achieved in training, which allows the model to figure out how to construct queries , keys , and values such that the contextualized embedding helps it achieve the model objective, such as predicting the next unit in a sequence (see Section Model output and training ). In other words, the model uses the attention mechanism to learn how context can help it solve a prediction problem.

The next step in the attention block is to gather the contextualized embeddings from the multiple attention mechanisms that are executed in parallel across several attention heads. Running multiple attention mechanisms in parallel permits the model to produce different contextualized embeddings – based on different queries , keys , and values – that may focus on different kinds of relations between tokens. However, it should be noted that these relations are typically not human-interpretable (e.g., Vig, 2019 ; Vig & Belinkov, 2019 ). Figure  2 shows , for illustrative purposes, four attention heads; however, current models usually have significantly more. The contextualized embeddings from the four attention heads are concatenated such that the final embedding for each token is a combination of the embeddings produced by the different heads. This final contextual embedding has the same dimensionality as the input embeddings (e.g., 768).

The attention block further processes the contextualized embeddings in several steps. First, they are passed through a linear layer. They are then added to the initial embeddings that entered the attention heads. This addition is called a skip connection and is thought to help stabilize the model during training. After the skip connection, the embeddings are passed through a larger layer with a nonlinear activation function that plays two important roles. First, through its nonlinearity, it provides considerable flexibility in processing and recombining the contextualized token embeddings. Second, through its larger number of nodes, it provides a number of weights (parameters) that enhance the network’s memory capacity. In the case of BERT models, this nonlinear layer is four times larger than the token embeddings, implying an upscaling to 3072 dimensions and over two million weights connecting the two layers. After the large, nonlinear layer, the token embeddings are scaled back down to the standard embedding size using a linear layer, so as to match the required input size for the next model block, and passed through another skip connection prior to the nonlinear layer to provide stability.

Ultimately, the attention block produces embeddings for each token that are the same size as the initial embeddings but are now contextualized such that each token’s embedding is a complex recombination with the other tokens’ embeddings. Following the first attention block, most transformer models add additional attention blocks that take the contextualized embeddings from the previous block as input. As a result, transformer models produce several layers of abstraction where the final contextualized embeddings are combinations of recombinations (for more technical introductions to the attention mechanism, see Prince, 2023 ; Tunstall et al., 2022 ; Sanderson, 2019 ).

figure 3

Overview of pre-training mode and transformer family. The figure illustrates two pre-training modes (masked language modeling, causal language modeling) and associated architecture family (encoder, decoder)

Model heads and training

The final component of the model is the model head. This produces the model output, which can be adjusted to numerous tasks. During pre-training – that is, the initial phase of training a language model on a large corpus by learning linguistic patterns and relationships within the data – the model head usually performs token classification. This means predicting one or more of the tokens in the model’s vocabulary based on the model input. There are two dominant approaches to pre-training LLMs through token classification: masked language modeling and causal language modeling . Of course, given the availability of high-quality, pre-trained LLMs, we believe most behavioral scientists will have little reason to train their own LLM from scratch. Furthermore, the computational resources and technical expertise required to do so can be prohibitive. However, discussing these two modes of training will both help in illustrating the role of a model head (see Fig. 3 ) as well as provide some necessary background for making informed decisions about which kind of pre-trained model to use for the task at hand (see Section “ Choosing between models ”).

In masked language modeling , a special "[MASK]" token is inserted into the token sequence in place of one or more randomly selected tokens. For instance, in our example sentence, replacing the word "rock" would result in the token sequence "[CLS]", "open", "-", "source", "ll", "##ms", "[MASK]", ".", "[SEP]" as model input. As with any other token, the model will produce an embedding for the "[MASK]" token that reflects its context. This contextual information can thus be leveraged by the model head to predict the masked token. To achieve this, the model head takes the "[MASK]" token’s final contextualized embedding, passes it first through a final hidden layer, which can be linear or nonlinear, and then into the final linear layer with a softmax output that has as many output nodes as there are tokens in the model’s vocabulary. As in the attention mechanism, the softmax function produces values that sum to 1, meaning they can be interpreted as probabilities. The model head uses these probabilities to predict which token is behind the "[MASK]" token. Before pre-training, these predictions will be no better than chance. However, during training, the model will incrementally adjust its parameters ("weights") in a direction that produces higher probabilities for tokens that are behind the "[MASK]" token and lower probabilities for those that are not. In our example sentence, this would mean learning to assign a higher probability for the target, "rock", and other tokens plausibly fitting the context, such as "impress" or "excel", and a lower probability for unfitting tokens, such as "cat", "taste", or ",".

The second and now more popular way to pre-train LLMs is known as causal language modeling (or autoregressive modeling). In this mode of training, the model head also performs token classification. However, instead of predicting a masked token from its complete surrounding context, the model is trained to predict the next token in a sequence based only on the tokens that preceded it (i.e., it does not get access to future tokens). To perform this kind of training, causal language models use different tokenizers without a "[MASK]" token. Models trained using causal language modeling also implement a different type of attention mechanism that manually sets all attention to future tokens to zero. This means that each token’s contextualized embedding is composed only of its own value vector and the values of the tokens preceding it. To predict the next token, the model head selects the contextualized embedding of the last available token of the input (which is used analogously to the "[MASK]" token in masked language modeling). For example, to predict the token "rock" based on preceding tokens, the model head would select the contextualized embedding of the "##ms" token, including information about the preceding tokens and itself, and predict which of all possible tokens likely follow. After training, the model will assign high probabilities to suitable tokens, such as "enable", "offer", or the target token "rock", but not to unsuitable tokens. Importantly, these will not be the same tokens predicted in masked language learning, due to the difference in the information accessible to the model. Specifically, not being able to look into the future, a model trained through the causal language mode would likely assign some probability to the token "," to mark the end of a sentence clause. This is an unlikely prediction for a model trained in masked language mode, which would have access to the period token "." later in that sequence.

Finally, it is important to note that token classification, as employed in masked and causal language modeling, is not the only pre-training objective, nor are masked and causal language modeling the only modes of pre-training. One other pre-training mode for transformer models is next-sentence prediction . In this mode, the model head is set up to predict a single numerical value between 0 and 1, reflecting the probability with which two input sentences occurred adjacently in the training data. Next-sentence prediction is used in the pertaining of some BERT-style models, typically in addition to masked language learning. Next-sentence prediction illustrates that the transformer model head can be adjusted to predict different data types. This flexibility is exploited frequently, for instance, in the fine-tuning of LLMs to perform specific tasks based on smaller, task-related datasets (see Section “ Fine-tuning ”).

An overview of model types

Since the inception of the transformer architecture (Vaswani et al., 2017 ), many model variants have emerged that differ in important ways, including the architecture family (i.e., encoder , decoder , or encoder–decoder ), model size, stage of training reached by the model, and openness.

Concerning the architecture family, it is helpful to distinguish the encoder , decoder , and encoder–decoder architectures (see Fig. 3 ). The encoder architecture is characterized by bidirectional attention, pre-training through masked language modeling (and, sometimes, next-sentence prediction), and the use of special tokens such as "[CLS]", "[SEP]", and "[MASK]". The goal of the encoder architecture is to produce accurate contextualized embeddings, including for the special tokens. Prominent examples following the encoder architecture are the models of the BERT family (e.g., DistilBERT (Sanh et al., 2019 ) or RoBERTa (Liu et al., 2019 )), and the instructor models (Su et al., 2022 ). The decoder architecture, on the other hand, is characterized by causal attention and pre-training through causal language modeling. The goal of the decoder architecture is to generate text via next-token prediction. Prominent examples of the decoder architecture are the GPT (OpenAI, 2023 ) and LLaMA model families (Touvron et al., 2023 ). Finally, the encoder–decoder architecture is characterized by a combination of the two, and is the original transformer architecture as proposed in Vaswani et al. ( 2017 ). It is trained with next-token prediction, where the input text is first fed to the encoder, the encoder’s last hidden state is then passed as input to the decoder, which then predicts the next token. Prominent examples of the encoder–decoder architecture are the BART (Lewis et al., 2019 ) and T5 (Raffel et al., 2020 ) models.

A second key differentiating factor between LLMs is size. Size is often measured in terms of the number of weights in a model, which can vary between a few hundred million (e.g., most BERT models) and several hundred billion (Smith et al., 2022 , e.g., Megatron Turing NLG) – or even the trillion weights supposedly reached by OpenAI’s GPT-4 model. Although the number of weights plays a large role in determining a model’s capacity to learn from the training data, how the weights are distributed throughout the various model components also matters (Kaplan et al., 2020 ). The size of LLMs can also differ in a more functional way, by allowing for different context sizes. The context size is the maximum number of tokens in a sequence that the attention mechanism can evaluate at any given time, and it determines the complexity of connections between tokens that the model can consider. This is important for applications such as few-shot learning (see Section “ Applications of LLMs in behavioral science ”). From a practical perspective, context size also determines the amount of random-access memory (RAM) needed to run a model. For large decoder models such as LLaMA-2 (70 billion weights), RAM requirements can be as high as 300 gigabytes, resulting in a need for expensive, high-performance graphical processing units (GPUs).

A third differentiating factor is the stage of training reached by the model. First, there are pre-trained models , which have been trained on a large corpus of text using masked or causal language modeling. The text corpus typically includes websites, which in turn include blogs, news articles, Wikipedia, social media platforms (e.g., Reddit), and other sources of text (e.g., books, academic articles). Larger pre-trained models are sometimes also called foundation models (Bommasani et al., 2023 ), emphasizing their purpose as a basis for task-specific training. Second, there are fine-tuned models , which are pre-trained models that have been further trained on task-specific data to selectively increase their performance on certain tasks. These can be basic tasks, such as token classification or prediction of numerical variables, or more complex tasks, such as named-entity recognition or question answering.

A fourth differentiating factor that concerns the set of fine-tuned models and has played an especially important role in the growth in popularity of LLMs in recent times is whether the model is a "chat" model. Specific fine-tuning regimes exist to make pre-trained models better suited to high-quality, assistant-style interactions through a chat interface. These include training steps with explicit human input such as supervised fine-tuning , reinforcement learning from human feedback (Christiano et al., 2017 ; Touvron et al., 2023 ), and direct preference optimization (Rafailov et al., 2024 ). For instance, in supervised fine-tuning, human "annotators" generate prompts and appropriate assistant-style responses to those prompts, such that the model may learn via "imitation" to become a good assistant. Reinforcement learning from human feedback is a more complex, multistage procedure in which human annotators indicate their preferences between model outputs according to specific criteria (e.g., safety or helpfulness) to build a "preference dataset" (stage 1) (Touvron et al., 2023 ). This dataset is then used to train a reward model that learns the annotators’ preferences (stage 2), which in turn provides feedback on the outputs of the LLM in much vaster quantities than would be possible with human annotations alone. This enables the LLM to learn via reinforcement to become a better assistant. Such fine-tuning can be seen as an example of how LLMs can be tailored to specific behavioral applications. A prominent example of a chat model is ChatGPT, but other open-source models exist, including LLaMA-2 Chat (Touvron et al., 2023 ) or Falcon Chat (TII, 2023 ).

A fifth and final differentiating factor is openness. LLMs differ in terms of how much information is available about the training data, training method, or architecture (see Bommasani et al., 2023 ) and how openly available the models themselves are. Some models, such as GPT-4, are only available through remote user interfaces, whereas others are mostly (e.g., LLaMA) or fully (e.g., BERT) open-source. Most open-source models can be accessed and employed via the Hugging Face ecosystem introduced in this tutorial, which has significant advantages over closed-source models in terms of transparency and reproducibility.

Applications of LLMs in behavioral science

LLMs can be employed in several ways in behavioral science. The most basic but often effective mode is feature extraction (e.g., Aka & Bhatia, 2022 ; Binz & Schulz, 2023a ; Hussain et al., 2023 ; Wulff & Mata, 2023 ). Feature extraction sends text input through the model and records contextualized embeddings, typically at the final layer. The resulting embedding vectors can then be utilized in many ways. For example, the use cases presented in the next sections demonstrate how feature vectors can be used to predict the similarity between personality constructs, choices in reinforcement learning tasks, or numerical judgments such as risk or health perception ratings. Feature extraction is commonly performed using encoder models with bidirectional attention (Muennighoff et al., 2022 ), which allows them to better utilize all available information during pre-training. However, when the goal is to predict the next element in a sequence, decoder models can be equally or more effective, also because current decoder models are significantly larger and trained on more data than encoder models.

Another class of applications utilizes the model’s ability to predict outcomes with no or minimal supervision. The use of transformer models without any kind of task-specific training to predict verbal or numerical outcomes is called zero-shot learning. This approach can be used to generate categorical and numerical predictions – for example, to predict the category of a news article or the sentiment of a social media post (e.g., Widmann & Wich, 2023 ; Pelicon et al., 2020 ; Gilardi et al., 2023 ). Zero-shot learning can be performed using all three types of transformer architectures, with some differences in implementation. Few-shot learning is an extension of zero-shot learning where minimal supervision is provided in the model input – for instance, in the form of a handful of input–output pairs. In classifying a social media post’s sentiment, this would imply including pairs of the post’s text and known sentiment in the model input. Few-shot learning can, in principle, be used with any model type; however, it is most commonly employed using modern decoder models, which tend to contain more parameters and show better performance on few-shot tasks (Brown et al., 2020 ). Off-the-shelf, large-scale decoder models, such as GPT-4, provide good zero-shot and few-shot performance (Törnberg, 2023 ; Rathje et al., 2023 ).

However, fine-tuning smaller language models on a specific task can result in equally good or even better performance relative to the zero- or few-shot performance of larger models (Wang et al., 2022 ; Chae & Davidson, 2023 ; Rosenbusch et al., 2023 ). Unlike zero- or few-shot learning – both of which leave model parameters unchanged during the "learning" phase – fine-tuning involves explicit updating of model weights with the goal of improving task-specific performance. As a result, fine-tuning is an important strategy for applications in behavioral science (e.g., Demszky et al., 2023 ).

figure 4

The main components of the Hugging Face ecosystem. The figure is adapted from Tunstall et al. ( 2022 )

The final class of application is text generation . This application is specific to encoder-decoder and decoder-only models. Text generation can be used to perform various tasks including summarization, question answering, or simply free text generation. Some examples of the use of free text generation include comparing reasoning abilities in LLMs and humans (Yax et al., 2023 ; Webb et al., 2023 ) and simulating the responses of study participants (Argyle et al., 2023 ). These simulations need not be constrained only to predicting human behavior, but could also be used to suggest explanations for why people behaved a certain way (Bhatia, 2023 ).

Choosing between models

Given the various model types available and their differing applications in behavioral science, one natural question that follows is which model to use for the application at hand. In view of the constantly developing architectural landscape and the important role played by task-specific properties, it is difficult to predict model success. Nevertheless, some heuristics tend to hold.

First, larger models tend to perform better than smaller ones (Kaplan et al., 2020 ). On the other hand, they also demand more computational resources. Modern large language models typically exceed the capacity of personal computers, requiring access to either high-performance clusters or cloud computing services. These requirements can be alleviated with quantization (e.g., Ma et al., 2024 ; Frantar et al., 2022 ), reducing a model’s numerical precision, or knowledge distillation (Hinton et al., 2015 ), transferring a model’s representations into a smaller model. Although these approaches usually come with some loss in performance, a quantized version of a larger model will, for instance, often outperform its smaller, full-precision equivalents. Ultimately, choices concerning model size must factor in these kinds of computational resource considerations.

Second, decoder models are more suited than encoder models to tasks where tokens must be predicted in sequence – that is, in the order that the tokens would be written (see, e.g., Section Extracting token probability and perplexity ). This is due to their causal language modeling pre-training objective, which prevents them from "cheating" by having access to future tokens when predicting the present token. On the flip side, when representations of the context (before and) after a given token of interest are desired (see, e.g., sections Relating personality measures and Predicting health perception ), encoder models typically outperform similarly sized decoder models. This is often the case for feature extraction tasks unless the goal is to generate predictions that are more sequential in nature (see, e.g., Section “ Predicting repeated choice ”), in which case a decoder model may be better suited.

Third, chat models that have been fine-tuned with assistant-oriented regimes, such as reinforcement learning from human feedback, are often better at producing coherent human-like output. This makes them more useful for a variety of tasks, such as text labeling, information retrieval, and text generation. The latter includes applications such as generating study materials or simulating human participants in behavioral experiments (see Section “ Text generation ”).

In addition to these heuristics, several empirical benchmarks exist that can help with selecting the right model for a task. There are separate benchmarks for various tasks, covering both feature extraction (e.g., huggingface.co/spaces/mteb /leaderboard ) and text output (e.g., huggingface.co/spaces/ lmsys/chatbot-arena-leaderboard ). An overview of benchmark results for open models accessible through Hugging Face can be accessed at huggingface.co/spaces/HuggingFace H4/open_llm_leaderboard . Generally, it is useful to select the model whose benchmark performance is high for tasks similar to the task at hand. Of course, benchmark performance may fail to generalize.

The Hugging Face ecosystem

The Hugging Face ecosystem has two main components: an online hub ("the Hub", huggingface.co/docs/hub ) and a set of Python libraries (as illustrated in Fig. 4 , huggingface.co/docs ). Hosting over 300,000 models and 60,000 datasets, the Hub constitutes an impressive community-driven effort to democratize language modeling, as well as other types of modeling, such as computer vision. Furthermore, thanks to Hugging Face’s Python libraries, many of the steps traditionally required to implement LLMs have become considerably more accessible, often requiring just a few lines of code each. These steps typically include processing the data, initializing the model, and applying the model to a specific task. This section introduces some of the more crucial components of libraries, such as | datasets|, | tokenizers|, | transformers|, and | accelerate|, and outlines how these components can help with these tasks.

The first step in almost any language modeling pipeline is data processing. This usually involves data loading, cleaning, and reshaping. Because natural language processing (NLP) datasets can sometimes be in the order of tens or even hundreds of gigabytes, they cannot always be loaded into RAM or stored on the hard drive. | datasets| addresses this issue by enabling users to convert their data to Apache Arrow format, allowing it to be flexibly and efficiently read from the hard drive or streamed online, thus preventing RAM or hard drive storage overload.

Once the dataset has been loaded in | datasets. Dataset| format, it must be tokenized before it can be fed into a model. It is essential that the method of tokenization is matched to the pre-trained model. Hugging Face’s | tokenizers| and higher-level API alternatives from the | transformers| library make it easy to initialize the appropriate tokenizer using the tokenizer’s .from_pretrained() method. This can be done by passing the model checkpoint – a unique character string that identifies the model on the Hugging Face Hub (see huggingface.co/models ).

Hugging Face models can be loaded in a single line using the transformers.AutoModel.from_pretrained() method, and placed on the graphics processing unit (GPU) if a compatible GPU is available. This can speed up model inference and fine-tuning to such an extent that it may make an otherwise infeasible task feasible. Training and inference can be further optimized by distributing across multiple GPUs with | accelerate|. In the examples that follow, | accelerate| works in the background of the | transformers| library via arguments such as device_map="auto" to automatically optimize the distribution of resources across processing units to allow easy upscaling to larger models.

It is important to note that the Hugging Face ecosystem is dependent on deep learning libraries such as PyTorch ( pytorch.org ) or TensorFlow ( tensorflow.org ), and interacts with popular Python libraries such as NumPy ( numpy.org ) and Pandas ( pandas.pydata.org ). Furthermore, it interfaces with several other Python libraries, including SentenceTransformers ( sbert.net/docs/quickstart.html ), offering a high-level API to obtain embeddings from over 500 models hosted on the Hugging Face Hub in addition to providing its own pre-trained models.

In what follows, we demonstrate how the Hugging Face ecosystem can be used for three types of applications, namely, feature extraction, fine-tuning, and text generation, by presenting several use cases in behavioral research with (accompanying code). Comprehensive and richly commented code is available in notebook format at github.com/Zak-Hussain/LLM4BeSci.git , a GitHub repository with instructions for running the code online in a Google Colab environment. The repository also provides a means of keeping the code base for this tutorial up to date. Keep in mind that the Hugging Face ecosystem is in active development, making it likely that specific aspects of the code presented in this paper will be deprecated by the time of reading. We plan to regularly update the GitHub repository and respond to update requests, which can be submitted as GitHub issues at github.com/Zak-Hussain/LLM4BeSci/issues/new . For further information on Hugging Face, we suggest the Hugging Face textbook by Tunstall et al. ( 2022 ).

Feature extraction

Relating personality measures.

Feature extraction from LLMs is already being leveraged in diverse ways to assist research in personality psychology (e.g., Abdurahman et al., 2023 ; Cutler & Condon, 2023 ; Wulff & Mata, 2023 ). In this example, we show how LLMs can be used to predict the relationship between existing personality measures (i.e., personality items and constructs). Specifically, we walk the reader through an analysis pipeline emulating the work of Wulff & Mata ( 2023 ) that used feature extraction to generate item and construct embeddings – representations of psychometric items and constructs in a vector space obtained from LLMs – and then applied these vectors in downstream tasks, such as similarity comparison and clustering, to both validate different models and tackle the lack of conceptual clarity in personality psychology (Wulff & Mata, 2023 ). See github.com/Zak-Hussain/LLM4BeSci.git to run this example in a Google Colab environment.

The example begins by loading the relevant data, in this case, data concerning the IPIP-NEO-300 five-factor personality inventory (Goldberg et al., 2006 ), into a pandas.DataFrame , verb|neo_items (see Table 1 ). The goal in this example will be to obtain item embeddings – representations of personality items and constructs – using feature extraction, so that these embeddings can be used to estimate the similarity between items and constructs. The similarity between items and constructs can ultimately be used to uncover the structure of psychological constructs and to inform the labeling of these measures (Wulff & Mata, 2023 ). The data set used in the example has three columns: | ’item’| (the personality item description), | ’construct’| (the personality construct to which the item belongs), and | ’factor’| (the Big Five personality factor to which the construct belongs). Once the data has been loaded (and any necessary cleaning and reshaping performed), they are converted into a | datasets.Dataset| object for efficient storage using the from_pandas() method.

The text input is now ready for tokenization. As mentioned earlier (see Section “ Tokenizers ”), the tokenizer must be consistent with the model to be used downstream. As such, a model checkpoint ( model_ckpt ) must first be defined. In our example, we use a lightweight version of BERT ( | ’distilbert-base-uncased’|) to ensure ease of storing and running the model on most hardware setups. However, it could easily be replaced by a larger model from the Hugging Face Hub, hardware limitations permitting. With the model_ckpt specified, the model tokenizer can be loaded with AutoTokenizer.from_pretrained(model_ckpt) .

Tokenization is performed efficiently across groups of inputs, called batches, by mapping | dat| with a user-defined batch_tokenizer function. This takes two important arguments: | padding| and | truncation|. | padding| is used to fill up sequences with zeros to match the length of the longest sequence in the batch, thus ensuring that all sequences in the batch have the same length. This is essential for training and inference with deep learning models, which operate on fixed-size input tensors. Tensors are a generalization of vectors, including vectors, matrices, and higher order arrays. | truncation| is the process of cutting off the end of a sequence to ensure that it fits within the model’s maximum context size. In the case of BERT models, this is 512 tokens. It is worth noting that alternative strategies exist if the to-be-dropped part of the sequence is thought to contain useful information, including splitting up the sequence into digestible batches and averaging the embeddings across batches.

As output, the batch_tokenizer returns a Python dictionary with two keys: ’input_ids’ and ’attention_mask’ . ’input_ids’ maps to a list of integers uniquely representing each token in the sequence. ’attention_mask , which is not to be confused with the learned attention weights, maps to a list of ones and zeros, where the ones stand in place of each token and the zeros pad the sequence to match the longest in the batch due to | padding|. For instance, the personality item | ’Go straight for the goal.’| tokenizes to:

figure a

with the ’input_ids’ referring to the tokens "[CLS]", "go", "straight", "for", "the", "goal", ".", and "[SEP]". The final pre-processing step involves converting the data to the PyTorch ( | torch|) format such that they can be passed to the model.

The data are now ready to be fed into the model. Model architecture and pre-trained weights are loaded in a single line with AutoModel.from_pretrained(model_ckpt) . The code next detects whether a Compute Unified Device Architecture (CUDA)-compatible or Apple Metal Performance Shader (MPS)-compatible GPU is available using PyTorch and, if so, sets the device to the GPU. Otherwise, the CPU is used. The model is then moved to the device with the | to()| method.

Data can be more efficiently fed into the model in batches and this is done automatically using the | Dataset.map()| method. To extract the features (i.e., the numerical vector representations of the inputs), the researcher must define a function extract_features() . It takes batches of | dat| as input and selects the columns containing the model inputs by checking whether the column name is referenced in a list accessed via tokenizer.model_input_names . In this case, the list contains two model inputs: ’input_ids’ and ’attention_mask’ . These are then input to the model with gradient calculations disabled by torch.no_grad() . This is done for efficiency reasons: By default, | torch| models build a computational graph in the background in order to perform gradient descent. Because feature extraction only performs a forward pass through the model (i.e., there is no weight updating), no computational graph is required.

Finally, the extract_features() function extracts the activations of the last layer of the model via  model(**inputs).last_hidden_state . This returns a tensor of shape batch_size, n_tokens, hidden_dim – in this case | 8, 16, 768|, respectively – due to being passed through | dat.map()| with batch_size=8 , padding=True , and the number of embedding dimensions in the model being 768. It is worth mentioning that because padding=True pads to the length of the longest sequence in the batch, n_tokens will not always equal 16. From this tensor, the first token features at position | 0| are selected, moved back onto the CPU, and converted to a NumPy array per the requirement of | Dataset.map()|. The first token is the special "[CLS]" token, inserted at the beginning of each input. It is known to produce a holistic representation of the content of the entire input and is therefore a common target in feature extraction (Tunstall et al., 2022 ). However, it should be noted that other feature extraction strategies exist than those focusing on the "[CLS]" token, such as taking the average of all output representations (known as "mean-pooling", see, e.g., Reimers & Gurevych, 2019 ).

The function is then applied using the | dat.map()| method, which runs the items in batches of eight through the function. Depending on the researcher’s RAM constraints, the batch_size could be increased. Finally, the features are converted into a | pandas.DataFrame| for later downstream use.

In our example, the application of item embeddings involves similarity comparison between personality items. The similarity between the items is evaluated by passing dat[’hidden_state’] , containing the features extracted for each item, to | sklearn|’s cosine_similarity() function. This function computes the cosine similarity, a measure commonly used to evaluate the similarity between embedding vectors, for each pair of items and returns a square NumPy array of pairwise item similarities.

figure b

Before we discuss the actual use of the cosine similarities obtained, it should be noted that there are higher-level API alternatives to feature extraction, such as the SentenceTransformers library ( sbert.net/docs/quickstart.html ) and the Hugging Face | transfomers.pipeline| abstraction. There are, however, two reasons for introducing feature extraction as we do above. First, even if the researcher opts for the higher-level API in their own work, the lower-level code can give them a better understanding of what is happening in the background. For instance, it highlights the importance of tokenization, allows researchers to easily inspect which inputs get fed into the model ( ’input_ids’ and ’attention_mask’ ), and indicates which features get extracted (the last hidden state). This background understanding can be crucial for debugging code and knowing how to appropriately adjust it to the research context. Second, and almost by definition, the lower-level API has the advantage of greater customizability. That being said, due to their simplicity, higher-level APIs will often be the best option for researchers wishing to implement their own feature extractor. For the sake of brevity and consistency, we demonstrate how this can be done with the higher-level | transformers.pipeline| API, and refer readers to the SentenceTransformers library for an alternative approach.

The pipeline object, | pipeline|, achieves the same results as the example above but with considerably less code. | pipeline| takes | ’feature-extraction’| as a positional argument upon initialisation. In the same line, the desired model and corresponding tokenizer can be loaded by specifying model=model_ckpt and tokenizer=model_ckpt as arguments. Rather than moving the data and model onto the GPU with the | to()| method, this is all done in the background by setting | device=device|. PyTorch is specified as the chosen deep-learning framework for running the model with | framework=’pt’|. Finally, feature extraction is run by passing the personality items as a list of strings to the now initialized feature_extractor , with tokenization options such as padding and truncation provided as a dictionary argument via tokenize_kwargs . The feature extraction returns a list of tensors with each tensor corresponding to a sample (i.e., personality item), and with the "[CLS]" features accessible at index | [0, 0]|.

figure d

Personality psychology application. ( A ) Correlations between predicted versus observed item similarities and ( B ) predicted versus observed construct similarities based on embeddings from DistilBERT, Instructor (instructor-xl), Cohere (Cohere-embed-english-v3.0), and ada (text-embedding-ada-002) models. C Pairwise controlled manifold approximation projection Wang et al. ( 2021 ) of Instructor-XL construct-level features. Colors reflect the Big Five personality factor to which the construct belongs. Error bars represent 95% confidence intervals

Following Wulff & Mata ( 2023 ), the cosine similarities between features can be compared with observed correlations between participant responses at both the item and the construct level. With | ’distilbert-base-uncased’|, a relatively simple baseline model, the semantic content of the personality items – as captured by the extracted features – is correlated with the absolute observed similarities with a Pearson \(r=.14\) . By aggregating item features and item correlations to the construct level, this correlation increases to \(r=.32\) (see Fig. 5 , panels A and B). As further shown in Wulff & Mata ( 2023 ), these correlations are higher for more recent, larger models. For instance, | hkunlp/instructor-xl| is a considerably larger alternative to DistilBERT that contains 1.2 billion as opposed to 67 million parameters. At the time of writing, it is at the top end of the Massive Text Embedding Benchmark (MTEB) Leaderboard ( huggingface.co/spaces/mteb/leaderboard ), and is also openly available through the Hugging Face Hub. As such, it presents a promising alternative to DistilBERT. Indeed, it achieves considerably higher item-wise ( \(r=.39\) ) and construct-wise ( \(r=.56\) ) correlations, which are comparable to those produced by top-of-the-range paid-API-access embedding models from Cohere ( Cohere , Cohere-embed-english-v3.0) and OpenAI ( ada , text-embedding-ada-002). Finally, as illustrated in Fig. 5 C, plotting a two-dimensional projection of the similarities between constructs reveals that the placement of constructs largely recovers the Big Five personality factors to which the constructs are assigned. Overall, this example highlights that item embeddings generated through feature extraction can accurately capture the empirical correlations between personality items and constructs and the overall structure of human personality, although performance differs across models.

More generally, we should point out that LLMs are not only capable of reproducing known facts about personality psychology. Their ability to capture the conceptual relationships between items, constructs, and their labels has been exploited by Wulff & Mata ( 2023 ) to produce a mapping of personality constructs and associated labels that increases parsimony and reduces jingle–jangle fallacies ; that is, the proliferation of measures that have been given similar labels, yet capture different constructs (jingle fallacies), as well as measures that have received different labels, yet capture the same construct (jangle fallacies). Consequently, feature extraction can be a powerful tool in contributing to conceptual clarity in this field.

Predicting health perception

In this section, we move into the domain of prediction, with the goal of showing how behavioral researchers can use LLMs to predict human judgments and decisions using the now-familiar feature extraction approach coupled with regression modeling (or other predictive modeling approaches). We use the example of predicting health perceptions following the approach of Aka & Bhatia ( 2022 ). We believe that predictive applications of LLMs such as this present a promising means of both tracking real-world perceptions and behaviors (e.g., Hussain et al., 2023 ), as well as enabling (in-silico) testing of potential interventions for improving communication between, for instance, health experts or policymakers and the general public Aka & Bhatia ( 2022 ). See github.com/Zak-Hussain/LLM4BeSci.git to run this example in a Google Colab environment.

The dataset used here has two columns: | ’text’| and | ’labels’| (Table 2 ; Aka & Bhatia, 2022 ). | ’text’| is composed of short descriptions of various health states extracted from the U.K. National Health Service’s web page. | ’labels’| contains average numerical ratings of the severity of these health states from 782 participants, with higher ratings indicating less severe states. The goal is to build a model that predicts people’s perception of the severity of the health states presented.

As demonstrated in the last section, the hidden state representation of each health description in the | ’text’| column can be extracted using | ’distilbert-base-uncased’|. These features can then be converted to a pandas.DataFrame , and used as predictors in a regression model to predict the health ratings. So as not to repeat code, this section begins with these features already extracted.

Model performance is evaluated out-of-sample. In the simplest case, this involves splitting the data into a training and a test set using | sklearn|’s train_test_split , with 80% used for training and 20% for testing (as determined by test_size=.2 ). randome_state=42 is used for reproducibility.

It is important to remember that the extracted features are high-dimensional. In this case, there are 768 predictors, as determined by the number of hidden dimensions in | ’distilbert-base-uncased’|. With only 621 samples in the training set, there are more predictors than samples. In such a case, an ordinary least squares regression solution cannot be identified. To address this issue, and more general issues associated with high-dimensional data such as over-fitting and multicollinearity, researchers commonly employ regularization. In this case, | sklearn|’s | RidgeCV| is used, where the regularization penalty ( | alpha|) is automatically tuned using cross-validation within the training set.

figure e

Once the regression model has been initialized using | RidgeCV()| and assigned as | regr|, it is then fitted to the standardized training data with | regr.fit()|. Standardization is commonly performed in regularized regression, such as ridge regression, to ensure that all predictors are given equal weight in the regularization penalty. Finally, model performance is evaluated with | regr.score()|.

Figure 6 A shows a high alignment of predicted and observed health state ratings, implying that much of the variance in health state ratings can be explained by the DistilBERT features.

figure 6

Predicting health perception. A Predicted versus observed health ratings using DistilBERT. B Comparing the performance of DistilBERT, Instructor (instructor-xl), Cohere (Cohere-embed-english-v3.0), and ada (text-embedding-ada-002), with tenfold cross-validation using ridge regression. Error bars reflect \(\pm 1\) SD

For the purposes of keeping the tutorial code simple, out-of-sample performance was measured using a single train-test split with a single model; ideally, performance would be evaluated across multiple splits to ensure the robustness of the results. Using ridge regression with automatic (nested) tuning of the regularization penalty hyper-parameter (Varma & Simon, 2006 ), feature extraction with DistilBERT on average explains over half of the variance in health perceptions ( \(R^2=.58\) ). This performance, although considerable, is inferior to that of larger models. As Fig. 6 B shows, the open-source Instructor model (instructor-xl), as well as the proprietary Cohere (Cohere-embed-english-v3.0) and ada (text-embedding-ada-002) models benefit from their larger size, all showing a relatively similar increase in performance relative to DistilBERT, with the best-performing model achieving \(R^2=.70\) . These results could potentially be improved further by using different prediction algorithms, including nonlinear algorithms, or more fine-grained hyper-parameter tuning.

Overall, the analysis shows that LLMs implementing a simple feature extraction coupled with a regression-based procedure can achieve impressive performance when tasked with predicting human judgments.

Predicting repeated choice

The previous examples have demonstrated that LLM feature extraction can be used to predict aspects of human psychology and judgments. But what about more complex human behavior? This section demonstrates that the feature extraction pipeline can also be applied to decisions in a repeated choice paradigm that involves sequential cognitive reasoning. Moreover, it shows that only minor code changes are needed to employ some of the largest and most advanced LLMs currently available for these purposes. See github.com/Zak-Hussain/LLM4BeSci.git to run this example in a Google Colab environment.

The experimental data in this section come from a paradigm known as the horizon task (Wilson et al., 2014 ). In this task, participants repeatedly choose between two options. Upon selecting an option they receive a probabilistic reward. Each game starts with four observational trials, in which the examples are predetermined by a computer program, followed by either one or six choices. Participants are instructed to accumulate as much reward as possible over the experiment. The data considered here are combined from two previous studies (Wilson et al., 2014 ; Feng et al., 2021 ) in which 60 participants each played 320 games, making a total of 67,200 choices.

In line with earlier work (Binz & Schulz, 2023a ; Coda-Forno et al., 2023 ), the model inputs, also known as prompts, are designed as follows (see also Fig. 7 A). Each prompt contains information about a single game, starting with a list of previous choices and outcomes in the game. Following a brief instruction, the text continues Q: Which machine do you choose? A: Machine, missing only the number of the selected machine (1 or 2). Choices and outcomes are sequentially added to the list as the LLM interacts with the task.

figure 7

Predicting repeated choices. A Example prompt for the text-based version of the horizon task. B Accuracy of predicting human choices on the test data for different models

Once the prompts have been loaded in a verb|pandas.DataFrame , the same out-of-sample testing pipeline used in the previous section on health judgments can be applied. The only change necessary is to replace | sklearn|’s | RidgeCV| with | LogisticRegressionCV| to account for the binary criterion (i.e., machine 1 or 2). Analogous to | RidgeCV|, | LogisticRegressionCV| also performs L2-regularization with automatic regularization parameter tuning. For the sake of brevity, we omit this code from the code block.

With the model ready, it can be fitted using the features extracted from the | ’distilbert-base-uncased’| model using a training portion of the data. The resulting model achieved a considerable accuracy of \(78.9\%\) on the test set. However, this performance can be improved using larger models. This time, we consider decoder models, which, based on currently available models, are better suited for the task at hand in two ways. First, current decoder models are larger in size than encoder models and can thus potentially capture more sophisticated reasoning (Brown et al., 2020 ). Second, their design may better match the sequential nature of the task, which mirrors the next-token prediction setting.

Fortunately, the Hugging Face interface makes it possible to use some of the most advanced models with only minor code modifications. In line with the goal of sharing reproducible code that can run in a Google Colab environment, we use a smaller Llama 2 model in the analysis example (Touvron et al., 2023 ). Llama 2 is a family of large-scale decoder models developed by Meta that are available in different sizes, training stages, and formats through the Hugging Face Hub. The specific model used in the example is | ’TheBloke/Llama-2-13B-GPTQ’|, which has been created by quantizing the 13 billion parameter model to 4 bits to reduce its RAM footprint without a significant loss in accuracy (see, e.g., Frantar et al., 2022 ). With 13 billion parameters, the model is two orders of magnitude larger than the | ’distilbert-base-uncased’| model used thus far.

A few additional minor modifications are needed to use the Llama 2 model. First, because ’TheBloke/Llama-2-13B-GPTQ’ is a different class of model, that is, a decoder causal language model, there is no "[CLS]" token that could be used to stand in for the entire input. Instead, in causal language models, the most important token to generate predictions is the final token in the input, that is, the token "machine", which is the only token whose contextualized embeddings are determined by those of all other tokens. The hidden state for the final token in the sequence can be extracted using last_hidden_state[:,-1] . Second, loading this model requires the more specific AutoModelForCausalLM.from_pretrained() method as opposed to the generic AutoModel.from_pretrained() method. Within the function, device_map="auto" , which is available only for certain larger models, is used to allocate the model automatically to GPU or CPU resources, making the explicit casting of tensors via | .to(device)| obsolete, and | revision=’main’| is used to indicate the specific model branch. (See https://huggingface.co/TheBloke/Llama-2-13B-GPTQ for the list of options on the corresponding model card.)

figure f

Using the 13B Llama 2 model with these settings required only around 16 GB GPU memory, implying that it can be replicated on many standard personal computers. Predicting repeated choice using the Llama 2 features resulted in a substantial increase in accuracy over the DistilBERT model to \(82.9\%\) (see Fig. 7 B). This performance is on a par with that of a psychological model using handcrafted features such as the difference in estimated rewards or the difference in uncertainty, as described by Gershman ( 2018 ); Binz & Schulz ( 2022 ), which achieved a test accuracy of \(82.6\%\) .

Overall, the results presented in this section demonstrate that LLMs can produce feature vectors that make it possible to predict complex decision-making behavior at a level comparable with that of current psychological models. Given that considerably larger models are available and that the model in the example may have suffered from quantization, it is plausible that the ceiling for state-of-the-art LLMs is considerably higher. With such strong performances, these results hint at behavioral applications of LLMs that go beyond predicting observed data, such as investigating how changes to the experimental stimuli or instructions may impact participant behavior. Such changes may be used, for instance, as in-silico pilot studies to inform future experimental designs, or to test specific hypotheses that were not testable with the original experimental data.

Fine-tuning

Feature extraction can be a powerful approach for predicting human judgments and behavior, especially when labeled task-specific data are scarce or if the researcher does not have access to strong GPUs. However, as the extracted features are domain-general, they may not be optimal for all tasks. An alternative approach is fine-tuning, sometimes referred to as transfer learning . During fine-tuning, all model weights are typically optimized for the given task (for an alternative strategy, see, e.g., Hu et al., 2022 ).

This section again uses data from Aka & Bhatia ( 2022 ), as illustrated in Table 2 . See github.com/Zak-Hussain/LLM4Be Sci.git to run this example in a Google Colab environment. So as not to repeat code, we begin with | dat| already tokenized, and with the device already set to the GPU. The code starts by setting torch.manual_seed(42) . This helps to ensure the reproducibility of stochastic processes such as stochastic gradient descent and its variants, including the ’adamw_torch’ optimizer used for fine-tuning the model.

The data are again split into a training and a test set. Because the plan is to use the Hugging Face ecosystem downstream as well, the data are kept in the native Hugging Face format using the inbuilt datasets.Dataset.train_test_split() method. Mirroring | sklearn|’s train_test_split() , a test set size and "random state" can be set, this time with test_size and | seed| as arguments. The method returns the split data sets in the form of a datasets.DatasetDict object, which behaves much like a regular Python dictionary. Technically, it inherits from | dict|, and has the keys | ’train’| and | ’test’| that map to the respective data sets.

Next, | ’distilbert-base-uncased’| is initialized as the model_ckpt , but this time with transformers.AutoModelForSequenceClassification , as opposed to the basic | tranformers.AutoModel| from earlier. This loads the model with a "classification head" attached. Despite its name, this head can also be used for regression tasks such as the one at hand by specifying num_labels=1 as an argument in the from_pretrained() method.

Having initialized the model, it is now time to set up the training loop with a | transformers|’ TrainingArguments object. This involves specifying the name of the output directory where model predictions and checkpoints will be saved ( output_dir ), the batch size for training and evaluation ( per_device_train_batch_size , per_device_eval_batch_size ), and the frequency with which training and evaluation metrics are recorded ( logging_strategy , evaluation_strategy ). Because neural networks usually need to be trained for multiple iterations over the training set, the number of iterations, also called epochs , can be specified with num_train_epochs . It is important to note that more advanced strategies exist to automatically halt training when certain test-performance-optimizing heuristics are triggered (e.g., early stopping ). However, for the purposes of this tutorial, we stick to manually specify num_train_epochs .

An evaluation function, compute_metrics() , is defined to evaluate model performance with metrics other than the model’s loss, which is automatically logged. This takes a | transformers.EvalPrediction| as input, which can be unpacked into the model’s predictions of the health perception ratings ( | preds|) and the actual ratings ( | labels|). An object to compute the explained variance ( \(R^2\) ) is then loaded using Hugging Face’s | evaluate| library, using evaluate.load("r_squared") , and the \(R^2\) is computed with the object’s | compute()| method using the | preds| and | labels|.

Training is carried out by the transformers.Trainer . In this case, the trainer is initialized with the model ( | model=model|), the data set to be used for training the model ( train_dataset=dat[’train’] ), the data set to be used for evaluating the model ( eval_dataset=dat[’test’] ), and the function to evaluate the model’s performance on the test set ( compute_metrics=compute_metrics ). Model training is then initiated by running | trainer.train()|.

figure g

In our example, after ten epochs, test performance plateaued at around \(R^2=.50\) . This is slightly below the \(R^2=.54\) achieved using feature extraction. As before, repeated train–test splits would need to be run using, for instance, (repeated) k-fold cross-validation, to achieve a more reliable test performance estimate. For the sake of brevity, the code for k-fold cross-validation is not included in this tutorial; however, the pattern of results remains the same. Thus, using our approach, feature extraction outperforms fine-tuning when it comes to predicting health perceptions.

This result highlights that fine-tuning may not always be the dominant choice when it comes to using LLMs for predicting human judgments and behavior. Factors such as data quantity and quality as well as the availability of computational resources will play a large role in determining which approach makes the most sense. It is also worth noting that a host of fine-tuned models are available for download from the Hugging Face Hub. As such, depending on the task at hand, the researcher may find that fine-tuned models for specific tasks or domains are already available at huggingface.co/models . Of course, the data used for fine-tuning may still differ considerably from the data that the researcher wishes to apply the fine-tuned model to. As such, it is always important to compare fine-tuned models to their pre-trained alternatives.

Text generation

Perhaps one of the major features contributing to the proliferation of LLMs is their ability to generate convincing, human-like text in response to prompts. In behavioral science, this capacity has made it possible to perform a vast battery of in-silico behavioral experiments (e.g., Yax et al., 2023 ; Binz & Schulz, 2023b ): As long as the experiment can be converted into a text-based format – setting multimodal models aside for present purposes – the model can "participate" in it.

Following Yax et al. ( 2023 ), this section draws on the example of the cognitive reflection test (CRT) to demonstrate how this can be done (Frederick, 2005 ). The three-question CRT is designed to measure a person’s ability to inhibit intuitive but incorrect responses in favour of deliberation. Here, we focus on how researchers can run experiments on LLMs in a replicable manner on their own hardware or using freely available cloud computing resources. See github.com/Zak-Hussain/LLM4BeSci.git to run this example in a Google Colab environment.

We again use a quantized 13 billion parameter version of Llama 2 (Touvron et al., 2023 ). As for predicting repeated choice, despite the vast increase in model size, the code for running the model uses some familiar Hugging Face components: a | transformers.pipeline| is used, the model_ckpt is defined and passed as an argument to | model| and | tokenizer| when initializing the pipeline. Likewise, device_map=’auto’ is used for efficient upscaling to larger models.

However, as the goal is to have the model respond to the CRT questions with text, there are two main differences from predicting repeated choice. First, a transformers.pipeline for ’text_generation’ is used, which gets the model to generate responses with minimal code. Second, a chat-based version of Llama 2, ’TheBloke/Llama-2-13B-chat-GPTQ’ , is employed (TheBloke, 2022 ). This model version has been further trained with techniques such as supervised fine-tuning and reinforcement learning from human feedback to align it more closely with the preferences of humans who wish to use it as an AI assistant (Touvron et al., 2023 ).

The final arguments in the pipeline are specifically to do with text generation. max_new_tokens=512 means that the model can produce a maximum of 512 tokens in response to the prompt. do_sample=False prevents the model from performing random sampling from the softmax output distribution over its vocabulary. This forces the model to employ a strategy known as greedy search decoding , whereby the model output at each timestep is simply the token with the highest probability in the output distribution. This can be contrasted with text-generation techniques involving sampling, which seek to improve the diversity of model outputs by adding some randomness to the generation process. However, this added diversity can come at the expense of coherence and reproducibility (Tunstall et al., 2022 ).

The CRT comprises three questions, which are assigned to | prompt|. In order to create a more realistic experimental context for the model, the code uses the Llama 2 chat-specific prompting template recommended on the | ’TheBloke/Llama-2-13B-chat-GPTQ’| model card page (TheBloke, 2022 ):

figure i

In this case, the | system prompt|, which is the broader context given to the model to help guide it, is a general description of the psychological study along with the instructions for participants, whereas the | prompt| contains the CRT questions. Both together form the prompt_template .

Generating the output is then as straightforward as passing the prompt_template to | generator|, which returns the output of the model as a list. The generated tokens can be accessed with the zero index, which returns a dictionary with a single key: ’generated_text’ . Accessing the key’s value returns a string composed of the text generated by the model with the input ( prompt_template ) prepended. In order to return only the output generated, the string is sliced such that the printed output begins after the last character of the prompt_template .

figure j

The model output is the following. It answers two out of the three questions correctly.

figure k

On average, human participants answer between 1 and 2 questions correctly, suggesting that LLMs can answer problems from the CRT at or above human level. However, one should be careful not to draw strong conclusions about the models’ reasoning capabilities from these results. It is possible that examples of the CRT are in Llama 2’s pre-training set, with (in-)correct answers included. Similar to findings on CRT pre-exposure among humans (Haigh, 2016 ), this may inflate the model’s performance and conclusions about its ability to reason (Mitchell, 2023 ).

The main purpose of this example is to show that open-access LLMs such as Llama 2 can be run on freely available hardware and used as stand-in "participants" in behavioral experiments, with the potential to generate insights about the experiment, the model, and perhaps human psychology as well.

Token probability and perplexity extraction

This tutorial has introduced three common LLM use cases: feature extraction, fine-tuning, and text generation. Although these are perhaps more well-known use cases, they are not exhaustive. For instance, an additional use of LLMs is to deploy them on the very tasks for which they were pre-trained; namely, to assign probabilities to tokens. During training, the model is incentivized to assign high probabilities to tokens and token sequences that are common in the training data and vice versa for those that are uncommon. As a result, the probabilities produced by a trained model can be used to detect text sequences that (the model has learned) are uncommon. Measures based on token and token sequence probabilities have thus been used to, for instance, investigate how language models capture grammatical gender (An et al., 2019 ) and to predict human reading times (e.g., Merkx & Frank, 2020 ).

The present example demonstrates how the log probabilities extracted from GPT-2 can be used to predict teachers’ text readability ratings. Specifically, these are teachers’ ratings of how difficult student readers would find certain text excerpts, obtained from the CommonLit Ease of Readability (CLEAR) corpus (Table 3 ; Crossley et al., 2023 ).

A common metric to evaluate the probability of a sequence is called perplexity (Jelinek et al., 1977 ). Given a sequence of tokens \(X=(x_{0}, x_{1}, ..., x_{t})\) , perplexity can be defined as

with \(p(x_{i} \vert x_{<i}){ beingtheprobabilityassignedtothe}i\) -th token in the sequence given the preceding tokens. Perplexity is thus the inverse geometric mean of sequentially produced token probabilities. As perplexity assumes sequentially produced tokens, it is not well-defined for masked language modeling. In our example, we therefore rely on GPT-2, which is a decoder model trained using causal language modeling and a direct precursor to today’s GPT-4.

The code begins by reading the | ’Exerpt’| and ’BT_easiness’ columns of the CLEAR corpus into a pandas.DataFrame . It then loads a perplexity metric object from the | evaluate| library’s | load| method. This object has a convenient | compute| method, which allows the researcher to specify from which model they wish to compute the perplexity ( model_id=’openai-community/gpt2’ in this case), the text sequences to compute perplexity on ( | clear[’Excerpt’])|), the batch size ( | 8|). The device defaults to the GPU if a CUDA-compatible GPU is available. The method returns a dictionary containing a list of perplexity scores for each sample ( | ’perplexities’|) and the average of those scores ( ’mean_perplexity’ ).

figure l

To evaluate whether perplexity can predict readability, we feed perplexity as a feature into an ordinary least squares regression. Using 1000 randomly sampled excerpts from CLEAR, the perplexity model achieves a tenfold cross-validation \(R^2=.20 (\text {SD}=.05)\) , implying that perplexity can account for a significant portion of text readability variance. The predictive accuracy of perplexity can be compared to an established predictor of readability, the Flesch Reading Ease (Flesch, 1948 ), calculated based on average sentence and word lengths. Flesch Reading Ease achieves an \(R^2=.28 (\text {SD}=.04),\,{ whichisonlyslightlyhigherthantheperformanceofperplexity}.{ Moreover},\,{ combiningthetwofeaturesleadstoaconsiderableincreaseinpredictiveaccuracy}(R^2=.44,\,\text {SD}=.05\) ), indicating the added value of perplexity for the prediction of readability.

Compared to the other use cases presented above, token probability and perplexity measures have been less frequently employed in behavioral science research. Nevertheless, these measures show promise in behavioral research in at least two respects. First, they can be used to predict human perceptions and evaluations of text, such as the readability of study vignettes or the surprisingness of statements. Secondly, they can be employed to evaluate the likelihood of human responses to text, such as response to open-text format items.

Open questions and future directions

This tutorial has introduced the Hugging Face ecosystem as an open-source approach to using LLMs in behavioral science. This section discusses the advantages and limitations of (open-source) language modeling in behavioral science, the broader societal-level risks posed by (open-source) LLMs, and future directions for behavioral science research with LLMs.

Good science practices with LLMs

The goal of this tutorial has been to make the responsible use of LLMs for behavioral science more accessible. However, using LLMs responsibly includes following good science practices. This requires more than the ability to implement LLMs in code; it also necessitates a substantive understanding of what that code is doing and an appreciation of the complex and nuanced theoretical questions concerning LLM capabilities. We hope that through Section “ A primer on transformer-based language models ”, and the explanations accompanying the code in this tutorial, readers will come away with a thorough understanding of what this code is doing. Nevertheless, a few cautionary points are in order.

The first concerns hyper-parameters. Like most popular statistical libraries in behavioral science, Hugging Face libraries enable control over a vast array of hyper-parameters, and many of these hyper-parameters have default settings. Although these defaults often make code more readable, they can also lead to complacency. Against this tendency, we stress the importance of making active decisions between the universes of possible hyper-parameter settings. In cases where substantive justifications are lacking, we encourage the use of multiverse analyses (i.e., with different plausible hyper-parameter setups; Steegen et al., 2016 ), computational resources permitting.

The second concerns performance evaluation. In addition to repeated out-of-sample testing, we emphasize the importance of other evaluation and sanity-checking strategies, such as testing meaningful synthetic baseline models. This could include randomly shuffling the pairings between extracted features and labels to identify possible data leakages – a pernicious problem in machine learning more broadly (see, e.g., Kaufman et al., 2012 ). Alternatively, it could mean artificially constructing perfectly predictive fine-tuning features to ensure model fitting is working properly.

The third and final point is more theoretical. The recent proliferation of LLMs has revived long-running debates over whether LLMs possess capacities such as true "understanding" ("thinking", "reasoning", etc.; see, e.g., Mitchell & Krakauer, 2023 ). Although we believe that the validity of the applications described in this tutorial does not depend on whether LLMs truly possess such capacities, we anticipate that the broader conclusions drawn from scientific studies involving LLMs will often be mediated by researchers’ background beliefs concerning these questions. As such, we would like to highlight the existence of a vast scientific and philosophical literature pertaining to the (evaluation of) language model capabilities (e.g. Mitchell & Krakauer, 2023 ; Turing, 1950 ; Bender et al., 2021 ; Searle, 1980 ; Günther et al., 2019 ) – a literature that has by no means reached a consensus – to guard against snap judgments or uncritical default positions.

In their broader form, these words of caution do not concern LLMs alone but are part of good science practice more generally. However, the complexity, opacity, and quirkiness of neural network models can exacerbate these issues in ways that require special attention.

Open-source LLMs and open behavioral science

Behavioral science is going through an open science revolution guided by principles such as transparency, accessibility, and sharing (Vicente-Saez & Martinez-Fuentes, 2018 ). Open-source and open-access language modeling frameworks such as Hugging Face are closely aligned with these principles. For instance, with Hugging Face, all analysis steps – from data preprocessing to model validation – are in principle accessible to fellow scientists wishing to better understand and perhaps reproduce what others have done. Likewise, models fine-tuned using the | transformers| library can easily be shared on the Hugging Face Hub, making it easier for researchers to build on and benefit from the work of their peers. With over 300,000 models and 60,000 datasets, Hugging Face stands as an exemplary case of the power of sharing in research and beyond.

Hugging Face also supports reproducibility. Features such as the ability to set seeds help improve the reproducibility of nondeterministic processes such as gradient descent. Likewise, because models are saved to the hard drive (instructions for locating models saved to hard drive are regularly updated at stackoverflow.com/questions/61798573/where-does-hug ging-faces-transformers-save-models ), the precise version of the model used for the analysis can be permanently saved for future reproductions. This stands in contrast to less open alternatives such as the OpenAI API, which, at the time of writing, does not provide the ability to access the same version of the model indefinitely into the future after model updates.

Open-source and open-access language modeling frameworks also have considerable advantages when it comes to data privacy: Because models can be saved and run locally, sensitive data can remain on the appropriate hardware, and researchers can be sure that the creators of the model will not have access to it. This is crucial in behavioral science, where ensuring the privacy of participant data is paramount from an ethical and legal perspective.

Open-source language modeling frameworks also help mitigate an important disadvantage of LLMs: their poor interpretability. Interpretability is a general problem for neural network models, whose complexity and abstract representational nature have earned them the label "black-box" models (Alishahi et al., 2019 ). In behavioral science, the limited interpretability of LLMs can hinder a researcher’s ability to draw strong theoretical conclusions. For instance, being able to interpret the internal states of the model presented in the section on Text Generation would help clarify whether it had simply "memorized" answers to the CRT or actually "reasoned" them through. However, a commitment to openness – in this case, transparency about the data used to train the model – could also help resolve this uncertainty by revealing whether the CRT even featured in the training data at all. In general, interpretability is worsened when researchers are not given information about important details concerning the model’s pre-training data, tokenizer, architecture, weights, and fine-tuning regimes. Of course, even when these details are known, it can remain a mystery why a model performs well on some tasks but not on others. This point is exemplified by the existence of emergent abilities in LLMs: abilities that arise from model upscaling, but whose arrival is incredibly difficult to predict, even for the developers who trained the model (Wei et al., 2022 ).

(Open-source) LLMs and society

Although we believe that open-source and open-access language modeling has its advantages for research, making LLMs publicly accessible also comes with considerable risks. LLMs are, after all, powerful tools, and in the hands of bad actors, they could be used to do serious harm (e.g., spreading mis- and disinformation; Weidinger et al., 2022 ). Increasing access to LLMs will also have environmental impacts (Strubell et al., 2019 ), especially when more researchers have the ability not only to use these models for inference but also to train them via ecosystems such as Hugging Face.

Furthermore, it is important to be aware of the broader risks that present and future LLMs may pose to society, especially if they are poorly aligned with people’s preferences and values (Bockting et al., 2023 ; Russell, 2019 ). Concerns such as these have motivated research programs into AI alignment from leading AI companies (Leike & Sutskever, 2023 ; Leike et al., 2018 ). They have also led to the use of more explicit human behavioral data for fine-tuning LLMs (e.g., via explicit human feedback on model outputs) to achieve closer alignment with human preferences. As has been pointed out (Irving & Askell, 2019 ; Russell, 2019 ), this endeavor presents a unique opportunity for behavioral scientists, who of course have expertise in collecting high-quality human data.

Future direction for LLMs and behavioral science

This insight points to an interesting future direction at the intersection of behavioral science and language modeling. Although this tutorial has focused on how LLMs can be used as tools for behavioral science research (LLMs \(\rightarrow \) behavioral science), an inverse relationship also holds promise: the use of behavioral science methods to build more interpretable, human-aligned LLMs (behavioral science \(\rightarrow \) LLMs). For instance, as foreshadowed in the section on Text Generation, behavioral experiments can be run on LLMs to provide greater insight into the models’ capabilities (e.g., Binz & Schulz, 2023b ; Yax et al., 2023 ). Likewise, interpretability techniques such as probing can be combined with psycholinguistic methods to assess how aligned the internal representations of LLMs are to people’s semantic representations (Cassani et al., 2023 ; Sucholutsky et al., 2023 ; Aeschbach et al., 2024 ). This relationship is mutually reinforcing (behavioral science \(\rightarrow { LLMs}\rightarrow \) behavioral science): LLMs that are more aligned with people may also serve as more plausible and predictive psychological models (e.g., Binz & Schulz, 2023a ), and LLMs that are more interpretable may allow for deeper theoretical insights – into their psychological plausibility, but also into human cognition and behavior itself.

LLMs hold immense promise for behavioral science and open-source frameworks play a crucial role in promoting transparency, reproducibility, and data protection. This tutorial has provided a practical overview of using the open-source Hugging Face ecosystem, covering feature extraction, fine-tuning, and text generation, in order to empower researchers to harness the potential of LLMs for behavioral applications. While acknowledging the current limitations of this approach, we hope that these efforts can help catalyze LLM utilization and offer novel insights and opportunities for future research in behavioral science.

Open practices statement

The data used in all sections are available at github.com/Zak- Hussain/LLM4BeSci.git . Aside from the specific results for Cohere (Cohere-embed-english-v3.0) and ada (text-embedding-ada-002) – these models are behind a paywall – all results can be reproduced using freely available software. None of the analyses were preregistered, as they are presented for demonstration purposes only.

Availability of data and materials

The datasets analyzed for the current tutorial are available in the LLM4BeSci repository, https://github.com/Zak-Hussain/LLM4BeSci.git .

Code availability

The code used for the current tutorial is available in the LLM4BeSci repository, https://github.com/Zak-Hussain/LLM4BeSci.git .

Abdurahman, A., Vu, H., Zou, W., Ungar, L., & Bhatia, S. (2023). A deep learning approach to personality assessment: Generalizing across items and expanding the reach of survey-based research. Journal of Personality and Social Psychology , Advance online publication https://doi.org/10.1037/pspp0000480

Aeschbach, S., Mata, R., Wulff, D.U. (2024). Mapping the Mind With Free Associations: A Tutorial Using the R Package associator. PsyArXiv [SPACE] https://doi.org/10.31234/osf.io/ra87s

Aka, A., & Bhatia, S. (2022). Machine learning models for predicting, understanding, and influencing health perception. Journal of the Association for Consumer Research , 7 (2), 142–153. https://doi.org/10.1086/718456

Ali, M., Fromm, M., Thellmann, K., Rutmann, R., Lübbering, M., Leveling, J., ..., Flores-Herr, N. (2023). Tokenizer Choice For LLM Training: Negligible or Crucial? arXiv [SPACE] https://arxiv.org/abs/2310.08754

Alishahi, A., Chrupała, G., & Linzen, T. (2019). Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop. Natural Language Engineering , 25 (4), 543–557. https://doi.org/10.1017/S135132491900024X

An, A., Qian, P., Wilcox, E., Levy, R. (2019). Representation of constituents in neural language models: Coordination phrase as a case study. arXiv preprint arXiv:1909.04625

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis , 31 (3), 337–351. https://doi.org/10.1017/pan.2023.2

Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , 610–623

Bhatia, S. (2023). Exploring the Sources of Variance in Risky Decision Making with Large Language Models. https://doi.org/10.31234/osf.io/3hrnc

Binz, M., & Schulz, E. (2022). Modeling human exploration through resource-rational reinforcement learning. Advances in Neural Information Processing Systems , 35 , 31755–31768.

Binz, M., & Schulz, E. (2023a). Turning large language models into cognitive models. arXiv , https://arxiv.org/abs/2306.03917

Binz, M., & Schulz, E. (2023b). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences, 120 (6), e2218523120. https://doi.org/10.1073/pnas.2218523120

Bockting, C. L., Van Dis, E. A. M., Van Rooij, R., Zuidema, W., & Bollen, J. (2023). Living guidelines for generative AI: Why scientists must oversee its use. Nature, 622 (7984), 693–696. https://doi.org/10.1038/d41586-023-03266-1

Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., ..., Liang, P. (2023). The Foundation Model Transparency Index. arXiv , https://arxiv.org/abs/2310.12941

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems , 33 , 1877–1901.

Cassani, G., Guenther, F., Attanasio, G., Bianchi, F., & Marelli, M. (2023). Meaning Modulations and Stability in Large Language Models: An Analysis of BERT Embeddings for Psycholinguistic Research. PsyArXiv , https://doi.org/10.31234/osf.io/b45ys

Chae, Y., Davidson, T. (2023). Large language models for text classification: from zero-shot learning to fine-tuning. OSF [SPACE] https://osf.io/5t6xz/

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Iformation Pocessing Systems , 30 , 4299–4307.

Coda-Forno, J., Binz, M., Akata, Z., Botvinick, M., Wang, J.X., & Schulz, E. (2023). Meta-in-context learning in large language models, arXiv [SPACE] https://arxiv.org/abs/2305.12907

Crossley, S., Heintz, A., Choi, J. S., Batchelor, J., Karimi, M., & Malatinszky, A. (2023). A large-scaled corpus for assessing text readability. Behavior Research Methods , 55 (2), 491–507.

Cutler, A., & Condon, D. M. (2023). Deep lexical hypothesis: Identifying personality structure in natural language. Journal of Personality and Social Psychology , 125 (1), 173–197. https://doi.org/10.1037/pspp0000443

Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology , 2 , 688–701. https://doi.org/10.1038/s44159-023-00241-5

Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv , https://arxiv.org/abs/1810.04805

Feng, S. F., Wang, S., Zarnescu, S., & Wilson, R. C. (2021). The dynamics of explore-exploit decisions reveal a signal-to-noise mechanism for random exploration. Scientific Reports , 11 (1), 3077. https://doi.org/10.1038/s41598-021-82530-8

Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology , 32 (3), 221.

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv , https://arxiv.org/abs/2210.17323

Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives , 19 (4), 25–42. https://doi.org/10.1257/089533005775196732

Gershman, S. J. (2018). Deconstructing the human algorithms for exploration. Cognition , 173 , 34–42. https://doi.org/10.1016/j.cognition.2017.12.014

Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences , 120 (30), e2305016120. https://doi.org/10.1073/pnas.2305016120

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. G. (2006). The International Personality Item Pool and the future of public-domain personality measures. Journal of Research in Personality , 40 (1), 84–96. https://doi.org/10.1016/j.jrp.2005.08.007

Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Springer.

Günther, F., Rinaldi, L., & Marelli, M. (2019). Vector-space models of semantic representation from a cognitive perspective: A discussion of common misconceptions. Perspectives on Psychological Science , 14 (6), 1006–1033.

Haigh, M. (2016). Has the standard Cognitive Reflection Test become a victim of its own success? Advances in Cognitive Psychology , 12 (3), 145–149. https://doi.org/10.5709/acp-0193-5

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ..., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations , https://openreview.net/forum?id=nZeVKeeFYf9

Hussain, Z., Mata, R., & Wulff, D.U. (2023). Novel embeddings improve the prediction of risk perception. PsyArXiv , https://doi.org/10.31234/osf.io/yrjfb

Irving, G., & Askell, A. (2019). AI safety needs social scientists. Distill [SPACE] https://doi.org/10.23915/distill.00014

Jelinek, F., Mercer, R. L., Bahl, L. R., & Baker, J. K. (1977). Perplexity-a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America , 62 (S1), S63–S63.

Kajonius, P. J., & Johnson, J. A. (2019). Assessing the structure of the Five Factor Model of Personality (IPIP-NEO-120) in the public domain. Europe’s Journal of Psychology , 15 (2), 260–275. https://doi.org/10.5964/ejop.v15i2.1671

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., ..., & Amodei, D., (2020). Scaling laws for neural language models. arXiv [SPACE] https://arxiv.org/abs/2001.08361

Kaufman, S., Rosset, S., Perlich, C., & Stitelman, O. (2012). Leakage in data mining: Formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD) , 6 (4), 1–21.

Korinek, A. (2023). Language Models and Cognitive Automation for Economic Research. NBER Working Paper Series , (30957) https://doi.org/10.3386/w30957

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv , https://arxiv.org/abs/1811.07871

Leike, J., & Sutskever, I. (2023). Introducing Superalignment. OpenAI [SPACE] https://openai.com/blog/introducing-superalignment

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ..., Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv , https://arxiv.org/abs/1910.13461

Li, H. (2022). Language models: past, present, and future. Communications of the ACM , 65 (7), 56–63. https://doi.org/10.1145/3490443

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ..., & Stoyanov, V. (2019). ROBERTa: A robustly optimized BERT pretraining approach. arXiv [SPACE] https://arxiv.org/abs/1907.11692

Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., ..., & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv preprint [SPACE] arXiv:2402.17764

Merkx, D., & Frank, S.L. (2020). Human sentence processing: Recurrence or attention?. arXiv preprint [SPACE] arXiv:2005.09471

Mikolov, T., Chen, K., Corrado, G. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

Mitchell, M. (2023). How do we know how smart AI systems are?. Science , 381 (6654), eadj5957

Mitchell, M., & Krakauer, D. C. (2023). The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences , 120 (13), e2215907120.

Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. arXiv: https://arxiv.org/abs/2210.07316

OpenAI (2023). GPT-4 Technical Report. arXiv [SPACE] https://openai.com/research/gpt-4

Pelicon, A., Pranjić, M., Miljković, D., Škrlj, B., & Pollak, S. (2020). Zero-shot learning for cross-lingual news sentiment classification. Applied Sciences , 10 (17), 5993. https://doi.org/10.3390/app10175993

Prince, S.J. (2023). Understanding Deep Learning . MIT press

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., & Matena, ..., Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21 (1), 5485–5551. https://doi.org/10.5555/3455716.3455856

Rathje, S., Mirea, D.M., Sucholutsky, I., Marjieh, R., & Robertson, C. (2023). GPT is an effective tool for multilingual psychological text analysis. PsyArXiv , https://osf.io/preprints/psyarxiv/sekf5/

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , https://arxiv.org/abs/1908.10084

Rosenbusch, H., Stevenson, C. E., & Van Der Maas, H. L. J. (2023). How Accurate are GPT-3’s Hypotheses About Social Science Phenomena? Digital Society , 2 , 26. https://doi.org/10.1007/s44206-023-00054-2

Russell, S. (2019). Human compatible: Artificial intelligence and the problem of control . Penguin

Sanderson, G. (2019). Neural Networks https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv [SPACE] https://arxiv.org/abs/1910.01108

Searle, J. R. (1980). Minds, brains, and programs. Behavioral and brain sciences , 3 (3), 417–424.

Siew, C. S., Wulff, D. U., Beckage, N. M., & Kenett, Y. N. (2019). Cognitive network science: A review of research on cognition through the lens of network representations, processes, and dynamics. Complexity , 2019 , 2108423. https://doi.org/10.1155/2019/2108423

Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., ..., & Catanzaro, B. (2022). Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv [SPACE] https://arxiv.org/abs/2201.11990

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science , 11 (5), 702–712.

Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv [SPACE] https://arxiv.org/abs/1906.02243

Su, H., Shi, W., Kasai, J., Wang, Y., Hu, Y., Ostendorf, M., ..., & Yu, T. (2022). One embedder, any task: Instruction-finetuned text embeddings. arXiv [SPACE] https://arxiv.org/abs/2212.09741

Sucholutsky, I., Muttenthaler, L., Weller, A., Peng, A., Bobu, A., Kim, B., ..., & Griffiths, T.L. (2023). Getting aligned on representational alignment. arXiv [SPACE] https://arxiv.org/abs/2310.13018

TheBloke (2022). Llama-2-7b-Chat-GPTQ. https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ

TII (2023). Falcon-40B-Instruct: A 40B parameters causal decoder-only model [Accessed: 2023-11-16] https://huggingface.co/tiiuae/falcon-40b-instruct

Törnberg, P. (2023). ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning. arXiv [SPACE] https://arxiv.org/abs/2304.06588

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ..., & Bhosale, S. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

Tunstall, L., Von Werra, L., & Wolf, T. (2022). Natural language processing with transformers . O’Reilly.

Turing, A.M. (1950) I.-COMPUTING MACHINERY AND IN LIGENCE[_eprint: https://academic.oup.com/mind/article-pdf/LIX/236/433/30123314/lix-236-433.pdf ]. Mind , 59 (236), 433–460 https://doi.org/10.1093/mind/LIX.236.433

Van Noorden, R., & Perkel, J. M. (2023). AI and science: what 1,600 researchers think. AI and science. Nature, 621 (7980), 672–675. https://doi.org/10.1038/d41586-023-02980-0

Varma, S., & Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7 (1), 1–8. https://doi.org/10.1186/1471-2105-7-91

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems , 30 , 5998–6008.

Vicente-Saez, R., & Martinez-Fuentes, C. (2018). Open Science now: A systematic literature review for an integrated definition. Journal of Business Research , 88 , 428–436. https://doi.org/10.1016/j.jbusres.2017.12.043

Vig, J. (2019). A multiscale visualization of attention in the transformer model. arXiv [SPACE] https://arxiv.org/abs/1906.05714

Vig, J., & Belinkov, Y. (2019). Analyzing the structure of attention in a transformer language model. arXiv [SPACE] https://arxiv.org/abs/1906.04284

Wang, T., Roberts, A., Hesslow, D., Le Scao, T., Chung, H.W., Beltagy, I., Launay, J., & Raffel, C. (2022). What language model architecture and pretraining objective works best for zero-shot generalization? In: K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning (pp. 22964–22984) https://proceedings.mlr.press/v162/wang22u.html

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. The Journal of Machine Learning Research , 22 (1), 9129–9201 https://dl.acm.org/doi/abs/10.5555/3546258.3546459

Webb, T., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour , 7 (9), 1526–1541. https://doi.org/10.1038/s41562-023-01659-w

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., ..., & Fedus, W. (2022). Emergent abilities of large language models. arXiv [SPACE] https://arxiv.org/abs/2206.07682

Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.S., Mellor, J., ..., & Gabriel, I. (2022). Taxonomy of risks posed by language models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency 214–229 https://doi.org/10.1145/3531146.3533088

Wetzel, L. (2018). Types and Tokens. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Fall 2018) . Metaphysics Research Lab: Stanford University.

Widmann, T., & Wich, M. (2023). Creating and Comparing dictionary, word embedding, and transformer-based models to measure discrete emotions in German political text. Political Analysis , 31 (4), 626–641. https://doi.org/10.1017/pan.2022.15

Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., & Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General , 143 (6), 2074–2081. https://doi.org/10.1037/a0038199

Wu, Y., Schuster, M, Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., ..., & Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv [SPACE] https://arxiv.org/abs/1609.08144

Wulff, D.U., & Mata, R. (2023). Automated jingle–jangle detection: Using embeddings to tackle taxonomic incommensurability. https://doi.org/10.31234/osf.io/9h7aw

Yax, N., Anlló, H., Palminteri, & Stefano, S. (2023). Studying and improving reasoning in humans and machines. arXiv [SPACE] https://arxiv.org/abs/2309.12485

Download references

Acknowledgements

We thank Susannah Goss and Laura Wiles for editing the manuscript. We also thank Ada Aka and Sudeep Bhatia for making their health perception data available for our use. This work was supported by grants from the Swiss Science Foundation to Rui Mata (204700) and Dirk U. Wulff (197315).

Open Access funding enabled and organized by Projekt DEAL. Swiss National Science Foundation grant (197315), to Dirk U. Wulff Swiss National Science Foundation grant (204700), to Rui Mata

Author information

Authors and affiliations.

University of Basel, Basel, Switzerland

Zak Hussain, Rui Mata & Dirk U. Wulff

Max Planck Institute for Human Development, Berlin, Germany

Zak Hussain & Dirk U. Wulff

Max Planck Institute for Biological Cybernetics, Tübingen, Germany

Marcel Binz

Helmholtz Center for Computational Health, Neuherberg, Germany

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: ZH, DUW Methodology: ZH, MB, DUW Formal Analysis: ZH, MB, DUW Writing - Original Draft Preparation: ZH (sections Relating personality measures , Predicting health perception , The Hugging Face ecosystem , Open questions and future directions ), MB ( Predicting repeated choice ), RM ( Introduction ), DUW ( A primer on transformer-based language models ) Writing - Review & Editing: ZH, MB, RM, DUW

Corresponding author

Correspondence to Zak Hussain .

Ethics declarations

Conflict of interest/competing interests.

The authors declare that they have no competing interests.

Ethics approval

Not applicable.

Consent to participate

Consent for publication, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Hussain, Z., Binz, M., Mata, R. et al. A tutorial on open-source large language models for behavioral science. Behav Res (2024). https://doi.org/10.3758/s13428-024-02455-8

Download citation

Accepted : 27 May 2024

Published : 15 August 2024

DOI : https://doi.org/10.3758/s13428-024-02455-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Large language models
  • Behavioral science
  • Hugging face
  • Find a journal
  • Publish with us
  • Track your research
  • Features & Opinion
  • Webinars & Podcasts

An LLM-based AI agent to assist with smart deployments across business operations

The AI agent empowering higher autonomous level Catalyst is building an advanced LLM-based technical architecture designed to simplify integration of AI into the variety of CSP operational processes

An LLM-based AI agent to assist with smart deployments across business operations

Commercial context

As CSPs worldwide aim to integrate AI into their operations and business services, it's crucial that these deployments are strategically planned and adaptable to long-term challenges and varied use cases. Yet many AI deployments remain siloed within specific applications or suffer from limited efficiency – and this fragmentation creates difficulties in assessing return on investment (ROI) and hinders staff engagement with AI technologies.

This highlights the need for a more unified and strategic approach to AI adoption, which could serve as an industry-wide standard for integrations. The emergence of large language models (LLMs) has introduced a new opportunity in the form of ‘AI agents’. An LLM-based agent can assist CSPs in integrating various AI deployments, even when facing highly specific, complex, and unique requirements. If such an agent were available industry-wide, CSPs could significantly enhance operational efficiency, improve customer experiences, and unlock new revenue streams.

How the solution works

Establishing a framework for such an AI agent has been the mission of the AI agent empowering higher autonomous level Catalyst. The project group has developed an advanced LLM-based technical architecture designed to simplify integration into the variety of CSP operational processes. This framework is underpinned by TM Forum’s Value Operations Framework, which includes its model and execution system, and other assets, including AIOps, DT4DI, and AOMM. These help ensure that the solution can be applied – and adapt – to various business scenarios.

Part of the project team’s main focus was creation of a finely-tuned LLM which can be incorporated with digital twins and domain-specific knowledge – something critical to enabling the AI agent to comprehend and operate effectively within the telco environment. These AI agents can be applied across multiple value streams within the telecom operational cycle and customer experience lifecycle, enhancing everything from provisioning and installation to after-sales service and incident management.

Wider application and value

The implementation of this LLM-based AI solution holds considerable promise for CSPs across the globe. For instance, China Telecom plans to use the agent to significantly enhance its AI and cloud network operations. It’s also expected to support 35 new beyond-connectivity products for over 190 million wireline broadband subscribers, assist with the automation of 75% of cloud network incident handling, and improve the efficiency of high-risk command audits by 50%.

Moreover, China Telecom expects it will reduce hardware-related on-site visits for home broadband installations and maintenance by 50%. The Chinese CSP also expects an increase in the number of base stations maintained per capita by 21%, and a cut in nationwide operations and maintenance (O&M) costs by 200 million RMB annually. There are considerable environmental benefits too, with the CSP claiming that this will help achieve an annual energy saving of 760 million kWh across 5 million base stations and 90 million kWh across 2,500 IDC facilities, while boosting the autonomous operation level to L4+ in certain cloud network scenarios.

Similarly, HKT's use of the agent for VVIP user experience assurance has yielded impressive results, with issue handling efficiency improved by 85%, reduced processing times from hours to minutes, and halving the bad customer experience indicator from 2.7% to 1.3%. Overall fault resolution efficiency has seen a 30% improvement, leading to annual O&M cost savings of 15 million RMB and achieving higher customer satisfaction ratings than competitors.

According to Project Leader, China Telecom’s Zhang Le, “the broader implications of this technology are vast, and provide CSPs with a strong framework to innovate and transform their operations. By harnessing the power of LLM-based AI agents, CSPs can achieve significant cost savings, enhance operational efficiency, and deliver superior customer experiences, all the while contributing to sustainability goals through substantial energy savings too.”

Director, Oriel

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.

Dicklesworthstone/llm_aided_ocr

Folders and files.

NameName
50 Commits

Repository files navigation

Llm-aided ocr project, introduction.

The LLM-Aided OCR Project is an advanced system designed to significantly enhance the quality of Optical Character Recognition (OCR) output. By leveraging cutting-edge natural language processing techniques and large language models (LLMs), this project transforms raw OCR text into highly accurate, well-formatted, and readable documents.

Example Outputs

To see what the LLM-Aided OCR Project can do, check out these example outputs:

  • Original PDF
  • Raw OCR Output
  • LLM-Corrected Markdown Output
  • PDF to image conversion
  • OCR using Tesseract
  • Advanced error correction using LLMs (local or API-based)
  • Smart text chunking for efficient processing
  • Markdown formatting option
  • Header and page number suppression (optional)
  • Quality assessment of the final output
  • Support for both local LLMs and cloud-based API providers (OpenAI, Anthropic)
  • Asynchronous processing for improved performance
  • Detailed logging for process tracking and debugging
  • GPU acceleration for local LLM inference

Detailed Technical Overview

Pdf processing and ocr.

PDF to Image Conversion

  • Function: convert_pdf_to_images()
  • Uses pdf2image library to convert PDF pages into images
  • Supports processing a subset of pages with max_pages and skip_first_n_pages parameters

OCR Processing

  • Function: ocr_image()
  • Utilizes pytesseract for text extraction
  • Converts image to grayscale
  • Applies binary thresholding using Otsu's method
  • Performs dilation to enhance text clarity

Text Processing Pipeline

Chunk Creation

  • The process_document() function splits the full text into manageable chunks
  • Uses sentence boundaries for natural splits
  • Implements an overlap between chunks to maintain context

Error Correction and Formatting

  • Core function: process_chunk()
  • Uses LLM to fix OCR-induced errors
  • Maintains original structure and content b. Markdown Formatting (optional):
  • Converts text to proper markdown format
  • Handles headings, lists, emphasis, and more

Duplicate Content Removal

  • Implemented within the markdown formatting step
  • Identifies and removes exact or near-exact repeated paragraphs
  • Preserves unique content and ensures text flow

Header and Page Number Suppression (Optional)

  • Can be configured to remove or distinctly format headers, footers, and page numbers

LLM Integration

Flexible LLM Support

  • Supports both local LLMs and cloud-based API providers (OpenAI, Anthropic)
  • Configurable through environment variables

Local LLM Handling

  • Function: generate_completion_from_local_llm()
  • Uses llama_cpp library for local LLM inference
  • Supports custom grammars for structured output

API-based LLM Handling

  • Functions: generate_completion_from_claude() and generate_completion_from_openai()
  • Implements proper error handling and retry logic
  • Manages token limits and adjusts request sizes dynamically

Asynchronous Processing

  • Uses asyncio for concurrent processing of chunks when using API-based LLMs
  • Maintains order of processed chunks for coherent final output

Token Management

Token Estimation

  • Function: estimate_tokens()
  • Uses model-specific tokenizers when available
  • Falls back to approximate_tokens() for quick estimation

Dynamic Token Adjustment

  • Adjusts max_tokens parameter based on prompt length and model limits
  • Implements TOKEN_BUFFER and TOKEN_CUSHION for safe token management

Quality Assessment

  • Function: assess_output_quality()
  • Compares original OCR text with processed output
  • Uses LLM to provide a quality score and explanation

Logging and Error Handling

  • Comprehensive logging throughout the codebase
  • Detailed error messages and stack traces for debugging
  • Suppresses HTTP request logs to reduce noise

Configuration and Customization

The project uses a .env file for easy configuration. Key settings include:

  • LLM selection (local or API-based)
  • API provider selection
  • Model selection for different providers
  • Token limits and buffer sizes
  • Markdown formatting options

Output and File Handling

  • Raw OCR Output : Saved as {base_name}__raw_ocr_output.txt
  • LLM Corrected Output : Saved as {base_name}_llm_corrected.md or .txt

The script generates detailed logs of the entire process, including timing information and quality assessments.

Requirements

  • Python 3.12+
  • Tesseract OCR engine
  • PDF2Image library
  • PyTesseract
  • OpenAI API (optional)
  • Anthropic API (optional)
  • Local LLM support (optional, requires compatible GGUF model)

Installation

  • Install Pyenv and Python 3.12 (if needed):
  • Set up the project:

Install Tesseract OCR engine (if not already installed):

  • For Ubuntu: sudo apt-get install tesseract-ocr
  • For macOS: brew install tesseract
  • For Windows: Download and install from GitHub

Set up your environment variables in a .env file:

Place your PDF file in the project directory.

Update the input_pdf_file_path variable in the main() function with your PDF filename.

Run the script:

The script will generate several output files, including the final post-processed text.

How It Works

The LLM-Aided OCR project employs a multi-step process to transform raw OCR output into high-quality, readable text:

PDF Conversion : Converts input PDF into images using pdf2image .

OCR : Applies Tesseract OCR to extract text from images.

Text Chunking : Splits the raw OCR output into manageable chunks for processing.

Error Correction : Each chunk undergoes LLM-based processing to correct OCR errors and improve readability.

Markdown Formatting (Optional): Reformats the corrected text into clean, consistent Markdown.

Quality Assessment : An LLM-based evaluation compares the final output quality to the original OCR text.

Code Optimization

  • Concurrent Processing : When using API-based models, chunks are processed concurrently to improve speed.
  • Context Preservation : Each chunk includes a small overlap with the previous chunk to maintain context.
  • Adaptive Token Management : The system dynamically adjusts the number of tokens used for LLM requests based on input size and model constraints.

Configuration

The project uses a .env file for configuration. Key settings include:

  • USE_LOCAL_LLM : Set to True to use a local LLM, False for API-based LLMs.
  • API_PROVIDER : Choose between "OPENAI" or "CLAUDE".
  • OPENAI_API_KEY , ANTHROPIC_API_KEY : API keys for respective services.
  • CLAUDE_MODEL_STRING , OPENAI_COMPLETION_MODEL : Specify the model to use for each provider.
  • LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS : Set the context size for local LLMs.

Output Files

The script generates several output files:

  • {base_name}__raw_ocr_output.txt : Raw OCR output from Tesseract.
  • {base_name}_llm_corrected.md : Final LLM-corrected and formatted text.

Limitations and Future Improvements

  • The system's performance is heavily dependent on the quality of the LLM used.
  • Processing very large documents can be time-consuming and may require significant computational resources.

Contributing

Contributions to this project are welcome! Please fork the repository and submit a pull request with your proposed changes.

This project is licensed under the MIT License.

Contributors 2

@Dicklesworthstone

  • Python 100.0%

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computation and Language

Title: a comprehensive overview of large language models.

Abstract: Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.
Subjects: Computation and Language (cs.CL)
Cite as: [cs.CL]
  (or [cs.CL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

COMMENTS

  1. LL.M. Program

    The LL.M. (Master of Laws) program is a one-year degree program that typically includes 180 students from some 65 countries. The Graduate Program is interested in attracting intellectually curious and thoughtful candidates from a variety of legal systems and backgrounds and with various career plans. Harvard's LL.M. students include lawyers working in firms, government officials, […]

  2. Master of Laws (LLM)

    Design Your Own LLM. You will choose from 300+ courses to plan a curriculum that meets your intellectual and professional interests. You can choose to specialize in one or two areas, or take a broad range of classes. You also will have the chance to write a paper in close consultation with a professor, or expand a typical research assignment into a master's thesis.

  3. LLM Program

    The Michigan LLM is a full-time program, and all students begin classes in late August and graduate in early May. LLM students are permitted to enroll in most Law School courses, including several clinics. We offer two courses that are exclusive to LLM students (a constitutional law course and a research and writing course ).

  4. Research LLM

    Osgoode's Research LLM is a full-time, research-intensive program that is ideal for students who want to pursue a specific area of legal study in depth, including those who are considering a PhD. Students conduct their research under the supervision of an Osgoode faculty member. The Research LLM does not qualify students to practise law in ...

  5. Master of Laws (LLM)

    On behalf of our Graduate Studies Committee, welcome to the University of Chicago Law School! The University of Chicago Law School uniquely offers the combination of a small (70-80 students) and diverse (more than 25 nationalities) LLM program with a real sense of community among our students. The rigorous and elite academic atmosphere at the Law School is part of the experience both inside ...

  6. Law LLM by Research

    This makes the LLM by Research a particularly attractive option for those wishing to undertake postgraduate research on a part-time basis, while pursuing legal practice or other employment. Find out more about compulsory and optional courses. We link to the latest information available. Please note that this may be for a previous academic year ...

  7. Master of Laws (Research)

    The Master of Laws by Research (LLM) at the University of Sydney Law School is a pathway to a number of careers, including tertiary education, policy development, advanced research, and specialisation for employment in government, inter-governmental and international organisations, and civil society organisations. ...

  8. LLM Researcher and Scientist Roadmap: A Guide to Mastering Large

    Additionally, the article delves into advanced optimization techniques, covering quantization and inference optimization. By navigating through the detailed Table of Contents, readers gain a thorough understanding of the essential components involved in LLM research, empowering them to embark on a journey toward expertise in the field.

  9. LLM By research

    An LLM by Research is intended to develop the student's legal research and writing skills by directing them towards planning and executing a large piece of academic research - usually around 30,000-40,000 words - on their chosen field of law. Although this dissertation will be completed under specialist supervision, the student will be ...

  10. Topics, Authors, and Institutions in Large Language Model Research

    Abstract. Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field's future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM ...

  11. LLM By Research

    LLM by Research students are given access to dedicated hot desk space. The specialist law library is housed in the same building as the Law School's teaching and staff rooms. The library contains not only a comprehensive collection of Scots and English sources but also a substantial collection of European Union, Commonwealth and American ...

  12. EXPRESS: AI-Human Hybrids for Marketing Research: Leveraging LLMs as

    For quantitative research, the LLM correctly picks the answer direction and valence, with the quality of synthetic data significantly improving through few-shot learning and retrieval-augmented generation. The authors demonstrate the value of the AI-human hybrid by collaborating with a Fortune 500 food company and replicating a 2019 qualitative ...

  13. (PDF) A Comprehensive Overview of Large Language Models

    LLM research, it has become considerably challenging to perceive. the bigger picture of the advances in this direction. Considering. the rapidly emerging plethora of literature on LLMs, it is.

  14. Understanding LLMs: A Comprehensive Overview from Training to Inference

    Additionally, some research efforts introduce specialized data from professional domains, such as code or scientific data, to enhance LLM capabilities in those fields. Leveraging diverse sources of text data for LLM training can significantly enhance the model's generalization capabilities.

  15. LLM Research

    Our Goals. To work as a partner with our patients in conducting research that benefits the practice of medicine and overall patient care. To provide the best mix of expertise and service to achieve our research goals. To continuously improve our methods and knowledge to provide innovative solutions to our patients needs.

  16. LLMs develop their own understanding of reality as their language

    Ask a large language model (LLM) like GPT-4 to smell a rain-soaked campsite, and it'll politely decline. ... This research indicates that the LLM develops an internal model of the simulated reality, even though it was never trained to develop this model," says Martin Rinard, an MIT professor in EECS, CSAIL member, and senior author on the ...

  17. A study of generative large language model for medical research and

    Specifically, we adopted a fixed-LLM prompt-tuning strategy 42 to attach a continuous embedding (i.e., virtue tokens) to the input sequence [virtual tokens; x; y] as a soft prompt to control the ...

  18. A Comprehensive Overview of Large Language Models

    LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field.

  19. Top 176 LLM Programs in the United States 2024

    Georgetown Law, in Washington, D.C., has one of the most well-established graduate programs in the United States and offers an unparalleled opportunity for lawyers to broaden and deepen their understanding of law through advanced study. Our LL.M., S.J.D. and Certificate students come from more than 60 countries and close to 150 different law ...

  20. About

    LLM Research has conducted a variety of clinical studies in adult and pediatric patients and has contributed to the development of new therapies for healthy participants and patients struggling with diseases in multiple therapeutic areas including but not limited to pulmonology, gastroenterology, gynecology, oncology, hematology, hepatology, endocrinology, dermatology, and psychiatry.

  21. LLM Research

    An LLB degree, a bachelor's degree with honours, or an appropriate postgraduate diploma. Your application should include a CV, an academic record (transcript) and a short proposal, also known as an expression of interest. The short proposal is a document of 7 - 10 pages that outlines the applicant's proposed research and the research ...

  22. Large language models pose a risk to society and need tighter

    Associate Professor and Research Associate Dr. Chris Russell, Oxford Internet Institute said, "While LLMs are built so that using them feels like a conversation with an honest and accurate ...

  23. Large language model

    A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification.Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training ...

  24. Research Masters Degree (LLM)

    Research Masters Degree (LLM) The degree is awarded based on the strength of a dissertation of approximately 40,000 words. An internal and external examiner examines it. The topic must fall within the law, justice, and sustainability focus. The faculty must have sufficient expertise to provide effective study supervision.

  25. Experiments reveal LLMs develop their own understanding of reality as

    This research indicates that the LLM develops an internal model of the simulated reality, even though it was never trained to develop this model," says Martin Rinard, an MIT professor in EECS ...

  26. Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

    This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to ...

  27. A tutorial on open-source large language models for ...

    Large language models (LLMs) have the potential to revolutionize behavioral science by accelerating and improving the research cycle, from conceptualization to data analysis. Unlike closed-source solutions, open-source frameworks for LLMs can enable transparency, reproducibility, and adherence to data protection standards, which gives them a crucial advantage for use in behavioral science. To ...

  28. An LLM-based AI agent to assist with smart deployments across business

    An LLM-based agent can assist CSPs in integrating various AI deployments, even when facing highly specific, complex, and unique requirements. If such an agent were available industry-wide, CSPs could significantly enhance operational efficiency, improve customer experiences, and unlock new revenue streams.

  29. Dicklesworthstone/llm_aided_ocr

    The LLM-Aided OCR Project is an advanced system designed to significantly enhance the quality of Optical Character Recognition (OCR) output. By leveraging cutting-edge natural language processing techniques and large language models (LLMs), this project transforms raw OCR text into highly accurate, well-formatted, and readable documents.

  30. [2307.06435] A Comprehensive Overview of Large Language Models

    Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics ...