PERSUADE 2.0 Academic Writing Corpus
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as PERSUADE 2.0, is a collection of over 25,000 argumentative essays written by students in the 6th to 12th grades in the United States. It was developed by The Learning Agency and Vanderbilt University. Building upon the PERSUADE 1.0 corpus, this version significantly enhances its utility by providing holistic essay scores and proficiency scores for individual argumentative and discourse elements within each essay. Unlike its predecessor, PERSUADE 2.0 includes all essays, not just a training set. The essays were produced in response to 15 different prompts across two writing tasks: independent and source-based. The dataset also includes detailed individual and demographic information for each student writer, along with the initial annotations for argumentative and discourse elements from the prior version. Each essay underwent a rigorous human annotation process for argumentative and discourse elements and their interrelationships, using a double-blind rating system with full adjudication by expert raters. The annotation rubric was developed in-house, refined through feedback from teacher panels and a research advisory board comprising experts in writing, discourse processing, linguistics, and machine learning. The discourse elements are adapted from established argumentative frameworks.
Columns
The dataset contains several key files, with
persuade_2.0_human_scores_demo_id_github.csv
being a primary source for scores and demographic data, and persuade_corpus_1.0.csv
containing discourse element annotations.From
persuade_2.0_human_scores_demo_id_github.csv
:essay_id_comp
: A unique identifier for each essay, used in Feedback Prize competitions.full_text
: The complete text of the student's essay.holistic_essay_score
: The overall score awarded to the essay (e.g., ranging from 1.00-1.25 to 5.75-6.00).word_count
: The total number of words in the essay's full text.prompt_name
: A brief name identifying the essay prompt.task
: Indicates whether the essay was an 'Independent' or 'Text dependent' writing task.assignment
: The full text of the writing prompt given to the student.source_text
: The title or reference of the source material used for text-dependent tasks.gender
: The declared gender of the student (Female or Male).grade_level
: The student's academic grade level (e.g., 8, 10).ell_status
: Indicates if the student is an English Language Learner ('Yes' or 'No').race_ethnicity
: The student's self-reported race or ethnicity (e.g., White, Hispanic/Latino).economically_disadvantaged
: Status indicating if the student is economically disadvantaged.student_disability_status
: Status indicating if the student is identified as having a disability.
From
persuade_corpus_1.0.csv
:essay_id_comp
: The essay ID.competition_set
: Indicates if the essay was part of the training or test set in the Feedback Prize.full_text
: The complete text of the essay.discourse_id
: An ID for each identified discourse element.discourse_start
: The character position where the discourse element begins in the essay.discourse_end
: The character position where the discourse element ends in the essay.discourse_text
: The actual text of the discourse element.discourse_type
: The human annotation for the type of discourse element, including:- Lead: An introduction designed to capture attention and point towards the thesis.
- Position: An opinion or conclusion on the main question.
- Claim: A statement supporting the main position.
- Counterclaim: A statement that refutes another claim or provides an opposing reason to the position.
- Rebuttal: A statement that refutes a counterclaim.
- Evidence: Ideas or examples that support claims, counterclaims, rebuttals, or the position.
- Concluding Summary: A statement that restates the position and claims at the end.
- Unannotated: Segments not identified as specific discourse elements.
discourse_type_num
: A numerical representation for the discourse element within the essay.
Distribution
The PERSUADE 2.0 corpus consists of over 25,000 argumentative essays. The primary data file,
persuade_2.0_human_scores_demo_id_github.csv
, is approximately 75.96 MB in size. The dataset is provided in CSV format. Many columns, such as essay_id_comp
, full_text
, holistic_essay_score
, word_count
, prompt_name
, task
, assignment
, gender
, and race_ethnicity
, have valid entries for all 26,000 records. Some demographic fields like grade_level
and ell_status
have a small percentage of missing values (around 4-5%), while economically_disadvantaged
and student_disability_status
have about 20% missing. source_text
is missing for approximately 50% of entries, corresponding to independent writing tasks where no source text was provided.Usage
This dataset is well-suited for a variety of applications, particularly in educational research, natural language processing (NLP), and machine learning. Ideal uses include:
- Developing and evaluating automated essay scoring models.
- Research into argumentation mining and the identification of argumentative structures in student writing.
- Analysing patterns in student writing proficiency across different demographic groups.
- Studying the effectiveness of various writing prompts and tasks.
- Training machine learning models to classify or extract argumentative and discourse elements from text.
- Exploring the relationship between student demographics and writing performance.
- Educational technology development focused on writing feedback and instruction.
Coverage
The dataset's geographic scope is focused on students in the United States. It covers 6th-12th grade students. The demographic information includes:
- Gender: Male and Female students are almost equally represented.
- Grade Level: Data is available for six different grade levels.
- English Language Learner Status: Information is available for both ELL and non-ELL students, with the majority being non-ELL.
- Race/Ethnicity: Six categories are represented, with 'White' and 'Hispanic/Latino' being the most prevalent.
- Economic Status: Data differentiates between economically disadvantaged and non-disadvantaged students.
- Disability Status: It indicates whether a student is identified as having a disability.
Specific time range for essay collection is not provided, and the dataset is indicated to not expect further updates.
License
CC BY-NC-SA 4.0
Who Can Use It
- Researchers in educational assessment, linguistics, discourse analysis, and natural language processing seeking to understand and model argumentative writing.
- Educators and curriculum developers interested in student writing development, assessment rubrics, and the teaching of argumentation.
- Data scientists and machine learning engineers working on text classification, regression tasks for scoring, and information extraction from educational texts.
- Developers of AI tools for writing support and feedback systems.
Dataset Name Suggestions
- Student Argumentative Essays 6-12
- PERSUADE 2.0 Academic Writing Corpus
- US K-12 Argumentation Dataset
- Annotated Student Persuasive Essays
- Student Essay Discourse Elements
Attributes
Original Data Source: PERSUADE 2.0 Academic Writing Corpus