Opendatabay APP

Educational Writing Assessment Dataset

Education & Learning Analytics

Tags and Keywords

Education

Text

Nlp

Regression

Transfer

Learning

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Educational Writing Assessment Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset comprises a collection of student essay writing samples, meticulously labelled with both overall and analytic language proficiency scores. It includes approximately 6,500 unique writing samples, each assessed for key linguistic features such as cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Additionally, the dataset provides demographic information about the English Language Learner (ELL) writers, including their economic status, gender, grade level (ranging from 8 to 12), and race/ethnicity. This rich dataset is an invaluable resource for advancing research in corpus linguistics and Natural Language Processing (NLP), particularly for developing and evaluating models that assess overall language proficiency and more granular writing skills.

Columns

The dataset contains the following labelled columns:
  • text_id: A unique identifier for each essay.
  • full_text: The complete content of the student essay.
  • grade: The student's academic grade level at the time of writing, typically ranging from 8 to 12.
  • prompt: The specific topic or writing assignment given to the student.
  • Overall: The overall holistic essay score, typically ranging from 1.00 to 5.00.
  • Cohesion: A score indicating the essay's logical flow and connection of ideas, typically ranging from 1.00 to 5.00.
  • Syntax: A score reflecting the grammatical structure and sentence variety used, typically ranging from 1.00 to 5.00.
  • Vocabulary: A score for the range, precision, and appropriate use of words, typically ranging from 1.00 to 5.00.
  • Phraseology: A score assessing the use of natural and idiomatic phrases, typically ranging from 1.00 to 5.00.
  • Grammar: A score for adherence to grammatical rules, typically ranging from 1.00 to 5.00.
  • Conventions: A score related to spelling, punctuation, and capitalisation.

Distribution

The dataset is provided as labelled text data, typically in a CSV file format. It contains 6,482 unique records, corresponding to the number of writing samples. Specific file size in megabytes or gigabytes is not available.

Usage

This dataset is ideally suited for:
  • Developing and training NLP models for automated essay scoring and feedback generation.
  • Researching language proficiency assessment and the linguistic features indicative of different proficiency levels.
  • Studying the writing development of English Language Learners across various demographic groups.
  • Creating educational tools for writing instruction and evaluation.
  • Corpus linguistic analysis of student writing patterns and errors.

Coverage

The dataset's coverage is global, comprising writing samples from English Language Learners. It includes demographic details such as economic status, gender, grade level (8-12), and race/ethnicity, allowing for studies on diverse groups. The essays are based on various prompts, including 'Distance learning' and 'Success and failure'.

License

CC-BY-NC.

Who Can Use It

This dataset is particularly valuable for:
  • Academics and researchers in fields like computational linguistics, education, and second language acquisition.
  • Data scientists and machine learning engineers working on natural language processing tasks related to text classification and regression.
  • Educational institutions and assessment organisations aiming to improve automated writing evaluation systems.
  • Developers of AI and LLM solutions for language learning and educational technology.

Dataset Name Suggestions

  • ELLIPSE Student Writing Proficiency Corpus
  • English Language Learner Essay Assessment
  • Student Writing Proficiency Scores Dataset
  • Annotated ELL Essay Corpus for NLP
  • Educational Writing Assessment Dataset

Attributes

Original Data Source: ELLIPSE Corpus

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

24/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format