Opendatabay APP

English Language Origin Dataset

Data Science and Analytics

Tags and Keywords

Text

Nlp

English

Mining

Trusted By
Trusted by company1Trusted by company2Trusted by company3
English Language Origin Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed to facilitate the development of models capable of discerning between native and non-native English writing. It addresses the subtle yet detectable differences in grammar, dictation, and common phrasing that distinguish the two. The collection was formed by segmenting numerous writing tasks from non-native speakers and incorporating thousands of native English phrases extracted from reputable sources like the BBC and New York Times, creating a labelled dataset ideal for natural language processing (NLP) applications. The objective is to enable automated detection of a sentence or paragraph's origin, whether it belongs to a native or non-native writer.

Columns

The dataset is structured to provide sentences alongside their corresponding labels, indicating whether the text is from a native or non-native speaker.
  • Text/Sentence: Contains the English sentence or paragraph.
  • Label: Specifies whether the accompanying text is 'Native' or 'Non-Native'.

Distribution

The dataset is typically provided in a CSV file format. While specific figures for the total number of rows or records are not explicitly available, it includes a large aggregation of non-native writing tasks and thousands of native English phrases. The original non-native and native English texts are also uploaded alongside the CSV.

Usage

This dataset is ideal for a variety of applications, particularly in the field of natural language processing and machine learning. Its primary use case is for training models to detect whether a sentence or paragraph originates from a native or non-native English speaker. This could be beneficial for:
  • Developing AI models for language assessment.
  • Linguistic research on second language acquisition.
  • Creating tools for content analysis and categorisation based on writing style.
  • Educational platforms aiming to provide feedback on written English.

Coverage

The dataset has a global regional coverage. While no specific time range for the original data collection is noted, it was listed on 26/06/2025. The demographic scope covers both native English speakers (drawing from sources like BBC and New York Times) and a variety of non-native English speakers from aggregated writing tasks.

License

CC0

Who Can Use It

This dataset is primarily intended for professionals and enthusiasts in data science and analytics, especially those working with NLP techniques.
  • Data Scientists and NLP Engineers: To train machine learning models for language origin detection.
  • Linguists and Researchers: For studies on language variations and second language characteristics.
  • Developers: To integrate language detection capabilities into applications such as writing assistance tools or educational software.

Dataset Name Suggestions

  • English Language Origin Dataset
  • Native vs. Non-Native English Classifier
  • NLP Language Attribution Data
  • Global English Styles
  • Bilingualism Detection Dataset

Attributes

Original Data Source: Native or non Native

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

26/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format