English Language Origin Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to facilitate the development of models capable of discerning between native and non-native English writing. It addresses the subtle yet detectable differences in grammar, dictation, and common phrasing that distinguish the two. The collection was formed by segmenting numerous writing tasks from non-native speakers and incorporating thousands of native English phrases extracted from reputable sources like the BBC and New York Times, creating a labelled dataset ideal for natural language processing (NLP) applications. The objective is to enable automated detection of a sentence or paragraph's origin, whether it belongs to a native or non-native writer.
Columns
The dataset is structured to provide sentences alongside their corresponding labels, indicating whether the text is from a native or non-native speaker.
- Text/Sentence: Contains the English sentence or paragraph.
- Label: Specifies whether the accompanying text is 'Native' or 'Non-Native'.
Distribution
The dataset is typically provided in a CSV file format. While specific figures for the total number of rows or records are not explicitly available, it includes a large aggregation of non-native writing tasks and thousands of native English phrases. The original non-native and native English texts are also uploaded alongside the CSV.
Usage
This dataset is ideal for a variety of applications, particularly in the field of natural language processing and machine learning. Its primary use case is for training models to detect whether a sentence or paragraph originates from a native or non-native English speaker. This could be beneficial for:
- Developing AI models for language assessment.
- Linguistic research on second language acquisition.
- Creating tools for content analysis and categorisation based on writing style.
- Educational platforms aiming to provide feedback on written English.
Coverage
The dataset has a global regional coverage. While no specific time range for the original data collection is noted, it was listed on 26/06/2025. The demographic scope covers both native English speakers (drawing from sources like BBC and New York Times) and a variety of non-native English speakers from aggregated writing tasks.
License
CC0
Who Can Use It
This dataset is primarily intended for professionals and enthusiasts in data science and analytics, especially those working with NLP techniques.
- Data Scientists and NLP Engineers: To train machine learning models for language origin detection.
- Linguists and Researchers: For studies on language variations and second language characteristics.
- Developers: To integrate language detection capabilities into applications such as writing assistance tools or educational software.
Dataset Name Suggestions
- English Language Origin Dataset
- Native vs. Non-Native English Classifier
- NLP Language Attribution Data
- Global English Styles
- Bilingualism Detection Dataset
Attributes
Original Data Source: Native or non Native