Opendatabay APP

Language Difficulty Control Dataset

Education & Learning Analytics

Tags and Keywords

Nlp

English

Text

Classification

Text-to-text

Generation

Sentence

Similarity

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Language Difficulty Control Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset offers a distinct collection of English synonyms and figures of speech, specifically created to address the scarcity of such grouped data. It was initially developed to support research in Sentence Restructuring with User-Controlled Difficulty using NLP. The dataset is built upon synonyms extracted from WordNet database version 3.0, featuring over 100,000 unique lemmas and an equivalent number of unique synonym values. It further enhances its utility by incorporating word usage counts derived from the Google Ngrams Viewer (1800-2018, English Language Corpus), providing valuable real-world applicability. This is a free dataset ideal for education, learning analytics, and various Natural Language Processing (NLP) applications.

Columns

  • ID: A unique identifier for each entry.
  • part_of_speech: Indicates the grammatical classification of the word (e.g., noun, verb).
  • synonyms: Presented as a dictionary, where each key represents a synonym of the main word, and its corresponding value is a numerical metric associated with that synonym. The file is structured such that each synonym row can contain a varying number of values within a list, indicative of its horizontally scaled nature.

Distribution

The dataset comprises over 9200 rows of key-value pairs representing synonyms and their associated word counts. Specifically, it contains 9238 total entries as observed in the label counts. It includes more than 100,000 distinct lemmas and synonym values, highlighting its rich lexical coverage. While specific file format details are not provided, it is structured to facilitate analysis of synonyms and their historical word usage.

Usage

This dataset is particularly well-suited for:
  • Natural Language Processing (NLP) tasks, including text-to-text generation and classification.
  • Research focused on sentence restructuring and controlling text difficulty.
  • Linguistic analysis and exploring word relationships.
  • Educational applications and learning analytics, providing insights into English vocabulary and usage.
  • Analysing word frequency and historical usage trends from 1800 to 2018.

Coverage

The dataset's scope is global, focusing on the English language. It provides word usage data spanning from 1800 to 2018, sourced from Corpus 26 of the Google Ngrams Viewer. The lexical content, including synonyms, is derived from WordNet database version 3.0.

License

CC0

Who Can Use It

  • NLP Researchers and Developers: For creating and refining language models, text generation systems, and text classification tools.
  • Academics: Especially those in linguistics, computational linguistics, and digital humanities, for studies on lexical semantics and historical language use.
  • Educators and Curriculum Designers: To develop resources for vocabulary building, understanding figures of speech, and analysing literary texts.
  • Data Scientists: Interested in text mining, natural language understanding, and building applications that require synonym information or word frequency analysis.

Dataset Name Suggestions

  • English Synonyms and Word Usage Metrics
  • WordNet-Derived Lexical Database for NLP
  • Historical English Synonyms and Figures of Speech
  • Language Difficulty Control Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format