Language Difficulty Control Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset offers a distinct collection of English synonyms and figures of speech, specifically created to address the scarcity of such grouped data. It was initially developed to support research in Sentence Restructuring with User-Controlled Difficulty using NLP. The dataset is built upon synonyms extracted from WordNet database version 3.0, featuring over 100,000 unique lemmas and an equivalent number of unique synonym values. It further enhances its utility by incorporating word usage counts derived from the Google Ngrams Viewer (1800-2018, English Language Corpus), providing valuable real-world applicability. This is a free dataset ideal for education, learning analytics, and various Natural Language Processing (NLP) applications.
Columns
- ID: A unique identifier for each entry.
- part_of_speech: Indicates the grammatical classification of the word (e.g., noun, verb).
- synonyms: Presented as a dictionary, where each key represents a synonym of the main word, and its corresponding value is a numerical metric associated with that synonym. The file is structured such that each synonym row can contain a varying number of values within a list, indicative of its horizontally scaled nature.
Distribution
The dataset comprises over 9200 rows of key-value pairs representing synonyms and their associated word counts. Specifically, it contains 9238 total entries as observed in the label counts. It includes more than 100,000 distinct lemmas and synonym values, highlighting its rich lexical coverage. While specific file format details are not provided, it is structured to facilitate analysis of synonyms and their historical word usage.
Usage
This dataset is particularly well-suited for:
- Natural Language Processing (NLP) tasks, including text-to-text generation and classification.
- Research focused on sentence restructuring and controlling text difficulty.
- Linguistic analysis and exploring word relationships.
- Educational applications and learning analytics, providing insights into English vocabulary and usage.
- Analysing word frequency and historical usage trends from 1800 to 2018.
Coverage
The dataset's scope is global, focusing on the English language. It provides word usage data spanning from 1800 to 2018, sourced from Corpus 26 of the Google Ngrams Viewer. The lexical content, including synonyms, is derived from WordNet database version 3.0.
License
CC0
Who Can Use It
- NLP Researchers and Developers: For creating and refining language models, text generation systems, and text classification tools.
- Academics: Especially those in linguistics, computational linguistics, and digital humanities, for studies on lexical semantics and historical language use.
- Educators and Curriculum Designers: To develop resources for vocabulary building, understanding figures of speech, and analysing literary texts.
- Data Scientists: Interested in text mining, natural language understanding, and building applications that require synonym information or word frequency analysis.
Dataset Name Suggestions
- English Synonyms and Word Usage Metrics
- WordNet-Derived Lexical Database for NLP
- Historical English Synonyms and Figures of Speech
- Language Difficulty Control Dataset
Attributes
Original Data Source:English Synonyms Dataset with Figures of Speech