Opendatabay APP

Deep Learning Word Difficulty Dataset

Data Science and Analytics

Tags and Keywords

Computer

Games

Text

Nlp

Languages

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Deep Learning Word Difficulty Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides an indicator of word complexity, essential for text-simplification systems. It explores the use of deep learning-based models for predicting word difficulty, formulated as a binary classification problem. A key objective was to remove reliance on the frequency of previously acquired words for measuring difficulty. The dataset also analyses a convolutional neural network-based prediction model that operates at the character level, comparing its efficiency to other models. It contains data instances used to train and evaluate traditional machine learning models for this task.

Columns

  • Word Length: The length of the word.
  • Freq_HAL: Hyperspace Analogue to Language Frequency Norms.
  • Log_Freq_HAL: The log value of Freq_HAL.
  • I_Mean_RT: The Mean Response Time Score.
  • I_Zscore: This score determines the difficulty of the word, ranging between 0 (SIMPLE) and 1 (DIFFICULT). Further details on obtaining the difficulty label from this score are available in the associated research paper.
  • I_SD: The I_SD Score.
  • Obs: Observations count.
  • I_Mean_Accuracy: The accuracy score.

Distribution

The data is provided in CSV format and comprises 40,481 data instances. A sample file will be updated separately to the platform. The dataset is structured to facilitate analysis of word difficulty metrics.

Usage

This dataset is ideal for training and evaluating machine learning models focused on word difficulty prediction. It is particularly useful for developing and enhancing text-simplification systems and for research into linguistic complexity. The data can be utilised by the global data science community to answer various questions related to computational linguistics and natural language processing.

Coverage

The dataset's region coverage is global. It was listed on 17th June 2025. There are no specific notes on demographic scope, but it is applicable wherever text analysis and simplification are relevant.

License

CC-By

Who Can Use It

This dataset is intended for data scientists, machine learning engineers, researchers, and developers working in areas such as natural language processing (NLP), computational linguistics, and AI. It is especially valuable for those developing or improving text-simplification systems and anyone needing a quantitative measure of word complexity.

Dataset Name Suggestions

  • Word Difficulty Prediction Dataset
  • Linguistic Complexity Scores
  • Text Difficulty Metrics
  • NLP Word Complexity
  • Deep Learning Word Difficulty

Attributes

Original Data Source: WORD DIFFICULTY PREDICTION

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format