Deep Learning Word Difficulty Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides an indicator of word complexity, essential for text-simplification systems. It explores the use of deep learning-based models for predicting word difficulty, formulated as a binary classification problem. A key objective was to remove reliance on the frequency of previously acquired words for measuring difficulty. The dataset also analyses a convolutional neural network-based prediction model that operates at the character level, comparing its efficiency to other models. It contains data instances used to train and evaluate traditional machine learning models for this task.
Columns
- Word Length: The length of the word.
- Freq_HAL: Hyperspace Analogue to Language Frequency Norms.
- Log_Freq_HAL: The log value of Freq_HAL.
- I_Mean_RT: The Mean Response Time Score.
- I_Zscore: This score determines the difficulty of the word, ranging between 0 (SIMPLE) and 1 (DIFFICULT). Further details on obtaining the difficulty label from this score are available in the associated research paper.
- I_SD: The I_SD Score.
- Obs: Observations count.
- I_Mean_Accuracy: The accuracy score.
Distribution
The data is provided in CSV format and comprises 40,481 data instances. A sample file will be updated separately to the platform. The dataset is structured to facilitate analysis of word difficulty metrics.
Usage
This dataset is ideal for training and evaluating machine learning models focused on word difficulty prediction. It is particularly useful for developing and enhancing text-simplification systems and for research into linguistic complexity. The data can be utilised by the global data science community to answer various questions related to computational linguistics and natural language processing.
Coverage
The dataset's region coverage is global. It was listed on 17th June 2025. There are no specific notes on demographic scope, but it is applicable wherever text analysis and simplification are relevant.
License
CC-By
Who Can Use It
This dataset is intended for data scientists, machine learning engineers, researchers, and developers working in areas such as natural language processing (NLP), computational linguistics, and AI. It is especially valuable for those developing or improving text-simplification systems and anyone needing a quantitative measure of word complexity.
Dataset Name Suggestions
- Word Difficulty Prediction Dataset
- Linguistic Complexity Scores
- Text Difficulty Metrics
- NLP Word Complexity
- Deep Learning Word Difficulty
Attributes
Original Data Source: WORD DIFFICULTY PREDICTION