Dark Mode

Home

Data Categories

AI & ML Data

Deep Learning Word Difficulty Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Deep Learning Word Difficulty Dataset

Data Science and Analytics

Tags and Keywords

Computer

Games

Text

Nlp

Languages

Trusted By

Deep Learning Word Difficulty Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides an indicator of word complexity, essential for text-simplification systems. It explores the use of deep learning-based models for predicting word difficulty, formulated as a binary classification problem. A key objective was to remove reliance on the frequency of previously acquired words for measuring difficulty. The dataset also analyses a convolutional neural network-based prediction model that operates at the character level, comparing its efficiency to other models. It contains data instances used to train and evaluate traditional machine learning models for this task.

Columns

Word Length: The length of the word.
Freq_HAL: Hyperspace Analogue to Language Frequency Norms.
Log_Freq_HAL: The log value of Freq_HAL.
I_Mean_RT: The Mean Response Time Score.
I_Zscore: This score determines the difficulty of the word, ranging between 0 (SIMPLE) and 1 (DIFFICULT). Further details on obtaining the difficulty label from this score are available in the associated research paper.
I_SD: The I_SD Score.
Obs: Observations count.
I_Mean_Accuracy: The accuracy score.

Distribution

The data is provided in CSV format and comprises 40,481 data instances. A sample file will be updated separately to the platform. The dataset is structured to facilitate analysis of word difficulty metrics.

Usage

This dataset is ideal for training and evaluating machine learning models focused on word difficulty prediction. It is particularly useful for developing and enhancing text-simplification systems and for research into linguistic complexity. The data can be utilised by the global data science community to answer various questions related to computational linguistics and natural language processing.

Coverage

The dataset's region coverage is global. It was listed on 17th June 2025. There are no specific notes on demographic scope, but it is applicable wherever text analysis and simplification are relevant.

License

CC-By

Who Can Use It

This dataset is intended for data scientists, machine learning engineers, researchers, and developers working in areas such as natural language processing (NLP), computational linguistics, and AI. It is especially valuable for those developing or improving text-simplification systems and anyone needing a quantitative measure of word complexity.

Dataset Name Suggestions

Word Difficulty Prediction Dataset
Linguistic Complexity Scores
Text Difficulty Metrics
NLP Word Complexity
Deep Learning Word Difficulty

Attributes

Original Data Source: WORD DIFFICULTY PREDICTION

Listing Stats

VIEWS

DOWNLOADS

LISTED

17/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...