Dark Mode

Home

Data Categories

AI & ML Data

StackOverflow Question Insights

FREE DATASET LIBRARY

Verified Data Provider

£0

StackOverflow Question Insights

Education & Learning Analytics

Tags and Keywords

Earth

Nature

Computer

Science

Education

Internet

Programming

Classification

Nlp

Trusted By

StackOverflow Question Insights Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a collection of more than 20,000 question titles sourced directly from the Stack Overflow API, forming the basis for the "Stackoverflow Question Classification Challenge". The dataset is designed to support machine learning initiatives, particularly in Natural Language Processing (NLP) and classification tasks. It addresses the common challenge in coding of seeking help for complex problems, reflecting the collaborative learning environment of Stack Overflow. The primary purpose of this dataset is to facilitate the development of predictive models, such as classifiers that can determine the programming language (label) of a question solely from its title, or algorithms that predict a question's score. It also serves as a valuable resource for exploring similarities between coding question titles.

Columns

The dataset is structured with the following columns:

title: Represents the full text of the question title.
id_stack: The unique identifier assigned to each question on Stack Overflow. This column contains 28,128 unique values out of 28,174 total entries.
tags: Describes the tags associated with the question. Notably, approximately 27% of entries in this column are null, with 1% related to 'jquery' and the remaining 72% comprising other diverse tags.
views: Indicates the total number of times the question has been viewed. This column has 28,174 total values.
score: Reflects the question's overall score, derived from upvotes and downvotes. This column also has 28,174 total values.
done: A boolean field indicating whether the question has been marked as answered (True) or not (False). It is important to note that approximately 99% of entries in this column are null, with recorded true and false counts being zero.
label: Specifies the programming language used in the question. This column is entirely null (100%), indicating no specific language labels are present in the provided sample.

Distribution

The dataset is provided as a CSV file, utilising a semicolon (;) as the delimiter. While specific total row counts are not explicitly stated for the entire dataset, several key columns like id_stack, views, and score each contain 28,174 total values, which can be considered the record count for these fields. The dataset is built upon more than 20,000 question titles.

Usage

This dataset is ideal for a variety of analytical and machine learning applications, including:

Building Classification Models: Developing models to predict the programming language (label) based solely on the question title.
Score Prediction: Creating algorithms to forecast a question's popularity or score using its title as an input.
Similarity Analysis: Identifying and grouping similar coding questions by analysing their titles.
Natural Language Processing (NLP): Conducting research and developing applications in text analysis, topic modelling, and information retrieval within the programming domain.
Educational Analytics: Gaining insights into common learning difficulties and question patterns in programming education.

Coverage

The dataset's scope is global, drawing questions from the Stack Overflow platform. While no specific time range for data collection is provided, the dataset was listed on 26 June 2025 on its platform. There are specific notes on data availability for certain groups/years, particularly regarding the high percentage of null values in the tags, done, and label columns, which may influence analytical outcomes for these fields.

License

CC0

Who Can Use It

This dataset is suitable for:

Data Scientists and Machine Learning Engineers: For training and evaluating NLP models, classification algorithms, and predictive systems.
Academic Researchers: Those studying programming education, online learning communities, or text analysis.
Developers and Programmers: Interested in understanding trends in Stack Overflow questions or building tools related to question categorisation.
Students: For academic projects involving data analysis, machine learning, and natural language processing.

Dataset Name Suggestions

StackOverflow Question Classifier Data
Coding Question Title Analytics
StackOverflow NLP Dataset
Programming Language Prediction Challenge
StackOverflow Question Insights

Attributes

Original Data Source: Stackoverflow Question Classification Challenge

Listing Stats

VIEWS

DOWNLOADS

LISTED

26/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...