StackOverflow Question Insights
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a collection of more than 20,000 question titles sourced directly from the Stack Overflow API, forming the basis for the "Stackoverflow Question Classification Challenge". The dataset is designed to support machine learning initiatives, particularly in Natural Language Processing (NLP) and classification tasks. It addresses the common challenge in coding of seeking help for complex problems, reflecting the collaborative learning environment of Stack Overflow. The primary purpose of this dataset is to facilitate the development of predictive models, such as classifiers that can determine the programming language (label) of a question solely from its title, or algorithms that predict a question's score. It also serves as a valuable resource for exploring similarities between coding question titles.
Columns
The dataset is structured with the following columns:
- title: Represents the full text of the question title.
- id_stack: The unique identifier assigned to each question on Stack Overflow. This column contains 28,128 unique values out of 28,174 total entries.
- tags: Describes the tags associated with the question. Notably, approximately 27% of entries in this column are null, with 1% related to 'jquery' and the remaining 72% comprising other diverse tags.
- views: Indicates the total number of times the question has been viewed. This column has 28,174 total values.
- score: Reflects the question's overall score, derived from upvotes and downvotes. This column also has 28,174 total values.
- done: A boolean field indicating whether the question has been marked as answered (
True
) or not (False
). It is important to note that approximately 99% of entries in this column are null, with recorded true and false counts being zero. - label: Specifies the programming language used in the question. This column is entirely null (100%), indicating no specific language labels are present in the provided sample.
Distribution
The dataset is provided as a CSV file, utilising a semicolon (
;
) as the delimiter. While specific total row counts are not explicitly stated for the entire dataset, several key columns like id_stack
, views
, and score
each contain 28,174 total values, which can be considered the record count for these fields. The dataset is built upon more than 20,000 question titles.Usage
This dataset is ideal for a variety of analytical and machine learning applications, including:
- Building Classification Models: Developing models to predict the programming language (
label
) based solely on the question title. - Score Prediction: Creating algorithms to forecast a question's popularity or score using its title as an input.
- Similarity Analysis: Identifying and grouping similar coding questions by analysing their titles.
- Natural Language Processing (NLP): Conducting research and developing applications in text analysis, topic modelling, and information retrieval within the programming domain.
- Educational Analytics: Gaining insights into common learning difficulties and question patterns in programming education.
Coverage
The dataset's scope is global, drawing questions from the Stack Overflow platform. While no specific time range for data collection is provided, the dataset was listed on 26 June 2025 on its platform. There are specific notes on data availability for certain groups/years, particularly regarding the high percentage of null values in the
tags
, done
, and label
columns, which may influence analytical outcomes for these fields.License
CC0
Who Can Use It
This dataset is suitable for:
- Data Scientists and Machine Learning Engineers: For training and evaluating NLP models, classification algorithms, and predictive systems.
- Academic Researchers: Those studying programming education, online learning communities, or text analysis.
- Developers and Programmers: Interested in understanding trends in Stack Overflow questions or building tools related to question categorisation.
- Students: For academic projects involving data analysis, machine learning, and natural language processing.
Dataset Name Suggestions
- StackOverflow Question Classifier Data
- Coding Question Title Analytics
- StackOverflow NLP Dataset
- Programming Language Prediction Challenge
- StackOverflow Question Insights
Attributes
Original Data Source: Stackoverflow Question Classification Challenge