Opendatabay APP

Competitive Programming Problem-Editorial Data

Website Analytics & User Experience

Tags and Keywords

Computer

Programming

Nlp

Deep

Retrieval/ranking

Sentence

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Competitive Programming Problem-Editorial Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset addresses a significant limitation in competitive programming, specifically the scarcity of publicly available datasets that include problem statements, editorials, and source code for machine learning research. It offers an extensive collection of over 7000 competitive programming problems, each augmented with editorial solutions, source code, and other relevant metadata. The initial version of this dataset was featured in the paper "Matching Problem Statements to Editorials in Competitive Programming" (ICALT 2024), while a subsequent, seven-times larger revision was published in "Domain Adaptation for Automated Tag Prediction in Competitive Programming" (AIAI 2025). Competitive programming is a demanding endeavour that requires strong proficiency in computer science concepts and advanced problem-solving capabilities. This dataset aims to support the development of novel algorithms and techniques, ultimately improving the efficiency and accuracy of selecting or generating suitable editorial explanations for given problems.

Columns

The dataset features a full index of competitive programming problem statements and associated materials, including the following columns:
  • problem_link: A direct link to the problem on the Codeforces platform.
  • problem_id: The unique identifier for the problem.
  • problem_idx: An index specific to the problem.
  • short_id: A shortened identifier for the problem.
  • contest_number: The identifier of the contest where the problem appeared.
  • problem_name: The title or name of the competitive programming problem.
  • problem_statement: The detailed description of the problem presented to contestants.
  • problem_solution: The corresponding source code solution to the problem.
  • problem_input: Example input data for the problem.
  • problem_output: Expected output data for the given example input.

Distribution

This dataset contains a full index of 7185 competitive programming problem statements. The second revision of the dataset is seven times larger than its initial publication. It includes unique values for problem links, IDs, indices, short IDs, contest numbers, problem names, statements, solutions, inputs, and outputs, with 7185 unique entries for most of these attributes. Data files are typically provided in CSV format.

Usage

This dataset is ideally suited for various applications within machine learning and computer science, including:
  • Developing and evaluating new algorithms for competitive programming tasks.
  • Improving the efficiency and accuracy of automated editorial selection or generation.
  • Research into automated tag prediction for competitive programming problems.
  • Applications in Natural Language Processing (NLP) related to understanding problem statements and editorials.
  • Deep Learning model training for code analysis and problem-solving.
  • Research focused on Retrieval/Ranking systems for educational or programming content.
  • Studies involving Sentence Similarity in programming contexts.

Coverage

The dataset's coverage is global, making it suitable for research and applications worldwide. No specific time range or demographic scope is indicated in the available information.

License

CC By 4.0

Who Can Use It

This dataset is valuable for:
  • Machine Learning Researchers: To train and test models for tasks such as problem classification, solution generation, or editorial summarisation.
  • Competitive Programming Enthusiasts/Developers: To analyse problem trends, study solutions, or develop tools for problem-solving assistance.
  • Educators: To create teaching materials or automated tutoring systems based on real-world programming challenges.
  • Data Scientists: To explore relationships between problem statements, solutions, and performance metrics in competitive programming.

Dataset Name Suggestions

  • Codeforces Competitive Programming Dataset
  • Competitive Programming Problem-Editorial Data
  • Codeforces ML Challenge Dataset
  • Problem Solving Code Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free