Dark Mode

Home

Data Categories

AI & ML Data

Global Bug Report Translation Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Global Bug Report Translation Dataset

Data Science and Analytics

Tags and Keywords

Computer

Science

Classification

Nlp

Translation

Multilingual

Trusted By

Global Bug Report Translation Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a collection of multilingual bug reports sourced from various open-source repositories. It is designed to support research and development in natural language processing, particularly for tasks such as bug triaging and cross-language analysis. The dataset includes original bug report content along with multiple machine-generated translations and associated metadata.

Columns

number: A distinct identifier for each bug report.
labels: Tags assigned to the bug report, which include its status.
created_at: The exact date and time when the bug report was generated.
body: The primary textual content that describes the reported bug.
state_reason: Indicates the current status of the bug report, such as 'completed' or 'not planned'.
title: The heading or summary of the bug report.
state: Signifies whether the bug report is currently open or has been closed.
translation: A version of the bug report's main content translated into another language.
src_lang: The detected original language of the bug report.
gpt_translation: A translation of the bug report generated using GPT models.
gpt_src_lang: The source language identified by GPT for its translation.
deepL_translation: A translation of the bug report provided by DeepL.
deepL_src_lang: The source language identified by DeepL for its translation.
aws_translation: A translation of the bug report provided by AWS Translate.
aws_src_lang: The source language identified by AWS Translate for its translation.

Distribution

The dataset is structured as bug reports, including their translations, labels, and metadata. While specific row counts are not provided, the dataset spans multiple years with varying counts of entries per period, suggesting a substantial volume of data. For instance, time ranges show counts up to 200 reports within specific intervals. Data on labels indicates a wide range of counts, from approximately 44,229 to 216,803 unique values. The state_reason column has 1237 unique values, with 'completed' representing 72% and 'not_planned' 28% of entries. Language distribution indicates that Chinese Simplified (zh-CN) makes up 49% of the source languages, Portuguese (pt) 12%, and other languages account for the remaining 39%. The dataset is large, typically provided in CSV format.

Usage

This dataset is ideal for:

Multilingual Natural Language Processing (NLP): Analysing bug reports across various languages.
Machine Translation Benchmarking: Comparing the effectiveness and quality of translations produced by models such as GPT, DeepL, and AWS Translate.
Software Bug Triage Automation: Developing NLP models to automatically categorise and prioritise software bug reports.
Cross-Language Information Retrieval: Enhancing search and retrieval capabilities for bug reports that are not in English.

Coverage

The dataset covers a time range from 22 February 2018 to 21 June 2024. It is global in its regional scope, including content in various languages such as English, Chinese Simplified, Portuguese, and Russian, among others. The dataset's multilingual nature supports analysis across different linguistic contexts.

License

CC-BY-SA

Who Can Use It

This dataset is beneficial for:

Data Scientists: For developing and testing machine learning models for text analysis.
NLP Researchers: To explore multilingual text processing, machine translation, and information extraction.
Software Engineers and Developers: Interested in building automated bug tracking or triaging systems.
Organisations: Looking to improve their handling of international bug reports and customer feedback.

Dataset Name Suggestions

Multilingual Bug Reports with Translations
Global Bug Report Translation Dataset
NLP Bug Report Corpus
Software Issue Translation Data

Attributes

Original Data Source: Multilingual Bug Reports with Translations

Listing Stats

VIEWS

DOWNLOADS

LISTED

21/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...