Global Bug Report Translation Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a collection of multilingual bug reports sourced from various open-source repositories. It is designed to support research and development in natural language processing, particularly for tasks such as bug triaging and cross-language analysis. The dataset includes original bug report content along with multiple machine-generated translations and associated metadata.
Columns
- number: A distinct identifier for each bug report.
- labels: Tags assigned to the bug report, which include its status.
- created_at: The exact date and time when the bug report was generated.
- body: The primary textual content that describes the reported bug.
- state_reason: Indicates the current status of the bug report, such as 'completed' or 'not planned'.
- title: The heading or summary of the bug report.
- state: Signifies whether the bug report is currently open or has been closed.
- translation: A version of the bug report's main content translated into another language.
- src_lang: The detected original language of the bug report.
- gpt_translation: A translation of the bug report generated using GPT models.
- gpt_src_lang: The source language identified by GPT for its translation.
- deepL_translation: A translation of the bug report provided by DeepL.
- deepL_src_lang: The source language identified by DeepL for its translation.
- aws_translation: A translation of the bug report provided by AWS Translate.
- aws_src_lang: The source language identified by AWS Translate for its translation.
Distribution
The dataset is structured as bug reports, including their translations, labels, and metadata. While specific row counts are not provided, the dataset spans multiple years with varying counts of entries per period, suggesting a substantial volume of data. For instance, time ranges show counts up to 200 reports within specific intervals. Data on labels indicates a wide range of counts, from approximately 44,229 to 216,803 unique values. The
state_reason
column has 1237 unique values, with 'completed' representing 72% and 'not_planned' 28% of entries. Language distribution indicates that Chinese Simplified (zh-CN) makes up 49% of the source languages, Portuguese (pt) 12%, and other languages account for the remaining 39%. The dataset is large, typically provided in CSV format.Usage
This dataset is ideal for:
- Multilingual Natural Language Processing (NLP): Analysing bug reports across various languages.
- Machine Translation Benchmarking: Comparing the effectiveness and quality of translations produced by models such as GPT, DeepL, and AWS Translate.
- Software Bug Triage Automation: Developing NLP models to automatically categorise and prioritise software bug reports.
- Cross-Language Information Retrieval: Enhancing search and retrieval capabilities for bug reports that are not in English.
Coverage
The dataset covers a time range from 22 February 2018 to 21 June 2024. It is global in its regional scope, including content in various languages such as English, Chinese Simplified, Portuguese, and Russian, among others. The dataset's multilingual nature supports analysis across different linguistic contexts.
License
CC-BY-SA
Who Can Use It
This dataset is beneficial for:
- Data Scientists: For developing and testing machine learning models for text analysis.
- NLP Researchers: To explore multilingual text processing, machine translation, and information extraction.
- Software Engineers and Developers: Interested in building automated bug tracking or triaging systems.
- Organisations: Looking to improve their handling of international bug reports and customer feedback.
Dataset Name Suggestions
- Multilingual Bug Reports with Translations
- Global Bug Report Translation Dataset
- NLP Bug Report Corpus
- Software Issue Translation Data
Attributes
Original Data Source: Multilingual Bug Reports with Translations