Opendatabay APP

Translated Text Inference Dataset

Data Science and Analytics

Tags and Keywords

Computer

Science

Biology

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Translated Text Inference Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides augmented translated text data, serving as training and test material for natural language inference tasks, specifically for the "Contradictory, my dear Watson" competition. It was created using a data augmentation technique to reduce processing time for training K-Fold XLM Roberta models. The augmentation process expanded the original competition data from 12,120 entries to 24,240 rows, offering a richer resource for model development.

Columns

  • id: Unique identifiers for each entry in the dataset.
  • premise: The initial statement or piece of text provided for evaluation.
  • hypothesis: The statement or claim that is assessed in relation to the premise.
  • lang_abv: A two-letter abbreviation indicating the language of the text (e.g., 'en' for English, 'es' for Spanish).
  • language: The full name of the language for the text entries (e.g., 'English', 'Spanish').
  • label: A categorical indicator that defines the relationship between the premise and the hypothesis, such as contradiction.

Distribution

The dataset is provided in CSV format, including train_augmented.csv and test.csv files. The training file, train_augmented.csv, contains 24,240 rows, which is double the number of entries from the original competition data. The test.csv file also includes relevant columns but without the labels.

Usage

This dataset is ideally suited for training and testing models focused on natural language inference. It is particularly useful for participants in the "Contradictory, my dear Watson" competition and for those developing or fine-tuning K-Fold XLM Roberta models. Its structure supports tasks such as contradiction detection and other forms of text relationship analysis.

Coverage

The dataset offers global coverage in terms of region. Linguistically, it includes English (accounting for 57% of the data), Spanish (3%), and a variety of other languages making up the remaining 40% of the text entries. Specific time ranges or demographic details beyond language distribution are not provided.

License

CC0

Who Can Use It

This dataset is valuable for data scientists, machine learning engineers, and researchers working on Natural Language Processing (NLP) tasks. It is especially relevant for individuals involved in natural language inference challenges, text classification, and the development of models for detecting contradictions within textual data.

Dataset Name Suggestions

  • Translated Text Inference Dataset
  • Augmented Contradiction Detection Data
  • Multilingual NLI Training Set
  • XLM-Roberta Text Analysis Dataset

Attributes

Original Data Source: Translated Dataset Augmentation

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format