Translated Text Inference Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides augmented translated text data, serving as training and test material for natural language inference tasks, specifically for the "Contradictory, my dear Watson" competition. It was created using a data augmentation technique to reduce processing time for training K-Fold XLM Roberta models. The augmentation process expanded the original competition data from 12,120 entries to 24,240 rows, offering a richer resource for model development.
Columns
- id: Unique identifiers for each entry in the dataset.
- premise: The initial statement or piece of text provided for evaluation.
- hypothesis: The statement or claim that is assessed in relation to the premise.
- lang_abv: A two-letter abbreviation indicating the language of the text (e.g., 'en' for English, 'es' for Spanish).
- language: The full name of the language for the text entries (e.g., 'English', 'Spanish').
- label: A categorical indicator that defines the relationship between the premise and the hypothesis, such as contradiction.
Distribution
The dataset is provided in CSV format, including
train_augmented.csv
and test.csv
files. The training file, train_augmented.csv
, contains 24,240 rows, which is double the number of entries from the original competition data. The test.csv
file also includes relevant columns but without the labels.Usage
This dataset is ideally suited for training and testing models focused on natural language inference. It is particularly useful for participants in the "Contradictory, my dear Watson" competition and for those developing or fine-tuning K-Fold XLM Roberta models. Its structure supports tasks such as contradiction detection and other forms of text relationship analysis.
Coverage
The dataset offers global coverage in terms of region. Linguistically, it includes English (accounting for 57% of the data), Spanish (3%), and a variety of other languages making up the remaining 40% of the text entries. Specific time ranges or demographic details beyond language distribution are not provided.
License
CC0
Who Can Use It
This dataset is valuable for data scientists, machine learning engineers, and researchers working on Natural Language Processing (NLP) tasks. It is especially relevant for individuals involved in natural language inference challenges, text classification, and the development of models for detecting contradictions within textual data.
Dataset Name Suggestions
- Translated Text Inference Dataset
- Augmented Contradiction Detection Data
- Multilingual NLI Training Set
- XLM-Roberta Text Analysis Dataset
Attributes
Original Data Source: Translated Dataset Augmentation