Turkish Natural Language Inference Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.
Columns
The dataset records typically include the following columns:
- premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.
- hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.
- label: This column assigns a relationship between the premise and hypothesis. Possible values include:
- 'entailment': The hypothesis logically follows from the premise.
- 'contradiction': The hypothesis directly contradicts the premise.
- 'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.
- domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.
Distribution
The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are
SNLI_tr_train.csv
for training models, slni_tr_validation
for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv
for additional validation on complex scenarios. The multinli_tr_train.csv
file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv
file, for instance, containing approximately 392,700 records.Usage
This dataset is ideal for various applications and use cases in NLP and machine learning:
- Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.
- Training sentiment analysis algorithms to discern sentiment in Turkish text.
- Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.
- Conducting general NLI research in Turkish.
- Investigating cross-lingual generalisation capabilities of NLP models.
- Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.
Coverage
The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.
License
CC0
Who Can Use It
The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:
- The natural language processing (NLP) community.
- The machine learning community.
- Seasoned and budding researchers looking to delve into NLI tasks.
- Developers aiming to create automated models for Turkish language inference.
- Academics and practitioners exploring the cross-lingual generalisation capabilities of models.
- Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.
Dataset Name Suggestions
- NLI-TR (Turkish NLI Research)
- Turkish Natural Language Inference Dataset
- SNLI-TR and MNLI-TR Turkish Data
- Turkish Textual Entailment Data
Attributes
Original Data Source: NLI-TR (Turkish NLI Research)