Opendatabay APP

Turkish Natural Language Inference Dataset

Education & Learning Analytics

Tags and Keywords

Education

Nlp

Text

Languages

Sampling

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Turkish Natural Language Inference Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.

Columns

The dataset records typically include the following columns:
  • premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.
  • hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.
  • label: This column assigns a relationship between the premise and hypothesis. Possible values include:
    • 'entailment': The hypothesis logically follows from the premise.
    • 'contradiction': The hypothesis directly contradicts the premise.
    • 'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.
  • domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.

Distribution

The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv for training models, slni_tr_validation for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv for additional validation on complex scenarios. The multinli_tr_train.csv file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv file, for instance, containing approximately 392,700 records.

Usage

This dataset is ideal for various applications and use cases in NLP and machine learning:
  • Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.
  • Training sentiment analysis algorithms to discern sentiment in Turkish text.
  • Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.
  • Conducting general NLI research in Turkish.
  • Investigating cross-lingual generalisation capabilities of NLP models.
  • Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.

Coverage

The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.

License

CC0

Who Can Use It

The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:
  • The natural language processing (NLP) community.
  • The machine learning community.
  • Seasoned and budding researchers looking to delve into NLI tasks.
  • Developers aiming to create automated models for Turkish language inference.
  • Academics and practitioners exploring the cross-lingual generalisation capabilities of models.
  • Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.

Dataset Name Suggestions

  • NLI-TR (Turkish NLI Research)
  • Turkish Natural Language Inference Dataset
  • SNLI-TR and MNLI-TR Turkish Data
  • Turkish Textual Entailment Data

Attributes

Original Data Source: NLI-TR (Turkish NLI Research)

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free