Opendatabay APP

Textual Relatedness Classifier Dataset

Education & Learning Analytics

Tags and Keywords

Online

Communities

Nlp

Binary

Classification

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Textual Relatedness Classifier Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed to help determine if two given sentences originate from the same article [1]. It comprises sentence pairs randomly sampled from across Wikipedia [1]. The primary purpose is to serve as a valuable resource for Natural Language Processing (NLP) tasks, specifically binary classification problems [1]. It is ideal for training and evaluating machine learning models that need to identify topical similarity or relatedness between textual segments [1].

Columns

  • id: A unique identifier for each sentence pair [2].
  • sent1: The first sentence in the pair [2].
  • sent2: The second sentence in the pair [2].
  • same_source: A binary field indicating whether sent1 and sent2 came from the same article (1) or different articles (0) [2, 3].

Distribution

The dataset is substantial, containing over 129,000 sentence pairs [2-5]. The label distribution for same_source is well-balanced, with approximately 64,515 pairs from different articles and 64,641 pairs from the same article [3]. While the exact file format is not specified in the initial information, such datasets are typically provided in CSV format [6].

Usage

This dataset is well-suited for a variety of applications, including:
  • Training NLP models for text classification and semantic similarity [1].
  • Developing algorithms to identify related content or detect plagiarism.
  • Researching and improving techniques for topic modelling and document clustering.
  • Educational purposes in machine learning and data science curricula [1].

Coverage

The sentences within this dataset are drawn from a random sample of Wikipedia articles [1]. The dataset does not specify a particular geographic region, time range, or specific demographic focus, as it aims for broad coverage inherent in Wikipedia's content [1]. Efforts were made to remove data containing controversial words or hate speech; however, due to the dataset's size, some such material may still be present as it reflects language found on Wikipedia [1].

License

CC-BY-NC

Who Can Use It

This dataset is particularly useful for:
  • AI and Machine Learning Practitioners: For developing and testing models on text relatedness and classification [1].
  • Data Scientists: To explore text data and build predictive models.
  • Researchers: In the fields of NLP, information retrieval, and computational linguistics.
  • Students and Educators: As a practical dataset for learning about text processing and classification algorithms [1].

Dataset Name Suggestions

  • Are Two Sentences of the Same Topic?
  • Wikipedia Sentence Pair Similarity Dataset
  • Textual Relatedness Classifier Dataset
  • Sentence Source Verification Data

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

26/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format