Dark Mode

Home

Data Categories

AI & ML Data

Textual Relatedness Classifier Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Textual Relatedness Classifier Dataset

Education & Learning Analytics

Tags and Keywords

Online

Communities

Nlp

Binary

Classification

Trusted By

Textual Relatedness Classifier Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed to help determine if two given sentences originate from the same article [1]. It comprises sentence pairs randomly sampled from across Wikipedia [1]. The primary purpose is to serve as a valuable resource for Natural Language Processing (NLP) tasks, specifically binary classification problems [1]. It is ideal for training and evaluating machine learning models that need to identify topical similarity or relatedness between textual segments [1].

Columns

id: A unique identifier for each sentence pair [2].
sent1: The first sentence in the pair [2].
sent2: The second sentence in the pair [2].
same_source: A binary field indicating whether sent1 and sent2 came from the same article (1) or different articles (0) [2, 3].

Distribution

The dataset is substantial, containing over 129,000 sentence pairs [2-5]. The label distribution for same_source is well-balanced, with approximately 64,515 pairs from different articles and 64,641 pairs from the same article [3]. While the exact file format is not specified in the initial information, such datasets are typically provided in CSV format [6].

Usage

This dataset is well-suited for a variety of applications, including:

Training NLP models for text classification and semantic similarity [1].
Developing algorithms to identify related content or detect plagiarism.
Researching and improving techniques for topic modelling and document clustering.
Educational purposes in machine learning and data science curricula [1].

Coverage

The sentences within this dataset are drawn from a random sample of Wikipedia articles [1]. The dataset does not specify a particular geographic region, time range, or specific demographic focus, as it aims for broad coverage inherent in Wikipedia's content [1]. Efforts were made to remove data containing controversial words or hate speech; however, due to the dataset's size, some such material may still be present as it reflects language found on Wikipedia [1].

License

CC-BY-NC

Who Can Use It

This dataset is particularly useful for:

AI and Machine Learning Practitioners: For developing and testing models on text relatedness and classification [1].
Data Scientists: To explore text data and build predictive models.
Researchers: In the fields of NLP, information retrieval, and computational linguistics.
Students and Educators: As a practical dataset for learning about text processing and classification algorithms [1].

Dataset Name Suggestions

Are Two Sentences of the Same Topic?
Wikipedia Sentence Pair Similarity Dataset
Textual Relatedness Classifier Dataset
Sentence Source Verification Data

Attributes

Original Data Source: Are Two Sentences of the Same Topic?

Listing Stats

VIEWS

DOWNLOADS

LISTED

26/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...