Textual Relatedness Classifier Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to help determine if two given sentences originate from the same article [1]. It comprises sentence pairs randomly sampled from across Wikipedia [1]. The primary purpose is to serve as a valuable resource for Natural Language Processing (NLP) tasks, specifically binary classification problems [1]. It is ideal for training and evaluating machine learning models that need to identify topical similarity or relatedness between textual segments [1].
Columns
- id: A unique identifier for each sentence pair [2].
- sent1: The first sentence in the pair [2].
- sent2: The second sentence in the pair [2].
- same_source: A binary field indicating whether
sent1
andsent2
came from the same article (1) or different articles (0) [2, 3].
Distribution
The dataset is substantial, containing over 129,000 sentence pairs [2-5]. The label distribution for
same_source
is well-balanced, with approximately 64,515 pairs from different articles and 64,641 pairs from the same article [3]. While the exact file format is not specified in the initial information, such datasets are typically provided in CSV format [6].Usage
This dataset is well-suited for a variety of applications, including:
- Training NLP models for text classification and semantic similarity [1].
- Developing algorithms to identify related content or detect plagiarism.
- Researching and improving techniques for topic modelling and document clustering.
- Educational purposes in machine learning and data science curricula [1].
Coverage
The sentences within this dataset are drawn from a random sample of Wikipedia articles [1]. The dataset does not specify a particular geographic region, time range, or specific demographic focus, as it aims for broad coverage inherent in Wikipedia's content [1]. Efforts were made to remove data containing controversial words or hate speech; however, due to the dataset's size, some such material may still be present as it reflects language found on Wikipedia [1].
License
CC-BY-NC
Who Can Use It
This dataset is particularly useful for:
- AI and Machine Learning Practitioners: For developing and testing models on text relatedness and classification [1].
- Data Scientists: To explore text data and build predictive models.
- Researchers: In the fields of NLP, information retrieval, and computational linguistics.
- Students and Educators: As a practical dataset for learning about text processing and classification algorithms [1].
Dataset Name Suggestions
- Are Two Sentences of the Same Topic?
- Wikipedia Sentence Pair Similarity Dataset
- Textual Relatedness Classifier Dataset
- Sentence Source Verification Data
Attributes
Original Data Source: Are Two Sentences of the Same Topic?