Scholarly Contribution Binary Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is specifically designed for binary classification tasks within Natural Language Processing (NLP) research. It comprises an excerpt of data from scholarly NLP articles, where contributions have been meticulously structured for integration into Knowledge Graph infrastructures such as the Open Research Knowledge Graph. The annotations include contribution sentences, scientific terms and their relations extracted from these sentences, and semantic triples. These triples are organised under various information units, including Research Problem, Approach, Model, Code, and Dataset, among others. The primary purpose of this dataset is to facilitate the training of models to identify statements within research papers as either contributing or non-contributing to the overall research.
Columns
- contents: This column contains the textual statements extracted from scholarly articles.
- label: This column holds a binary value, '0' or '1', indicating the classification of the corresponding statement. In this dataset, '0' represents a contributing statement in a research paper, while '1' represents a non-contributing statement.
Distribution
The dataset is typically provided in a CSV file format. It is derived from NLP scholarly articles and is structured to enable binary classification. The dataset contains a total of 55,201 unique statements, with 50,137 statements classified under one label (presumably '0') and 5,064 under the other (presumably '1'). Specific details regarding the exact number of rows or records beyond these label counts are not available. A script is needed to compile the data for use.
Usage
This dataset is ideal for a variety of applications in machine learning and NLP. It can be effectively used for:
- Training and evaluating binary classification models to distinguish between contributing and non-contributing statements in academic texts.
- Developing information extraction systems focused on scholarly content.
- Populating and enhancing Knowledge Graph infrastructures with structured research contributions.
- Conducting NLP research related to argument mining, discourse analysis, or summarisation of scientific articles.
- Creating tools for automated literature review or scientific knowledge organisation.
Coverage
The dataset's coverage is global, as it is not restricted by any specific geographical region. It is derived from Natural Language Processing scholarly articles, focusing on the structuring of their contributions. There is no specific time range or demographic scope noted for the data. The dataset was listed on 27 June 2025.
License
CC-BY
Who Can Use It
This dataset is highly valuable for:
- Data scientists and machine learning engineers looking to build and train text classification models.
- NLP researchers and academics interested in automated knowledge extraction from scientific literature or the construction of knowledge graphs.
- Organisations and developers aiming to create applications that analyse and summarise research papers.
- Students and educators studying text classification, information retrieval, or knowledge representation in NLP.
Dataset Name Suggestions
- NLP Contribution Classifier
- Research Paper Contribution Classification
- Scholarly Contribution Binary Data
- Article Contribution Identifier
- SemEval 2021 Contribution Dataset
Attributes
Original Data Source: Contribution Graph (Binary Classification)