Opendatabay APP

CL-Scisumm Research Dataset

Data Science and Analytics

Tags and Keywords

Earth

And

Nature

Beginner

Text

Nlp

Deep

Learning

Pytorch

Trusted By
Trusted by company1Trusted by company2Trusted by company3
CL-Scisumm Research Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed to train scientific paper summarisation models that utilise citations, facilitating research in supervised methods. It is significantly larger than previous datasets, containing over 1,000 papers. The data, originally in XML format, has been parsed into a CSV file for easier use. It provides rich content for developing and testing advanced summarisation techniques for academic documents.

Columns

  • text: This column contains every token of a scientific research paper. There are 1009 unique values in this column.
  • summary: This column consists of human-annotated golden summaries corresponding to the scientific papers. This column also has 1009 unique values.

Distribution

The dataset is provided in CSV format, having been converted from its original XML structure. It includes approximately 1,000 unique examples or papers. The structure is simple, featuring two key columns for text content and its corresponding summary.

Usage

This dataset is ideal for training cutting-edge scientific paper summarisation models, especially those that can leverage citation information. It supports research into supervised machine learning methods for text summarisation and can be used to develop models that aim to outperform existing benchmarks in scientific document summarisation.

Coverage

The dataset has a global reach, focusing specifically on papers within the computational linguistics and natural language processing (NLP) domains. While a specific time range for the papers is not detailed, the underlying project has been organised since 2014.

License

CC-BY-SA

Who Can Use It

This dataset is suitable for:
  • Researchers and Academics: For developing and evaluating new summarisation algorithms and models for scientific literature.
  • Data Scientists and NLP Engineers: For building practical applications that require automated summarisation of research papers.
  • Machine Learning Practitioners: For exploring advanced deep learning architectures for sequence-to-sequence tasks in academic contexts.

Dataset Name Suggestions

  • ScisummNet Corpus
  • Scientific Document Summarisation Dataset
  • Academic Paper NLP Summaries
  • CL-Scisumm Research Dataset

Attributes

Original Data Source: ScisummNet Corpus

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

22/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format