CL-Scisumm Research Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to train scientific paper summarisation models that utilise citations, facilitating research in supervised methods. It is significantly larger than previous datasets, containing over 1,000 papers. The data, originally in XML format, has been parsed into a CSV file for easier use. It provides rich content for developing and testing advanced summarisation techniques for academic documents.
Columns
- text: This column contains every token of a scientific research paper. There are 1009 unique values in this column.
- summary: This column consists of human-annotated golden summaries corresponding to the scientific papers. This column also has 1009 unique values.
Distribution
The dataset is provided in CSV format, having been converted from its original XML structure. It includes approximately 1,000 unique examples or papers. The structure is simple, featuring two key columns for text content and its corresponding summary.
Usage
This dataset is ideal for training cutting-edge scientific paper summarisation models, especially those that can leverage citation information. It supports research into supervised machine learning methods for text summarisation and can be used to develop models that aim to outperform existing benchmarks in scientific document summarisation.
Coverage
The dataset has a global reach, focusing specifically on papers within the computational linguistics and natural language processing (NLP) domains. While a specific time range for the papers is not detailed, the underlying project has been organised since 2014.
License
CC-BY-SA
Who Can Use It
This dataset is suitable for:
- Researchers and Academics: For developing and evaluating new summarisation algorithms and models for scientific literature.
- Data Scientists and NLP Engineers: For building practical applications that require automated summarisation of research papers.
- Machine Learning Practitioners: For exploring advanced deep learning architectures for sequence-to-sequence tasks in academic contexts.
Dataset Name Suggestions
- ScisummNet Corpus
- Scientific Document Summarisation Dataset
- Academic Paper NLP Summaries
- CL-Scisumm Research Dataset
Attributes
Original Data Source: ScisummNet Corpus