Academic Paper Abstract Summarisation Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a focused collection for scientific document summarisation, specifically targeting abstractive summarisation. It comprises 5,400 TLDRs (Too Long; Didn't Read summaries) derived from over 3,200 scientific papers. The dataset includes both summaries written by authors and those created by experts using a novel annotation protocol, ensuring high-quality outputs with minimal annotation effort. The SciTLDR-A version exclusively uses the abstract of the paper as the source material for summarisation tasks. It is an invaluable resource for developing and evaluating models in natural language processing, particularly for generating succinct summaries of academic content.
Columns
- source: This column contains the abstract of the paper, with each sentence presented on a separate line.
- source_labels: A binary value (0 or 1), where '1' indicates an oracle sentence, signifying its importance for summarisation.
- rouge_scores: Precomputed ROUGE baseline scores are included for each sentence, offering a foundational metric for future research and model evaluation.
- paper_id: The Arxiv Paper ID, which uniquely identifies each scientific document.
- target: This column provides multiple summaries for each sentence, with each summary presented on a separate line.
- title: The title of the scientific paper.
Distribution
The dataset is provided in a CSV format. It is structured with a standard 60/20/20 split for training, development (validation), and testing purposes. For the SciTLDR-A subset, the distribution is as follows: 1,992 entries for training, 618 for validation, and 619 for testing. While specific total row counts for the entire dataset are not detailed, these split figures provide clarity on its structure and volume for each subset. Data files are typically provided in CSV format, with sample files updated separately on the platform.
Usage
This dataset is ideally suited for tasks related to summarisation, particularly within the domain of scientific documents. It can be utilised for:
- Developing and training abstractive summarisation models for academic papers.
- Research into natural language processing and generation techniques.
- Evaluating the performance of summarisation algorithms using the provided precomputed ROUGE scores.
- Exploring novel annotation protocols for creating high-quality summary datasets.
Coverage
The dataset's regional coverage is global, and all content is in English. It was listed on 24th June 2025, and is currently at Version 1.0. Given its focus on scientific abstracts, demographic scope is not applicable.
License
CC0
Who Can Use It
This dataset is beneficial for a wide range of users, including:
- Data scientists and machine learning engineers working on NLP tasks, especially summarisation.
- Academic researchers and students in artificial intelligence, computer science, and linguistics.
- Developers building applications that require automated scientific document summarisation.
- Anyone interested in the methodology of creating high-quality, expert-derived summaries.
Dataset Name Suggestions
- Scientific Abstract TLDR Summaries (SciTLDR-A)
- Academic Paper Abstract Summarisation Dataset
- Expert-Annotated Scientific Summaries
- Multi-Target Scientific TLDRs
Attributes
Original Data Source: Scientific Document Summarization (SciTLDR-A)