Dark Mode

Home

Data Categories

AI & ML Data

Academic Paper Abstract Summarisation Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Academic Paper Abstract Summarisation Dataset

Data Science and Analytics

Tags and Keywords

Earth

Nature

Computer

Science

Tabular

Nlp

Research

Transformers

Trusted By

Academic Paper Abstract Summarisation Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a focused collection for scientific document summarisation, specifically targeting abstractive summarisation. It comprises 5,400 TLDRs (Too Long; Didn't Read summaries) derived from over 3,200 scientific papers. The dataset includes both summaries written by authors and those created by experts using a novel annotation protocol, ensuring high-quality outputs with minimal annotation effort. The SciTLDR-A version exclusively uses the abstract of the paper as the source material for summarisation tasks. It is an invaluable resource for developing and evaluating models in natural language processing, particularly for generating succinct summaries of academic content.

Columns

source: This column contains the abstract of the paper, with each sentence presented on a separate line.
source_labels: A binary value (0 or 1), where '1' indicates an oracle sentence, signifying its importance for summarisation.
rouge_scores: Precomputed ROUGE baseline scores are included for each sentence, offering a foundational metric for future research and model evaluation.
paper_id: The Arxiv Paper ID, which uniquely identifies each scientific document.
target: This column provides multiple summaries for each sentence, with each summary presented on a separate line.
title: The title of the scientific paper.

Distribution

The dataset is provided in a CSV format. It is structured with a standard 60/20/20 split for training, development (validation), and testing purposes. For the SciTLDR-A subset, the distribution is as follows: 1,992 entries for training, 618 for validation, and 619 for testing. While specific total row counts for the entire dataset are not detailed, these split figures provide clarity on its structure and volume for each subset. Data files are typically provided in CSV format, with sample files updated separately on the platform.

Usage

This dataset is ideally suited for tasks related to summarisation, particularly within the domain of scientific documents. It can be utilised for:

Developing and training abstractive summarisation models for academic papers.
Research into natural language processing and generation techniques.
Evaluating the performance of summarisation algorithms using the provided precomputed ROUGE scores.
Exploring novel annotation protocols for creating high-quality summary datasets.

Coverage

The dataset's regional coverage is global, and all content is in English. It was listed on 24th June 2025, and is currently at Version 1.0. Given its focus on scientific abstracts, demographic scope is not applicable.

License

CC0

Who Can Use It

This dataset is beneficial for a wide range of users, including:

Data scientists and machine learning engineers working on NLP tasks, especially summarisation.
Academic researchers and students in artificial intelligence, computer science, and linguistics.
Developers building applications that require automated scientific document summarisation.
Anyone interested in the methodology of creating high-quality, expert-derived summaries.

Dataset Name Suggestions

Scientific Abstract TLDR Summaries (SciTLDR-A)
Academic Paper Abstract Summarisation Dataset
Expert-Annotated Scientific Summaries
Multi-Target Scientific TLDRs

Attributes

Original Data Source: Scientific Document Summarization (SciTLDR-A)

Listing Stats

VIEWS

DOWNLOADS

LISTED

24/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...