Opendatabay APP

AI Summarisation Model Evaluation Dataset

Data Science and Analytics

Tags and Keywords

Earth

Data

Nlp

Text

Summarisation

Ai

Models

Trusted By
Trusted by company1Trusted by company2Trusted by company3
AI Summarisation Model Evaluation Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a unique corpus for natural language processing tasks, specifically designed for text summarisation tools and for validating reward models from OpenAI. It includes text summaries sourced from the TL;DR, CNN, and Daily Mail datasets. The collection also contains essential supplementary information such as choices made by workers during the summarisation process, batch details to distinguish between different worker-generated summaries, and dataset attribute splits. This allows users to train state-of-the-art natural language processing systems with real-world data, facilitating the creation of reliable, concise summaries from longer texts. It enables developers to explore cutting-edge summarisation research whilst directly assessing against human-generated results.

Columns

  • info: Provides contextual information about the original text to be summarised, including an ID, title, site, and the full article content.
  • summary: Contains the generated summaries of text from the source datasets.
  • worker: Denotes the specific worker who produced a given summary, useful for analysing worker-specific trends or biases.
  • batch: Indicates the batch identifier for summaries, helping to differentiate groups of summaries created by workers.
  • split: Specifies the dataset attribute split (e.g., training, validation) for machine learning tasks.

Distribution

The dataset is primarily available in CSV file format. It includes separate files for training, validation, and testing purposes, such as train.csv, validation.csv, and axis_test.csv. Specific numbers for the total rows or records across all files are not explicitly detailed in the provided information.

Usage

This dataset is ideal for:
  • Training natural language processing models to automatically generate text summaries.
  • Evaluating OpenAI's reward model for natural language processing, aiming to enhance its accuracy and performance.
  • Analysing worker and batch information to identify trends that might indicate bias or other issues impacting summarisation accuracy.
  • Developing machine learning models that understand and evaluate natural language processing.

Coverage

The dataset's content is derived from existing news and article sources like TL;DR, CNN, and Daily Mail, providing broad topical coverage. Its geographic scope is global. A specific time range for the original articles is not stated, but the dataset itself was listed on 11/06/2025. There are no explicit demographic notes on data availability.

License

CCO

Who Can Use It

  • Data scientists and machine learning engineers developing and refining NLP models.
  • AI researchers focusing on text summarisation and generative AI.
  • Developers looking to integrate high-quality summarisation capabilities into their applications.
  • Academics and students studying natural language processing and model evaluation.

Dataset Name Suggestions

  • OpenAI Text Summarisation Corpus
  • AI Summarisation Model Evaluation Dataset
  • NLP Human-Generated Summaries
  • Machine Learning Summarisation Benchmark
  • Text Summary Reward Model Data

Attributes

Original Data Source: OpenAI Summarization Corpus

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

11/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free