OpenAI Summarization Corpus
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
More Datasets
For more datasets, click here.
Featured Notebooks
🚨 Your notebook can be here! 🚨!
How to use the dataset
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks:
Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation).
Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization.
Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
Research Ideas
Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
Original Data Source: OpenAI Summarization Corpus