Dark Mode

Home

Data Categories

Consumer & Product Data

Gutenberg Text and Audio Book Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Gutenberg Text and Audio Book Dataset

Product Reviews & Feedback

Tags and Keywords

Text

Literature

Nlp

Audio

Books

Trusted By

Gutenberg Text and Audio Book Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

A corpus of over 15,000 book texts, including their authors and titles, scraped from the Project Gutenberg website. This data was collected using a custom script that parsed all bookshelves. The dataset is a valuable resource for natural language processing, literary analysis, and text mining. Some entries are audiobooks, offering an opportunity for audio data analysis as well.

Columns

Title: The title of the book.
Author: The author of the book.
Link: The URL to the book's page on the Project Gutenberg website.
ID: The unique identifier for the book, extracted from the URL.
Bookshelf: The category or bookshelf the book is associated with on the Project Gutenberg website.
Text: The full, cleaned text content of the book.

Distribution

The data is distributed as a CSV file. The final file, containing the book texts along with their metadata, will have a size of approximately 5 GB.

Usage

This dataset is ideal for a wide range of applications, including:

Natural Language Processing (NLP) research and model training.
Literary analysis and computational linguistics.
Text mining and information retrieval tasks.
Developing applications that involve large text corpora.
Audio data analysis for entries that are audiobooks.

Coverage

The dataset's content is global, sourced from the Project Gutenberg digital library. The time range covers a vast period of literary history. Some books from the website are not included as they have not yet been categorised. The data is not expected to be updated.

License

CC BY-NC-SA 4.0

Who Can Use It

Data Scientists and NLP Researchers: For training language models and performing text analysis.
Academics and Students: For literary studies and digital humanities research.
Software Developers: For building applications that require a large library of literary texts.

Dataset Name Suggestions

Project Gutenberg 15k Book Text Corpus
Gutenberg Literary Text Collection
Multilingual Book Text Corpus for NLP
Gutenberg Text and Audio Book Dataset

Attributes

Original Data Source: Gutenberg Text and Audio Book Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

17/09/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...