Opendatabay APP

Project Gutenberg Book Corpus

Education & Learning Analytics

Tags and Keywords

Text

Literature

Nlp

Audio

Education

Books

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Project Gutenberg Book Corpus Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.

Columns

The dataset primarily includes the following columns:
  • Title: The title of the book.
  • Author: The author of the book.
  • Link: The direct download link for the book's text.
  • Bookshelf: The category or genre assigned to the book on Project Gutenberg.
  • Text Data: The actual text content of the books, which can be downloaded using a provided script.

Distribution

The dataset's metadata is initially available in a gutenberg_metadata.csv file. The full text data for each book can be downloaded using a gutenberg_download.py script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.

Usage

This dataset is ideal for various applications in education and learning analytics. Specific use cases include:
  • Natural Language Processing (NLP) tasks, such as text analysis, topic modelling, and language understanding.
  • Literature studies and computational humanities research.
  • Developing and training AI and Machine Learning models on large text corpora.
  • Working with audio data, as some books are included as audiobooks.

Coverage

The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.

License

CC-BY-SA

Who Can Use It

This dataset is suitable for:
  • Researchers and academics focusing on text analysis, literary studies, or digital humanities.
  • Data scientists and machine learning engineers building and testing NLP models.
  • Students undertaking projects in linguistics, computer science, or library science.
  • Developers creating applications that require a large corpus of literary texts.

Dataset Name Suggestions

  • Project Gutenberg Book Corpus
  • Digital Literature Collection
  • Classic Book Text Dataset
  • Historical Text Library

Attributes

Original Data Source: 15000 Gutenberg Books

Listing Stats

VIEWS

3

DOWNLOADS

0

LISTED

11/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free