Gutenberg Text and Audio Book Dataset
Product Reviews & Feedback
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
A corpus of over 15,000 book texts, including their authors and titles, scraped from the Project Gutenberg website. This data was collected using a custom script that parsed all bookshelves. The dataset is a valuable resource for natural language processing, literary analysis, and text mining. Some entries are audiobooks, offering an opportunity for audio data analysis as well.
Columns
- Title: The title of the book.
- Author: The author of the book.
- Link: The URL to the book's page on the Project Gutenberg website.
- ID: The unique identifier for the book, extracted from the URL.
- Bookshelf: The category or bookshelf the book is associated with on the Project Gutenberg website.
- Text: The full, cleaned text content of the book.
Distribution
The data is distributed as a CSV file. The final file, containing the book texts along with their metadata, will have a size of approximately 5 GB.
Usage
This dataset is ideal for a wide range of applications, including:
- Natural Language Processing (NLP) research and model training.
- Literary analysis and computational linguistics.
- Text mining and information retrieval tasks.
- Developing applications that involve large text corpora.
- Audio data analysis for entries that are audiobooks.
Coverage
The dataset's content is global, sourced from the Project Gutenberg digital library. The time range covers a vast period of literary history. Some books from the website are not included as they have not yet been categorised. The data is not expected to be updated.
License
CC BY-NC-SA 4.0
Who Can Use It
- Data Scientists and NLP Researchers: For training language models and performing text analysis.
- Academics and Students: For literary studies and digital humanities research.
- Software Developers: For building applications that require a large library of literary texts.
Dataset Name Suggestions
- Project Gutenberg 15k Book Text Corpus
- Gutenberg Literary Text Collection
- Multilingual Book Text Corpus for NLP
- Gutenberg Text and Audio Book Dataset
Attributes
Original Data Source: Gutenberg Text and Audio Book Dataset