Project Gutenberg Book Corpus
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.
Columns
The dataset primarily includes the following columns:
- Title: The title of the book.
- Author: The author of the book.
- Link: The direct download link for the book's text.
- Bookshelf: The category or genre assigned to the book on Project Gutenberg.
- Text Data: The actual text content of the books, which can be downloaded using a provided script.
Distribution
The dataset's metadata is initially available in a
gutenberg_metadata.csv
file. The full text data for each book can be downloaded using a gutenberg_download.py
script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.Usage
This dataset is ideal for various applications in education and learning analytics. Specific use cases include:
- Natural Language Processing (NLP) tasks, such as text analysis, topic modelling, and language understanding.
- Literature studies and computational humanities research.
- Developing and training AI and Machine Learning models on large text corpora.
- Working with audio data, as some books are included as audiobooks.
Coverage
The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.
License
CC-BY-SA
Who Can Use It
This dataset is suitable for:
- Researchers and academics focusing on text analysis, literary studies, or digital humanities.
- Data scientists and machine learning engineers building and testing NLP models.
- Students undertaking projects in linguistics, computer science, or library science.
- Developers creating applications that require a large corpus of literary texts.
Dataset Name Suggestions
- Project Gutenberg Book Corpus
- Digital Literature Collection
- Classic Book Text Dataset
- Historical Text Library
Attributes
Original Data Source: 15000 Gutenberg Books