Goodreads and Google Books Data
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset was initially developed as a foundational example for a recommender system article. It addresses shortcomings found in other available datasets, such as missing book descriptions, mixed languages without proper indicators, or unusual delimiters, by providing a curated and cleaned collection of book information [1]. The dataset is derived from ISBNs sourced from Soumik's Goodreads-books dataset, with additional book details extracted via the Google Books API [2]. It is suitable for exploratory data analysis, clustering books by topic or category, and building content-based recommendation engines utilising various fields from book descriptions [2].
Columns
- isbn13: The 13-digit International Standard Book Number, with 6810 valid entries [3].
- isbn10: The 10-digit International Standard Book Number, also with 6810 valid unique entries [3].
- title: The primary title of the book, with 6398 unique titles across 6810 valid entries [3].
- subtitle: An optional secondary title, valid for 35% of entries (2381), with 2009 unique subtitles and 65% missing values [3, 4].
- authors: The author(s) of the book, separated by a semicolon, 99% valid with 3780 unique authors [4].
- categories: Categorisations of the book, separated by a semicolon, 99% valid with 567 unique categories. "Fiction" is the most frequent category [4].
- thumbnail: A URL linking to the book's thumbnail image, 95% valid with 6481 unique URLs [4, 5].
- description: A text description of the book, 96% valid with 6474 unique descriptions [5].
- published_year: The year of the book's publication, 100% valid. Publication years range from 1853 to 2019, with a mean year of 2000 [5, 6].
- average_rating: The average Goodreads rating for the book, 99% valid. Ratings range from 0 to 5, with a mean of 3.93 [6, 7].
- num_pages: The number of pages in the book, 99% valid. Page counts range from 0 to 3342, with a mean of 348 [7].
- ratings_count: The total number of Goodreads ratings received by the book, 99% valid. Rating counts range from 0 to 5.63 million, with a mean of 21,100 [8].
Distribution
The dataset is provided in a CSV file named
books.csv
, with a file size of 4.14 MB [3]. It comprises 12 distinct columns [3]. The dataset contains 6810 records or rows [3].Usage
This dataset is ideal for:
- Conducting exploratory data analysis on book attributes [2].
- Clustering books based on their topics or categories [2].
- Developing content-based recommendation engines by leveraging book descriptions and other textual fields [2].
Coverage
The dataset primarily covers books published between 1853 and 2019 [6]. The data originates from Goodreads (via Soumik's dataset) and the Google Books API, suggesting a broad scope of titles available through these platforms, although many ISBNs from the original source did not return valid results from the API [2, 9].
License
CC0: Public Domain
Who Can Use It
- Data scientists and analysts for academic research or practical application in data exploration and pattern recognition [2].
- Machine learning engineers interested in building and testing recommendation systems for books [2].
- Researchers studying literary trends, publication patterns, or reader behaviour over time [2].
Dataset Name Suggestions
- 7k Books Dataset
- Curated Books for Recommendations
- Goodreads and Google Books Data
- Book Recommendation Data
- Digital Library Dataset
Attributes
Original Data Source: Goodreads and Google Books Data