Opendatabay APP

Goodreads and Google Books Data

Education & Learning Analytics

Tags and Keywords

Books

Goodreads

Api

Ratings

Publishing

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Goodreads and Google Books Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset was initially developed as a foundational example for a recommender system article. It addresses shortcomings found in other available datasets, such as missing book descriptions, mixed languages without proper indicators, or unusual delimiters, by providing a curated and cleaned collection of book information [1]. The dataset is derived from ISBNs sourced from Soumik's Goodreads-books dataset, with additional book details extracted via the Google Books API [2]. It is suitable for exploratory data analysis, clustering books by topic or category, and building content-based recommendation engines utilising various fields from book descriptions [2].

Columns

  • isbn13: The 13-digit International Standard Book Number, with 6810 valid entries [3].
  • isbn10: The 10-digit International Standard Book Number, also with 6810 valid unique entries [3].
  • title: The primary title of the book, with 6398 unique titles across 6810 valid entries [3].
  • subtitle: An optional secondary title, valid for 35% of entries (2381), with 2009 unique subtitles and 65% missing values [3, 4].
  • authors: The author(s) of the book, separated by a semicolon, 99% valid with 3780 unique authors [4].
  • categories: Categorisations of the book, separated by a semicolon, 99% valid with 567 unique categories. "Fiction" is the most frequent category [4].
  • thumbnail: A URL linking to the book's thumbnail image, 95% valid with 6481 unique URLs [4, 5].
  • description: A text description of the book, 96% valid with 6474 unique descriptions [5].
  • published_year: The year of the book's publication, 100% valid. Publication years range from 1853 to 2019, with a mean year of 2000 [5, 6].
  • average_rating: The average Goodreads rating for the book, 99% valid. Ratings range from 0 to 5, with a mean of 3.93 [6, 7].
  • num_pages: The number of pages in the book, 99% valid. Page counts range from 0 to 3342, with a mean of 348 [7].
  • ratings_count: The total number of Goodreads ratings received by the book, 99% valid. Rating counts range from 0 to 5.63 million, with a mean of 21,100 [8].

Distribution

The dataset is provided in a CSV file named books.csv, with a file size of 4.14 MB [3]. It comprises 12 distinct columns [3]. The dataset contains 6810 records or rows [3].

Usage

This dataset is ideal for:
  • Conducting exploratory data analysis on book attributes [2].
  • Clustering books based on their topics or categories [2].
  • Developing content-based recommendation engines by leveraging book descriptions and other textual fields [2].

Coverage

The dataset primarily covers books published between 1853 and 2019 [6]. The data originates from Goodreads (via Soumik's dataset) and the Google Books API, suggesting a broad scope of titles available through these platforms, although many ISBNs from the original source did not return valid results from the API [2, 9].

License

CC0: Public Domain

Who Can Use It

  • Data scientists and analysts for academic research or practical application in data exploration and pattern recognition [2].
  • Machine learning engineers interested in building and testing recommendation systems for books [2].
  • Researchers studying literary trends, publication patterns, or reader behaviour over time [2].

Dataset Name Suggestions

  • 7k Books Dataset
  • Curated Books for Recommendations
  • Goodreads and Google Books Data
  • Book Recommendation Data
  • Digital Library Dataset

Attributes

Original Data Source: Goodreads and Google Books Data

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

03/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format