Opendatabay APP

r/Books Community Discussions and Metadata

Reddit & Forum Data

Tags and Keywords

Reddit

Books

Nlp

Sentiment

Literature

Trusted By
Trusted by company1Trusted by company2Trusted by company3
r/Books Community Discussions and Metadata Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Unveiling the literary gems of the digital world, this collection captures the vibrant interactions within the r/Books community on Reddit. It offers a detailed record of posts and comments, allowing for deep analysis of literary discussions, user sentiment, and reading trends. Crucial to interpreting this file is understanding the structure of rows: entries with null values in the 'title' column represent comments, while those with populated 'title' fields represent posts. Similarly, the 'post_id' column serves as the relational link, connecting comments to their parent discussions. This structured approach enables researchers to reconstruct conversation threads and analyse the dynamic between original submissions and community responses.

Columns

  • register_index: Unique identifier for each entry in the file.
  • post_id: Identification number for a post; links comments to their parent post.
  • comment_id: Identification number for a comment within a post (null for post rows).
  • author: Username or display name of the contributor.
  • datetime: Date and time when the post or comment was created.
  • title: Headline or title of the post (null for comment rows).
  • url: Hyperlink associated with the post.
  • score: Net rating of the post based on community upvotes and downvotes.
  • comments: Total count of comments on the post (null for comment rows).
  • text: The actual content body of the post or comment.
  • author_post_karma: The author's karma score specifically derived from their contributions to r/Books.
  • tag: Categorisation tag applied to the post (e.g., WeeklyThread).

Distribution

The dataset is provided in CSV format and encompasses approximately 332.47 MB of data. It contains roughly 993,000 unique records organised across 12 specific columns. The structure integrates both posts and comments into a single table, distinguished by the presence or absence of specific field values.

Usage

  • Natural Language Processing (NLP): Training models for text classification, topic modelling, and sentiment analysis.
  • Social Network Analysis: Mapping interaction patterns between authors and commenters to understand community dynamics.
  • Literary Trend Analysis: Tracking the popularity of specific book titles or genres over time.
  • User Behaviour Modelling: Analysing activity patterns, such as posting times and engagement rates (scores/comments).
  • Text Pre-processing: Serving as a rich corpus for cleaning and tokenisation exercises.

Coverage

  • Temporal Range: The data spans from 26 May 2023 to 04 October 2024.
  • Geographic Scope: Global (Internet-based community).
  • Demographic Scope: Users of the r/Books subreddit.
  • Content Availability: Includes titles, body text, and metadata for both initial posts and subsequent user comments.

License

CC0: Public Domain

Who Can Use It

  • Data Scientists: For developing and testing machine learning models on unstructured text.
  • Linguists: For studying internet slang, literary discourse, and language evolution.
  • Marketing Analysts: For gauging public sentiment towards specific authors or publications.
  • Sociologists: For researching online community behaviours and social interactions.
  • Students/Beginners: For practising data cleaning, exploration, and visualisation techniques.

Dataset Name Suggestions

  • r/Books Community Discussions and Metadata
  • Reddit Literature Corpus: Posts & Comments 2023-2024
  • r/Books Interaction Log and Text Data
  • The Literary Reddit Archive

Attributes

Listing Stats

VIEWS

4

DOWNLOADS

0

LISTED

07/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format