Opendatabay APP

Higher Education Social Media Analytics

Education & Learning Analytics

Tags and Keywords

Education

Online

Communities

Universities

Colleges

Text

Nlp

United

States

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Higher Education Social Media Analytics Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset captures insights into college life by collecting data from subreddits associated with the 10 leading American colleges, as ranked by Forbes in 2019 [1]. It includes a collection of posts and comments made on these Reddit college boards up to 21 February 2022 [1]. The data was procured using SocialGrep and includes analysed sentiment for each comment [1, 2]. It offers a unique perspective on student communities and their discussions, allowing for textual analysis and understanding of online interactions [1]. To ensure user anonymity and prevent targeted harassment, usernames have been excluded from the data [1].

Columns

  • type: Denotes the type of the data point [2].
  • id: A unique Base-36 ID for each comment [2].
  • subreddit.id: A unique Base-36 ID for the subreddit associated with the comment [2].
  • subreddit.name: The human-readable name of the comment's subreddit [2]. For example, "upenn" (27%), "stanford" (20%), and "Other" (54%) are prominent [3].
  • subreddit.nsfw: A boolean indicating whether the comment's subreddit is NSFW (Not Safe For Work). All 338,231 entries are marked as 'false' [2, 3].
  • created_utc: The timestamp indicating when the comment was created in UTC [2].
  • permalink: The direct link to the comment on Reddit [2].
  • body: The main text content of the comment. Some comments are labelled as '[deleted]' (9%) or '[removed]' (1%) [2, 3].
  • sentiment: The analysed sentiment score for the comment, ranging from -1.00 to 1.00 [2-5]. The distribution shows a wide range of sentiment, with significant counts around 0.00 (54,278 comments) and higher positive ranges (e.g., 0.92-0.96 with 12,206 comments, 0.96-1.00 with 14,204 comments) [4, 5].
  • score: The comment's score [2, 5, 6]. Scores range from -179 to 387, with the majority falling between -9.20 and 2.12 (198,359 comments) and 2.12 and 13.44 (118,335 comments) [5, 6].

Distribution

The dataset is structured as a table containing 338,231 comments [2, 3]. Data files are typically provided in CSV format [7]. The dataset includes a variety of subreddit names, with 'upenn' and 'stanford' being notable examples [3]. The 'body' column indicates that a percentage of comments are either '[deleted]' or '[removed]' [3]. Sentiment scores are distributed across a full spectrum from -1.00 to 1.00, while comment scores range from -179 to 387 [3-6].

Usage

This dataset is ideal for:
  • Education and Learning Analytics: Understanding student discussions, trends, and campus life as reflected on Reddit [1].
  • Natural Language Processing (NLP) Research: Analysing textual data, sentiment analysis models, and topic modelling related to higher education [1].
  • Social Media Trend Analysis: Identifying popular topics, concerns, and overall sentiment within university communities [1].
  • Comparative Studies: Comparing online behaviour and sentiment across different top-tier universities [1].
  • Academic Research: Exploring various facets of online student communities and their digital interactions [1].

Coverage

  • Geographic Scope: Focused on American colleges, specifically the 10 best US colleges according to the 2019 Forbes list [1].
  • Time Range: Data collected up to 21 February 2022 [1].
  • Demographic Scope: Captures interactions and discussions within the online student communities of the specified universities [1].
  • Data Availability Notes: Usernames are excluded to preserve anonymity [1].

License

CC-BY

Who Can Use It

  • Data Scientists and Researchers: For building and testing NLP models, conducting sentiment analysis, and exploring social dynamics in online communities [1].
  • Academics and Educators: To gain insights into student life, academic discussions, and common sentiments among university students [1].
  • Market Researchers and Strategists: To understand the online footprint and perception of different universities and student demographics.
  • Developers: For creating applications that leverage social media text data for insights.

Dataset Name Suggestions

  • US University Reddit Comments
  • College Community Sentiment Data
  • American Campus Reddit Discourse
  • Higher Education Social Media Analytics

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

21/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format