Opendatabay APP

ChatGPT Social Media Insights Dataset

Social Media and Networking

Tags and Keywords

Online

Text

Social

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
ChatGPT Social Media Insights Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset captures a daily collection of tweets containing keywords such as "ChatGPT", "GPT3", or "GPT4". It was designed to provide a rich source of social media data for analysis, particularly for applications concerning Natural Language Processing (NLP) and sentiment analysis. The collection process began on 3rd April 2023, with approximately 1,000 tweets added daily. Tweets were extracted 24-72 hours after creation to allow for relevant engagement metrics like likes and retweets to accumulate. However, updates to this dataset ceased on 13th May 2023, due to changes in Twitter (X) API conditions, which introduced a cost for its use. The dataset includes tweets from various languages, selected randomly throughout the day, with basic filters applied to discard sensitive content and spam.

Columns

  • tweet_id: An integer serving as a unique identifier for each tweet. Older tweets typically have smaller IDs.
  • tweet_created: A timestamp indicating the exact time the tweet was published.
  • tweet_extracted: A UTC timestamp recording when the ETL (Extract, Transform, Load) pipeline pulled the tweet and its associated metadata (e.g., likes count, retweets count).
  • text: A string containing the raw text content of the tweet payload.
  • lang: A string providing the short name for the language of the tweet's text.
  • user_id: An integer representing the author's unique user ID on Twitter.
  • user_name: A string displaying the author's public name on Twitter.
  • user_username: A string showing the author's Twitter account username (e.g., @example).
  • user_location: A string detailing the author's publicly stated location.
  • user_description: A string containing the author's public profile biography.
  • user_created: A timestamp indicating when the user's Twitter account was created.
  • user_followers_count: An integer showing the number of followers the author's account had at the moment the tweet was extracted.
  • user_following_count: An integer indicating the number of accounts the author was following at the moment of tweet extraction.
  • user_tweet_count: An integer representing the total number of tweets the author had published at the time of tweet extraction.
  • user_verified: A boolean value (True/False) indicating if the user is verified (i.e., has a blue tick).
  • source: This column was intended to show the device or application used to publish the tweet but currently contains only 'Nan' (Not a Number) values.
  • retweet_count: An integer displaying the number of times the tweet had been retweeted at the moment of extraction.
  • like_count: An integer showing the number of likes the tweet had received at the moment of extraction.
  • reply_count: An integer indicating the number of reply messages to the tweet.
  • impression_count: An integer representing the number of times the tweet had been seen at the moment of extraction.

Distribution

The dataset is provided in a CSV file format, generated from a Pandas DataFrame, with each row containing the tweet's text and its metadata, along with the author's information. The collection started on 3rd April 2023, adding approximately 1,000 tweets per day, and stopped updating on 13th May 2023. While specific total row counts are not available, various segments show substantial data, such as 43,000 tweets collected between 22nd September 2022 and 12th May 2023. Daily additions of 1,000 to 7,000 tweets are noted for the period of 8th April 2023 to 14th May 2023. The dataset includes unique values for over 25,000 tweet IDs, over 37,000 unique user IDs, and over 38,000 unique user locations.

Usage

This dataset is ideal for various data analysis and visualisation applications. It is particularly well-suited for Natural Language Processing (NLP) techniques, including sentiment analysis, to understand public opinion and trends related to ChatGPT, GPT3, and GPT4. Researchers can use it for social media listening, trend tracking, and studying the evolution of discussions around large language models.

Coverage

The dataset primarily covers tweets from 3rd April 2023 to 13th May 2023, with some older tweets included, particularly from September 2022. Tweets are from any language, randomly selected globally. English (en) tweets constitute approximately 48% of the dataset, Japanese (ja) tweets make up about 23%, and other languages account for 30%. User locations vary widely, with a significant portion (41%) being null, 1% from Japan, and the remaining 59% from various other global locations.

License

CC0

Who Can Use It

  • Data Analysts: For exploring social media trends and user engagement related to AI.
  • Researchers: Studying the public reception, discussion patterns, and sentiment around large language models.
  • Machine Learning Engineers: Developing and testing NLP models for sentiment analysis, topic modelling, or text classification.
  • Marketing Professionals: Gaining insights into public perception and brand mentions of AI technologies.
  • Students: For academic projects involving social media data analysis.

Dataset Name Suggestions

  • Daily GPT Tweets Collection
  • AI Language Model Tweets
  • ChatGPT Social Media Insights
  • GPT Public Discourse Archive
  • Twitter Data on Large Language Models

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free