Opendatabay APP

High-Star Python Repositories Data

Software and Technology

Tags and Keywords

Computer

Science

Programming

Tabular

Data

Visualization

Nlp

Python

Clustering

Trusted By
Trusted by company1Trusted by company2Trusted by company3
High-Star Python Repositories Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a collection of public repositories on GitHub that are primarily based on the Python programming language and have garnered 500 or more stars. Collected on 5th May 2022, it features a total of 9031 unique repositories. This dataset is designed for in-depth analysis, enabling users to generate descriptive statistics, perform data visualisation, apply Natural Language Processing (NLP) techniques to repository descriptions, cluster data by topics, and identify valuable open-source projects.

Columns

  • full_name: The complete name of the repository.
  • repo_lang: The primary programming language utilised in the repository.
  • repo_topics: Keywords or categories assigned to the repository.
  • created_at: The date on which the repository was originally created.
  • description: A textual summary or outline of the repository's purpose and contents.
  • forks_count: The total number of times the repository has been forked.
  • open_issues_count: The current number of open issues associated with the repository.
  • repo_size: The size of the repository.
  • repo_stargazers_count: The total number of stars received by the repository.
  • repo_subscribers_count: The total number of subscribers to the repository.
  • repo_watchers_count: The total number of users watching the repository.
  • git_url: The Git URL for cloning or accessing the repository.
  • html_url: The HTML URL linking directly to the repository's page on GitHub.

Distribution

The dataset is typically provided in a CSV file format and includes a total of 9031 unique repositories. Regarding the distribution of values within key columns:
  • repo_lang: There are 9004 unique language values. 'Python' accounts for 34% of entries, 'Python' and 'Shell' for 10%, with other combinations making up 56%.
  • repo_topics: 8995 unique topic values are present. Approximately 38% of entries have no specified topics, and 61% have other topics.
  • forks_count: Most repositories (8,954) have between 0 and 4,578 forks, with counts extending up to 45,784.
  • open_issues_count: The majority (8,959) have between 0 and 858 open issues, with some reaching up to 8,583.
  • repo_size: Most repositories (9,024) range from 0 to 10,488,024 units in size, with some reaching up to 104,880,243 units.
  • repo_stargazers_count: The majority (8,661) have between 500 and 7,798 stars, with counts extending up to 73,500.
  • repo_subscribers_count: Most repositories (8,770) have between 1 and 343 subscribers, with counts extending up to 3,425.

Usage

This dataset is ideal for:
  • Generating descriptive statistics: Analyse the characteristics and trends of Python-based GitHub repositories.
  • Data visualisation: Create visualisations to represent aspects such as repository popularity, growth, and activity.
  • Natural Language Processing (NLP): Extract insights and patterns from the 'description' field of repositories.
  • Clustering by topics: Group similar repositories based on their assigned topics, aiding in discovery and categorisation.
  • Finding hidden gems of open-source projects: Identify valuable projects that might not be widely known, based on various metrics.

Coverage

The dataset's geographic scope is global, encompassing Python repositories from across GitHub. The data was collected on 5th May 2022, providing a snapshot of repository metrics and activity from that specific date. No specific notes on data availability for particular demographic groups or distinct time periods beyond the collection date are provided.

License

CC0

Who Can Use It

This dataset is suitable for a wide range of users, including:
  • Data analysts: To explore and summarise trends within the open-source software ecosystem.
  • Researchers: For academic studies on software development, community dynamics, or programming language popularity.
  • Machine learning engineers: To train models for tasks such as repository classification, recommendation systems, or sentiment analysis on project descriptions.
  • Developers: To discover relevant and popular open-source projects for collaboration, learning, or integration into their own work.

Dataset Name Suggestions

  • Python GitHub Repositories (500+ Stars)
  • Popular Python GitHub Projects
  • GitHub Starred Python Repositories
  • High-Star Python Repositories Data
  • Python Open-Source Projects (GitHub)

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free