Global GitHub Repository Data
Social Media and Posts
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides details on over 215,000 top GitHub projects, specifically those with more than 167 stars. It offers valuable insights into popular repositories, including their names, descriptions, creation and update times, sizes, popularity metrics like stars and forks, and various boolean indicators for features such as issue trackers, wikis, and discussions. The data was meticulously collected using the GitHub search API, where queries for star ranges were iterated to ensure all relevant repositories were captured. This collection is ideal for understanding trends and characteristics of widely-used open-source projects.
Columns
- Name: The name of the GitHub repository.
- Description: A brief textual summary of the repository's purpose or focus.
- URL: The unique web address linking to the GitHub repository.
- Created At: The date and time the repository was first created on GitHub, in ISO 8601 format.
- Updated At: The date and time of the repository's most recent modification, in ISO 8601 format.
- Homepage: The URL to any associated homepage or landing page for the repository.
- Size: The total storage space used by the repository's files and data, in bytes.
- Stars: The count of stars or likes a repository has received from GitHub users, indicating its popularity.
- Forks: The number of times the repository has been forked by other GitHub users.
- Issues: The total count of open issues within the repository.
- Watchers: The number of GitHub users monitoring the repository for updates.
- Language: The primary programming language used in the repository.
- License: Information about the software license, typically using a license identifier.
- Topics: A list of tags or topics associated with the repository for discovery.
- Has Issues: A boolean value indicating if an issue tracker is enabled (true in this dataset).
- Has Projects: A boolean value indicating if GitHub Projects are used for task management.
- Has Downloads: A boolean value indicating if downloadable files or assets are offered.
- Has Wiki: A boolean value indicating if the repository has an associated wiki for documentation.
- Has Pages: A boolean value indicating if GitHub Pages are enabled for a website.
- Has Discussions: A boolean value indicating if GitHub Discussions are enabled for community interaction.
- Is Fork: A boolean value indicating if the repository is a fork of another (false in this dataset).
- Is Archived: A boolean value indicating if the repository is archived (read-only and no longer actively maintained).
- Is Template: A boolean value indicating if the repository is configured as a template.
- Default Branch: The name of the repository's default branch.
Distribution
The dataset is typically provided as a CSV file, named
repositories.csv
, with a file size of 70.27 MB. It comprises over 215,000 records, specifically 215,029, and contains 24 distinct columns.Usage
This dataset is ideal for:
- Exploratory data analysis: Delving into the characteristics and trends of GitHub repositories.
- Research: Studying software engineering practices, open-source project dynamics, and community engagement.
- Trend analysis: Identifying popular programming languages, project features, and development patterns.
- Benchmarking: Comparing repository metrics such as stars, forks, and issues across projects. It is important to note that this dataset may not be used for spamming purposes, including selling GitHub users' personal information to recruiters, headhunters, or job boards.
Coverage
The dataset covers a wide time range for repository creation, from 29 October 2007 to 24 September 2023. The last update timestamps for repositories span from 11 July 2019 to 26 September 2023. The scope is global, reflecting the worldwide reach of GitHub. No specific geographic or demographic details about the repository creators or contributors are provided.
License
CC0: Public Domain
Who Can Use It
- Data scientists and analysts: For researching open-source trends, performing statistical analysis on project popularity, and identifying key features of successful repositories.
- Software developers: To explore influential projects, understand best practices, or discover popular libraries and frameworks.
- Academics and researchers: For studies on software evolution, collaborative development, and the dynamics of online developer communities.
- Students: As a resource for learning about data analysis, programming languages, and version control systems.
Dataset Name Suggestions
- GitHub Repository Stars & Metrics
- Popular GitHub Projects Archive
- The GitHub Open Source Index
- Developer Project Insights
- Global GitHub Repository Data
Attributes
Original Data Source: Global GitHub Repository Data