Open Source Project Metrics Snapshot
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The GitHub ecosystem, capturing metadata for more than 40,000 repositories as of 2022. The data is instrumental for studying trends in programming languages, project popularity, repository management, and the overall open-source landscape. It includes two files: one detailing repository topics and a larger file detailing the metrics and attributes of the repositories themselves.
Columns
The main repository file contains 22 columns. Key columns include:
topic
: The related topic area on GitHub.name
: The name assigned to the repository.owner
: The user or entity that owns the repository.owner_type
: Indicates whether the owner is a "User" or an "Organization".full_name
: The repository identifier in the format{owner}/{name}
.description
: A short explanation of the repository's purpose.license
: The associated software license (e.g., "mit", "mpl-2.0").size
: The repository file size, measured in bytes.language
: The primary programming language used.tags
: A list of tags applied to the repository by the user.open_issues
: The count of unresolved issues.forks
: The number of times the repository has been forked.stars
: The total number of users who have starred the repository.watchers
: The number of users observing the repository.is_archived
: A boolean flag indicating if the repository is unmaintained.is_forked
: A boolean flag indicating if the repository originated as a fork.created_at
,updated_at
: Date and time stamps identifying creation and last update, in ISO format.
Distribution
The data files are typically delivered in CSV format. The
repositories.csv
file is approximately 14.73 MB and includes records for over 40,000 repositories. The dataset contains both string/text data and numerical metrics, as well as boolean indicators for various features (e.g., has_wiki
, has_pages
). Updates to this resource are expected annually.Usage
This resource is ideally suited for:
- Analysing software development metrics and trends.
- Machine learning projects focused on predicting repository popularity (stars, forks).
- Studying the adoption rates of different programming languages over time.
- Researching the characteristics of successful or archived open-source projects.
Coverage
The scope is focused entirely on GitHub topics and repository data scraped in 2022. The temporal coverage of the included projects varies, with
created_at
and updated_at
fields detailing the lifecycle of individual repositories. The dataset covers general computer science, internet-related, and categorical text domains.License
CC0: Public Domain
Who Can Use It
- Data Scientists: For training models that forecast project success or measure community engagement.
- Academic Researchers: To perform quantitative studies on open-source ecosystems and developer behaviour.
- Technology Analysts: For benchmarking industry adoption of programming languages and specific open-source tools.
- Developers: To identify high-activity repositories related to specific topics or technologies.
Dataset Name Suggestions
GitHub Topics and Repositories 2022
Open Source Project Metrics Snapshot
GitHub Repository Metadata
Ecosystem of GitHub Projects
Attributes
Original Data Source: Open Source Project Metrics Snapshot