Opendatabay APP

Open Source Project Metrics Snapshot

Data Science and Analytics

Tags and Keywords

Github

Repository

Topics

Metrics

Open-source

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Open Source Project Metrics Snapshot Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The GitHub ecosystem, capturing metadata for more than 40,000 repositories as of 2022. The data is instrumental for studying trends in programming languages, project popularity, repository management, and the overall open-source landscape. It includes two files: one detailing repository topics and a larger file detailing the metrics and attributes of the repositories themselves.

Columns

The main repository file contains 22 columns. Key columns include:
  • topic: The related topic area on GitHub.
  • name: The name assigned to the repository.
  • owner: The user or entity that owns the repository.
  • owner_type: Indicates whether the owner is a "User" or an "Organization".
  • full_name: The repository identifier in the format {owner}/{name}.
  • description: A short explanation of the repository's purpose.
  • license: The associated software license (e.g., "mit", "mpl-2.0").
  • size: The repository file size, measured in bytes.
  • language: The primary programming language used.
  • tags: A list of tags applied to the repository by the user.
  • open_issues: The count of unresolved issues.
  • forks: The number of times the repository has been forked.
  • stars: The total number of users who have starred the repository.
  • watchers: The number of users observing the repository.
  • is_archived: A boolean flag indicating if the repository is unmaintained.
  • is_forked: A boolean flag indicating if the repository originated as a fork.
  • created_at, updated_at: Date and time stamps identifying creation and last update, in ISO format.

Distribution

The data files are typically delivered in CSV format. The repositories.csv file is approximately 14.73 MB and includes records for over 40,000 repositories. The dataset contains both string/text data and numerical metrics, as well as boolean indicators for various features (e.g., has_wiki, has_pages). Updates to this resource are expected annually.

Usage

This resource is ideally suited for:
  • Analysing software development metrics and trends.
  • Machine learning projects focused on predicting repository popularity (stars, forks).
  • Studying the adoption rates of different programming languages over time.
  • Researching the characteristics of successful or archived open-source projects.

Coverage

The scope is focused entirely on GitHub topics and repository data scraped in 2022. The temporal coverage of the included projects varies, with created_at and updated_at fields detailing the lifecycle of individual repositories. The dataset covers general computer science, internet-related, and categorical text domains.

License

CC0: Public Domain

Who Can Use It

  • Data Scientists: For training models that forecast project success or measure community engagement.
  • Academic Researchers: To perform quantitative studies on open-source ecosystems and developer behaviour.
  • Technology Analysts: For benchmarking industry adoption of programming languages and specific open-source tools.
  • Developers: To identify high-activity repositories related to specific topics or technologies.

Dataset Name Suggestions

GitHub Topics and Repositories 2022 Open Source Project Metrics Snapshot GitHub Repository Metadata Ecosystem of GitHub Projects

Attributes

Listing Stats

VIEWS

5

DOWNLOADS

0

LISTED

04/10/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format