Opendatabay APP

COVID-19 Academic Publications & Metadata

Patient Health Records & Digital Health

Tags and Keywords

Coronavirus

Nlp

Research

Metadata

Pandemic

Trusted By
Trusted by company1Trusted by company2Trusted by company3
COVID-19 Academic Publications & Metadata Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Vital scientific knowledge regarding the global pandemic is contained within this dataset, offering a vast collection of research articles related to COVID-19. It provides metadata and abstracts for scientific papers, designed to facilitate advances in Natural Language Processing (NLP) and data analysis. Researchers can utilise this compendium to generate insights, track developments, and aid the medical community in understanding the virus through the mining of extensive textual data.

Columns

  • sha: The SHA hash of the paper, serving as a unique identifier.
  • source_x: The source from which the article was obtained (e.g., PMC, CZI).
  • title: The official title of the research paper.
  • doi: The Digital Object Identifier for the paper.
  • pmcid: The PubMed Central ID associated with the paper.
  • pubmed_id: The PubMed ID of the paper.
  • license: The specific license under which the paper is published.
  • abstract: A summary text of the research paper's contents.
  • publish_time: The date the paper was published.
  • authors: A list of the researchers who authored the paper.
  • journal: The name of the journal in which the paper appeared.
  • Microsoft Academic Paper ID: The identifier used by Microsoft Academic.
  • WHO #Covidence: The WHO Covidence ID.
  • has_full_text: A boolean indicator specifying whether the full text of the article is available.

Distribution

This dataset is distributed as a single CSV file with a size of approximately 49.84 MB. It comprises 29,500 total entries organized into 14 columns. Users should note that missing values exist across several columns; for instance, the 'sha' column has approximately 41% missing values, and 'WHO #Covidence' is missing in 96% of records. The file structure is flat, ensuring compatibility with most data analysis tools.

Usage

This data is ideal for researchers, data scientists, and analysts aiming to:
  • Analyse trends and publication rates in COVID-19 research.
  • Study the impact of the virus across various scientific fields.
  • Perform Natural Language Processing (NLP) on paper abstracts and titles to extract semantic meaning.
  • Investigate authorship networks and collaboration patterns within the medical research community.

Coverage

The dataset covers a publication timeline extending from 2006 to 2020, with the majority of content concentrated around the recent outbreak period. It encompasses a global scope of scientific literature, aggregating content from major repositories like PMC and CZI. The topics span Earth and Nature, Health Conditions, Science and Technology, and Public Health, ensuring a broad multidisciplinary perspective.

License

CC BY-SA 4.0

Who Can Use It

  • Medical Researchers: To review historical and current findings on coronaviruses.
  • Data Scientists: To train NLP models and improve information retrieval systems.
  • Policy Analysts: To understand the velocity and focus of scientific output during the pandemic.
  • Academic Institutions: For educational purposes in data mining and bibliometrics.

Dataset Name Suggestions

  • CORD-19: COVID-19 Open Research Compendium
  • Global Coronavirus Scientific Literature Metadata
  • Pandemic Research & NLP Abstracts Dataset
  • COVID-19 Academic Publications & Metadata

Attributes

Listing Stats

VIEWS

2

DOWNLOADS

0

LISTED

04/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format