COVID-19 Academic Publications & Metadata
Patient Health Records & Digital Health
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Vital scientific knowledge regarding the global pandemic is contained within this dataset, offering a vast collection of research articles related to COVID-19. It provides metadata and abstracts for scientific papers, designed to facilitate advances in Natural Language Processing (NLP) and data analysis. Researchers can utilise this compendium to generate insights, track developments, and aid the medical community in understanding the virus through the mining of extensive textual data.
Columns
- sha: The SHA hash of the paper, serving as a unique identifier.
- source_x: The source from which the article was obtained (e.g., PMC, CZI).
- title: The official title of the research paper.
- doi: The Digital Object Identifier for the paper.
- pmcid: The PubMed Central ID associated with the paper.
- pubmed_id: The PubMed ID of the paper.
- license: The specific license under which the paper is published.
- abstract: A summary text of the research paper's contents.
- publish_time: The date the paper was published.
- authors: A list of the researchers who authored the paper.
- journal: The name of the journal in which the paper appeared.
- Microsoft Academic Paper ID: The identifier used by Microsoft Academic.
- WHO #Covidence: The WHO Covidence ID.
- has_full_text: A boolean indicator specifying whether the full text of the article is available.
Distribution
This dataset is distributed as a single CSV file with a size of approximately 49.84 MB. It comprises 29,500 total entries organized into 14 columns. Users should note that missing values exist across several columns; for instance, the 'sha' column has approximately 41% missing values, and 'WHO #Covidence' is missing in 96% of records. The file structure is flat, ensuring compatibility with most data analysis tools.
Usage
This data is ideal for researchers, data scientists, and analysts aiming to:
- Analyse trends and publication rates in COVID-19 research.
- Study the impact of the virus across various scientific fields.
- Perform Natural Language Processing (NLP) on paper abstracts and titles to extract semantic meaning.
- Investigate authorship networks and collaboration patterns within the medical research community.
Coverage
The dataset covers a publication timeline extending from 2006 to 2020, with the majority of content concentrated around the recent outbreak period. It encompasses a global scope of scientific literature, aggregating content from major repositories like PMC and CZI. The topics span Earth and Nature, Health Conditions, Science and Technology, and Public Health, ensuring a broad multidisciplinary perspective.
License
CC BY-SA 4.0
Who Can Use It
- Medical Researchers: To review historical and current findings on coronaviruses.
- Data Scientists: To train NLP models and improve information retrieval systems.
- Policy Analysts: To understand the velocity and focus of scientific output during the pandemic.
- Academic Institutions: For educational purposes in data mining and bibliometrics.
Dataset Name Suggestions
- CORD-19: COVID-19 Open Research Compendium
- Global Coronavirus Scientific Literature Metadata
- Pandemic Research & NLP Abstracts Dataset
- COVID-19 Academic Publications & Metadata
Attributes
Original Data Source: COVID-19 Academic Publications & Metadata
Loading...
