Opendatabay APP

Biomedical COVID-19 Claims Dataset

Health Information Systems & Technology

Tags and Keywords

Nlp

Tabular

Public

Coronavirus

Binary

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Biomedical COVID-19 Claims Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The COVID-19 Numerical Claims Open Research Dataset (CONCORD) is an open-source collection of numerical claims meticulously extracted from academic papers focused on COVID-19 research. This dataset contains approximately 203,000 numerical claims, derived from over 57,000 scientific research articles published between January 2020 and May 2022. The claims were extracted from full-text research articles using a white box, weakly supervised model, with the CORD-19 repository serving as the raw dataset. The inclusion of numerical entities significantly enhances the credibility of claims, offering fine-grained, tangible, and valuable information, particularly beneficial within the biomedical domain.

Columns

The dataset features the following columns:
  • claim_uid: A unique identifier for each individual numerical claim.
  • cord_uid: An identifier for the research paper from which the claims were extracted, similar to those found in CORD-19.
  • title: A string field representing the title of the research paper.
  • doi: A string field for the Digital Object Identifier (DOI) of the paper.
  • numerical_claims: A string field containing the numerical claim sentence itself.
  • publish_time: A datetime field indicating the published date of the paper in yyyy-mm-dd format. Note that this field may not always be accurate, as some publishers denote unknown dates with future dates (e.g., yyyy-12-31).
  • authors: A list of strings, where each string represents an author of the paper in 'Last, First Middle' format, semicolon-separated.
  • journal: A string field for the journal in which the paper was published. Journal strings are not normalised (e.g., "BMJ" and "British Medical Journal" may both exist). This field can be empty if unknown.
  • country: A string field indicating the author's country. Country strings are not normalised (e.g., "USA" and "United States of America" may both exist). This field can be empty if unknown.
  • institution: A string field for the author's institute of affiliation. This field can be empty if unknown.

Distribution

This dataset is typically provided in a CSV data file format. It comprises approximately 203,000 unique numerical claims, each represented by a record in the dataset. It is structured in a tabular format, suitable for analytical processing.

Usage

This dataset is ideally suited for various applications, including:
  • Biomedical research and analysis.
  • Public health studies and insights into COVID-19.
  • Natural Language Processing (NLP) tasks, such as information extraction and entity recognition.
  • Binary classification problems within text analysis.
  • Developing and training AI and Large Language Models (LLMs) that require fine-grained numerical information from scientific literature.

Coverage

The dataset's content covers scientific research articles published within the time range of January 2020 to May 2022. The publish_time data specifically spans from 2020-01-01 to 2022-12-31. While the dataset is global in scope, author affiliations show geographical distribution, with 16% from the USA and 10% from China, among others. Journal representation includes Int J Environ Res Public Health (5%) and PLoS One (4%), with the majority falling under other journals. Institution data indicates that 1% of authors are affiliated with the University of California, with the remaining distributed among other institutions or unknown.

License

CC By 4.0

Who Can Use It

This dataset is valuable for a wide array of users, including:
  • Researchers and academics in medical sciences, public health, and epidemiology, seeking quantitative evidence from COVID-19 literature.
  • Data scientists and AI developers focusing on extracting structured information from unstructured text, especially in the health domain.
  • Healthcare professionals and policymakers who need specific, numerically supported insights into the pandemic's various aspects.
  • Anyone requiring tangible and credible numerical information from the biomedical research field to inform their models or analyses.

Dataset Name Suggestions

  • COVID-19 Scientific Numerical Claims
  • CONCORD: COVID-19 Research Data
  • Pandemic Research Numerical Insights
  • Biomedical COVID-19 Claims Dataset
  • Academic COVID-19 Numeric Data

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format