Biomedical COVID-19 Claims Dataset
Health Information Systems & Technology
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The COVID-19 Numerical Claims Open Research Dataset (CONCORD) is an open-source collection of numerical claims meticulously extracted from academic papers focused on COVID-19 research. This dataset contains approximately 203,000 numerical claims, derived from over 57,000 scientific research articles published between January 2020 and May 2022. The claims were extracted from full-text research articles using a white box, weakly supervised model, with the CORD-19 repository serving as the raw dataset. The inclusion of numerical entities significantly enhances the credibility of claims, offering fine-grained, tangible, and valuable information, particularly beneficial within the biomedical domain.
Columns
The dataset features the following columns:
claim_uid
: A unique identifier for each individual numerical claim.cord_uid
: An identifier for the research paper from which the claims were extracted, similar to those found in CORD-19.title
: A string field representing the title of the research paper.doi
: A string field for the Digital Object Identifier (DOI) of the paper.numerical_claims
: A string field containing the numerical claim sentence itself.publish_time
: A datetime field indicating the published date of the paper in yyyy-mm-dd format. Note that this field may not always be accurate, as some publishers denote unknown dates with future dates (e.g., yyyy-12-31).authors
: A list of strings, where each string represents an author of the paper in 'Last, First Middle' format, semicolon-separated.journal
: A string field for the journal in which the paper was published. Journal strings are not normalised (e.g., "BMJ" and "British Medical Journal" may both exist). This field can be empty if unknown.country
: A string field indicating the author's country. Country strings are not normalised (e.g., "USA" and "United States of America" may both exist). This field can be empty if unknown.institution
: A string field for the author's institute of affiliation. This field can be empty if unknown.
Distribution
This dataset is typically provided in a CSV data file format. It comprises approximately 203,000 unique numerical claims, each represented by a record in the dataset. It is structured in a tabular format, suitable for analytical processing.
Usage
This dataset is ideally suited for various applications, including:
- Biomedical research and analysis.
- Public health studies and insights into COVID-19.
- Natural Language Processing (NLP) tasks, such as information extraction and entity recognition.
- Binary classification problems within text analysis.
- Developing and training AI and Large Language Models (LLMs) that require fine-grained numerical information from scientific literature.
Coverage
The dataset's content covers scientific research articles published within the time range of January 2020 to May 2022. The
publish_time
data specifically spans from 2020-01-01 to 2022-12-31. While the dataset is global in scope, author affiliations show geographical distribution, with 16% from the USA and 10% from China, among others. Journal representation includes Int J Environ Res Public Health (5%) and PLoS One (4%), with the majority falling under other journals. Institution data indicates that 1% of authors are affiliated with the University of California, with the remaining distributed among other institutions or unknown.License
CC By 4.0
Who Can Use It
This dataset is valuable for a wide array of users, including:
- Researchers and academics in medical sciences, public health, and epidemiology, seeking quantitative evidence from COVID-19 literature.
- Data scientists and AI developers focusing on extracting structured information from unstructured text, especially in the health domain.
- Healthcare professionals and policymakers who need specific, numerically supported insights into the pandemic's various aspects.
- Anyone requiring tangible and credible numerical information from the biomedical research field to inform their models or analyses.
Dataset Name Suggestions
- COVID-19 Scientific Numerical Claims
- CONCORD: COVID-19 Research Data
- Pandemic Research Numerical Insights
- Biomedical COVID-19 Claims Dataset
- Academic COVID-19 Numeric Data
Attributes
Original Data Source: COVID-19 Numerical Claims Open Research Dataset