CORD-19 Study Design & Metadata
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides training data and pre-trained models for identifying study designs and other metadata within scientific articles. It enables the detection of fields such as sample size, sampling methods, and study statistics like risk factors. It is particularly useful for Natural Language Processing (NLP) applications and machine learning tasks related to scientific literature.
Columns
The dataset includes two primary files:
design.csv
:- label: An integer representing the study design type, with categories including systematic review, randomised control trial, prospective observational, modelling, and others.
- CORD-19 sha id: A unique identifier for the specific scientific study from the CORD-19 dataset.
attribute.csv
:- label: An integer indicating the type of metadata attribute, such as statistic, sampling method, or sample size.
- text: The actual sentence from the CORD-19 dataset that has been labelled.
Distribution
The dataset is provided in CSV format for the training data files. It contains two sets of training data, each comprising over 1000 labelled articles or sentences. Additionally, binary pre-trained models are included for both the
attribute
and design
classifications.Usage
This dataset is ideally suited for:
- Automated detection of study designs in research articles.
- Extraction of key metadata such as sample size, sampling methodologies, and relevant statistics.
- Developing and training AI and machine learning models for text analysis in scientific publications.
- Natural Language Processing (NLP) tasks focused on structuring information from academic papers, particularly in the context of coronavirus research.
- Researchers can utilise the accompanying code repository for ETL metadata processes.
Coverage
The data originates from the CORD-19 dataset, which focuses on coronavirus research. It is designed for global application. Specific time ranges or demographics for the underlying articles are not detailed, though it relates to contemporary scientific literature.
License
CC BY-SA
Who Can Use It
This resource is valuable for:
- Data scientists and machine learning engineers developing solutions for scientific text mining.
- Researchers in fields such as education, learning analytics, and public health.
- Academics and students focusing on NLP, information extraction, or bibliometrics.
- Anyone interested in applying AI to analyse and categorise medical and scientific literature, especially concerning infectious diseases.
Dataset Name Suggestions
- CORD-19 Study Design & Metadata
- Scientific Article Study Classifier
- Research Paper Metadata Extraction Dataset
- NLP for Medical Study Design
Attributes
Original Data Source: CORD-19 Study Design