Opendatabay APP

CORD-19 Study Design & Metadata

Education & Learning Analytics

Tags and Keywords

Earth

Education

Nlp

Coronavirus

Trusted By
Trusted by company1Trusted by company2Trusted by company3
 CORD-19 Study Design & Metadata Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides training data and pre-trained models for identifying study designs and other metadata within scientific articles. It enables the detection of fields such as sample size, sampling methods, and study statistics like risk factors. It is particularly useful for Natural Language Processing (NLP) applications and machine learning tasks related to scientific literature.

Columns

The dataset includes two primary files:
  • design.csv:
    • label: An integer representing the study design type, with categories including systematic review, randomised control trial, prospective observational, modelling, and others.
    • CORD-19 sha id: A unique identifier for the specific scientific study from the CORD-19 dataset.
  • attribute.csv:
    • label: An integer indicating the type of metadata attribute, such as statistic, sampling method, or sample size.
    • text: The actual sentence from the CORD-19 dataset that has been labelled.

Distribution

The dataset is provided in CSV format for the training data files. It contains two sets of training data, each comprising over 1000 labelled articles or sentences. Additionally, binary pre-trained models are included for both the attribute and design classifications.

Usage

This dataset is ideally suited for:
  • Automated detection of study designs in research articles.
  • Extraction of key metadata such as sample size, sampling methodologies, and relevant statistics.
  • Developing and training AI and machine learning models for text analysis in scientific publications.
  • Natural Language Processing (NLP) tasks focused on structuring information from academic papers, particularly in the context of coronavirus research.
  • Researchers can utilise the accompanying code repository for ETL metadata processes.

Coverage

The data originates from the CORD-19 dataset, which focuses on coronavirus research. It is designed for global application. Specific time ranges or demographics for the underlying articles are not detailed, though it relates to contemporary scientific literature.

License

CC BY-SA

Who Can Use It

This resource is valuable for:
  • Data scientists and machine learning engineers developing solutions for scientific text mining.
  • Researchers in fields such as education, learning analytics, and public health.
  • Academics and students focusing on NLP, information extraction, or bibliometrics.
  • Anyone interested in applying AI to analyse and categorise medical and scientific literature, especially concerning infectious diseases.

Dataset Name Suggestions

  • CORD-19 Study Design & Metadata
  • Scientific Article Study Classifier
  • Research Paper Metadata Extraction Dataset
  • NLP for Medical Study Design

Attributes

Original Data Source: CORD-19 Study Design

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free