Opendatabay APP

Structured COVID-19 Trial Eligibility Data

Patient Health Records & Digital Health

Tags and Keywords

Covid-19

Trials

Eligibility

Criteria

Health

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Structured COVID-19 Trial Eligibility Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This resource contains highly structured data derived from the eligibility criteria of clinical trials related to COVID-19. It aims to provide researchers with knowledge extracted from the typically unstructured text of trial descriptions, focusing on key medical and procedural entities. Each entry in the data identifies a specific entity within a trial's criteria, maps it to a standardised vocabulary using concepts and domains (such as Condition, Drug, or Measurement), and normalises associated temporal and numerical attributes. The data clearly identifies whether the criterion serves as an inclusion or exclusion requirement for participation.

Columns

The dataset comprises 13 columns detailing the extracted information:
  • nct_id: The unique identifier for the clinical trial as recorded on clinicaltrials.gov.
  • entity_source_text: The precise segment of the original eligibility criteria text containing the medical or procedural entity (e.g., "pregnant").
  • concept_id: The identifier used within the standardised vocabulary for the mapped entity.
  • concept_name: The standardised name corresponding to the concept ID (e.g., "Disease caused by severe acute respiratory syndrome coronavirus 2").
  • domain: The type or category of the concept, such as 'Condition', 'Drug', or 'Measurement'.
  • start_index/end_index: The character positions indicating where the entity begins and ends within the full criteria source text.
  • temporal_source_text: The text snippet from the criteria that indicates a time frame or duration (e.g., "history of"). This is frequently null.
  • days: The temporal attribute normalised into a number of days.
  • numeric_source_text: The text snippet describing a numerical attribute associated with the entity (e.g., "positive"). This is often null.
  • numeric_att_min/numeric_att_max: The lower and upper bounds of the normalised numerical attribute.
  • is_exclusion: A binary flag where '1' signifies an exclusion criterion and '0' signifies an inclusion criterion.

Distribution

The data is provided in a tabular format, typically a CSV file. It contains approximately 10.2 thousand valid records of extracted entities and attributes. The provided file size is 1.19 MB. Structure analysis indicates that concepts related to 'Condition' are the most frequent domain, accounting for over half of all entries.

Usage

This data product is highly valuable for several research and development applications, including:
  • Developing and evaluating Natural Language Processing (NLP) models designed to automatically structure and extract complex clinical information from text.
  • Analysing patterns in eligibility requirements across global COVID-19 clinical trials.
  • Studying the ratio and types of inclusion versus exclusion criteria employed in pandemic-related research, noting that exclusion criteria significantly outnumber inclusion criteria.
  • Research in medical informatics focusing on standardisation and interoperability of trial data.

Coverage

The scope of this data is based on clinical trial IDs sourced from the clinicaltrials.gov platform. The concepts covered directly pertain to COVID-19, including the disease itself and common related conditions, procedures, measurements, and demographics, such as pregnancy status. The dataset provides structured semantic tags applicable to a range of medical domains relevant to trial enrolment.

License

CC0: Public Domain

Who Can Use It

  • Biomedical Researchers: To gain quick insights into population characteristics and restrictions in COVID-19 trials.
  • Data Scientists and Machine Learning Engineers: To train models for entity recognition, semantic tagging, and attribute normalisation in clinical documents.
  • Health Informatics Specialists: To explore methods for standardising clinical trial metadata and criteria.

Dataset Name Suggestions

  • COVID-19 Clinical Trial Semantic Eligibility Criteria
  • Structured COVID-19 Trial Eligibility Data
  • Normalized Clinical Trial Inclusion and Exclusion Attributes

Attributes

Listing Stats

VIEWS

3

DOWNLOADS

0

LISTED

16/10/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format