Opendatabay APP

Premium Synthetic Pharma EHR + Genomics + PK/PD Dataset (10K Patients)

Synthetic Data Generation

Tags and Keywords

Synthetic

Data

Ehr

Electronic

Health

Records

Genomics

Pk

Pd

Pharma

R&d

Biotech

Ai

Icd-10

Atc

Codes

Meddra

Clinical

Trial

Simulation

Precision

Medicine

Pharmacovigilance

Safety

Signal

Detection

Laboratory

Longitudinal

Gdpr

Compliant

Hipaa

Compatible

Medical

Drug

Development

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Premium Synthetic Pharma EHR + Genomics + PK/PD Dataset (10K Patients) Dataset on Opendatabay data marketplace

"No reviews yet"

£79

About

This dataset contains 10,000 fully synthetic pharma‑grade EHR records combining demographics, encounters, diagnoses, medications with PK/PD parameters, and key laboratory values in a single, analysis‑ready table. Its purpose is to provide realistic but privacy‑safe clinical data for pharma R&D, biotech AI development, and healthcare analytics without using any real patient data.

Dataset Features

patient_id: Unique synthetic patient identifier linking all records for the same individual.
region: World region (e.g. Europe_CEE, Europe_West, Middle East, Africa, East Asia, South America, North America).
country: Patient’s country of residence.
city: Synthetic city category (CapitalCity, MetroTown, ProvCity, HillTown, CoastalCity).
birth_year: Year of birth of the patient.
sex: Biological sex (Male, Female, Other).
ethnicity: Categorical ethnicity group (GroupA–GroupD).
education_level: Highest completed education (None, Primary, Secondary, Tertiary, Graduate).
occupation: Broad occupation class (Engineer, Student, Farmer, Clerk, HealthWorker, Teacher, Retired, Unemployed, etc.).
income_usd: Annual income in USD.
insurance_type: Coverage type (Public, Private, Mixed, None).
registration_date: Date when the patient was first registered in the system.
encounter_id_x: Unique identifier of the main encounter.
encounter_type: Encounter setting (outpatient, inpatient, emergency, rehab).
start_date_x: Encounter start date.
end_date_x: Encounter end date.
site_id: Synthetic site / hospital identifier.
diagnosis_id: Internal diagnosis record identifier.
encounter_id_y: Linked encounter ID for diagnosis/medication context.
icd10: Primary diagnosis in ICD‑10 format (e.g. I10, E11, C50, J45).
severity: Disease severity category (mild, moderate, severe).
medication_id: Synthetic medication record identifier.
atc: Drug code in the ATC classification (e.g. A10BA02, C07AB02, M01AE01).
dose_mg: Prescribed dose in milligrams.
start_date_y: Medication start date.
end_date_y: Medication end date.
adherence_pct: Percentage adherence to the prescribed regimen.
clearance_L_per_h: Drug clearance in liters per hour (PK parameter).
Vd_L: Volume of distribution in liters (PK parameter).
t_half_h: Drug half‑life in hours (PK parameter).
lab_id: Laboratory record identifier.
date: Date of the laboratory measurement.
glucose_mg_dL: Blood glucose level.
ldl_mg_dL: LDL cholesterol level.
crp_mg_L: C‑reactive protein level (inflammation marker).
hb_g_dL: Hemoglobin concentration.
  • Column 1 Name: Description of what this column represents.
  • Column 2 Name: Add as needed...

Distribution

Data format: CSV (comma‑separated values) with header row.
Data volume: Full dataset 10,000 rows × 36+ columns; attached demo file ehr_1000_demo.csv contains 1,000 preview records with identical structure.
Structure: Each row represents one encounter–diagnosis–medication–lab combination for a given synthetic patient, with demographics and PK/PD parameters flattened into a single wide table.
  • Data Volume: Number of rows/records, number of columns, etc.

Usage

This dataset is ideal for many applications:
Application: Training and validating ML models for diagnosis prediction, treatment response, adherence scoring, and risk stratification on realistic EHR‑like data.
Application: Pharma and biotech research, including clinical trial simulation, protocol design, drug portfolio analytics, and PK/PD modeling.
Application: Testing hospital data platforms, ETL pipelines, and clinical data warehouses using fully synthetic, privacy‑safe EHR data.
  • Application: Brief description of the first use case.
  • Application: Add more as needed.

Coverage

Geographical coverage: Multiple regions (Europe_CEE, Europe_West, Middle East, Africa, East Asia, South America, North America), with country and synthetic city categories for each patient.
Time range: Registrations, encounters, medications, and labs span approximately 2012–2022, reflecting contemporary clinical practice patterns.
Demographic coverage: Wide age range (birth years roughly 1920–2020), both sexes, diverse ethnicity groups, varied occupations, income levels, and insurance types.
  • Geographic Coverage: Region, country, or global.
  • Time Range: Start date - End date of data collection.
  • Demographics (if applicable): Age groups, gender, industries, etc.

License

Proprietary

Who Can Use It

Proprietary commercial synthetic healthcare dataset; allowed for internal research, analytics, and AI development, while resale, redistribution, or public sharing of the raw data is restricted.
  • Data Scientists: For training machine learning models.
  • Researchers: For academic or scientific studies.
  • Businesses: For analysis, insights, or AI development.

Include any additional notes or context about the dataset that might be helpful for users.

Listing Stats

VIEWS

4

DOWNLOADS

0

LISTED

02/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

£79

Download Dataset in ZIP Format