Premium Synthetic Pharma EHR + Genomics + PK/PD Dataset (10K Patients)
Synthetic Data Generation
Tags and Keywords
Trusted By




"No reviews yet"
£79
About
This dataset contains 10,000 fully synthetic pharma‑grade EHR records combining demographics, encounters, diagnoses, medications with PK/PD parameters, and key laboratory values in a single, analysis‑ready table. Its purpose is to provide realistic but privacy‑safe clinical data for pharma R&D, biotech AI development, and healthcare analytics without using any real patient data.
Dataset Features
patient_id: Unique synthetic patient identifier linking all records for the same individual.
region: World region (e.g. Europe_CEE, Europe_West, Middle East, Africa, East Asia, South America, North America).
country: Patient’s country of residence.
city: Synthetic city category (CapitalCity, MetroTown, ProvCity, HillTown, CoastalCity).
birth_year: Year of birth of the patient.
sex: Biological sex (Male, Female, Other).
ethnicity: Categorical ethnicity group (GroupA–GroupD).
education_level: Highest completed education (None, Primary, Secondary, Tertiary, Graduate).
occupation: Broad occupation class (Engineer, Student, Farmer, Clerk, HealthWorker, Teacher, Retired, Unemployed, etc.).
income_usd: Annual income in USD.
insurance_type: Coverage type (Public, Private, Mixed, None).
registration_date: Date when the patient was first registered in the system.
encounter_id_x: Unique identifier of the main encounter.
encounter_type: Encounter setting (outpatient, inpatient, emergency, rehab).
start_date_x: Encounter start date.
end_date_x: Encounter end date.
site_id: Synthetic site / hospital identifier.
diagnosis_id: Internal diagnosis record identifier.
encounter_id_y: Linked encounter ID for diagnosis/medication context.
icd10: Primary diagnosis in ICD‑10 format (e.g. I10, E11, C50, J45).
severity: Disease severity category (mild, moderate, severe).
medication_id: Synthetic medication record identifier.
atc: Drug code in the ATC classification (e.g. A10BA02, C07AB02, M01AE01).
dose_mg: Prescribed dose in milligrams.
start_date_y: Medication start date.
end_date_y: Medication end date.
adherence_pct: Percentage adherence to the prescribed regimen.
clearance_L_per_h: Drug clearance in liters per hour (PK parameter).
Vd_L: Volume of distribution in liters (PK parameter).
t_half_h: Drug half‑life in hours (PK parameter).
lab_id: Laboratory record identifier.
date: Date of the laboratory measurement.
glucose_mg_dL: Blood glucose level.
ldl_mg_dL: LDL cholesterol level.
crp_mg_L: C‑reactive protein level (inflammation marker).
hb_g_dL: Hemoglobin concentration.
region: World region (e.g. Europe_CEE, Europe_West, Middle East, Africa, East Asia, South America, North America).
country: Patient’s country of residence.
city: Synthetic city category (CapitalCity, MetroTown, ProvCity, HillTown, CoastalCity).
birth_year: Year of birth of the patient.
sex: Biological sex (Male, Female, Other).
ethnicity: Categorical ethnicity group (GroupA–GroupD).
education_level: Highest completed education (None, Primary, Secondary, Tertiary, Graduate).
occupation: Broad occupation class (Engineer, Student, Farmer, Clerk, HealthWorker, Teacher, Retired, Unemployed, etc.).
income_usd: Annual income in USD.
insurance_type: Coverage type (Public, Private, Mixed, None).
registration_date: Date when the patient was first registered in the system.
encounter_id_x: Unique identifier of the main encounter.
encounter_type: Encounter setting (outpatient, inpatient, emergency, rehab).
start_date_x: Encounter start date.
end_date_x: Encounter end date.
site_id: Synthetic site / hospital identifier.
diagnosis_id: Internal diagnosis record identifier.
encounter_id_y: Linked encounter ID for diagnosis/medication context.
icd10: Primary diagnosis in ICD‑10 format (e.g. I10, E11, C50, J45).
severity: Disease severity category (mild, moderate, severe).
medication_id: Synthetic medication record identifier.
atc: Drug code in the ATC classification (e.g. A10BA02, C07AB02, M01AE01).
dose_mg: Prescribed dose in milligrams.
start_date_y: Medication start date.
end_date_y: Medication end date.
adherence_pct: Percentage adherence to the prescribed regimen.
clearance_L_per_h: Drug clearance in liters per hour (PK parameter).
Vd_L: Volume of distribution in liters (PK parameter).
t_half_h: Drug half‑life in hours (PK parameter).
lab_id: Laboratory record identifier.
date: Date of the laboratory measurement.
glucose_mg_dL: Blood glucose level.
ldl_mg_dL: LDL cholesterol level.
crp_mg_L: C‑reactive protein level (inflammation marker).
hb_g_dL: Hemoglobin concentration.
- Column 1 Name: Description of what this column represents.
- Column 2 Name: Add as needed...
Distribution
Data format: CSV (comma‑separated values) with header row.
Data volume: Full dataset 10,000 rows × 36+ columns; attached demo file
Structure: Each row represents one encounter–diagnosis–medication–lab combination for a given synthetic patient, with demographics and PK/PD parameters flattened into a single wide table.
Data volume: Full dataset 10,000 rows × 36+ columns; attached demo file
ehr_1000_demo.csv contains 1,000 preview records with identical structure.Structure: Each row represents one encounter–diagnosis–medication–lab combination for a given synthetic patient, with demographics and PK/PD parameters flattened into a single wide table.
- Data Volume: Number of rows/records, number of columns, etc.
Usage
This dataset is ideal for many applications:
Application: Training and validating ML models for diagnosis prediction, treatment response, adherence scoring, and risk stratification on realistic EHR‑like data.
Application: Pharma and biotech research, including clinical trial simulation, protocol design, drug portfolio analytics, and PK/PD modeling.
Application: Testing hospital data platforms, ETL pipelines, and clinical data warehouses using fully synthetic, privacy‑safe EHR data.
Application: Training and validating ML models for diagnosis prediction, treatment response, adherence scoring, and risk stratification on realistic EHR‑like data.
Application: Pharma and biotech research, including clinical trial simulation, protocol design, drug portfolio analytics, and PK/PD modeling.
Application: Testing hospital data platforms, ETL pipelines, and clinical data warehouses using fully synthetic, privacy‑safe EHR data.
- Application: Brief description of the first use case.
- Application: Add more as needed.
Coverage
Geographical coverage: Multiple regions (Europe_CEE, Europe_West, Middle East, Africa, East Asia, South America, North America), with country and synthetic city categories for each patient.
Time range: Registrations, encounters, medications, and labs span approximately 2012–2022, reflecting contemporary clinical practice patterns.
Demographic coverage: Wide age range (birth years roughly 1920–2020), both sexes, diverse ethnicity groups, varied occupations, income levels, and insurance types.
Time range: Registrations, encounters, medications, and labs span approximately 2012–2022, reflecting contemporary clinical practice patterns.
Demographic coverage: Wide age range (birth years roughly 1920–2020), both sexes, diverse ethnicity groups, varied occupations, income levels, and insurance types.
- Geographic Coverage: Region, country, or global.
- Time Range: Start date - End date of data collection.
- Demographics (if applicable): Age groups, gender, industries, etc.
License
Proprietary
Who Can Use It
Proprietary commercial synthetic healthcare dataset; allowed for internal research, analytics, and AI development, while resale, redistribution, or public sharing of the raw data is restricted.
- Data Scientists: For training machine learning models.
- Researchers: For academic or scientific studies.
- Businesses: For analysis, insights, or AI development.
Include any additional notes or context about the dataset that might be helpful for users.
Loading...
