The team had everything. IRB approval after four months of applications and revisions. Institutional support. A compelling research question — predicting 30-day readmissions in heart failure patients. A solid model architecture. And then, eleven weeks in, they discovered the Kaggle dataset they'd built everything on carried a Creative Commons Non-Commercial license. Their product was commercial. Their investors expected commercial returns. The data they'd used to train every model they had was one they didn't have the rights to use.

They tried Synthea next. Open-source, Apache-licensed, no restrictions. They generated 500,000 synthetic patients, ran their feature engineering pipeline, and built a readmission model that showed promising validation metrics. Then one of their cardiologists looked at the training data more carefully. Where were the CPT procedure codes? Where was the billing data? Where were the E&M visit levels? The revenue cycle data that their hospital partner needed to integrate the model into their actual workflow was simply absent. Synthea doesn't generate it.

Back to searching. This time for MIMIC-III, the gold-standard ICU dataset from Beth Israel Deaconess Medical Center and MIT. Freely available — but behind a PhysioNet Data Use Agreement that explicitly restricts the data to non-commercial research. Again: not usable for a commercial product.

That team's experience is not an edge case. It's the standard arc of a healthcare ML project's first data crisis. IRB approval creates a false sense of forward momentum. Free datasets that look promising turn out to carry commercial restrictions. Open-source generators produce data that's missing critical clinical and billing fields. And through all of it, the calendar keeps moving and the runway keeps shrinking.

The problem isn't that healthcare data doesn't exist. It's that the data that exists is either inaccessible, improperly licensed, structurally incomplete, or all three simultaneously. And every one of these failures is invisible until you're deep enough into a project that turning back feels impossible.

This guide is a systematic walkthrough of everything that makes a healthcare dataset actually usable for machine learning. Not theoretically usable. Not usable for academic research. Actually usable — for a commercial product, in production, by a team with real timelines and real stakeholders.

The Lifecycle of a Healthcare ML Project — and Where Data Problems Strike

Healthcare ML projects fail at data problems in predictable ways. Understanding the lifecycle lets you anticipate where the landmines are before you step on them.

Phase 1: Problem Definition and Data Requirements (Weeks 1–4)

This phase feels data-independent — you're defining the clinical question, identifying stakeholders, sketching model architecture. But the data decisions you make here, or fail to make, determine whether your project survives. Teams that don't specify data requirements with precision during problem definition spend months on a model that can't be trained because the required features don't exist in any available dataset.

The specific failure: a team building a length-of-stay prediction model specifies "clinical data" as their data requirement. They get six months in before realizing that LOS prediction requires not just clinical features but admission source, insurance type, bed availability proxies, and surgical scheduling data — none of which are in the "clinical dataset" they sourced. The clinical question drove the feature requirements; the feature requirements drove the data requirements; and the data requirements weren't specified until too late.

Phase 2: Data Discovery and Acquisition (Weeks 4–16)

This is where most projects lose the most time. Data discovery in healthcare is not a search problem — it's a compliance, legal, and negotiation problem. A dataset that looks right from a clinical content perspective may fail on licensing, schema structure, volume, or access timeline.

The specific failure mode: a team identifies a dataset that meets their clinical requirements, begins the IRB and DUA process, and discovers six to eighteen months later that the data was transferred with restrictions that make it unusable for their specific application. By this point, they've written code that assumes specific data formats and built an entire feature engineering pipeline around a schema they can no longer legally use.

Phase 3: Exploratory Data Analysis and Feature Engineering (Weeks 12–24)

EDA in healthcare has failure modes unique to the domain. The missing data isn't missing at random. The outliers aren't errors — they're real extreme values that exist in clinical populations. The apparent class imbalance in outcome data isn't a data quality problem — it's a reflection of actual event rates. Teams that don't understand these characteristics will make wrong decisions during feature engineering and produce models that fail silently in deployment.

The specific failure: a team discovers that 23% of lab values are missing across their training set. They impute with population means. What they don't realize is that the missingness is correlated with clinical severity — sicker patients in ICU settings had fewer routine labs ordered because ordering priorities shifted to critical monitoring. Mean imputation destroys the missingness signal. When the model is deployed, it performs well on the patients it was trained to handle and systematically misclassifies the sickest patients — exactly the ones that matter most.

Phase 4: Model Training and Validation (Weeks 20–36)

Even when the data is right, train/test splitting in healthcare data has domain-specific failure modes that don't exist in other ML domains. Random splitting creates temporal leakage. Site-level splitting creates distribution shift. Outcome-stratified splitting ignores the temporal structure of clinical prediction. The model that looks great on validation metrics may be substantially less capable than the metrics suggest.

Phase 5: Deployment and Monitoring (Weeks 32+)

The data problems that weren't caught in validation surface here: demographic biases that produce differential performance across patient subgroups, seasonal effects that shift the input distribution, institutional norms that differ between training sites and deployment sites. A model trained primarily on data from a large academic medical center may perform differently when deployed in a rural critical access hospital — even if the clinical outcome is the same.

Why Healthcare Data Is Structurally Different from Every Other ML Domain

Machine learning engineers who've worked in other domains — financial services, e-commerce, computer vision, NLP — often underestimate how structurally different healthcare data is. It's not just that the data is sensitive. It's that the structure of healthcare data violates assumptions that hold in virtually every other ML domain.

Temporal Dependencies Are Clinically Meaningful, Not Statistical Artifacts

In most ML domains, temporal ordering of training examples is a technical consideration — you split chronologically to avoid leakage. In healthcare, the temporal structure of the data carries clinical information that must be preserved to build meaningful features. A patient's creatinine trending upward over three visits tells a different clinical story than three independent creatinine measurements at the same values. The trajectory matters, not just the level.

This means that healthcare models require temporal feature engineering that preserves visit ordering, accounts for irregular time intervals between encounters, and captures trajectories rather than point-in-time values. A dataset that doesn't preserve encounter timestamps — or that provides dates only at the year level, as many de-identified datasets do — makes temporal feature engineering impossible or unreliable.

ICD-10's Hierarchical Structure Creates Non-Obvious Feature Relationships

ICD-10 codes are organized in a hierarchy: chapters, blocks, categories, and subcategories. E11 is type 2 diabetes mellitus. E11.6 is type 2 diabetes mellitus with other specified complications. E11.65 is type 2 diabetes mellitus with hyperglycemia. The relationship between these codes is hierarchical — a patient with E11.65 also has E11.6 and E11, but not the reverse.

A model that treats ICD-10 codes as independent categorical variables loses all hierarchical information. A patient with E11.65 and a patient with E11.9 (type 2 diabetes without complications) are both "diabetic" at the chapter level but clinically very different. Feature engineering that doesn't account for ICD hierarchy — collapsing to three-digit codes or using code embeddings that capture clinical relatedness — will miss signal that experienced clinicians find obvious.

Beyond hierarchy, ICD-10 co-occurrence patterns carry massive clinical signal. Heart failure (I50.x) and chronic kidney disease (N18.x) co-occur at high rates — cardiorenal syndrome is a well-characterized clinical entity. Atrial fibrillation (I48.x) and heart failure co-occur at rates far higher than would be expected by chance. A dataset whose ICD co-occurrence patterns don't reflect real clinical co-morbidity relationships — because it was generated without attention to these relationships — will produce models that miss interactions that any cardiologist takes for granted.

Missingness Is Not at Random — It's Clinically Structured

In most ML datasets, missing values are a data quality issue. A product review is missing because the user didn't provide it. A transaction record is missing because the system was down. These are Missing Completely at Random (MCAR) scenarios, or at worst Missing at Random (MAR) — missing conditional on observed variables.

Healthcare data is dominated by Missing Not at Random (MNAR) patterns. A hemoglobin A1c lab is missing not because the system failed to record it but because the physician didn't order it — perhaps because the patient is acutely ill and A1c is not the clinical priority, or because the patient is known to be well-controlled and annual monitoring is deemed sufficient, or because the patient doesn't have diabetes and A1c was never relevant. The missingness pattern itself is clinically informative.

A model that imputes missing lab values without accounting for why they're missing will systematically destroy clinically relevant information. The right approach is to treat the missingness indicator as a feature — to let the model learn that "this lab was not ordered" carries signal independent of what the lab value would have been. This requires a dataset that preserves the absence of records rather than simply encoding them as null values.

Comorbidity Correlations Create Multi-Dimensional Target Leakage Risk

In healthcare, the presence of certain diagnoses is strongly predictive of others. A patient with heart failure is significantly more likely to also have hypertension, CKD, and type 2 diabetes. A patient with COPD is significantly more likely to also have pulmonary hypertension. A patient with cirrhosis is significantly more likely to have hepatic encephalopathy.

This creates a subtle but serious risk in model development: when your training dataset doesn't reflect realistic comorbidity correlations, your model learns the wrong feature relationships. A model trained on a dataset where heart failure patients have lower-than-realistic rates of CKD will learn that creatinine elevation is less predictive of adverse outcomes in heart failure patients than it actually is. When that model encounters real patients — where the comorbidity correlation is present — it miscalibrates on exactly the patients at highest risk.

The ICD-10 Distribution Problem: Why Your Dataset Needs the Right Tail

Most healthcare ML teams focus on whether their dataset has the diagnoses they need. The more important question is whether the distribution of those diagnoses reflects the population they're trying to model.

Consider the ICD-10 distribution in a real inpatient population. The most common primary diagnoses are dominated by a handful of conditions: heart failure (I50.x), sepsis (A41.x), pneumonia (J18.x), COPD exacerbation (J44.x), atrial fibrillation (I48.x), acute myocardial infarction (I21.x). These conditions make up a substantial fraction of all hospitalizations. But the tail of the distribution — the rarer diagnoses that appear in lower frequencies — is critically important for several reasons.

First, rare diagnoses are often the ones associated with the highest clinical stakes. A patient admitted with acute liver failure (K72.0x) is at much higher risk than a patient admitted with simple pneumonia. If your readmission prediction model has very few training examples of patients with rare high-acuity diagnoses, it will produce poorly calibrated predictions for exactly the patients who need the most accurate prediction.

Second, rare diagnosis codes often serve as the differentiating features that separate high-risk patients from medium-risk patients who have similar common comorbidities. A patient with heart failure, CKD, and diabetes is not unusual. A patient with heart failure, CKD, diabetes, AND hereditary hemorrhagic telangiectasia (I78.0) is unusual — and that rare additional diagnosis may significantly change their clinical trajectory and their risk of adverse events.

Third, the relationship between common and rare diagnoses in your training data determines whether your model can generalize. A dataset that perfectly models the most common ICD-10 codes but has unrealistic frequency or co-occurrence patterns for less common codes will produce a model that performs well on the most common patients and poorly on the rare, complex patients who are disproportionately represented in adverse outcome events.

The right question isn't "does my dataset have ICD-10 codes?" It's "does my dataset have an ICD-10 distribution that reflects the population I'm modeling, including realistic frequency of rare diagnoses and realistic co-occurrence patterns between common and rare codes?" A dataset where every patient has one clean primary diagnosis, with no realistic comorbidity complexity, will produce a model that fails in the real world.

When evaluating a dataset, ask for the ICD-10 code frequency distribution and compare it to CMS inpatient utilization data, which publishes annual diagnosis frequency statistics for Medicare and Medicaid populations. A dataset whose top-20 diagnoses match the CMS distribution reasonably well is a promising sign. A dataset where the distribution looks implausibly uniform — where uncommon codes appear at the same frequency as common ones — is a red flag indicating that the data was generated without attention to epidemiological realism.

The Commercial Licensing Trap in "Free" Healthcare Datasets

The most expensive healthcare datasets in ML are often the free ones. Not because of what they cost to acquire, but because of what they cost in time and legal exposure when teams discover their licensing restrictions after building products on top of them.

PhysioNet and MIMIC-III

MIMIC-III (Medical Information Mart for Intensive Care) is one of the most widely used clinical datasets in healthcare ML research. It contains de-identified data from over 40,000 ICU admissions at Beth Israel Deaconess Medical Center, covering laboratory results, medications, vital signs, clinical notes, imaging reports, and billing codes. It is genuinely excellent data.

It is also governed by a PhysioNet Credentialed Health Data Use Agreement that explicitly prohibits commercial use. Section 3 of the standard MIMIC Data Use Agreement states that the data may be used only for "non-commercial" purposes. Any product that was trained on or validated using MIMIC data — where that product generates revenue — violates the DUA.

This matters because MIMIC-III has been used in dozens of published academic papers that describe commercial ML approaches. The gap between "published academic research" and "commercially deployed product" is exactly where many companies find themselves when they try to productionize a model that was developed in an academic context. The research team used MIMIC legally. The commercial team cannot.

Kaggle Healthcare Datasets

Kaggle hosts hundreds of healthcare datasets. A significant fraction of them are licensed under Creative Commons licenses — most commonly CC BY-NC 4.0 (Attribution-NonCommercial) or CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike). The Non-Commercial clause in these licenses is explicit: the data may not be used for "primarily commercial advantage or monetary compensation."

The application of this clause to ML training is legally uncertain but practically risky. If you train a commercial ML model on CC BY-NC data, you are potentially violating the license regardless of whether the training data appears directly in the product outputs. The safest interpretation, and the one most commercial legal teams will advise, is that training a commercial model on NC-licensed data constitutes commercial use of that data.

Many ML engineers treat Kaggle licensing as a formality because enforcement is rare and the community norm is permissive. In healthcare, this is a dangerous mindset. Healthcare ML products operate in a regulated industry where trust is paramount. A licensing controversy — even one that isn't litigated — can damage customer relationships, create regulatory complications, and undermine the credibility of a company trying to operate in a trust-sensitive domain.

The CMS Public Data Trap

CMS publishes extensive public datasets: Medicare claims data (available through the Research Data Assistance Center), HCUP databases, the National Health Interview Survey. These are genuinely public data — not restricted by Non-Commercial clauses. But they carry their own access challenges: the Medicare Limited Data Set requires a Data Use Agreement with CMS that takes months to execute. HCUP requires state data organization agreements. The timeline for legitimate access to these datasets rivals the IRB process.

The practical implication: if your project timeline requires training data within weeks rather than months, none of these public options will serve you. The access timelines are incompatible with commercial development cadences.

What Synthea Actually Generates vs. What ML Models Need

Synthea is the open-source synthetic patient generator maintained by The MITRE Corporation. It is genuinely excellent software — well-maintained, FHIR-native, Apache-licensed, and capable of generating large volumes of synthetic patient data quickly. For learning FHIR resource structure, testing FHIR parsers, and building basic data pipelines, it's an excellent tool.

For building production-grade ML models, particularly those involving billing, revenue cycle, or utilization management use cases, Synthea has significant gaps that are rarely documented in the marketing materials or the tutorials that most developers first encounter.

CPT Codes Are Absent

Current Procedural Terminology (CPT) codes are the standard for procedure billing in the United States. Every physician visit, every surgical procedure, every diagnostic test generates CPT codes that drive reimbursement. CPT codes are also critical features for ML models in revenue cycle, prior authorization, utilization management, and cost prediction applications.

Synthea does not generate CPT codes. This is not an oversight — CPT codes are proprietary AMA intellectual property, and including them in an open-source software product creates licensing complications. The result is that any model that needs CPT codes as features — which includes the majority of payer-side ML applications and a significant fraction of provider-side applications — cannot be built on Synthea data alone.

E&M Coding Levels Are Not Modeled

Evaluation and Management (E&M) codes define the complexity level of physician visits, from straightforward (99211-99202) to highly complex (99205, 99215). E&M codes drive physician billing, but they also capture clinical complexity in a way that structured diagnosis codes alone don't fully capture. A model predicting resource utilization or hospitalization risk benefits significantly from E&M level as a feature.

Synthea doesn't generate realistic E&M code distributions because modeling realistic E&M coding requires understanding the documentation complexity rules, the medical decision-making scoring criteria, and the clinical context that drives coding decisions. This is domain knowledge that a general-purpose synthetic patient generator doesn't incorporate.

Financial and Revenue Cycle Data Is Minimal

Real healthcare datasets contain rich financial data: claim amounts, allowed amounts, patient responsibility, payment amounts, denial codes, adjustment codes, remittance data. This financial layer is essential for any model operating in the revenue cycle domain — predicting claim denials, identifying coding errors, forecasting collections, modeling prior authorization outcomes.

Synthea's financial data modeling is limited. The coverage and claim data Synthea generates provides a basic structure but lacks the richness of real-world billing complexity: modifier codes that change reimbursement rates, bundling rules that affect claim adjudication, payer-specific contract terms that produce different allowed amounts for the same procedure at the same facility.

Comorbidity Co-occurrence Patterns Are Simplified

Synthea generates patients by running clinical modules that model the progression of individual conditions. The interaction between conditions — how the presence of one condition affects the probability and trajectory of others — is modeled at a high level but doesn't capture the full complexity of real clinical co-morbidity patterns. A Synthea-generated diabetic patient with heart failure may not have the same distribution of additional comorbidities as a real diabetic heart failure patient in a Medicare population.

For models where comorbidity interactions are important features — which includes most risk stratification, readmission prediction, and hospitalization prediction models — this gap matters. The model may train successfully on Synthea data and then encounter systematic calibration errors when deployed against real patient populations with more complex comorbidity patterns.

Schema Integrity: The Database Architecture Your Model Actually Needs

Most healthcare datasets that ML engineers first encounter are flat files — one row per encounter, or one row per patient. This format is convenient for initial exploration but structurally inadequate for most real ML use cases. Building longitudinal features, constructing patient trajectories, and engineering comorbidity indexes all require a properly normalized relational schema with enforced foreign key relationships.

The Required Table Structure

A minimum-viable relational schema for healthcare ML includes these linked tables:

Foreign Key Integrity and Encounter-to-Claim Linkage

Having these tables is necessary but not sufficient. The foreign key relationships between them must be enforced and consistent. A diagnosis record that references an encounter_id that doesn't exist in the encounters table is not just bad data quality — it will silently corrupt any feature engineering that attempts to join across tables.

Encounter-to-claim linkage is particularly important for payer and revenue cycle ML applications. The clinical encounter generates the claim. The claim may be adjudicated differently by different payers. The linkage between clinical encounter and financial claim is the foundation for any model that predicts revenue cycle outcomes. If your dataset can't link encounter records to claim records with consistent, enforced foreign keys, you can't build this class of model.

Temporal Sequencing of Records

All records that have a clinical timestamp — labs, vitals, medications, procedures — must have timestamps that are internally consistent and correctly sequenced. Lab values timestamped after the encounter discharge don't make clinical sense. Medication administration timestamped before the order doesn't make clinical sense. These temporal inconsistencies, which are common in poorly generated synthetic datasets, corrupt temporal feature engineering in ways that are difficult to detect and may only manifest as subtle model performance degradation rather than explicit errors.

Feature Engineering Challenges Unique to Healthcare

Comorbidity Index Calculation

The Charlson Comorbidity Index (CCI) and the Elixhauser Comorbidity Index are the two most commonly used comorbidity scoring systems in healthcare ML. Both map ICD diagnosis codes to weighted comorbidity scores that capture the overall burden of chronic disease in a patient. CCI assigns weights from 1 to 6 to 17 comorbidity categories; the summed score predicts mortality risk. Elixhauser covers 31 comorbidity categories and is generally considered more sensitive for predicting hospitalization outcomes.

Calculating these indexes from ICD-10 codes requires a complete mapping from ICD-10 to Elixhauser/Charlson categories, a dataset with sufficient secondary diagnoses to capture all relevant comorbidities, and a schema that lets you aggregate across all encounters in a patient's history, not just the index admission. A dataset where each encounter only carries a primary diagnosis — or where secondary diagnoses are consistently missing — cannot support reliable comorbidity index calculation.

HCC Risk Scores

Hierarchical Condition Categories (HCCs) are CMS's risk adjustment model for Medicare Advantage and other value-based payment programs. HCC risk scores predict expected healthcare costs for a patient population, adjusting for the demographic and clinical complexity of covered beneficiaries. For any ML application in the payer or value-based care space, HCC coding is a critical feature.

HCC categories are mapped from ICD-10 codes but have their own hierarchy — within each HCC, only the most severe code contributes to the risk score, which is why the hierarchy matters for score calculation. Building HCC features requires a dataset with realistic ICD-10 distributions at realistic encounter frequencies for each patient. A dataset where patients only have two or three ICD codes per year will produce artificially low HCC scores relative to real Medicare beneficiaries, who average significantly more coded diagnoses per year.

Readmission Prediction Features

The canonical CMS all-cause 30-day readmission metric — used in the Hospital Readmissions Reduction Program since 2012 — has generated an enormous body of literature on predictive features. The best-performing readmission models use features that span multiple data domains: demographics (age, sex, insurance type), clinical complexity (Charlson score, primary diagnosis category, number of chronic conditions), utilization history (number of prior admissions in past 12 months, prior ED visits), care transitions (discharge disposition, discharge to SNF vs. home, follow-up appointment scheduled), and medication count (polypharmacy defined as 5+ chronic medications).

A dataset that supports readmission model development needs longitudinal history — not just the index admission, but the prior encounter history that contributes to utilization features. It needs discharge disposition codes (patient discharged to home, to SNF, to LTAC, to hospice, against medical advice). It needs medication lists comprehensive enough to calculate polypharmacy indicators. And it needs the 30-day follow-up window — ideally including data on whether the patient was readmitted — to construct the training labels.

The Train/Test Split Problem in Healthcare: Temporal Leakage and Site-Specific Overfitting

Random train/test splitting is the default in most ML frameworks. In healthcare, it's almost always wrong. Random splitting of a patient dataset mixes records from the same patient across training and test sets. A model can effectively "memorize" a patient's history from their training encounters and then produce inflated performance metrics when it encounters that patient's test encounters. This is temporal leakage — the model uses future information (training on later encounters) to predict outcomes it should be predicting blindly (test encounters).

The correct approach for most healthcare ML use cases is temporal splitting: all encounters before a cutoff date go to training, all encounters after the cutoff go to test. This simulates the deployment scenario where the model was trained on historical data and is being evaluated on future patients it hasn't seen. Temporal splitting often produces substantially lower test performance than random splitting — which is exactly the point. It reveals actual expected performance, not inflated performance from leakage.

Site-specific overfitting is a second distinct problem. A model trained on data from a single healthcare system — even a large one — may learn institutional patterns that don't generalize. A specific hospital's coding practices, order set preferences, discharge timing conventions, or patient population characteristics may all be reflected in the model's learned weights. When the model is deployed at a different institution with different norms, it encounters distribution shift that degrades performance in ways that the validation metrics didn't predict.

Training on data from multiple sites — or at minimum, validating on held-out data from a different site than the training data — is the appropriate safeguard. A synthetic dataset that models multiple healthcare system types (academic medical center, community hospital, rural critical access hospital, federally qualified health center) is substantially more valuable for building generalizable models than one that models a single institution type.

Specific Model Types and Their Data Requirements

30-Day Readmission Models

Required data elements: longitudinal encounter history (minimum 12 months prior to index admission), primary and secondary ICD-10 diagnoses with POA flags, discharge disposition codes, medication count at discharge, lab values at admission and discharge, prior ED utilization, insurance type, demographic features. Required volume: a minimum of 5,000 readmission events in the training set to support reliable gradient boosting models; 10,000+ for deep learning approaches. Required label: 30-day all-cause readmission indicator, which requires linking back to encounter records in the 30 days post-discharge.

Length of Stay Prediction

Required data elements: admission source (ED, direct, transfer), admission diagnosis (primary ICD), acuity at presentation (vital sign abnormalities, initial lab values), age, insurance type, historical LOS for similar cases. Target variable: continuous (days) or categorical (0-3 days, 4-7 days, 8-14 days, 15+). Distribution considerations: LOS distributions are heavily right-skewed; a dataset with a realistic LOS distribution will have a substantial proportion of 1-2 day stays and a meaningful tail of very long stays (30+, 60+, 90+ days). A dataset where LOS is uniformly distributed or normally distributed doesn't reflect clinical reality and will produce miscalibrated LOS models.

Sepsis Early Warning Models

Required data elements: timestamped vital signs (temperature, heart rate, respiratory rate, mean arterial pressure) at sub-hourly resolution during the hospital stay, sequential organ function assessment (SOFA) score components (creatinine, bilirubin, platelet count, PaO2/FiO2 ratio, GCS score, vasopressor requirements), blood culture orders and results, lactate values, antibiotic administration timestamps. The Sepsis-3 definition requires these elements explicitly.

Required volume: sepsis events are common enough in ICU datasets but rare in outpatient or general inpatient datasets. A general inpatient dataset may have 2-5% sepsis prevalence; an ICU-focused dataset will have higher rates. The model needs sufficient sepsis cases to train reliably — a dataset with fewer than 1,000 sepsis events in the training set will produce models with poorly calibrated sensitivity/specificity tradeoffs.

Prior Authorization Prediction Models

Required data elements: CPT procedure codes (which Synthea doesn't generate), diagnosis codes supporting medical necessity, insurance type and specific payer, historical approval/denial rates for the same procedure/diagnosis combination, provider specialty, facility type. Prior auth models are fundamentally payer-side models — they need the financial data layer that clinical-only datasets don't provide. A dataset without CPT codes and claim-level data cannot support prior authorization prediction modeling.

Evaluating Synthetic Healthcare Data: A 15-Point Quality Checklist

Not all synthetic healthcare datasets are created equal. Before committing to a synthetic dataset for ML development, evaluate it against these criteria:

Synthetic Healthcare Dataset Quality Checklist

1

Commercial license with no NC restriction. Confirm the license explicitly permits commercial use, including training commercial ML models and deploying those models in commercial products. Get this in writing before evaluating content.

2

ICD-10 frequency distribution matches real population data. Request a code frequency report and compare the top-50 codes to CMS's inpatient diagnosis frequency statistics. Major divergences indicate unrealistic generation.

3

Realistic comorbidity co-occurrence patterns. Spot-check specific pairs: heart failure + CKD, diabetes + peripheral neuropathy, COPD + pulmonary hypertension. If the co-occurrence rates don't reflect known clinical epidemiology, the data won't support comorbidity-dependent feature engineering.

4

CPT codes present and with realistic distributions. Confirm CPT codes are included and that the distribution of code families (E&M vs. surgical vs. diagnostic) reflects realistic care patterns for the indicated population.

5

Relational schema with enforced foreign keys. Test a sample: select 100 diagnosis records and verify that every encounter_id references a valid encounter. Do the same for all foreign key relationships. Referential integrity failures indicate schema quality problems.

6

Temporal consistency within patient records. Check that lab collection timestamps precede discharge timestamps, that medication order timestamps precede administration timestamps, and that encounters don't overlap. Temporal inconsistencies corrupt longitudinal feature engineering.

7

Lab value distributions calibrated to real population statistics. Request summary statistics (mean, standard deviation, percentile distribution) for common labs: creatinine, HbA1c, hemoglobin, sodium, potassium. Compare to published population reference ranges and known disease-stratified distributions.

8

Realistic missing data patterns. Inspect missing rates by variable and by patient subgroup. Missing rates that are uniform across all variables or identical across patient subgroups suggest the data was generated with random missingness rather than clinically structured missingness.

9

Sufficient rare outcome volume. For your specific model target, count the training events. If you're building a readmission model, how many 30-day readmissions are in the dataset? If fewer than 5,000, you'll need a larger dataset or upsampling strategies that may introduce their own problems.

10

Demographic diversity reflecting the target population. Check distribution of age, sex, race/ethnicity, insurance type, and geographic region. A dataset skewed toward one demographic group will produce models that underperform for underrepresented groups in deployment.

11

Multiple healthcare setting types. Does the dataset include data from inpatient, outpatient, ED, and specialty settings? A dataset that only models inpatient encounters can't support longitudinal models that track patients across care settings.

12

Financial/claim data with realistic denial rates. If your model needs claim-level data, confirm that denial rates, claim adjustment codes, and payment amounts are present and have realistic distributions. A dataset where 100% of claims are paid at full billed charges doesn't reflect the reality of claim adjudication.

13

Multiple export formats available. Confirm that the dataset is available in formats compatible with your entire team's workflow: FHIR R4 JSON for integration engineers, Parquet for data engineers, CSV for analysts, HL7 v2 for teams working with legacy integration environments.

14

Data dictionary and schema documentation. Every column should have a documented definition, including the coding system it uses, valid value ranges, and how missing values are encoded. Undocumented fields create ambiguity that propagates errors through feature engineering.

15

Vendor support and SLA for updates. ICD-10 codes are updated annually. CPT codes are updated annually. A synthetic dataset vendor should commit to maintaining currency with coding updates and have a process for notifying customers of schema changes.

Evaluation Strategies: How to Know If Synthetic Data Is Good Enough for Your Use Case

The technical question of whether a synthetic dataset is adequate for your ML use case can be addressed systematically, rather than through intuition or hopeful assumption. Here are the evaluation strategies that experienced healthcare ML teams use.

Train on Synthetic, Validate on Real

The most rigorous evaluation is also the most practical: train your model on synthetic data, then validate it on a small held-out sample of real data. The performance gap between synthetic-trained and real-data-validated performance tells you how well the synthetic data captured the signal structure that matters for your use case. A small performance gap (less than 5 AUC points for most classification problems) suggests the synthetic data is adequate for development. A large gap suggests domain-specific limitations in the synthetic data that need to be addressed.

Statistical Distribution Comparison

Compare the marginal distributions of key features (age, common diagnoses, lab values, LOS) between your synthetic dataset and available real population statistics. Use published CMS utilization data, HCUP statistics, and condition-specific epidemiological data as reference points. Significant divergence in marginal distributions — even if the correlation structure is preserved — will affect model calibration.

Clinical Expert Review

Generate a random sample of 50-100 patient records from the synthetic dataset and have a clinician review them for face validity. An experienced hospitalist or specialist should be able to look at a patient's clinical record and find it plausible — the right conditions occurring together in a patient of the right age, with appropriate treatment patterns, with lab values that make clinical sense given the diagnoses. If clinicians consistently identify records as implausible, the dataset will likely produce models that fail when they encounter real patients.

The PatientDatasets Difference: ML-Grade from Day One

The team that had spent weeks on the Kaggle license problem, then the Synthea CPT gap, eventually found PatientDatasets. What struck them first wasn't the content — it was the license. Explicitly commercial. No NC clause. No DUA to negotiate. And then the content: CPT codes present and with realistic distributions. A relational schema with documented foreign keys. Lab values with distributions they could verify against published population statistics. ICD-10 co-occurrence patterns that a clinician on their team reviewed and called "the most realistic synthetic data I've seen."

They had a working readmission model in six weeks. The model they'd spent three months building on the wrong data, they rebuilt in six weeks on the right data. The difference wasn't the model architecture or the feature engineering approach. It was the data.

PatientDatasets provides healthcare datasets specifically engineered for ML use cases. Every dataset ships with a commercial license — not academic research only, not non-commercial only, but a license that explicitly permits training commercial ML models and deploying them in commercial products.

The clinical content reflects real epidemiology: ICD-10 distributions calibrated to CMS inpatient utilization data, comorbidity co-occurrence patterns based on published clinical research, lab value distributions matched to known population statistics. CPT codes are included. E&M visit levels are modeled. Financial and claim data is present with realistic adjudication patterns.

The schema is relational, with enforced foreign key integrity across patient, encounter, diagnosis, procedure, medication, lab, vital, and claim tables. Temporal sequencing is preserved and internally consistent. Missing data patterns reflect clinical structured missingness rather than random imputation. Multiple export formats — including FHIR R4 JSON, Parquet, normalized CSV, and HL7 v2 — mean your entire team can work with the data in the format that fits their workflow.

Healthcare Data That's Actually Ready for ML

Commercially licensed synthetic patient datasets across 60+ specialties. Relational schema with enforced foreign keys, realistic ICD-10 and CPT code distributions, lab values calibrated to real population statistics, clinical notes, and full financial data. Seven export formats including FHIR R4 and Parquet. Available today — no IRB, no DUA, no waiting. Get your free sample dataset and see the difference quality data makes.

See Dataset Details & Get Free Sample →

The Real Question to Ask Before You Start

Before you evaluate any healthcare dataset for ML use, answer these questions — in this order, because each one is a hard stop if the answer is wrong:

  1. Does the license explicitly permit my intended commercial use? Not "research use" — your specific, commercial, product-building use. If the answer is no or uncertain, stop here and find a different dataset.
  2. Does the schema include the specific data types my model requires? Not "does it have clinical data" — does it have CPT codes, claim data, discharge disposition, or whatever specific elements your feature engineering depends on? List every required field before you evaluate a dataset.
  3. Are the distributions realistic for the population I'm modeling? Not "does it have ICD-10 codes" — do the frequency distributions, co-occurrence patterns, and lab value ranges match what I'd expect from clinical epidemiology?
  4. Is there sufficient volume of my specific outcome of interest? Not "how many patients" — how many patients with the specific adverse outcome, rare diagnosis, or clinical event that my model needs to learn from?
  5. Does the schema support longitudinal feature engineering? Not "does it have timestamps" — are the timestamps consistent, correctly sequenced, and preserved at the granularity I need to build temporal features?
  6. Can I build a train/test split that simulates real deployment conditions? Is there sufficient temporal range and multi-site diversity to support temporal splitting and out-of-distribution validation?

A dataset that fails any of these questions will cost you more in rework than it saves in acquisition time. Healthcare ML is hard enough when the data is right. When it's wrong, you don't find out until you've invested months of work — and the failure mode is often a model that looks fine in development and falls apart in deployment, causing exactly the kind of clinical trust problem that is most difficult to recover from.

Get the data right first. Budget for it. Specify it with precision. Verify it before you build. Everything else follows from that foundation — and without it, nothing else holds.