We Had the Algorithm. We Just Couldn't Get the Data.

The model was ready. The team had spent fourteen months building a 30-day readmission risk predictor for heart failure patients — the kind of tool that, in simulations, flagged 73% of high-risk patients before discharge. If deployed across their health system's 400-bed hospital, they estimated it could prevent 380 readmissions per year. That's 380 people who would not face another hospitalization. 380 families spared. Hundreds of thousands of dollars in reduced costs that could be redirected to other care.

They couldn't deploy it. Not because the model wasn't good enough. Because they couldn't train it on enough data to satisfy the system's IRB. The process had started fourteen months earlier. It was still going.

"We had the architecture. We had the compute. We had the clinical collaboration. We had a genuine shot at something that could help real patients. And we were waiting on a signature."

That's how one data scientist described it to us. Not with bitterness — she understood why the process existed. Just with the particular exhaustion of someone who can see the finish line and can't reach it. Her lead ML engineer had moved to a new role at a tech company six months earlier. The cardiologist who was their clinical champion had left for an academic position. The departmental budget cycle had come and gone, taking with it the compute allocation they'd been counting on.

The model sat on a hard drive. The patients it would have helped were admitted, treated, discharged — and 380 of them, statistically, were readmitted within 30 days.

This story is not unusual. In our conversations with clinical data scientists and healthcare ML teams, we hear versions of it constantly. The technical work is done — or close to done — and the blocker is access. Not compute. Not talent. Not methodology. Access.

This article is about what that wait actually costs, how the system that creates it works, what alternatives exist, and how teams that have found a path through the access problem are using that path to build things that matter.

Healthcare Is Where AI Goes to Wait

In almost every other industry, data for AI development is a solved problem. A self-driving car company generates millions of miles of sensor data automatically and trains on it in real time. A recommendation engine learns from billions of user interactions per day. A fraud detection system updates its models continuously from streaming transaction data. The limiting factor in these domains is compute and talent, not access.

In healthcare, getting the data to train your model requires navigating an institutional infrastructure designed for a different era — one where data access was rare, carefully controlled, and handled by researchers with years of institutional standing who were pursuing goals that went through a rigorous ethical review process before anyone saw a single record.

That infrastructure is necessary. Patient privacy is not an abstraction — it's a real protection for real people who haven't consented to have their medical histories used to train a commercial algorithm. We don't dispute this. The regulations exist for good reasons, born from genuine historical abuses of research subjects that the regulatory apparatus was designed to prevent from recurring.

What we're observing is that the regulatory apparatus hasn't scaled to meet the pace of modern AI development. The same process that took six months in 2005 — when data access requests were rare, when the concept of machine learning applied to clinical data was largely theoretical, when "training a model" required specialized hardware unavailable to most teams — still takes six months in 2026. The world has changed dramatically around a process that has changed almost not at all.

And the teams that suffer most are the ones building tools that could genuinely help — readmission predictors, sepsis early warning systems, medication adherence models, clinical documentation assistants, imaging interpretation aids — tools where delay doesn't just mean slower innovation. It means worse outcomes for real patients, today.

How IRB Approval Actually Works

If you've never navigated the IRB process, the 14-month timeline in our opening story may seem extraordinary. It's not. Understanding why requires understanding what the Institutional Review Board actually does and how it does it.

The purpose and jurisdiction of IRBs

Institutional Review Boards are the mechanism by which the U.S. research regulatory framework (primarily the Common Rule, 45 CFR Part 46, and the FDA's research regulations at 21 CFR Parts 50 and 56) is operationalized at the institutional level. Any research involving human subjects that is conducted or supported by federal funding must receive IRB review. Most institutions extend that requirement to all research using identifiable patient data, regardless of funding source, as a condition of their institutional policies.

"Research" in the regulatory sense means systematic investigation designed to develop or contribute to generalizable knowledge. This is relevant to healthcare AI teams because many clinical AI projects — particularly those intended to publish results, validate models against reference cohorts, or contribute to the scientific literature — meet the regulatory definition of research even when they have an operational deployment goal.

The review process and its timeline

IRB review has three tiers: exempt review (for research that poses minimal risk and meets specific categorical criteria), expedited review (for research that poses no more than minimal risk and fits within categories specified in the regulations), and full board review (for research that doesn't qualify for exempt or expedited status).

Clinical AI projects rarely qualify for exempt review, because they typically involve identifiable health information and/or involve more than minimal risk if the model's outputs could affect clinical decisions. Expedited review is sometimes available for retrospective analyses of de-identified data, but the de-identification itself must be documented and verified. Full board review — which requires a convened meeting of the full IRB committee, a quorum, a vote, and formal written approval — is the most common outcome for clinical AI research proposals.

Full board meetings typically occur monthly at most institutions. This means: if your application is complete and accepted at the beginning of Month 1, the earliest you can receive a decision is the meeting in Month 2. If the board has questions or requires revisions — which is common for first-time submissions and for novel research methodologies — you respond to the contingencies, resubmit, and wait for the next meeting. A two or three-cycle review process is not unusual. That's Month 2, Month 3, Month 4 before you have approval — assuming nothing goes wrong.

And then the data access process begins.

What comes after IRB approval

IRB approval establishes that the research is ethically permissible. It does not establish that you have the legal right to access the data. That requires additional agreements.

The Data Use Agreement (DUA) is a contract between your organization and the data-holding institution that specifies what data you can access, how it can be used, how it must be protected, what happens to it after the study is complete, and what restrictions apply to publication and disclosure. DUAs are negotiated by legal counsel on both sides, and they routinely take 60 to 120 days to execute — sometimes longer when either institution's legal department has a long queue, when the DUA template used by one institution conflicts with the standard terms required by the other, or when novel data use scenarios (like training a commercial AI model) raise legal questions that the template doesn't address.

If the data includes PHI — which clinical AI training data typically does, even after partial de-identification — a Business Associate Agreement (BAA) under HIPAA is also required. The BAA establishes the data receiver as a business associate of the covered entity, with all the obligations that entails: safeguard requirements, breach notification obligations, and limitations on use and disclosure of PHI consistent with the minimum necessary standard.

Add a de-identification review (another 30-90 days if the institution's data governance team does this themselves), a data transfer mechanism negotiation (how does the data actually move, in what format, through what secure channel), and institutional information security review (does your computing environment meet the institution's data security requirements for handling PHI), and you have a process that reasonably takes 6 to 18 months from first inquiry to first data point.

The IRB and data access process isn't broken. It's doing exactly what it was designed to do. The problem is that it was designed for a world where accessing patient data was rare and where AI development timelines were measured in years rather than weeks. That world no longer exists — but the process does.

Why Healthcare AI Teams Fail at the Data Stage, Not the Technical Stage

A 2023 survey of clinical AI teams at academic medical centers found that data access was cited as the primary bottleneck by 67% of respondents — compared to 12% who cited technical limitations and 8% who cited compute resources. The pattern holds consistently across our own conversations with data science teams in healthcare.

The failure mode is predictable. A team assembles — clinical champion, data scientist, ML engineer, program manager. They identify the clinical problem: readmission, sepsis, medication adherence. They sketch the model architecture, identify the features they'll need, estimate the sample size required for the statistical power they want. They submit the IRB application. They wait.

While they wait, the clinical champion gets pulled into other priorities. The ML engineer, unable to do meaningful model development without data, gets allocated to other projects. The program manager loses budget visibility because there's no clear timeline to anchor to. The team's shared momentum — that particular energy that comes from a group of people aligned on a goal they believe in — dissipates under months of institutional friction.

When the data finally arrives, it often arrives to a different team than submitted the proposal. The original ML engineer has moved on. The clinical champion's enthusiasm has cooled. The budget has been reallocated. The institutional priority landscape has shifted. And the model that was going to deploy in Q3 is now a research project that might publish someday.

The specific compliance framework stack

To make the access problem concrete, here is the full compliance framework stack that a clinical AI team at a US health system must navigate:

Typical Data Access Requirements

IRB Application — Institutional Review Board approval. Full board review for most clinical AI research. Timeline: 2-6 months depending on institution and number of review cycles. Requires complete research protocol, informed consent waiver justification, data security plan, and conflict of interest disclosures.

Data Use Agreement (DUA) — Legal contract governing data access, permissible use, security requirements, publication rights, and data destruction requirements. Negotiated between legal teams. Timeline: 60-120 days. Often requires multiple rounds of redline negotiation on commercial use rights, IP ownership of model outputs, and indemnification terms.

Business Associate Agreement (BAA) — Required under HIPAA §164.308 when PHI is disclosed to a business associate. Establishes security obligations, breach notification requirements, and use restrictions. Timeline: 30-60 days. Must be executed before any PHI is transmitted.

Institutional Data Governance Review — Many health systems have an internal data governance committee that reviews data access requests independently of the IRB. This committee evaluates whether the requested data is the minimum necessary, whether the requesting team has appropriate credentials, and whether the use case is consistent with the institution's data sharing policies. Timeline: 30-90 days, often running parallel to other processes but adding additional contingencies.

De-identification Review and Processing — Even approved research often requires formal de-identification under the HIPAA Safe Harbor standard (45 CFR §164.514(b)) or Expert Determination standard. The Safe Harbor standard requires removal of 18 specific identifier categories. This work is performed by the data-holding institution's informatics team and adds 30-90 days depending on queue depth and data complexity.

Information Security Review — The receiving institution's information security team must certify that the computing environment where data will be stored and processed meets the data security requirements specified in the DUA. For cloud computing environments, this often requires a cloud security questionnaire, penetration testing results, and sometimes a site visit. Timeline: 30-60 days.

Annual Renewals and Amendments — IRB approval is typically granted for one year and requires annual renewal. DUAs often have expiration dates requiring renegotiation. Amendments to the research protocol (changing your model architecture, adding features, expanding the cohort) may require re-review. Each renewal or amendment restarts portions of the process.

Each of these steps is individually reasonable. Collectively, they represent a process that takes 6 months to 2 years from first request to first data access — for a single dataset, from a single institution. If your model requires data from multiple institutions for generalizability (which most production clinical AI tools do), multiply accordingly.

What Happens to Your Model While You Wait 14 Months

The 14-month wait is not neutral time. Clinical AI models and the teams that build them don't simply pause during the access process and resume unchanged when data arrives. Here is what actually happens while you wait.

Model drift before deployment

A readmission prediction model developed on simulated or synthetic data, or on a small pilot dataset, reflects the patient population and clinical practice patterns of the moment it was trained. Healthcare is not static. Medication formularies change. Clinical guidelines are updated. Coding practices shift with ICD and CPT updates. Patient demographics evolve as populations age. Pandemic-era disruptions altered care patterns in ways that persisted for years.

A model developed in Month 1 and deployed in Month 14 may already be partially stale before it makes a single prediction. The features that mattered most in the training data may have shifted in the real patient population. The model may perform well on retrospective validation and underperform on prospective deployment — not because the methodology was wrong, but because 14 months elapsed between the world the model learned about and the world it was asked to predict in.

Staff turnover

Clinical AI projects are built by people. People change jobs. The ML engineer who architected the feature engineering pipeline and knows the reasoning behind every preprocessing decision is, 14 months later, working somewhere else. The data scientist who established the clinical relationships that made this project possible has moved to a faculty position. The clinical champion who was going to shepherd deployment through the medical staff process has rotated to a different administrative role.

Knowledge doesn't transfer perfectly. The new team that inherits the project doesn't fully understand the decisions that were made before they arrived. They rebuild parts they don't understand. They make different decisions. The model evolves into something its original architects wouldn't recognize — not necessarily worse, but different, and disconnected from the institutional relationships that were going to enable deployment.

Priority shifts and budget reallocation

A 14-month timeline spans at least two budget cycles at most health systems. Budget cycles bring reallocation. The compute budget for model training that seemed generous in Month 1 looks different when a competing priority — a new EHR implementation, a regulatory compliance initiative, a cost-cutting mandate — absorbs the discretionary budget in Month 8. Projects that aren't actively producing output are easy targets for budget cuts. A project that's waiting for data access isn't producing output. It's spending institutional political capital to hold its place in line while generating no visible results.

The 14-month wait isn't just 14 months of delay. It's 14 months of compounding organizational risk. Every month that passes is a month in which something can go wrong with the team, the budget, the clinical relationships, or the institutional priorities that were going to carry the project to deployment. The longer the wait, the more likely that something will.

The Hidden Cost of IRB Delays: What 380 Prevented Readmissions Actually Means

Let's make the cost concrete. The readmission prediction team we opened with estimated their model would prevent 380 heart failure readmissions per year at their 400-bed hospital. What does that actually represent?

The average cost of a heart failure readmission to a US hospital is approximately $14,000 to $18,000 in direct facility costs, with total cost of care closer to $24,000 when including physician fees, post-acute care, and downstream utilization. Using a conservative midpoint of $16,000 per readmission: 380 prevented readmissions represent $6.08 million in annual healthcare cost reduction.

Under Medicare's Hospital Readmissions Reduction Program (HRRP), hospitals with excess readmission rates for conditions including heart failure face payment penalties of up to 3% of their total Medicare DRG payments. For a 400-bed hospital with significant Medicare volume, that penalty exposure can represent $1.5 to $3 million annually. A model that meaningfully reduces heart failure readmissions reduces that penalty exposure — potentially by more than the model's development cost in a single year.

But the cost that can't be put in dollars is the human cost. Heart failure readmissions are not comfortable. They involve emergency department visits, hospitalization, often intensive care, repeated procedures, extended recovery, and in many cases — particularly in elderly patients — they represent a step in a declining trajectory. A patient who is readmitted for heart failure decompensation within 30 days of discharge is a patient who didn't receive optimal post-discharge management. They may have missed warning signs that a better-supported care plan would have caught. They may have been unable to access the follow-up care that was planned. The model that would have flagged them as high-risk — and triggered the care management intervention that might have prevented the readmission — was sitting on a hard drive waiting for an IRB signature.

Over the 14-month wait for IRB approval: approximately 444 heart failure readmissions that the model might have prevented. At $16,000 each: $7.1 million in preventable healthcare costs. In human terms: 444 families who faced another hospitalization that adequate care transitions might have prevented.

These numbers are estimates, not certainties. Models don't prevent every readmission they flag. Care interventions don't always succeed. But the orders of magnitude are right. The cost of waiting is measured in outcomes, not just engineering hours.

The Regulatory Landscape for Healthcare AI in 2025-2026

The regulatory environment for clinical AI is evolving rapidly, and teams building in this space need to understand where the regulations are going — not just where they are today.

FDA Software as a Medical Device (SaMD) guidance

The FDA's framework for Software as a Medical Device (SaMD) applies to clinical software that meets the definition of a medical device under 21 U.S.C. §321(h). Clinical AI tools that are intended to diagnose, treat, mitigate, cure, or prevent disease may meet this definition — and the line between "decision support" (which may be exempt) and "SaMD" (which requires FDA clearance or approval) is not always clear.

The FDA's 2021 Action Plan for AI/ML-Based Software as a Medical Device established a framework for "Predetermined Change Control Plans" — a regulatory pathway that would allow AI models to update and improve over time without requiring a new 510(k) clearance for each update. The concept is sound: clinical AI should be able to improve as it learns, not be frozen at the version that received initial clearance. The regulatory implementation is still evolving as of early 2026, and teams building models with deployment ambitions should be engaging FDA's Digital Health Center of Excellence early.

The FDA's Draft Guidance on Artificial Intelligence-Enabled Device Software Functions (2023) further clarifies the agency's expectations around clinical validation, transparency, and documentation for AI tools that meet the SaMD definition. The guidance emphasizes the importance of testing on representative patient populations — a requirement that directly implicates the quality and diversity of training data.

ONC and the interoperability mandate

The ONC's 21st Century Cures Act Final Rule and subsequent Interoperability and Prior Authorization Rule (CMS-0057-F, finalized in 2024) create new requirements for health systems and payers to make clinical data available via standardized FHIR APIs. For clinical AI teams, this is significant: as more institutions implement FHIR R4 APIs with SMART on FHIR authorization, programmatic access to patient data for approved research uses becomes more standardized — though no less regulated.

The ONC's Trusted Exchange Framework and Common Agreement (TEFCA) and the supporting Qualified Health Information Networks (QHINs) are creating a national infrastructure for health information exchange that, in principle, could eventually enable broader multi-institutional data access for approved research. In practice, TEFCA is still maturing, and the pathway from "approved to participate in TEFCA" to "able to access data for AI training" is not yet clearly defined.

State-level AI regulation

Several states have enacted or are considering legislation specifically addressing clinical AI. Colorado's SB 21-169 requires health insurers to audit AI systems used in coverage determinations for discriminatory impact. California's AB 2930 (proposed) would require impact assessments for high-risk AI systems including those used in healthcare. The federal regulatory vacuum on clinical AI governance has created a patchwork of state-level requirements that teams building AI tools for multi-state deployment need to track.

Alternative Approaches: Federated Learning, Differential Privacy, and Synthetic Data

Teams blocked by data access have developed several alternative technical approaches to the training data problem. Each has genuine merits and genuine limitations, and the honest accounting of each is more nuanced than the marketing materials suggest.

Federated learning

Federated learning is a training approach where the model is sent to the data (rather than the data being sent to the model). Each participating institution trains the model on its local data, sends the model updates (gradients, not raw data) to a central server, and the central server aggregates the updates to improve the global model. The raw patient data never leaves the institution.

The appeal is obvious: federated learning seems to offer multi-institutional training without the multi-institutional data access process. In practice, the complications are significant. Federated learning still requires IRB approval at each participating institution, because the institution is contributing to research using its patients' data even if that data doesn't leave. The DUA and BAA requirements may still apply depending on what information is transmitted. The technical overhead of implementing a federated training infrastructure is substantial — it requires coordination, standardization, and consistent data preprocessing across institutions that may have different EHR systems, different coding practices, and different data quality profiles. And federated learning can perform significantly worse than centralized training in some model architectures and data distribution scenarios, particularly when local datasets are small or heterogeneous.

Federated learning is a genuine tool with genuine applications, particularly for large-scale multi-site collaborations where centralized data aggregation is politically impossible. It is not a shortcut around the access problem. It's a different, more complex version of the same problem.

Differential privacy

Differential privacy is a mathematical framework for adding calibrated statistical noise to data or model updates in a way that provides provable privacy guarantees while preserving the statistical properties of the data. A model trained with differential privacy guarantees that any individual's data contribution to the model is bounded — the model can't "memorize" specific patients in a way that could reconstruct their records from the model's parameters.

Differential privacy is increasingly seen as a best practice for clinical AI model training, and the FDA's SaMD guidance notes its relevance for privacy-preserving AI. But differential privacy does not eliminate the data access requirement — you still need to access the data to apply the privacy mechanism to it. It reduces the privacy risk of the resulting model, which may reduce the stringency of some regulatory requirements, but it does not substitute for IRB approval, DUA execution, or the other elements of the access framework.

The privacy-utility tradeoff in differential privacy is also real: adding enough noise to provide meaningful privacy guarantees often degrades model performance, particularly for small cohorts and rare conditions. Setting the privacy budget (epsilon) low enough to provide strong guarantees can produce a model that doesn't perform well enough to deploy. Setting it high enough to preserve performance may provide weaker privacy guarantees than the framework implies.

Synthetic data

Synthetic patient data — data that is generated to reflect the statistical and structural properties of real clinical data without being derived from any specific real patient — addresses the data access problem at its source. If the training data doesn't contain PHI, the HIPAA BAA requirement doesn't apply. If it wasn't derived from research on human subjects, the IRB requirement may not apply (depending on the research design). If it has no real patient data to de-identify, the de-identification process is eliminated.

The limitations of synthetic data are real and should be stated honestly. A model trained exclusively on synthetic data and deployed without validation against real patient outcomes is not a finished clinical tool. The statistical fidelity of synthetic data depends on the quality of the generation process — synthetic data that doesn't accurately reflect the joint distribution of clinical variables (comorbidity patterns, temporal relationships between events, realistic missingness patterns) can introduce biases that persist into the trained model. And for rare conditions or rare events, synthetic data generation faces the same fundamental challenge as any generative process: you can only generate realistic examples of things you have enough real examples to learn from.

Where synthetic data excels is in the development and testing phases — before real data is available and while real data access is being negotiated. A team that uses synthetic data to build and validate its preprocessing pipeline, feature engineering logic, model architecture, and evaluation framework arrives at real data access with a working foundation. The real data isn't used to build the scaffolding — it's used for what it's most valuable for: final training, validation, and the statistical rigor that clinical deployment requires.

Synthetic data and real data are not competing choices. They're sequential tools. Synthetic data for development. Real data for validation. This is the hybrid approach that the teams making the most effective use of both are actually doing.

What Healthcare Datasets Need That General ML Datasets Never Have

One of the persistent mistakes that data scientists moving into healthcare make is underestimating how different clinical data is from the tabular and image datasets they've worked with before. The differences are not superficial. They go to the structure of the data, the domain knowledge required to work with it, and the specific properties that a healthcare dataset needs to have for a model trained on it to generalize to real clinical environments.

Temporal relationships

Clinical data is fundamentally longitudinal. A patient's diagnosis history, medication changes, lab trends, hospitalization record, and vital sign trajectories are not independent observations — they're a time series where earlier events influence later ones, where the absence of an event is itself clinically meaningful, and where the timing between events matters as much as the events themselves.

A readmission prediction model needs to understand that a serum creatinine that has been rising over three visits is clinically different from a serum creatinine that is elevated but stable. A sepsis early warning model needs to understand the rate of change in vital signs, not just their current values. A medication adherence model needs to understand the temporal pattern of prescription refills relative to supply days — which requires not just the prescription events but the gaps between them.

General ML datasets rarely have this temporal structure. Healthcare synthetic data that doesn't capture realistic temporal relationships between clinical events — the expected progression of disease markers over time, the realistic distribution of time between encounters for different clinical contexts, the lagged relationship between a diagnosis and its complications — will produce models that fail to capture the dynamics that matter most in clinical prediction.

ICD hierarchy awareness

ICD-10-CM codes are not arbitrary labels. They exist in a hierarchical structure where codes at different levels of specificity represent different clinical entities, and where the relationship between a parent code and its child codes is clinically meaningful. E11 (type 2 diabetes mellitus) is the parent of E11.21 (with diabetic nephropathy), E11.311 (with unspecified diabetic retinopathy with macular edema), and E11.65 (with hyperglycemia) — each of which represents a distinct clinical situation.

A model that treats ICD-10 codes as a flat vocabulary — where E11 and E11.21 are simply different tokens with no structural relationship — misses the hierarchical information that is encoded in the code structure. Models that incorporate ICD ontology awareness, either through hierarchical embedding approaches or through feature engineering that captures the code's position in the hierarchy, consistently outperform models that treat codes as flat categorical variables.

Synthetic data that generates realistic ICD-10 code combinations — not just plausible individual codes but clinically realistic patterns of specificity and combination — provides training material that develops the right inductive biases in models designed to work with these codes.

Comorbidity correlations and realistic population distributions

Real clinical populations have comorbidity distributions that are highly non-uniform. Diabetes and hypertension co-occur at far higher rates than random chance would predict. Heart failure and CKD co-occur with specific patterns of severity correlation. Depression and chronic pain co-occur with bidirectional relationships that have clinical implications. A synthetic dataset that generates comorbidities independently — assigning each condition based on its marginal prevalence without accounting for the correlation structure — produces an artificial population that looks statistically unrealistic to models trained on real data.

This matters because models trained on synthetic data with incorrect comorbidity correlations will learn the wrong joint distributions. When those models encounter real patients — who do have the realistic correlation structure — the model's predictions will be based on features that don't reflect how those features actually co-occur in clinical reality.

Realistic missingness patterns

In general ML datasets, missing values are typically handled by imputation or exclusion. In clinical datasets, missingness is not random — it's clinically informative. A patient who doesn't have an HbA1c recorded in their chart may not have diabetes. A patient who has an HbA1c recorded but no follow-up within the expected monitoring window may be poorly adherent to care. A patient whose lab values are missing for a specific period may have been hospitalized at another institution. The pattern of what's missing tells you something about the patient.

Synthetic data that generates missingness at random — or that generates complete data for every patient — produces training material that teaches models to handle missingness as a technical artifact rather than a clinical signal. Models trained on this data will underperform on real data where missingness is informative.

High-quality healthcare synthetic data should generate realistic missingness patterns: the right types of observations missing in the right clinical contexts, with the right temporal relationships to the clinical events that should explain the missingness.

A Realistic Synthetic Data Evaluation Framework

If you're evaluating whether synthetic data is good enough for your specific use case, you need a framework that goes beyond visual inspection and basic statistical summaries. Here is how to assess synthetic healthcare data systematically.

Marginal fidelity

The simplest test: do the marginal distributions of key variables in the synthetic data match what you'd expect from a real clinical population? Age distribution, sex distribution, prevalence of common ICD-10 categories, distribution of common labs — all of these should be within reasonable range of published clinical epidemiology. If a synthetic dataset shows a 25% prevalence of type 2 diabetes when the real population prevalence is 11%, the marginal distribution is wrong and any model trained on it will have biased priors.

Joint fidelity

More demanding: do the joint distributions of related variables match? Specifically, do comorbidity co-occurrence rates match expected rates from the epidemiology literature? Is the correlation between HbA1c and diabetes diagnosis code presence realistic? Is the relationship between eGFR values and CKD stage codes consistent with clinical staging criteria? Joint fidelity tests whether the synthetic data reflects the correlation structure of real clinical data, not just its marginal properties.

Temporal fidelity

Most demanding: do longitudinal patterns in the synthetic data reflect realistic disease progression trajectories? If you track a cohort of diabetic patients through the synthetic data over multiple encounters, do you see the expected patterns of HbA1c change in response to medication adjustments? Do lab values change in the expected directions when medications are added or discontinued? Does the rate of complication development in diabetic patients match published rates from longitudinal cohort studies?

Train-test transfer fidelity

The ultimate test: train a model on synthetic data and test it on real data. If the model performs significantly better on synthetic test data than on real test data, the gap represents the distributional difference between the synthetic and real populations — the "synthetic-to-real gap" that determines how much retraining will be required when real data becomes available. A smaller gap means the synthetic data is more useful for development; a larger gap means the synthetic data is better used for architectural development and pipeline testing than for parameter estimation.

Team and Organizational Strategies for Reducing Data Access Friction

While navigating the compliance framework is unavoidable, there are strategies that meaningfully reduce the time and friction involved in clinical data access. Teams that build these capabilities are systematically faster at moving from proposal to data than teams that treat each data access request as a one-off negotiation.

Build institutional relationships before you need data

IRB review committees look more favorably on researchers with established track records at the institution. Data governance committees are more likely to expedite review for teams that have previously demonstrated responsible data stewardship. Clinical informatics teams move faster for requestors they know and trust. The investment in relationships — participation in data governance committees, collaboration on less sensitive projects, publication of results from prior approved projects — pays back in reduced friction for the projects that matter most.

Develop a library of pre-negotiated agreement templates

DUA negotiation is slow when each agreement is negotiated from scratch. Teams that have worked with their legal department to develop pre-negotiated templates for common data access scenarios — retrospective analysis of de-identified data, federated learning participation, synthetic data derivation from real cohorts — can short-circuit weeks of back-and-forth by starting from a template that both institutions' legal teams have already reviewed and accepted for routine requests.

Invest in data governance infrastructure

Institutions that have robust internal data governance infrastructure — a clinical data warehouse with standardized access protocols, a data stewardship committee that can evaluate and approve requests internally, a data catalog that makes available datasets discoverable — are systematically faster at enabling approved research. If you're at an institution building these capabilities, advocating for investment in them is advocacy for faster AI development timelines as much as it is advocacy for good data governance.

Use synthetic data to reduce the scope of real data requests

A well-considered synthetic data strategy can reduce the amount of real data you need to request. If you can demonstrate on synthetic data that your model architecture works, that your feature engineering is correct, and that your evaluation framework is sound, you can request a smaller, more targeted real dataset for final validation — rather than requesting broad access to everything you might possibly need. Smaller, more targeted requests move faster through the access process and face less scrutiny from data governance committees.

The Hybrid Approach: Synthetic Data for Development, Real Data for Validation

The teams making the most effective use of both synthetic and real clinical data aren't treating them as alternatives. They're using them sequentially, for different purposes, in a development workflow that gets the most value out of each.

Here is what that workflow looks like in practice, drawn from conversations with data science teams at three different health systems who have developed it independently and converged on similar approaches:

Phase 1 — Architecture and feature engineering (using synthetic data): Build the model architecture. Implement and test the feature engineering pipeline. Establish baseline model performance on synthetic validation sets. Identify the most important feature categories. Document the design decisions and their rationale. This phase can begin before any real data access request is submitted, using synthetic data that reflects the clinical domain of interest.

Phase 2 — IRB and access process (running in parallel with Phase 1): Submit the IRB application. Initiate DUA and BAA negotiations. Work through the data governance review. While the access process proceeds, the technical team continues with Phase 1, ensuring that when data access is granted, there is a working pipeline ready to process it.

Phase 3 — Real data validation and refinement (using real data): When real data arrives, run it through the established pipeline. Compare the feature distributions in the real data to the synthetic training data. Identify and address distribution gaps. Retrain the model on the real data. Validate performance on a held-out real cohort. The synthetic-data phase has established a foundation that makes this phase faster and more principled — the team knows what they're looking for and has tested the pipeline to handle it.

Phase 4 — Deployment and monitoring (real data): Deploy the model in the clinical environment. Monitor for performance degradation and distribution shift. Retrain periodically as the underlying patient population and clinical practice patterns evolve. Maintain synthetic data as a stable test set for regression testing when the model is updated.

The teams that have adopted this workflow describe a consistent benefit: they use their real data access period for what real data is uniquely valuable for — final validation and calibration — rather than for building and debugging the infrastructure that synthetic data could have handled. The result is faster progression from data access to deployment, and better-quality models because the architecture decisions were made thoughtfully during the synthetic phase rather than hastily during the constrained real-data phase.

Stop Waiting. Start Building.

Download synthetic patient data across 60+ specialties — clinically realistic, 7 export formats including FHIR R4 and Parquet. HIPAA-free. IRB-free. Realistic comorbidity patterns, temporal relationships, and the domain-specific complexity your model actually needs to learn from. Ready today.

Download Free Dataset →

The Model That Could Have Saved 380 Lives Is a Warning, Not Just a Story

The readmission prediction team we opened with eventually got their model deployed. It took two years from first research proposal to first patient flag — 14 months of IRB and access process, followed by 10 months of validation, deployment preparation, clinical workflow integration, and the gradual trust-building that clinical AI deployment requires before the alerts get acted on.

In the two years between when the model could have been ready and when it actually deployed, approximately 760 heart failure patients at that hospital were readmitted within 30 days of discharge. Statistically, their model would have flagged a significant portion of them as high-risk before discharge — patients who could have received enhanced discharge planning, earlier follow-up appointments, remote monitoring, or care management outreach. Some of those readmissions might have been prevented. We'll never know how many.

The team believes, and we believe them, that the hybrid synthetic-and-real approach made their eventual deployment better. Using synthetic data during the access waiting period let them build and test a more robust pipeline than they could have built under the time pressure of a constrained real-data window. The model that deployed was better than the model that would have deployed if they'd spent the waiting period in limbo rather than building.

But "better than it would have been" isn't the same as "as good as it could have been." The 14-month wait was not neutral. It was not merely inconvenient. It was a period during which a working clinical tool was withheld from the patients it was built to serve — not by malice, not by negligence, but by a regulatory system that hasn't yet found a way to move at the pace that modern healthcare AI development requires.

The story of the readmission model is not just a story about one team's struggle with IRB timelines. It's a story about the systems and incentives and regulatory frameworks that determine how quickly healthcare AI can reach the patients it's designed to help. Every team building clinical AI is navigating some version of this story. Every month of delay is a month that the model isn't helping anyone.

That's not a reason to bypass the safeguards that protect patient privacy. It's a reason to use every tool available — including high-quality synthetic data, federated learning where appropriate, and the hybrid development workflow that gets the most out of every phase — to move faster through the system that exists. And it's a reason to advocate, loudly, for a regulatory apparatus that can distinguish between the rare bad actor that the access controls were designed to stop and the many teams of careful, mission-driven researchers who are trying to help — and who shouldn't have to wait 14 months to try.

The model that could have saved 380 lives in a year is a warning. Not about the danger of clinical AI. About the cost of not having it. About what it means when the thing that could help is ready, and the patients who need it are waiting, and the gap between them is not technical. It's paperwork.

Build what you can build now. Fight for what you can't build yet. And don't let either task crowd out the other.