Synthetic vs. Real Patient Data — HIPAA, Licensing, and What to Use When

She had spent eight months on the IRB application. Eight months of protocol revisions, committee reviews, clarification requests, and resubmissions — each round requiring coordination between her company's legal team, the institution's IRB administrator, and the clinical research committee that ultimately had to sign off. When approval finally arrived, it felt like a genuine organizational achievement. She submitted the data use agreement request the same afternoon.

Six weeks later, the DUA came back — fully executed, bearing the signatures of the institution's legal counsel, its privacy officer, and the data governance committee chair. She read it carefully for the first time. Section 7, subsection (c): "Data shall not be used in connection with any commercial product, service, or offering, including but not limited to software products, clinical decision support systems, and analytics platforms offered for sale or license to third parties."

Her company was a digital health startup. Every product they built was commercial. The dataset she needed — longitudinal diabetes management records with linked pharmacy and lab data, one of the richest available for her research question — was not obtainable from any other institution under terms that would work for commercial deployment. She had spent the better part of a year gaining access to data she could not use.

That story is not exceptional. It is the normal arc of a healthcare data access project that didn't start with a clear-eyed understanding of the compliance landscape. The regulatory framework governing real patient data is complex, multi-layered, and designed to protect patient privacy in ways that create significant friction for commercial data use. Understanding when you need real data, when synthetic data works, and what the regulatory machinery actually requires is not a legal nicety. It's the most important project planning decision you'll make.

This article is the guide that the researcher who spent eight months on her IRB wished she'd had before she started.

The Full Compliance Framework for Working with Real PHI

HIPAA is the first name that comes up in any conversation about healthcare data access — but it's not the only regulatory framework that governs the use of real patient data, and in some respects it's not the most restrictive. A complete map of the compliance landscape looks like this:

HIPAA: The Federal Floor

The Health Insurance Portability and Accountability Act of 1996 and its implementing regulations — primarily the Privacy Rule (45 CFR Part 164) and the Security Rule (45 CFR Parts 160 and 164) — establish baseline federal requirements for the protection of protected health information. PHI is defined broadly: any individually identifiable health information held or transmitted by a covered entity or business associate, in any form.

Covered entities under HIPAA are healthcare providers that conduct certain electronic transactions, health plans, and healthcare clearinghouses. Business associates are persons or entities that perform functions or activities involving the use or disclosure of PHI on behalf of covered entities — a definition that encompasses most data analytics vendors, ML platform providers, and research organizations that receive PHI from hospitals or insurers.

The key HIPAA permission pathways for using PHI outside of direct treatment include: patient authorization, the research exception (which requires IRB or privacy board approval or meets specific criteria for waiver), and several narrow administrative exceptions. For most commercial ML applications, none of these pathways are available without significant process investment.

The Common Rule and IRB Oversight

The Common Rule (45 CFR Part 46) governs human subjects research that is federally funded or conducted at institutions that have signed a Federalwide Assurance with HHS. It requires IRB review and approval before research involving human subjects can begin.

Whether a commercial data project constitutes "research" under the Common Rule is a nuanced question. The definition of research — "a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge" — is broad enough to potentially encompass ML model development that will contribute to generalizable algorithmic knowledge. IRBs at data-holding institutions frequently take the conservative position and require review of any project involving their patients' data, regardless of whether the regulatory definition technically applies.

State Privacy Laws That Exceed HIPAA

Several states have enacted health privacy laws that are more restrictive than HIPAA's federal baseline. California's Confidentiality of Medical Information Act (CMIA) applies to providers, health service plans, pharmaceutical companies, and contractors — a broader covered entity definition than HIPAA. It prohibits disclosure of medical information without patient authorization in terms that are stricter than HIPAA's research exception. The California Consumer Privacy Act (CCPA) and its successor the California Privacy Rights Act (CPRA) create additional obligations for businesses that hold health-adjacent data, including the right to know, right to delete, and right to opt out of sale that HIPAA doesn't address.

Washington State's My Health MY Data Act, enacted in 2023 and effective from 2024, applies to entities that aren't covered under HIPAA and governs consumer health data broadly — including any personal information that identifies a consumer's health condition, treatment, or medical history. It includes a private right of action, which HIPAA does not. Texas, Nevada, and Virginia have passed similar laws with varying scopes and effective dates.

The practical implication: a project that is technically HIPAA-compliant may still violate state health privacy laws, depending on where the data originates and where the organization conducting the analysis is located. Legal review of applicable state law is not optional for commercial healthcare data projects.

What IRB Approval Actually Involves

For researchers who haven't navigated IRB approval before, the process is frequently more time-consuming and operationally complex than they expect. Understanding what's involved helps set realistic expectations and informs the decision about whether real data access is worth pursuing at all for a given project phase.

The Application Process

An IRB application for a research project involving patient data typically requires: a full research protocol describing the scientific objectives, design, and methods; a detailed privacy and data security plan; a data use agreement or description of the data access mechanism; a consent waiver request or documentation of why patient consent is not feasible; a justification for the minimum necessary data elements required; and a description of how data will be stored, secured, and destroyed at project completion.

For projects involving large datasets, commercial entities, or novel methods like machine learning, IRBs often request additional documentation: a description of the ML model's intended clinical application, a description of the risks and benefits to subjects, and evidence that the organization has the technical capacity to maintain data security at the required standard.

Committee Review and Typical Timelines

IRB committees typically review applications on a monthly cycle — many committees meet once a month, and submissions are due two to three weeks before the meeting. An initial submission that requires clarification may be tabled to the next cycle, adding four to six weeks to the timeline. Complex protocols may go through full board review rather than expedited review, which requires committee quorum and may require multiple rounds of correspondence.

For commercially oriented projects — particularly those involving AI/ML methods or commercial data analytics companies — IRBs often apply heightened scrutiny. The committee may have concerns about commercial interests affecting the research, about data security at a commercial entity, or about the appropriateness of the research exception when the primary output is a commercial product rather than generalizable scientific knowledge.

A realistic timeline for IRB approval on a complex project involving a commercial entity and a large institutional dataset: six to eighteen months from initial submission to final approval. Simple retrospective chart reviews at an institution where the applicant already has affiliation may move faster — three to six months is achievable for straightforward protocols. Novel AI/ML applications at institutions unfamiliar with these methods should budget twelve months or more.

What Makes a Protocol Approvable

IRBs apply the criteria of 45 CFR 46.111 when evaluating whether to approve research. The key criteria: risks to subjects are minimized and reasonable in relation to anticipated benefits; subject selection is equitable; informed consent is sought or appropriately waived; data monitoring and safety procedures are adequate; privacy and confidentiality protections are adequate.

For ML research projects, the most common grounds for IRB concern are: privacy protection adequacy (how will the data be secured, who will have access, what are the de-identification procedures), commercial interest disclosure (what financial interest does the sponsoring organization have in the results), and the informed consent waiver justification (why can't the research be practically conducted with individual consent).

A protocol that anticipates these concerns and addresses them explicitly — with specific, credible technical security measures, a clear disclosure of commercial interests that doesn't undermine the scientific integrity justification, and a thorough minimum-necessary-data argument — is substantially more likely to receive timely approval than one that provides generic assurances.

Data Use Agreements: What Hospitals Actually Require

Assuming IRB approval is obtained, the Data Use Agreement is the contract that actually governs how the data can be used. DUAs are negotiated individually between the data-providing institution and the data-receiving organization, and the terms vary enormously across institutions.

Standard DUA Provisions

Most institutional DUAs include the following provisions as standard: a description of the specific dataset being transferred, including its contents, format, and size; the specific research purposes for which the data may be used; the specific personnel who may access the data; data security requirements, including encryption standards, access controls, and audit logging; a prohibition on re-identification of individual patients; a requirement to report any security breach or unauthorized access within a specified timeframe (typically 24-72 hours); a data destruction or return requirement at the end of the project; and a term and renewal provision specifying how long the data access agreement remains in effect.

The Commercial Use Clause: The Killer Provision

The provision that kills commercial projects most often is the commercial use restriction. Hospitals and health systems are acutely sensitive to the perception that patient data is being used for commercial gain without explicit patient consent or institutional benefit-sharing. Many standard DUA templates prohibit commercial use categorically — not as a negotiating position but as a firm institutional policy.

Some institutions are willing to negotiate commercial use permissions in exchange for revenue-sharing arrangements, co-authorship rights on resulting publications, or a first right of negotiation on any product developed from the data. These arrangements require longer negotiation timelines and often involve institutional technology transfer offices, which have their own review processes and timelines.

The practical lesson: if your intended use is commercial, ask about the commercial use clause before you start any other part of the process. Do not spend months on IRB applications and data access negotiations without first confirming that the institution is willing to grant commercial use rights in principle.

The Re-identification Covenant

Every DUA governing de-identified data includes a covenant prohibiting re-identification — a legal commitment that the recipient will not attempt to identify any individual from the dataset, will not combine the dataset with other data in ways that could enable re-identification, and will not share the dataset with any party that has not agreed to equivalent re-identification prohibitions.

This covenant applies even to data that has been de-identified under HIPAA's Safe Harbor method. The fact that the data is no longer legally PHI doesn't change the contractual obligation not to re-identify. For ML practitioners, this means that any enrichment of the dataset with external data sources — even publicly available ones — must be carefully reviewed against the DUA's re-identification provisions.

Data Destruction Requirements

Most DUAs require certified data destruction at the end of the project: not just deletion of the data files, but documented destruction using approved methods (cryptographic erasure, overwriting to DoD standards, or physical destruction of media). This requirement extends to all copies of the data, including any copies made for backup purposes, any derived datasets, and any model artifacts that might allow reconstruction of the original data.

The data destruction requirement creates a significant operational challenge for ML projects. The trained model itself may need to be evaluated against the DUA to determine whether it constitutes a "copy" or derived version of the training data that is subject to the destruction requirement. Privacy lawyers with expertise in ML generally advise that trained neural networks are not themselves PHI — they don't contain patient records — but the legal analysis is not universally settled.

The Non-Commercial License Problem in Public Datasets

For teams that can't access institutional data, the natural alternative is public datasets — PhysioNet, Kaggle, government data repositories. These are valuable resources. They are also riddled with licensing restrictions that block the majority of commercial applications.

PhysioNet: Research-Only by Design

PhysioNet, hosted by MIT's Laboratory for Computational Physiology, is the home of MIMIC-III, MIMIC-IV, eICU, and dozens of other clinical datasets. It is the most important public clinical data repository in existence. Its Data Use Agreement, which all credentialed users must sign, explicitly restricts use to "non-commercial research" and "educational purposes."

The PhysioNet DUA's non-commercial restriction is not ambiguous. It states: "I will not use this data for commercial advantage, for private monetary gain, or to gain commercial competitive advantage." This prohibition applies to training commercial ML models as clearly as it applies to selling the data directly. If your company generates revenue from a product that was trained on PhysioNet data, you are in violation of your DUA.

The stakes are significant beyond legal exposure. PhysioNet data is used extensively in published academic research, and many papers describe approaches that were subsequently commercialized. The transition from research to commercial product is exactly where this restriction matters — and it's where teams frequently discover the limitation after significant development investment.

Kaggle: The NC Clause You Missed in the License Section

Kaggle hosts healthcare datasets with a wide range of licenses, and the license details are easily overlooked by engineers focused on evaluating data quality. Creative Commons Attribution-NonCommercial (CC BY-NC) and Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) licenses prohibit use "primarily for commercial advantage or monetary compensation." These are among the most common licenses on Kaggle healthcare datasets.

The "primarily commercial" qualifier creates ambiguity that teams sometimes use to argue that their internal research pipeline isn't "primarily" commercial. This argument is unlikely to survive scrutiny: if the purpose of training the model is to build a commercial product, the training is in service of commercial advantage regardless of how the pipeline is characterized internally.

The more important point is behavioral: teams that treat Kaggle data access as a licensing technicality rather than a genuine constraint are building on a foundation that could collapse under them. Healthcare is a trust-sensitive industry. A licensing controversy — even one that generates no legal action — can undermine partnerships, regulatory relationships, and customer trust in ways that are difficult to recover from.

MIMIC-IV and the Successor Problem

MIMIC-IV, the successor to MIMIC-III, uses a substantially similar Data Use Agreement with the same non-commercial restriction. Some teams have sought to argue that the updated agreement has modified terms — it has, in some respects, but not with respect to the commercial use prohibition. MIMIC-IV remains research-only for the same institutional and ethical reasons that MIMIC-III was research-only.

The PhysioNet team has been clear in public communications that commercial use of PhysioNet datasets is not permitted and that they intend to enforce this restriction. Teams that are building commercial products on PhysioNet data are exposed to legal risk that is not theoretical.

The Decision Framework: 8 Questions to Determine Whether You Need Real Data

The decision between real and synthetic data is not a technical question — it's a question about the regulatory requirements and scientific standards of your specific use case. These eight questions provide a systematic framework for making that decision:

The 8-Question Real vs. Synthetic Decision Framework

Is this project subject to FDA regulatory requirements? Clinical decision support tools, AI/ML-enabled medical devices, and in vitro diagnostic software that will be marketed or used in clinical practice may require FDA submission — and FDA submissions require evidence based on real patient outcomes. If FDA clearance or approval is required, real data validation is non-negotiable at the submission stage. Development can still proceed on synthetic data.

Are you generating real-world evidence for a regulatory body? CMS coverage decisions for new technologies increasingly require real-world evidence studies. FDA post-market surveillance requirements may require real-world data collection. Drug approval supplemental applications based on real-world evidence require real patient data. If your output is regulatory evidence, you need real data for the evidence generation phase.

Will the model's outputs directly affect clinical decisions for real patients before additional validation? A model that will be deployed into clinical workflows and whose outputs will influence clinical decisions — ordering tests, recommending treatments, triaging patients — requires real data validation before deployment, regardless of how it was developed. The validation phase requires real data; development can use synthetic.

Is the scientific contribution dependent on real population statistics? If you're publishing epidemiological research, actuarial analyses, or public health studies where the specific prevalence rates and outcome rates in a real population are the scientific contribution, you need real data. If the contribution is an algorithmic method that generalizes across realistic populations, synthetic data may be adequate.

Does your customer or partner explicitly require models trained on real patient data? Some institutional customers — particularly large health systems and payers — have policies requiring that vendor ML models be validated on real patient data before deployment. Check this requirement before starting development. If the customer requires real data validation, plan for it — but you can still develop on synthetic until the validation phase.

Is the model being used to price or underwrite a live insurance product? Actuarial modeling for active insurance products — setting premiums, underwriting risk — requires real claims and outcomes data. Synthetic data can support model architecture and feature engineering development, but the statistical basis for active pricing decisions must ultimately reflect real populations.

Are you currently in algorithm development, pipeline construction, or testing? If yes, synthetic data is almost certainly adequate and appropriate. The purpose of development-phase data is to give the algorithm something realistic to learn from and the pipeline something realistic to process. Synthetic data that mirrors real clinical patterns serves this purpose without the access barriers and licensing restrictions of real data.

Does your timeline accommodate the real data access process? If your development timeline is measured in weeks, real data access is not compatible — IRB review, DUA negotiation, and data transfer processes routinely take six to eighteen months. The mismatch between the speed of commercial product development and the pace of healthcare data access is itself a reason to start with synthetic data.

The correct interpretation of this framework: if you answered "yes" to questions 1-6, real data will be required at some stage — but that doesn't mean it's required now. Questions 7 and 8 determine whether this stage of the project can proceed on synthetic data. For most teams, the answer is yes for a longer period of the project lifecycle than they initially assumed.

The Legitimate Uses of Synthetic Data: A Comprehensive Map

There is still a pervasive sense in parts of the healthcare data community that synthetic data is a second-best substitute — something you use when you can't get the real thing. This framing is increasingly inaccurate. Synthetic data is not a substitute for real data in cases where real data is genuinely required. But it is the clearly correct choice — not a compromise — in a large set of important use cases.

Algorithm Development and Architecture Design

When you're determining whether a transformer architecture or a gradient boosting approach better captures temporal dynamics in clinical data, you need data that behaves like real clinical data — not data that is real clinical data. Synthetic data calibrated to real clinical distributions supports this investigation as well as real data does, and often better, because you can generate specific distributions that stress-test your architecture in controlled ways.

Data Pipeline Development and Integration Testing

Building FHIR parsers, HL7 message handlers, claim ingestion pipelines, and clinical data warehouse ETL processes requires data to run against. Using synthetic data in these pipelines carries zero compliance risk — a junior engineer who accidentally logs a patient record to an unencrypted log file creates a potential HIPAA breach when the data is real; with synthetic data, the same incident has no compliance implications at all.

Software Testing and QA with Edge Cases

Synthetic data can be generated with specific edge cases that real datasets may not contain in sufficient quantity for reliable testing: patients with unusual comorbidity combinations, records with systematic missing fields, encounters that span institutional boundaries, patients with rare diagnoses that don't appear in sufficient numbers in real training sets to support meaningful testing. With synthetic data, you control the edge case distribution. With real data, you test on the edge cases the data happens to contain — which may not be the edge cases that matter most for your use case.

Demonstrating Algorithms to Investors, Partners, and Customers

Investor demonstrations, partner evaluations, and customer proof-of-concept demonstrations often require showing a working algorithm on realistic clinical data. Showing real patient records in a sales meeting, investor pitch, or partner demo is ethically and legally indefensible regardless of the de-identification status of the data. Synthetic data — clearly labeled as such — is the only appropriate vehicle for demonstrating healthcare ML capabilities in these contexts.

Sharing Data Across Organizational Boundaries

When you're working with offshore development teams, academic collaborators, third-party vendors, or other external parties, sharing real patient data — even de-identified — creates compliance obligations: HIPAA business associate agreements, DUA subcontracting provisions, data transfer agreements if data crosses national boundaries. Synthetic data carries none of these obligations. You can email a synthetic dataset to a contractor in another country without any of the legal infrastructure that sharing real data would require.

Training and Onboarding New Team Members

Getting new data scientists, engineers, and clinical informatics staff familiar with healthcare data formats — ICD-10 hierarchies, FHIR resource types, HL7 message structures, claim adjudication logic — requires practice data. Using real patient data for team training and onboarding is both legally risky and ethically questionable. The HIPAA minimum necessary standard requires that PHI be used only to the extent necessary for the intended purpose; familiarizing a new hire with data formats is not a purpose that justifies PHI exposure.

Education, Publications, and Conference Presentations

Academic papers that demonstrate novel ML methods, conference presentations that show realistic clinical data examples, and educational content that teaches clinical informatics all require realistic healthcare data that can be published or displayed publicly. De-identified real data carries residual re-identification risk even when shared publicly — synthetic data carries none.

When You Still Need Real Data: Cases Where Synthetic Won't Suffice

Honest treatment of this topic requires acknowledging where synthetic data is genuinely inadequate — not just where the access barriers make real data difficult to obtain, but where the scientific or regulatory requirements mean real data is necessary.

FDA 510(k) clearance and De Novo submissions for AI/ML-enabled medical devices. The FDA's guidance on AI/ML-based software as a medical device requires clinical validation studies based on real patient data from real clinical settings. A predicate-based 510(k) or De Novo application cannot rely solely on synthetic data for its clinical performance evidence.
Final prospective validation before clinical deployment. Any algorithm that will make or inform clinical decisions about real patients — even if it was developed entirely on synthetic data — requires prospective or retrospective validation on real patient outcomes before clinical deployment. This is a scientific rather than regulatory requirement, but it is a firm one.
Epidemiological research where real population prevalence rates are the finding. If you're studying the actual prevalence of a condition, the actual rate of a complication, or the actual distribution of a clinical characteristic in a population, synthetic data cannot substitute for observation of real events in real people.
Clinical trials and interventional research. By definition, clinical trials measure what happens to actual patients when an intervention is applied. There is no synthetic substitute for this.
Actuarial tables and risk models used for active insurance pricing. Regulatory requirements for actuarial model support in insurance generally require that the statistical basis for pricing decisions derive from real claims and outcomes data.

De-identification as a Middle Path: When It Works and When It Doesn't

De-identification — removing identifying information from real patient data under one of HIPAA's approved methods — is often proposed as a middle path between using real PHI and using synthetic data. The appeal is intuitive: keep the real clinical signal, remove the identifying information. In practice, de-identification is a more complex and imperfect solution than its proponents often acknowledge.

HIPAA's Safe Harbor de-identification method requires removing 18 specific categories of identifiers. When properly executed, it produces data that is no longer legally PHI and is therefore not subject to HIPAA's use and disclosure restrictions. But "not legally PHI" is not the same as "impossible to re-identify." Research by Latanya Sweeney and others has demonstrated repeatedly that de-identified data from healthcare datasets can be re-identified with high accuracy when combined with publicly available data. ZIP code, date of birth, and sex alone are sufficient to uniquely identify approximately 87% of the US population, according to Sweeney's landmark 2000 analysis.

De-identification also doesn't eliminate the DUA restrictions that accompany most institutional datasets. A DUA that restricts commercial use applies to the de-identified data as much as to the raw PHI — the de-identification changes the HIPAA status of the data but doesn't change the contractual terms under which it was licensed.

For these reasons, de-identification works well as a path to broader sharing of research datasets where re-identification risk has been carefully analyzed and the distribution context is controlled. It works poorly as a general solution to the commercial data access problem when DUAs with commercial use restrictions are in place.

The Hidden Costs of Real Data Access

When teams budget for real data access, they typically account for the direct costs: IRB application fees, data transfer fees, storage infrastructure. The larger costs are frequently invisible until you're in the middle of them.

Legal Review and DUA Negotiation

DUA negotiation for complex commercial projects routinely involves your legal counsel, the institution's legal counsel, and sometimes a technology transfer attorney. Legal review time at commercial rates — $400-$800 per hour for healthcare regulatory attorneys — can accumulate quickly over a multi-month negotiation. A DUA that goes through three rounds of negotiation across six months, with two to three hours of legal review per round, represents $5,000-$15,000 in legal fees before the first byte of data is transferred.

IRB Coordination and Protocol Development

Writing a credible IRB application for a complex ML project requires significant time from technically sophisticated people: the principal investigator, a biostatistician, and often a privacy officer or IRB-savvy attorney. The opportunity cost of the key personnel involved in IRB application and revision cycles is rarely reflected in project budgets.

Data Security Infrastructure

HIPAA-compliant data storage for real PHI requires specific technical safeguards: encryption at rest and in transit, access logging, role-based access control, periodic access review, incident response procedures, and business continuity planning. If your organization doesn't already have this infrastructure, building it to satisfy the DUA's security requirements is a non-trivial investment. Cloud-based HIPAA-eligible environments (AWS GovCloud, Azure Government, Google Cloud Healthcare API configurations) are available but require configuration expertise and carry premium pricing.

The Hidden Cost of Scope Creep

Real data access agreements are scoped to specific datasets and specific research purposes. When your research question evolves — as it always does — the DUA may not cover the new direction. Expanding the data request or changing the research scope may require amendment procedures that restart significant portions of the approval process. Scope creep in real-data projects is not just a project management problem — it's a compliance problem that can require re-engaging the IRB, re-negotiating the DUA, and re-obtaining institutional approvals.

The ROI Calculation: When Synthetic Data Development Pays Off

The ROI of synthetic data use in healthcare ML projects can be calculated concretely. Consider a team that needs training data for a readmission prediction model:

Path A: Real Data. IRB application and revision: 3 months. IRB approval: add 6 months (conservative estimate for a complex commercial project). DUA negotiation: add 3 months. Legal review of DUA: $12,000. Data transfer and security infrastructure: $15,000. Total time to first training run: 12 months. Total additional cost: $27,000+ before data quality assessment.

Path B: Synthetic Data First. Dataset acquisition from a commercial provider: 1 day. License review: 2 hours. Data quality assessment: 1 week. Total time to first training run: 2 weeks. Total cost: the dataset price (typically $500-$5,000 for commercial synthetic healthcare datasets at the scale needed for readmission modeling).

The ROI calculation is not close in most project phases. The false economy argument against synthetic data — that you'll have to "redo" everything on real data — underestimates two things: first, how much of the model architecture, feature engineering approach, and pipeline infrastructure carries over from synthetic to real data validation; second, how much faster the real data validation phase goes when you arrive with a working system rather than a blank slate. Teams that build on synthetic data and then validate on real data move faster in the validation phase than teams that waited 12 months for real data before starting.

What Privacy Lawyers and IRB Coordinators Actually Say

"The question I get most often from commercial clients working in healthcare data is: how do we get access to patient data faster? And my answer, more often than not, is: step back and ask whether you actually need patient data for this phase of the project. In my experience, most teams are at a phase where synthetic data would serve just as well — and the months they'd spend on IRB and DUA they could spend building a better model."

That perspective is increasingly common among attorneys who work extensively with healthcare data and technology companies. The legal community that deals with healthcare data access has developed a pragmatic view: real data access is an expensive, time-consuming process that is appropriate when necessary and should be deferred when it isn't.

IRB coordinators at major research institutions tell a similar story from the institutional side. IRB coordinators report that commercial ML projects are among the most labor-intensive applications they review, requiring more rounds of clarification and more committee scrutiny than traditional research protocols. They also report that many applications they review are premature — the applicant is seeking access to data before they have a clear enough research question to make a credible IRB protocol. The teams that arrive with a clear, specific, well-justified data request — often because they've already done extensive development work on synthetic data — move through the IRB process faster than teams that haven't.

PatientDatasets: The Commercial License That Makes It Viable

The central practical limitation of the public synthetic data options — Synthea, open-source datasets, research-grade synthetic repositories — is licensing. They're available, but they're not commercially licensed in the way that production product development requires.

PatientDatasets provides synthetic patient datasets under an explicit commercial license: no Non-Commercial restriction, no research-only limitation, no prohibition on training commercial ML models or deploying those models in commercial products. The license is the foundation on which everything else is built — because a dataset with excellent clinical content but a NC license is not usable for the majority of commercial healthcare technology applications.

Beyond licensing, PatientDatasets provides what the commercial use case requires: clinical depth calibrated to real epidemiology, CPT codes and billing data for revenue cycle applications, realistic ICD-10 distributions including appropriate rare diagnosis frequencies, relational schema with documented foreign key relationships, and multiple export formats that serve diverse team workflows.

No IRB. No DUA. No Waiting.

Synthetic patient data across 60+ specialties — explicitly commercially licensed, clinically realistic, with relational schema, CPT codes, and lab values calibrated to real population distributions. Available immediately. Skip the 12-month access process and start building the model that will define your product. Get a free sample dataset today and see the quality for yourself.

Explore Synthetic Data Options →

The researcher who spent eight months on her IRB application eventually did get the data she needed — from a different institution, under different DUA terms, eighteen months after she started. By that point, her team had built nearly the entire development pipeline on synthetic data, and the real data integration and validation took weeks rather than months. The model they'd built on synthetic data transferred well to real data with calibration adjustments — the clinical patterns in the synthetic data were realistic enough to provide a strong algorithmic foundation.

She told us she wished she'd understood earlier that the decision between real and synthetic data wasn't binary. Starting with synthetic data didn't delay the project — it accelerated the phase that mattered most. The real data phase, when it came, was focused and purposeful: validation, calibration, and deployment preparation, not exploration and development. The months of IRB and DUA process weren't wasted, because they happened concurrently with development rather than blocking it.

That's the insight that sophisticated teams in healthcare AI have internalized. The question is never "synthetic or real." It's "synthetic now, real when it genuinely matters — and in parallel, not in sequence."