It is Friday afternoon at 3:15. Your sprint demo is Monday morning. The feature you're demoing requires realistic patient records — not the four recycled fake patients your team has been using for six months. Someone on Slack suggests Synthea. You clone the repo, read the README, and your stomach sinks. Java runtime environment. Maven build. A configuration system that spans multiple directories. Module customization for the specific disease populations you want. A generation run that, depending on your patient count and your hardware, takes somewhere between twenty minutes and several hours. Output in FHIR or OMOP format only — neither of which is what your application expects. Preprocessing required to get it into any format you can actually load.

By 5 PM, you have a running Synthea instance that produces no records because you haven't configured the disease modules correctly. By 7 PM, you have records that look nothing like what your application needs because Synthea doesn't include CPT codes and your billing module expects CPT codes. By 9 PM, you are questioning whether you should have just built four more fake patients in JSON and called it a demo.

Synthea is a powerful, well-maintained, academically respected tool. It is also the wrong tool for a significant number of use cases. This article is an honest, detailed assessment of Synthea and five alternative approaches to getting synthetic patient data — what each option actually produces, what it actually takes to use it, who it is right for, and who it will waste a Friday afternoon.

The Friday afternoon data problem is more common than teams admit, because admitting it requires acknowledging that the project's test data strategy was never properly planned. Most healthcare software projects start with a handful of hand-crafted fake patients in a JSON file somewhere. Those patients age poorly — they accumulate special-case logic in the code, they don't exercise edge cases, and they stop representing realistic clinical scenarios as the application matures. Then a demo is scheduled, a stakeholder asks "but does it work with real patient data?" and someone opens a new browser tab and types "synthetic patient data generator." This is where the evaluation usually begins, far later than it should.

What Synthea Actually Is and What It Generates

Synthea — developed at MITRE Corporation and open-sourced under an Apache 2.0 license — is a synthetic patient population generator built on clinically validated disease progression models. The project began in 2016 as a research initiative to produce realistic patient populations for healthcare IT testing without using real patient data. It is now used in FDA digital health submissions, academic research, EHR vendor testing environments, and healthcare interoperability demonstrations worldwide.

At its core, Synthea works by simulating patient lifecycles. Each synthetic patient is born with a gender, ethnicity, geographic location, and socioeconomic context drawn from US census data. The patient then progresses through time — acquiring conditions based on statistically realistic incidence rates, encountering the healthcare system for prevention and treatment, receiving medications, having lab results, and eventually, in Synthea's model, dying of realistic causes at actuarially plausible ages. The disease progression logic is encoded in "modules" — JSON state machine files that define the sequence of clinical events for specific conditions.

Synthea currently ships with approximately 80 disease/condition modules covering common conditions including:

Output formats include FHIR R4 bundles (JSON), FHIR DSTU2 bundles, C-CDA documents, CSV files (in a specific schema), and OMOP CDM format. The FHIR output includes Patient, Encounter, Condition, Observation, MedicationRequest, Procedure (using SNOMED procedure codes), Immunization, DiagnosticReport, and several other resource types.

The Honest Synthea Setup Experience

The Synthea GitHub README describes the quickstart process as: clone the repo, run a single Gradle command, done. The actual experience for most developers — particularly those who don't regularly work in Java ecosystems — is more involved.

Step one: Java. Synthea requires Java 11 or later. Many developers don't have a Java runtime installed or have the wrong version. Installing the correct Java version, setting JAVA_HOME correctly, and ensuring the build tool (Gradle) can find it takes anywhere from five minutes (if you know Java environments) to forty-five minutes (if you don't).

Step two: the build. Synthea uses Gradle as its build tool. The first build downloads all dependencies — a substantial number of Java dependencies — and compiles the source. On a fast connection, this takes 3-5 minutes. On a slow connection, longer. First-time build failures from dependency resolution errors are common enough that the Synthea project maintains an active issues tracker with build troubleshooting entries.

Step three: configuration. The default Synthea run generates patients from Massachusetts. If you need a different geographic area, different demographic distribution, or a specific mix of conditions, you need to modify configuration files. The primary configuration file at src/main/resources/synthea.properties has dozens of options. The demographic data that drives realistic population distributions lives in a separate set of CSV files. Getting from "I want 1,000 patients with a realistic mix of chronic conditions from the Pacific Northwest" to actual configuration requires reading several documentation pages and understanding the demographic configuration system.

Step four: module customization. If the conditions you need aren't covered by the 80 existing modules — or if you need specific prevalence rates different from Synthea's defaults — you write or modify JSON module files. The module format is well-documented, but writing clinically valid modules requires understanding the state machine model and knowledge of realistic clinical event sequences. This is a non-trivial task for teams without clinical informatics expertise.

Step five: the generation run. With everything configured, you run the generation command. For 1,000 patients with a few complex conditions, expect 10-30 minutes on a modern laptop. For 10,000 patients, proportionally longer. The output lands in an output directory that you then need to navigate.

Step six: output processing. Synthea's FHIR output is a directory of individual JSON files — one bundle per patient. If your application expects a database, a structured CSV, an HL7 v2 feed, or a 837P claim file, you have another processing step. If your application expects a FHIR server (not just flat files), you need to load those bundles into a FHIR server — which is its own setup task.

The six-hour estimate in this article's title is not hyperbole. It represents the realistic time from "I will try Synthea" to "I have usable patient records in a format my application can consume" for a developer who is competent but unfamiliar with Java build environments and FHIR data processing pipelines. For teams with Java expertise and existing FHIR infrastructure, setup is faster. For teams without those assets, it can take longer.

What Synthea Generates Well — and What It Lacks

Being specific about Synthea's capabilities and limitations is important because it is the most widely cited synthetic patient data tool, and that citation frequency creates an impression that it is a general-purpose solution when it is a specialized one.

What Synthea Does Well

Longitudinal patient histories with clinically plausible disease progression. If you want to model how a patient's Type 2 diabetes develops over fifteen years — the A1c trajectory, the complication emergence, the medication changes from metformin to insulin — Synthea's disease module framework does this better than any alternative. The temporal coherence of Synthea records is its strongest feature: a patient's lab values track with their conditions, their medications are appropriate for their diagnoses, their encounters are spaced realistically.

FHIR R4 resource structure. Synthea's FHIR output validates against the FHIR R4 specification. The resource structure — Patient, Encounter, Condition, Observation, MedicationRequest, Procedure — follows the FHIR specification correctly. For testing FHIR parsers and FHIR-native applications, Synthea provides structurally valid records.

US-realistic demographic distributions. Synthea uses US census data to generate realistic age, gender, and ethnicity distributions by geographic area. The synthetic patient population has realistic prevalence rates for common conditions calibrated to US epidemiology. For population-level analytics research, this statistical realism is valuable.

Active open-source community. The Synthea project has a large, active community and is maintained by MITRE with ongoing development. The disease module library grows regularly. Bug reports are addressed. This is not an abandoned academic project — it is actively used and maintained.

What Synthea Lacks

CPT procedure codes. This is the most significant limitation for the majority of healthcare software development use cases. Synthea uses SNOMED CT procedure codes, not CPT codes. If your application processes procedure codes for billing, prior authorization, scheduling, or analytics, Synthea's SNOMED procedure codes are not what you need. There is no Synthea configuration option to substitute CPT codes; it is a foundational design decision of the project.

Billing and claims data. Synthea does not generate 837P professional claims, 837I institutional claims, Explanation of Benefits (EOB) structures, remittance advice (835) formats, or any of the financial data structures that define healthcare revenue cycle management. The FHIR ExplanationOfBenefit resource in Synthea output is present but minimal and does not reflect realistic payer processing. For RCM software, denial management tools, clearinghouse testing, or payer analytics, Synthea produces nothing useful.

Clinical narrative notes. Synthea does not generate physician-authored clinical documentation. There are no H&P notes, no progress notes, no operative reports, no discharge summaries, no radiology reports with narrative text. The FHIR DiagnosticReport resources Synthea produces reference Observation resources but do not include a narrative "presented" text that represents a physician's written interpretation. For NLP model training, clinical documentation testing, or medical coding practice, this is a critical gap.

Specialty-specific complexity. Synthea's disease modules cover common conditions with realistic progression, but the specialty complexity that defines cardiology, oncology, orthopedics, and behavioral health coding is largely absent. There is no module for cardiac catheterization procedures coded according to CPT bundling rules, no chemotherapy administration coding complexity, no orthopedic global period scenarios.

Realistic comorbidity distributions. Synthea patients have comorbidities, but the specific clustering of conditions that reflects real clinical populations — the specific overlap pattern between diabetes, hypertension, CKD, peripheral vascular disease, and obesity that internists manage daily — may not match real-world clustering statistics as closely as curated datasets from actual patient populations. For ML training where comorbidity correlation matters for model validity, this can affect model performance on real-world data.

Geographic realism beyond demographics. While Synthea calibrates demographics to US census data, geographic healthcare utilization patterns — the fact that patients in rural areas have different access patterns than urban patients, that certain health systems have specific formularies, that different regions have different specialist availability — are not modeled.

Payer-specific data. Synthea doesn't model payer-specific coverage policies, authorization requirements, or claim adjudication logic. For testing applications that interact with specific payer systems, Synthea data won't reflect how those payers actually process and respond to claims.

The 5 Real Alternatives: Honest Pros and Cons

Option 1 — Fastest Time to Value

PatientDatasets.com

A commercial synthetic patient data service built specifically for teams that need realistic records without infrastructure overhead. Download structured datasets in CSV, JSON, FHIR R4, HL7 v2, 837P claims, C-CDA, or pipe-delimited formats — immediately, without any setup. Records include clinical narrative notes, CPT procedure codes, ICD-10 diagnosis codes, HCPCS Level II drug codes, and demographic data calibrated to realistic US population distributions. Specialty-specific dataset packs are available for cardiology, oncology, orthopedics, behavioral health, emergency medicine, and primary care. A free tier enables initial exploration; paid tiers unlock volume, specialty packs, and custom demographic configurations.

Pros
  • Instant download — zero setup time
  • 7 output formats including 837P and HL7 v2
  • CPT codes and HCPCS Level II included
  • Clinical narrative physician notes
  • Commercial license — use in products, demos, client work
  • Specialty-specific dataset packs
  • Realistic comorbidity patterns
  • No Java, Maven, Gradle, or preprocessing pipeline
  • Support team available if questions arise
Cons
  • Paid after free tier — not open source
  • Cannot customize disease module parameters
  • Fixed dataset — not a generative model you can run
  • Less suitable for academic research requiring open provenance
Option 2 — Open Source Standard

Synthea (SyntheticMass)

Developed at MITRE Corporation and maintained under an Apache 2.0 license, Synthea generates synthetic patient records using clinically validated disease progression models. It is the most widely cited open-source synthetic patient data generator in healthcare informatics research. Used in FDA digital health submissions, EHR vendor testing environments, and academic research publications around the world. The community is large, the documentation is extensive, and the FHIR output is structurally valid for FHIR integration testing.

Pros
  • Free and open source (Apache 2.0)
  • Large, active community and ongoing maintenance
  • Clinically validated disease progression models
  • Customizable disease modules for specific conditions
  • FHIR R4 and OMOP CDM output
  • Academically credible provenance for research
  • Realistic longitudinal patient histories
  • US census-calibrated demographics
Cons
  • No CPT procedure codes (SNOMED only)
  • No 837P/837I billing claim structures
  • No clinical narrative notes
  • 6+ hours to configure from scratch for most teams
  • Java/Maven/Gradle dependency chain — not trivial for non-Java teams
  • FHIR and OMOP only — no HL7 v2, no CSV in most useful formats
  • Outputs require additional processing for most production use cases
  • Module customization requires clinical informatics knowledge
Option 3 — Research Grade (Real De-identified Data)

MIMIC-IV (PhysioNet)

MIMIC-IV — Medical Information Mart for Intensive Care — is not synthetic data. It is de-identified data from real ICU patients treated at Beth Israel Deaconess Medical Center in Boston, covering over 500,000 admissions from 2008-2019. It includes structured data (diagnoses, procedures, medications, lab results, vital signs), clinical notes (nursing notes, physician notes, discharge summaries), and detailed time-series physiologic data. MIMIC-IV is the gold standard for critical care research and has been used in hundreds of peer-reviewed publications. If your use case involves ICU patient populations, MIMIC-IV is almost certainly the most realistic dataset available anywhere.

Pros
  • Real clinical data (de-identified per HIPAA Safe Harbor)
  • Academically rigorous — hundreds of published papers use it
  • Free with credentialing
  • Extensive structured data including time-series vitals and labs
  • Includes real clinical notes for NLP work
  • Large active research community; many derived datasets and analyses available
Cons
  • Requires PhysioNet account, CITI training completion, and DUA signature — credentialing takes days to weeks
  • NonCommercial license — cannot be used in commercial products or for client deliverables
  • ICU population only — no outpatient, no primary care, no pediatric general wards
  • Cannot redistribute data or share derived datasets that retain individual-level information
  • Complex relational database schema — requires significant data engineering to use
  • Historical data (2008-2019) — may not reflect current clinical practice patterns
  • ICD-9 codes for older records, ICD-10 for more recent ones — mixed coding system
Option 4 — Government Published Data

CMS Synthetic Medicare Data

The Centers for Medicare & Medicaid Services publishes several synthetic datasets based on de-identified Medicare claims data, including the DE-SynPUF (Data Entrepreneurs' Synthetic Public Use File) and the claims data files available through the CMS Data Entrepreneur's Synthetic Public Use File program. These datasets are modeled on real Medicare claims patterns and are useful for Medicare-specific policy research, analytics proof-of-concept work, and understanding CMS claims data structures. No credentialing or data use agreement is required — the data is publicly downloadable.

Pros
  • Free, publicly downloadable — no credentialing required
  • Realistic Medicare claims structure (Part A, B, D)
  • Useful for Medicare policy research and analytics POC work
  • Familiar format for teams working with CMS data
Cons
  • Medicare population only (65+, certain disabled under-65) — no pediatric, no general working-age adult population
  • No clinical notes, no laboratory narrative, no vital signs
  • Claims-only structure — no encounter-level clinical data
  • Older schema — the DE-SynPUF specifically has been criticized for statistical artifacts that limit its use for ML training
  • Not useful for EHR integration testing — wrong format and population
  • Limited specialty coverage — Medicare population skews toward chronic disease management and doesn't reflect specialty procedure populations well
Option 5 — Build Your Own

Faker / SDV / Custom Python Generation

For teams with data engineering resources, Python-based approaches offer flexibility at the cost of significant up-front development time. The Faker library (and its healthcare-specific extensions) can generate realistic-looking demographic and basic clinical data. The Synthetic Data Vault (SDV) from the MIT Data to AI Lab trains generative models on existing structured data and produces statistically similar synthetic outputs — useful when you have real data whose statistical properties you want to preserve. For teams building complex custom requirements, a fully custom generation script using clinical terminology APIs and medical ontologies is also an option.

Pros
  • Fully customizable to your specific requirements
  • Free and open source
  • SDV generates data that statistically mirrors real data if you have real data to train on
  • Can produce any output format your application requires
  • No dependencies on external tools or services
Cons
  • Requires a data engineer with healthcare domain knowledge — not a weekend project
  • Faker produces statistically plausible but clinically incoherent records — random ICD-10 and CPT combinations that don't reflect real clinical scenarios
  • SDV requires real patient data to train on — if you have real data, you have the original problem (PHI) that synthetic data is supposed to solve
  • No ICD-10/CPT clinical coherence without extensive custom logic
  • Validation that output is clinically realistic is your responsibility
  • Ongoing maintenance burden as code sets update annually
Option 6 — Enterprise Grade

Commercial Synthetic Data Platforms (MDClone, Syntegra, Gretel)

Enterprise synthetic data platforms generate privacy-certified synthetic data from institutional real patient datasets. MDClone, Syntegra, and Gretel (with its healthcare verticals) use advanced generative AI and differential privacy techniques to produce synthetic data that mirrors an institution's real patient population while providing formal privacy guarantees. These platforms are designed for large health systems that need to share their de-identified institutional data for research partnerships, commercial analytics, or AI model training without the risks of traditional de-identification.

Pros
  • Privacy-certified synthetic data with formal differential privacy guarantees
  • Mirrors the statistical properties of a real institutional dataset
  • Suitable for sharing data that would otherwise be restricted by the institution
  • Enterprise support and compliance documentation
  • Can generate large-scale population datasets reflecting specific health systems
Cons
  • Contracts typically range from $50,000 to $500,000+ annually
  • Requires a sales process — no self-service option for most platforms
  • Requires the institution to provide its own real patient data as the generative model input
  • Overkill for most development, testing, and ML prototyping use cases
  • Not a practical option for startups, researchers without institutional backing, or teams needing data quickly
  • Implementation and integration requires significant coordination with the vendor

Side-by-Side Comparison

Feature PatientDatasets Synthea MIMIC-IV CMS Synthetic DIY (Faker/SDV) Enterprise Platforms
Time to first record Minutes 6+ hours Days–weeks Hours Weeks Months
CPT codes Yes No Partial (ICD-PCS) Partial DIY Depends on source
ICD-10 codes Yes Yes Mixed ICD-9/10 Yes DIY Depends on source
Clinical narrative notes Yes No Yes No DIY Depends on source
837P billing format Yes No No Partial DIY Varies
FHIR R4 Yes Yes Via tools No DIY Varies
HL7 v2 Yes No No No DIY Varies
Commercial license Yes Apache 2.0 No (NonCommercial) Public domain MIT/Apache Enterprise contract
Cost Free tier + paid Free Free Free Free (dev time) $50k–$500k+
Setup required None Significant Credentialing + data eng Minimal Substantial dev work Enterprise procurement

The Evaluation Framework: 10 Questions to Ask

Before committing to any synthetic data source, work through these ten questions. Your answers will point you to the right option more reliably than any general comparison.

  1. What format does my application consume? FHIR R4 JSON, HL7 v2 messages, 837P claims files, CSV, relational database tables? The format question eliminates several options immediately.
  2. Do I need CPT procedure codes? Yes eliminates Synthea and MIMIC-IV. No keeps all options open.
  3. Do I need clinical narrative notes? Yes requires PatientDatasets.com, MIMIC-IV, or a DIY approach with NLP-generated text.
  4. What is my timeline? Hours: PatientDatasets.com. Days-weeks: Synthea or CMS Synthetic. Weeks-months: MIMIC-IV. Months: Enterprise platforms or DIY.
  5. Will this data appear in a commercial product or client deliverable? Yes requires a commercial license — eliminates MIMIC-IV (NonCommercial) and adds cost consideration for all options.
  6. Do I need to produce arbitrary volumes of new records on demand? Yes favors Synthea (generative) or a DIY approach. Curated datasets have fixed volume.
  7. Do I need a specific patient population (ICU, pediatric, oncology, geriatric)? ICU: MIMIC-IV is unmatched. Specialty outpatient: PatientDatasets.com specialty packs. Geriatric Medicare: CMS Synthetic.
  8. Do I need my synthetic data to statistically mirror a specific institution's real patient population? Yes requires either SDV (if you have real data) or enterprise platforms.
  9. Is academic provenance important — will this data be cited in a paper? Yes favors Synthea (widely cited, Apache 2.0) or MIMIC-IV (gold standard, but NonCommercial).
  10. Do I need payer-specific data structures? Medicare claims: CMS Synthetic. Payer-specific adjudication: enterprise platforms or PatientDatasets.com commercial data with payer-typical structures.

The License Reality: What NonCommercial Actually Means

License terms for healthcare data deserve more attention than most developers give them. The most common license misconception involves MIMIC-IV's "NonCommercial" designation.

The MIMIC-IV Data Use Agreement explicitly prohibits commercial use — and "commercial use" in this context includes using the data to develop a product that is then sold, using the data in a consulting deliverable for a paying client, and using the data to train a model that is then deployed in a commercial healthcare application. The prohibition is not just on redistributing the data — it is on using the data as part of any commercial activity.

This means: if you are a health IT startup, you cannot use MIMIC-IV to train your ML model. If you are a consulting firm, you cannot use MIMIC-IV to build a demo for a client. If you are a payer analytics team building tools that your company will use to make commercial decisions, MIMIC-IV's status is legally unclear and carries organizational risk.

Synthea's Apache 2.0 license is permissive — you can use Synthea-generated data in commercial products, in client deliverables, and in commercial ML model training. This is a meaningful advantage of Synthea over MIMIC-IV for commercial teams, separate from any feature comparison.

PatientDatasets.com's commercial license covers all commercial use cases — including building products, client deliverables, and ML model training for commercial deployment — with explicit IP indemnification in paid tiers. For regulated healthcare companies that need documented data provenance, this matters.

The license question is often an afterthought in data source evaluation — until a legal or compliance team asks about data provenance. In healthcare, where data misuse carries regulatory and reputational consequences, discovering six months into a project that the data you used for model training carries a NonCommercial restriction is a painful and expensive problem. Evaluate license terms on day one, not after launch.

Use-Case-to-Option Matching: Specific Scenarios

"I need data to build and test a readmission prediction model"

For a readmission model, you need longitudinal patient records with inpatient admissions, diagnoses, procedures, medications, lab results, and discharge dispositions — and you need enough data to train a generalizable model, typically tens of thousands of patient records. You also need some records that include actual readmission events.

If this is an academic research project and ICU readmissions are acceptable: MIMIC-IV is the right choice. The data depth is unmatched, and the academic community has published extensively on readmission modeling with MIMIC.

If this is a commercial product and you need general inpatient populations (not just ICU): PatientDatasets.com inpatient datasets provide readmission-relevant records with the necessary fields. Alternatively, Synthea can generate inpatient records with longitudinal history — the limitation is SNOMED procedure codes and no clinical notes, which may or may not matter depending on your feature set.

"I need data to test my FHIR integration"

For FHIR integration testing, you need FHIR R4 resources that conform to the specific profiles you are integrating against — likely US Core profiles with specific required elements populated. You need multiple records per resource type to exercise edge cases. You probably need records that test specific scenarios — observations with missing values, patients with unusual name structures, coverage resources with complex plan hierarchies.

Synthea produces valid FHIR R4 resources and is a reasonable choice here. PatientDatasets.com FHIR output is US Core-conformant with validated resource structure and covers the specific edge cases that integration tests need. Either works; the choice is between free-and-configure (Synthea) and instant-and-download (PatientDatasets.com).

"I need data to train new medical coders"

For medical coding training, you need records with CPT codes, ICD-10 codes, clinical narrative documentation, and realistic complexity — multiple comorbidities per encounter, documentation quality issues, specialty-specific procedures. Neither Synthea nor MIMIC-IV provides this. PatientDatasets.com medical coding practice datasets are specifically designed for this use case with all of these elements.

"I need data to build a prior authorization workflow demo"

For a prior authorization demo, you need records that include procedure requests (CPT codes), diagnosis justifications (ICD-10), patient insurance information (Coverage), and ideally clinical notes that support medical necessity. Synthea lacks CPT codes and clinical notes. MIMIC-IV is ICU-only and NonCommercial. PatientDatasets.com provides all of these elements with a commercial license that allows use in a client-facing demo.

"I need data to stress-test my EHR's claim generation pipeline"

For RCM pipeline testing, you need 837P claim structures, ERA/835 response files, EOB data, or at minimum structured claims data with realistic charge codes, modifier situations, and payer identifiers. Synthea generates none of this. CMS Synthetic Medicare data has claims structure but is Medicare-only and older format. PatientDatasets.com 837P format datasets are designed specifically for RCM pipeline testing. For testing against specific payer adjudication logic, the enterprise platform options are the only ones with payer-specific response modeling, but at enterprise cost.

"I need to test my NLP model on clinical notes"

For NLP work on clinical notes, you need narrative text in realistic physician documentation style. MIMIC-IV is the gold standard if your use case tolerates the NonCommercial license and ICU population. PatientDatasets.com provides synthetic clinical notes across specialties with a commercial license. Synthea provides no narrative notes. DIY approaches using LLM-generated notes are possible but require careful validation that the generated text reflects realistic documentation patterns rather than textbook-perfect prose that NLP models trained on real notes won't generalize from.

The Timeline Reality

The right synthetic data source for your use case sometimes depends less on features than on time. Here is a realistic assessment of what each option requires.

Hours: PatientDatasets.com (instant download), CMS Synthetic Medicare (downloadable without credentialing, though format work may take additional time).

Days: Synthea, if you have Java experience and the configuration is straightforward. MIMIC-III (the predecessor to MIMIC-IV) credentialing has historically processed in 24-72 hours for many applicants, though MIMIC-IV credentialing timelines vary.

Weeks: MIMIC-IV credentialing on the longer end, particularly for applicants without existing CITI training. Synthea, if you need custom disease modules or complex configuration. DIY approaches for simple use cases.

Months: Enterprise platforms (MDClone, Syntegra) — the sales cycle, contract negotiation, data transfer, and platform setup are multi-month endeavors. DIY approaches for complex use cases requiring clinically coherent records at scale.

From Zero to Realistic Patient Records in Minutes

Download synthetic patient data in 7 formats — including 837P claims, FHIR R4, HL7 v2, C-CDA, and CSV — with CPT codes, ICD-10 codes, HCPCS Level II drug codes, and clinical narrative notes across 60+ specialties. Free tier available with no setup required. Commercial license included in all paid tiers, with explicit IP indemnification. Used by healthcare developers, data scientists, medical coding educators, and RCM teams at organizations from two-person startups to large health systems.

Explore the Dataset →

A Note on Synthea, Specifically

Synthea deserves a direct address at the end of this comparison because it is the tool most frequently recommended by default — and because that reflexive recommendation does a disservice to teams whose use cases Synthea doesn't actually serve.

Synthea is an excellent tool. The MITRE team and the open-source community have built something genuinely valuable for the healthcare IT ecosystem. For research requiring longitudinal synthetic patient populations with FHIR R4 output, for academic projects requiring open data provenance, for teams building FHIR-native applications who need to test basic resource parsing — Synthea is the right answer.

The problem is not Synthea. The problem is recommendation without evaluation. When a developer asks "what synthetic patient data should I use?" and the answer is reflexively "Synthea" without asking what format they need, whether they need CPT codes, whether they have clinical notes requirements, and how much time they have — that's how Friday afternoon data problems happen. And they happen a lot.

The right question is not "which tool is best?" The right question is "which tool is best for this specific use case, at this timeline, with these license requirements, in this format?" Answer those questions honestly and the right choice usually becomes obvious. Most of the time, if the timeline is measured in hours rather than days and the use case involves billing data or clinical documentation, Synthea is not the answer — regardless of how many Stack Overflow answers recommend it.

Whatever your use case, the right tool exists. The goal of this comparison is simply to make sure you reach for it before it is 5 PM on a Friday and your demo is Monday morning. That distinction — knowing ahead of time which tool to reach for — is what separates teams that spend weekends configuring Java build environments from teams that spend weekends building features.