How is PatientDatasets.com different from Synthea?

Synthea generates statistical demographics and skeletal encounters — useful scaffolding, but clinically flat. PatientDatasets.com delivers complete narrative records: HPI, ROS, physical exam, assessment, plan, labs with reference ranges, and medications with dosages — written the way a clinician actually documents a chart. We offer 7 export formats vs. Synthea's OMOP/FHIR only. And you get instant download vs. 6+ hours of preprocessing and configuration to generate a usable Synthea cohort.

Python / pandas ML Training Data scikit-learn / PyTorch FHIR R4 / HL7 Parquet / Spark / BigQuery Clinical NLP

Synthetic Patient Data for
Machine Learning & AI

Q: Can I use synthetic patient data to train a machine learning model?

Yes — 100% synthetic data with no real PHI means no HIPAA restrictions, no data use agreements, and no IRB approval required. You can train, test, benchmark, and publish results freely. Parquet and CSV formats are optimized for pandas, scikit-learn, Spark, and BigQuery pipelines. Every record includes 36+ demographic fields, structured labs, vitals, medications, and full clinical notes ready for feature engineering.

Q: What format is best for pandas and scikit-learn?

CSV is available from the Sampler tier ($49) and works natively with pandas, R, and Excel. Parquet is available from the Innovator tier ($229) and is optimized for columnar access with pandas, Spark, DuckDB, and BigQuery — ideal for large-scale ML pipelines. Both formats require zero preprocessing: download and load immediately.

Q: Is IRB approval required to publish results using this data?

No. The data is 100% synthetic — no real patients, no real PHI. Human subjects research regulations (45 CFR 46) do not apply. You can publish results, share the dataset with co-authors, post to GitHub, and cite it in a Methods section without any IRB approval or institutional review. A commercial-use license is included with every purchase.

Q: What is the largest dataset available?

The Premium tier includes 2,500 records in all 7 formats. For larger needs, the Enterprise tier provides 10,000+ record custom cohorts with specialty filtering, custom ICD-10 distributions, and white-label rights. Contact support@patientdatasets.com for Enterprise pricing and custom configurations.

Production-ready healthcare datasets in 7 formats. No HIPAA overhead. No IRB delays.
Train your model today.

Download Free Sample → View Pricing

✓ HIPAA-free — no compliance overhead ✓ IRB-free — publish results freely ✓ Instant download ✓ Commercial-use license included

19,900+

Records generated

60+

Clinical specialties

Export formats

92.5%+

BOSS quality score

Encounter types

36+

Demographic fields

Built for ML workflows

Why data scientists choose PatientDatasets

Every design decision optimizes for the data science workflow — from download to model in minutes, not days.

📈

Parquet + CSV native

Load directly into pandas, Spark, DuckDB, or BigQuery. Parquet columnar format cuts memory overhead by 60–80% vs. raw CSV for large feature engineering jobs.

Parquet CSV DuckDB

⚡

Zero preprocessing required

Download and use immediately. Consistent schema, validated data types, normalized ICD-10 codes, and reference-range-annotated labs — no cleaning scripts needed.

Ready to train

🔒

HIPAA-free — zero compliance tax

100% synthetic means no BAA, no DUA, no IRB, no de-identification review. Work on your laptop, a shared cluster, or a public cloud instance without legal overhead.

No DUA required

🎯

Realistic comorbidity patterns

Calibrated to CDC/NHANES epidemiological baselines. Diabetes with CKD co-occurs at realistic rates. HTN cascades into CVD correctly. Models trained here generalize.

ICD-10-CM SNOMED-CT

📚

IRB-free — publish freely

No human subjects, no 45 CFR 46 requirements. Publish benchmark results, share on GitHub, submit to arXiv or peer-reviewed journals without institutional review.

Commercial license included

⚖

36+ demographic fields for bias testing

Age, sex, race/ethnicity, insurance type, ZIP-level SDoH, preferred language, and more — slice your model's performance across every dimension your IRB would require.

Fairness testing SDoH fields

5-line quickstart

From download to DataFrame in under a minute

No ETL pipeline. No schema mapping. No type coercion. The Parquet file loads directly into a clean pandas DataFrame with correctly typed columns and no nulls in key fields.

✓ Parquet schema includes column types, nullable flags, and metadata
✓ ICD-10 codes pre-mapped — no crosswalk step needed
✓ Dates parsed as ISO 8601, labs as float64 with reference ranges included
✓ Full clinical notes in a single text column — NLP-ready

quickstart.py
importpandasaspdimportpyarrow.parquetaspq# Load the dataset — zero preprocessingdf = pd.read_parquet("patientdatasets_500.parquet")

# 500 patients, 36+ columns, readyprint(df.shape)   # (500, 42)# Filter: diabetic patients with CKDcohort = df[
    df["primary_dx_code"].str.startswith("E11") &
    df["secondary_dx"].str.contains("N18", na=False)
]

# Clinical notes column — NLP-readynotes = df["clinical_note"].tolist()
print(f"{len(notes)} notes loaded")  # 500 notes# Feature engineering exampleX = df[[
    "age", "bmi", "hba1c", "egfr",
    "systolic_bp", "insurance_type"
]]
                        

Use cases

What ML teams build with this data

From production healthcare AI to portfolio projects — these are the workflows our customers run.

🤖

Clinical NLP model training

Fine-tune BioBERT, ClinicalBERT, or GPT models on realistic clinical narratives — chief complaint, HPI, ROS, assessment/plan. Full discharge summaries and operative notes included.

NER BioBERT ICD coding

📊

Denial prediction models

Build and validate prior-authorization and claims denial classifiers. Structured diagnosis codes, insurance type, procedure codes, and payer fields make for rich feature sets.

XGBoost Classification

🏢

HCC risk adjustment ML

Train and benchmark HCC RAF score predictors. Comorbidity patterns and demographic distributions are calibrated to CMS V28 model epidemiological baselines.

HCC V28 Risk scoring

📄

EHR NLP pipelines

Test extraction pipelines for medications, lab values, problem lists, and clinical events. FHIR R4 bundles provide structured reference output for evaluation.

FHIR R4 spaCy medspaCy

🏆

Portfolio & Kaggle-style projects

Build a compelling healthcare ML portfolio without data access barriers. Commercial-use license means you can publish your notebook, share on GitHub, and present to employers.

Public repo OK Commercial license

💾

Benchmark datasets

Establish reproducible baselines for clinical NLP and structured prediction tasks. Versioned releases with DOI-style citation satisfy most journal data transparency requirements.

Reproducible Versioned

7 formats

7 formats, zero conversion

Every format is pre-built and validated. No OMOP transformation scripts, no FHIR mapping, no ETL. Pick your stack and load.

Format	Best for	Available from
.parquet	Spark, DuckDB, BigQuery, pandas — columnar analytics at scale	Innovator+
.csv	pandas, R, Excel, any tabular tool — universal compatibility	All tiers
.json	REST APIs, document stores, MongoDB, Elasticsearch	Sampler+
FHIR R4	Healthcare interoperability, SMART on FHIR app testing, FHIR server validation	Architect+
HL7 v2.x	Interface engine testing, ADT/ORU message pipelines, legacy EHR integration	Architect+
C-CDA	Clinical document exchange, CDA parsing pipelines, Meaningful Use testing	Architect+
.sqlite	Relational queries, SQL practice, embedded database applications	Professional+

📌 Architect tier ($349) includes all 7 formats in a single download 📌 FHIR R4 bundles are conformant, R4-validated, include LOINC + SNOMED-CT codes 📌 Parquet schema uses Apache Arrow metadata

One-time purchase

One-time purchase. Instant download.

No subscription. No per-seat fees. Buy once, use forever. Download in under a minute.

Innovator

$229

one-time purchase

500 Records

✓ CSV + JSON + Parquet
✓ 60+ specialties
✓ 36+ demographic fields
✓ Commercial-use license
✓ Instant download
✓ 14-day refund policy

Buy Innovator →

Architect

$349

one-time purchase

1,000 Records — All 7 Formats

✓ All 7 formats in one download
✓ CSV, JSON, Parquet, SQLite
✓ FHIR R4 + HL7 v2.x + C-CDA
✓ 60+ specialties, all 9 encounter types
✓ Commercial-use license
✓ 14-day refund policy

Buy Architect →

Premium

$599

one-time purchase

2,500 Records

✓ All 7 formats
✓ 2,500 records — ML-scale training set
✓ 60+ specialties, full HEDIS-valid cohort
✓ HCC risk validation ready
✓ Commercial-use license
✓ Priority support

Buy Premium →

Need 10,000+ records or custom specialty cohorts? Contact us for Enterprise pricing → | Academic discount: ACADEMIC30

Comparison

How we compare to Synthea and MIMIC

Know your options. Here's an honest comparison for the ML engineer evaluating synthetic healthcare data sources.

Criterion	PatientDatasets.com	Synthea	MIMIC-IV
Clinical narrative depth	Full HPI, ROS, exam, A/P, notes	Skeletal encounters	Real ICU notes (de-identified)
HIPAA / compliance	HIPAA-free, no DUA	HIPAA-free	Requires CITI training + DUA
IRB required to publish	No	No	Institutional review recommended
Time to usable data	Under 1 minute	6+ hours setup + generation	Days to weeks (approval + setup)
Export formats	7 (CSV, JSON, Parquet, FHIR R4, HL7, C-CDA, SQLite)	FHIR, OMOP only	CSV / PostgreSQL only
Commercial use	Yes — license included	Apache 2.0	Non-commercial / research only
Specialty coverage	60+ specialties	Limited modules	ICU-only
Parquet / BigQuery ready	Yes — native	Requires conversion	Requires conversion

FAQ

Questions from data scientists

Answers to the questions ML engineers and data scientists ask before purchasing.

Can I use synthetic patient data to train a machine learning model?

Yes — this is the primary use case. The data is 100% synthetic, meaning no HIPAA restrictions, no data use agreements, and no IRB approval is required. You can train, test, benchmark, and publish results freely. Parquet and CSV formats are optimized for pandas, scikit-learn, Spark, and BigQuery pipelines. Every record includes 36+ demographic fields, structured labs with reference ranges, vitals, medications with dosages, ICD-10-coded problem lists, and full clinical narratives.

What format is best for pandas and scikit-learn?

CSV works natively with pd.read_csv() and is available from the Sampler tier ($49). Parquet is available from the Innovator tier ($229) and is the preferred format for ML workflows — columnar storage cuts memory usage significantly and loads 3–5x faster than equivalent CSV for large datasets. Both formats require zero preprocessing: download and load immediately. For Spark or BigQuery pipelines, Parquet with Arrow metadata is the clear choice.

How is this different from Synthea?

Synthea generates statistical demographics and skeletal encounters — useful scaffolding, but clinically flat. PatientDatasets.com delivers complete narrative records: chief complaint, full HPI (OLDCARTS), review of systems, physical exam, assessment, plan, operative reports, and discharge summaries — written the way a clinician actually documents a chart. We ship 7 formats natively vs. Synthea's FHIR/OMOP only. And you get an instant download vs. 6+ hours of Java environment setup, configuration, and generation time just to get a basic Synthea cohort running.

Is IRB approval required to publish results based on this data?

No. There are no real patients and no real PHI — 100% AI-generated. Human subjects research regulations (45 CFR 46) do not apply. You can publish results in peer-reviewed journals, share the dataset with co-authors, post to GitHub, cite it in a Methods section, and present at conferences without any institutional review. A commercial-use license is included with every paid tier. Each release is versioned with a DOI-style citation that satisfies most journal data transparency requirements.

What is the largest dataset available?

The Premium tier includes 2,500 records in all 7 formats — suitable for training most classification and NLP models. For larger needs, the Enterprise tier provides 10,000+ record custom cohorts with specialty filtering, custom ICD-10 code distributions, and white-label rights. Contact support@patientdatasets.com for Enterprise pricing and custom configuration options.

Can I use this for a Kaggle competition or portfolio project?

Yes — a commercial-use license is included with every paid tier. You can publish Jupyter notebooks, share derived datasets on Kaggle, post trained models to Hugging Face, and present your work to employers or investors. The only restriction is redistribution of the raw dataset files themselves. The Sampler tier ($49) is a practical entry point for Kaggle and portfolio work; the Innovator tier ($229, 500 records, Parquet) is recommended for anything you plan to publish.

Are the FHIR R4 bundles valid for testing FHIR servers and SMART apps?

Yes. Every FHIR export is a conformant R4 Bundle containing Patient, Encounter, Condition, Observation (vitals and labs), MedicationRequest, and AllergyIntolerance resources. Resources use standard LOINC and SNOMED-CT codes with proper reference chaining. They are suitable for testing FHIR servers, FHIR validation engines, SMART on FHIR applications, and CDS Hooks integrations without needing a live EHR connection. FHIR R4 is available in the Architect tier and above.

How are comorbidity patterns calibrated? Will my model generalize?

Comorbidity co-occurrence rates are calibrated to CDC/NHANES and CMS administrative data baselines. Conditions like type 2 diabetes (E11) with CKD (N18), hypertension (I10) with heart failure (I50), and COPD (J44) with respiratory failure appear at epidemiologically realistic rates — not randomly assigned. Demographic distributions for age, sex, race/ethnicity, and insurance type match national enrollment distributions. Models trained on PatientDatasets records have realistic covariate structure for generalization studies.

FREE SAMPLE

Free 5-record sample.
No credit card.

Download 5 complete synthetic patient records — full clinical narrative, 36+ demographic fields, structured labs, ICD-10 codes, medications — and evaluate the quality before you commit.

Download Free Sample → View Pricing

Delivered in under 1 minute · No signup required · 14-day refund on paid tiers

Synthetic Patient Data forMachine Learning & AI

Why data scientists choose PatientDatasets

Parquet + CSV native

Zero preprocessing required

HIPAA-free — zero compliance tax

Realistic comorbidity patterns

IRB-free — publish freely

36+ demographic fields for bias testing

From download to DataFrame in under a minute

What ML teams build with this data

Clinical NLP model training

Denial prediction models

HCC risk adjustment ML

EHR NLP pipelines

Portfolio & Kaggle-style projects

Benchmark datasets

7 formats, zero conversion

One-time purchase. Instant download.

How we compare to Synthea and MIMIC

Questions from data scientists

Free 5-record sample.No credit card.

Synthetic Patient Data for
Machine Learning & AI

Free 5-record sample.
No credit card.