Python / pandas ML Training Data scikit-learn / PyTorch FHIR R4 / HL7 Parquet / Spark / BigQuery Clinical NLP

Synthetic Patient Data for
Machine Learning & AI

Production-ready healthcare datasets in 7 formats. No HIPAA overhead. No IRB delays.
Train your model today.

HIPAA-free — no compliance overhead IRB-free — publish results freely Instant download Commercial-use license included
19,900+
Records generated
60+
Clinical specialties
7
Export formats
92.5%+
BOSS quality score
9
Encounter types
36+
Demographic fields
Built for ML workflows

Why data scientists choose PatientDatasets

Every design decision optimizes for the data science workflow — from download to model in minutes, not days.

📈

Parquet + CSV native

Load directly into pandas, Spark, DuckDB, or BigQuery. Parquet columnar format cuts memory overhead by 60–80% vs. raw CSV for large feature engineering jobs.

Parquet CSV DuckDB

Zero preprocessing required

Download and use immediately. Consistent schema, validated data types, normalized ICD-10 codes, and reference-range-annotated labs — no cleaning scripts needed.

Ready to train
🔒

HIPAA-free — zero compliance tax

100% synthetic means no BAA, no DUA, no IRB, no de-identification review. Work on your laptop, a shared cluster, or a public cloud instance without legal overhead.

No DUA required
🎯

Realistic comorbidity patterns

Calibrated to CDC/NHANES epidemiological baselines. Diabetes with CKD co-occurs at realistic rates. HTN cascades into CVD correctly. Models trained here generalize.

ICD-10-CM SNOMED-CT
📚

IRB-free — publish freely

No human subjects, no 45 CFR 46 requirements. Publish benchmark results, share on GitHub, submit to arXiv or peer-reviewed journals without institutional review.

Commercial license included

36+ demographic fields for bias testing

Age, sex, race/ethnicity, insurance type, ZIP-level SDoH, preferred language, and more — slice your model's performance across every dimension your IRB would require.

Fairness testing SDoH fields
5-line quickstart

From download to DataFrame in under a minute

No ETL pipeline. No schema mapping. No type coercion. The Parquet file loads directly into a clean pandas DataFrame with correctly typed columns and no nulls in key fields.

  • Parquet schema includes column types, nullable flags, and metadata
  • ICD-10 codes pre-mapped — no crosswalk step needed
  • Dates parsed as ISO 8601, labs as float64 with reference ranges included
  • Full clinical notes in a single text column — NLP-ready
quickstart.py
import pandas as pd import pyarrow.parquet as pq # Load the dataset — zero preprocessing df = pd.read_parquet("patientdatasets_500.parquet") # 500 patients, 36+ columns, ready print(df.shape) # (500, 42) # Filter: diabetic patients with CKD cohort = df[ df["primary_dx_code"].str.startswith("E11") & df["secondary_dx"].str.contains("N18", na=False) ] # Clinical notes column — NLP-ready notes = df["clinical_note"].tolist() print(f"{len(notes)} notes loaded") # 500 notes # Feature engineering example X = df[[ "age", "bmi", "hba1c", "egfr", "systolic_bp", "insurance_type" ]]
Use cases

What ML teams build with this data

From production healthcare AI to portfolio projects — these are the workflows our customers run.

🤖

Clinical NLP model training

Fine-tune BioBERT, ClinicalBERT, or GPT models on realistic clinical narratives — chief complaint, HPI, ROS, assessment/plan. Full discharge summaries and operative notes included.

NER BioBERT ICD coding
📊

Denial prediction models

Build and validate prior-authorization and claims denial classifiers. Structured diagnosis codes, insurance type, procedure codes, and payer fields make for rich feature sets.

XGBoost Classification
🏢

HCC risk adjustment ML

Train and benchmark HCC RAF score predictors. Comorbidity patterns and demographic distributions are calibrated to CMS V28 model epidemiological baselines.

HCC V28 Risk scoring
📄

EHR NLP pipelines

Test extraction pipelines for medications, lab values, problem lists, and clinical events. FHIR R4 bundles provide structured reference output for evaluation.

FHIR R4 spaCy medspaCy
🏆

Portfolio & Kaggle-style projects

Build a compelling healthcare ML portfolio without data access barriers. Commercial-use license means you can publish your notebook, share on GitHub, and present to employers.

Public repo OK Commercial license
💾

Benchmark datasets

Establish reproducible baselines for clinical NLP and structured prediction tasks. Versioned releases with DOI-style citation satisfy most journal data transparency requirements.

Reproducible Versioned
7 formats

7 formats, zero conversion

Every format is pre-built and validated. No OMOP transformation scripts, no FHIR mapping, no ETL. Pick your stack and load.

Format Best for Available from
.parquet Spark, DuckDB, BigQuery, pandas — columnar analytics at scale Innovator+
.csv pandas, R, Excel, any tabular tool — universal compatibility All tiers
.json REST APIs, document stores, MongoDB, Elasticsearch Sampler+
FHIR R4 Healthcare interoperability, SMART on FHIR app testing, FHIR server validation Architect+
HL7 v2.x Interface engine testing, ADT/ORU message pipelines, legacy EHR integration Architect+
C-CDA Clinical document exchange, CDA parsing pipelines, Meaningful Use testing Architect+
.sqlite Relational queries, SQL practice, embedded database applications Professional+
📌 Architect tier ($349) includes all 7 formats in a single download 📌 FHIR R4 bundles are conformant, R4-validated, include LOINC + SNOMED-CT codes 📌 Parquet schema uses Apache Arrow metadata
One-time purchase

One-time purchase. Instant download.

No subscription. No per-seat fees. Buy once, use forever. Download in under a minute.

Innovator
$229

one-time purchase

500 Records

  • CSV + JSON + Parquet
  • 60+ specialties
  • 36+ demographic fields
  • Commercial-use license
  • Instant download
  • 14-day refund policy
Buy Innovator →
Premium
$599

one-time purchase

2,500 Records

  • All 7 formats
  • 2,500 records — ML-scale training set
  • 60+ specialties, full HEDIS-valid cohort
  • HCC risk validation ready
  • Commercial-use license
  • Priority support
Buy Premium →
Need 10,000+ records or custom specialty cohorts? Contact us for Enterprise pricing →  |  Academic discount: ACADEMIC30
Comparison

How we compare to Synthea and MIMIC

Know your options. Here's an honest comparison for the ML engineer evaluating synthetic healthcare data sources.

Criterion PatientDatasets.com Synthea MIMIC-IV
Clinical narrative depth Full HPI, ROS, exam, A/P, notes Skeletal encounters Real ICU notes (de-identified)
HIPAA / compliance HIPAA-free, no DUA HIPAA-free Requires CITI training + DUA
IRB required to publish No No Institutional review recommended
Time to usable data Under 1 minute 6+ hours setup + generation Days to weeks (approval + setup)
Export formats 7 (CSV, JSON, Parquet, FHIR R4, HL7, C-CDA, SQLite) FHIR, OMOP only CSV / PostgreSQL only
Commercial use Yes — license included Apache 2.0 Non-commercial / research only
Specialty coverage 60+ specialties Limited modules ICU-only
Parquet / BigQuery ready Yes — native Requires conversion Requires conversion
FAQ

Questions from data scientists

Answers to the questions ML engineers and data scientists ask before purchasing.

Can I use synthetic patient data to train a machine learning model?

Yes — this is the primary use case. The data is 100% synthetic, meaning no HIPAA restrictions, no data use agreements, and no IRB approval is required. You can train, test, benchmark, and publish results freely. Parquet and CSV formats are optimized for pandas, scikit-learn, Spark, and BigQuery pipelines. Every record includes 36+ demographic fields, structured labs with reference ranges, vitals, medications with dosages, ICD-10-coded problem lists, and full clinical narratives.

What format is best for pandas and scikit-learn?

CSV works natively with pd.read_csv() and is available from the Sampler tier ($49). Parquet is available from the Innovator tier ($229) and is the preferred format for ML workflows — columnar storage cuts memory usage significantly and loads 3–5x faster than equivalent CSV for large datasets. Both formats require zero preprocessing: download and load immediately. For Spark or BigQuery pipelines, Parquet with Arrow metadata is the clear choice.

How is this different from Synthea?

Synthea generates statistical demographics and skeletal encounters — useful scaffolding, but clinically flat. PatientDatasets.com delivers complete narrative records: chief complaint, full HPI (OLDCARTS), review of systems, physical exam, assessment, plan, operative reports, and discharge summaries — written the way a clinician actually documents a chart. We ship 7 formats natively vs. Synthea's FHIR/OMOP only. And you get an instant download vs. 6+ hours of Java environment setup, configuration, and generation time just to get a basic Synthea cohort running.

Is IRB approval required to publish results based on this data?

No. There are no real patients and no real PHI — 100% AI-generated. Human subjects research regulations (45 CFR 46) do not apply. You can publish results in peer-reviewed journals, share the dataset with co-authors, post to GitHub, cite it in a Methods section, and present at conferences without any institutional review. A commercial-use license is included with every paid tier. Each release is versioned with a DOI-style citation that satisfies most journal data transparency requirements.

What is the largest dataset available?

The Premium tier includes 2,500 records in all 7 formats — suitable for training most classification and NLP models. For larger needs, the Enterprise tier provides 10,000+ record custom cohorts with specialty filtering, custom ICD-10 code distributions, and white-label rights. Contact support@patientdatasets.com for Enterprise pricing and custom configuration options.

Can I use this for a Kaggle competition or portfolio project?

Yes — a commercial-use license is included with every paid tier. You can publish Jupyter notebooks, share derived datasets on Kaggle, post trained models to Hugging Face, and present your work to employers or investors. The only restriction is redistribution of the raw dataset files themselves. The Sampler tier ($49) is a practical entry point for Kaggle and portfolio work; the Innovator tier ($229, 500 records, Parquet) is recommended for anything you plan to publish.

Are the FHIR R4 bundles valid for testing FHIR servers and SMART apps?

Yes. Every FHIR export is a conformant R4 Bundle containing Patient, Encounter, Condition, Observation (vitals and labs), MedicationRequest, and AllergyIntolerance resources. Resources use standard LOINC and SNOMED-CT codes with proper reference chaining. They are suitable for testing FHIR servers, FHIR validation engines, SMART on FHIR applications, and CDS Hooks integrations without needing a live EHR connection. FHIR R4 is available in the Architect tier and above.

How are comorbidity patterns calibrated? Will my model generalize?

Comorbidity co-occurrence rates are calibrated to CDC/NHANES and CMS administrative data baselines. Conditions like type 2 diabetes (E11) with CKD (N18), hypertension (I10) with heart failure (I50), and COPD (J44) with respiratory failure appear at epidemiologically realistic rates — not randomly assigned. Demographic distributions for age, sex, race/ethnicity, and insurance type match national enrollment distributions. Models trained on PatientDatasets records have realistic covariate structure for generalization studies.

FREE SAMPLE

Free 5-record sample.
No credit card.

Download 5 complete synthetic patient records — full clinical narrative, 36+ demographic fields, structured labs, ICD-10 codes, medications — and evaluate the quality before you commit.

Download Free Sample → View Pricing

Delivered in under 1 minute  ·  No signup required  ·  14-day refund on paid tiers