Parquet + CSV native
Load directly into pandas, Spark, DuckDB, or BigQuery. Parquet columnar format cuts memory overhead by 60–80% vs. raw CSV for large feature engineering jobs.
Every design decision optimizes for the data science workflow — from download to model in minutes, not days.
Load directly into pandas, Spark, DuckDB, or BigQuery. Parquet columnar format cuts memory overhead by 60–80% vs. raw CSV for large feature engineering jobs.
Download and use immediately. Consistent schema, validated data types, normalized ICD-10 codes, and reference-range-annotated labs — no cleaning scripts needed.
100% synthetic means no BAA, no DUA, no IRB, no de-identification review. Work on your laptop, a shared cluster, or a public cloud instance without legal overhead.
Calibrated to CDC/NHANES epidemiological baselines. Diabetes with CKD co-occurs at realistic rates. HTN cascades into CVD correctly. Models trained here generalize.
No human subjects, no 45 CFR 46 requirements. Publish benchmark results, share on GitHub, submit to arXiv or peer-reviewed journals without institutional review.
Age, sex, race/ethnicity, insurance type, ZIP-level SDoH, preferred language, and more — slice your model's performance across every dimension your IRB would require.
No ETL pipeline. No schema mapping. No type coercion. The Parquet file loads directly into a clean pandas DataFrame with correctly typed columns and no nulls in key fields.
From production healthcare AI to portfolio projects — these are the workflows our customers run.
Fine-tune BioBERT, ClinicalBERT, or GPT models on realistic clinical narratives — chief complaint, HPI, ROS, assessment/plan. Full discharge summaries and operative notes included.
Build and validate prior-authorization and claims denial classifiers. Structured diagnosis codes, insurance type, procedure codes, and payer fields make for rich feature sets.
Train and benchmark HCC RAF score predictors. Comorbidity patterns and demographic distributions are calibrated to CMS V28 model epidemiological baselines.
Test extraction pipelines for medications, lab values, problem lists, and clinical events. FHIR R4 bundles provide structured reference output for evaluation.
Build a compelling healthcare ML portfolio without data access barriers. Commercial-use license means you can publish your notebook, share on GitHub, and present to employers.
Establish reproducible baselines for clinical NLP and structured prediction tasks. Versioned releases with DOI-style citation satisfy most journal data transparency requirements.
Every format is pre-built and validated. No OMOP transformation scripts, no FHIR mapping, no ETL. Pick your stack and load.
| Format | Best for | Available from |
|---|---|---|
| .parquet | Spark, DuckDB, BigQuery, pandas — columnar analytics at scale | Innovator+ |
| .csv | pandas, R, Excel, any tabular tool — universal compatibility | All tiers |
| .json | REST APIs, document stores, MongoDB, Elasticsearch | Sampler+ |
| FHIR R4 | Healthcare interoperability, SMART on FHIR app testing, FHIR server validation | Architect+ |
| HL7 v2.x | Interface engine testing, ADT/ORU message pipelines, legacy EHR integration | Architect+ |
| C-CDA | Clinical document exchange, CDA parsing pipelines, Meaningful Use testing | Architect+ |
| .sqlite | Relational queries, SQL practice, embedded database applications | Professional+ |
No subscription. No per-seat fees. Buy once, use forever. Download in under a minute.
one-time purchase
500 Records
one-time purchase
1,000 Records — All 7 Formats
one-time purchase
2,500 Records
ACADEMIC30
Know your options. Here's an honest comparison for the ML engineer evaluating synthetic healthcare data sources.
| Criterion | PatientDatasets.com | Synthea | MIMIC-IV |
|---|---|---|---|
| Clinical narrative depth | Full HPI, ROS, exam, A/P, notes | Skeletal encounters | Real ICU notes (de-identified) |
| HIPAA / compliance | HIPAA-free, no DUA | HIPAA-free | Requires CITI training + DUA |
| IRB required to publish | No | No | Institutional review recommended |
| Time to usable data | Under 1 minute | 6+ hours setup + generation | Days to weeks (approval + setup) |
| Export formats | 7 (CSV, JSON, Parquet, FHIR R4, HL7, C-CDA, SQLite) | FHIR, OMOP only | CSV / PostgreSQL only |
| Commercial use | Yes — license included | Apache 2.0 | Non-commercial / research only |
| Specialty coverage | 60+ specialties | Limited modules | ICU-only |
| Parquet / BigQuery ready | Yes — native | Requires conversion | Requires conversion |
Answers to the questions ML engineers and data scientists ask before purchasing.
Yes — this is the primary use case. The data is 100% synthetic, meaning no HIPAA restrictions, no data use agreements, and no IRB approval is required. You can train, test, benchmark, and publish results freely. Parquet and CSV formats are optimized for pandas, scikit-learn, Spark, and BigQuery pipelines. Every record includes 36+ demographic fields, structured labs with reference ranges, vitals, medications with dosages, ICD-10-coded problem lists, and full clinical narratives.
CSV works natively with pd.read_csv() and is available from the Sampler tier ($49). Parquet is available from the Innovator tier ($229) and is the preferred format for ML workflows — columnar storage cuts memory usage significantly and loads 3–5x faster than equivalent CSV for large datasets. Both formats require zero preprocessing: download and load immediately. For Spark or BigQuery pipelines, Parquet with Arrow metadata is the clear choice.
Synthea generates statistical demographics and skeletal encounters — useful scaffolding, but clinically flat. PatientDatasets.com delivers complete narrative records: chief complaint, full HPI (OLDCARTS), review of systems, physical exam, assessment, plan, operative reports, and discharge summaries — written the way a clinician actually documents a chart. We ship 7 formats natively vs. Synthea's FHIR/OMOP only. And you get an instant download vs. 6+ hours of Java environment setup, configuration, and generation time just to get a basic Synthea cohort running.
No. There are no real patients and no real PHI — 100% AI-generated. Human subjects research regulations (45 CFR 46) do not apply. You can publish results in peer-reviewed journals, share the dataset with co-authors, post to GitHub, cite it in a Methods section, and present at conferences without any institutional review. A commercial-use license is included with every paid tier. Each release is versioned with a DOI-style citation that satisfies most journal data transparency requirements.
The Premium tier includes 2,500 records in all 7 formats — suitable for training most classification and NLP models. For larger needs, the Enterprise tier provides 10,000+ record custom cohorts with specialty filtering, custom ICD-10 code distributions, and white-label rights. Contact support@patientdatasets.com for Enterprise pricing and custom configuration options.
Yes — a commercial-use license is included with every paid tier. You can publish Jupyter notebooks, share derived datasets on Kaggle, post trained models to Hugging Face, and present your work to employers or investors. The only restriction is redistribution of the raw dataset files themselves. The Sampler tier ($49) is a practical entry point for Kaggle and portfolio work; the Innovator tier ($229, 500 records, Parquet) is recommended for anything you plan to publish.
Yes. Every FHIR export is a conformant R4 Bundle containing Patient, Encounter, Condition, Observation (vitals and labs), MedicationRequest, and AllergyIntolerance resources. Resources use standard LOINC and SNOMED-CT codes with proper reference chaining. They are suitable for testing FHIR servers, FHIR validation engines, SMART on FHIR applications, and CDS Hooks integrations without needing a live EHR connection. FHIR R4 is available in the Architect tier and above.
Comorbidity co-occurrence rates are calibrated to CDC/NHANES and CMS administrative data baselines. Conditions like type 2 diabetes (E11) with CKD (N18), hypertension (I10) with heart failure (I50), and COPD (J44) with respiratory failure appear at epidemiologically realistic rates — not randomly assigned. Demographic distributions for age, sex, race/ethnicity, and insurance type match national enrollment distributions. Models trained on PatientDatasets records have realistic covariate structure for generalization studies.