Python / pandas ML Training Data scikit-learn FHIR / HL7 / CDA Coding / Billing Parquet / CSV Healthplan Mock Audits

Records

Generated

Clinically Engineered
Data Sets

Q: I'm new — which tier should I buy?

Start with the Architect tier ($349, 1,000 records, all 7 formats). It's where most customers land — for $50 more than Professional you triple the record count and unlock every format: CSV, JSON, SQLite, Parquet, FHIR R4, HL7 v2, C-CDA. If you're just evaluating, download the free 5-record sample first — no credit card required, delivered in under a minute.

Q: Can I see a sample record before I pay?

Yes — download 5 full records free spanning a mix of disciplines. You'll see the complete structured header (36+ demographic fields), full clinical narrative (HPI, ROS, physical exam, assessment, plan), structured labs with reference ranges, medications with dosages and indications, ICD-10 coded problem list, and SDOH screening. Every field in the paid tiers is in the free sample.

Q: How is PatientDatasets.com different from Synthea or MIMIC?

Synthea generates statistical demographics and skeletal encounters — useful scaffolding, but clinically flat. PatientDatasets.com delivers complete narrative records: HPI, ROS, physical exam, assessment, plan, and operative reports written the way a clinician actually documents a chart. MIMIC-IV is real de-identified ICU data but requires CITI training, a Data Use Agreement, and research-only licensing. PatientDatasets.com is zero-gate, commercial-use-cleared, and available on-demand. Download today, ship product this week.

Q: Can I use this data to train or benchmark a machine learning model?

Yes — that is the primary use case. The data is 100% synthetic so there are no HIPAA restrictions, no data use agreements, and no IRB approval required. You can train, test, benchmark, and publish results freely. Parquet and CSV are optimized for pandas, Spark, scikit-learn, and BigQuery pipelines. Every record includes 36+ demographic fields, structured labs, vitals, medications, and full clinical notes.

Q: Is IRB approval required? Can I publish results based on this data?

No IRB required — there is no real patient data involved, so human subjects research rules do not apply. You can publish results, share the dataset with co-authors, and cite it in a Methods section without restriction. Each release is versioned with a DOI-style citation that satisfies most journal data transparency requirements.

Q: Are the FHIR R4 bundles valid for interface and system testing?

Yes. Every FHIR export is a conformant R4 Bundle containing Patient, Encounter, Condition, Observation (vitals and labs), MedicationRequest, and AllergyIntolerance resources. Resources use standard LOINC and SNOMED-CT codes with proper reference chaining. They are suitable for testing FHIR servers, validation engines, and SMART on FHIR apps without needing a live EHR connection.

Q: Can I practice ICD-10-CM/PCS and E/M coding from these records?

Yes — that is exactly what they are built for. Each record contains a complete clinical narrative: chief complaint, HPI (OLDCARTS), review of systems, physical exam, assessment, and plan. The documentation supports E/M level assignment, ICD-10-CM principal and secondary diagnosis selection, and procedure coding practice. Records do not have codes pre-filled — you derive them from the note, the same way you would on a real chart.

Q: Is this real patient data? Can I share it with my staff without legal concerns?

No real patients, no real PHI — 100% AI-generated. HIPAA does not apply. You can share the records openly with your billing staff, front desk team, or students without any compliance concern. There is no BAA required, no de-identification process, and no breach liability. Use it freely for internal training, workflow testing, or onboarding.

Q: Do you offer academic or nonprofit discounts?

Yes. 30% off Innovator, Professional, and Architect tiers for verified .edu domains — use coupon code ACADEMIC30 at checkout. 25% off the same tiers for verified 501(c)(3) nonprofits — use coupon code NONPROFIT25. Premium and Enterprise are margin-constrained; for volume pricing, email support@patientdatasets.com.

Q: What is the refund policy?

14-day no-questions refund on Innovator, Professional, and Architect tiers. Premium and Enterprise are custom-processed and non-refundable once delivered — but a free sample is available so you can evaluate quality before committing at that level.

Quality-audited, clinically coherent synthetic records.

No HIPAA No IRB Link delivered by email 100% Synthetic

🤖

ML & AI Training

pandas · scikit-learn · PyTorch

🏥

EHR / EMR Testing

Validate before go-live

📋

Medical Coding

ICD-10 · CPT · HCPCS

💼

Claims & RCM

837P/I · EOB · denial workflows

📊

Medicare & HEDIS

HCC · quality audits · compliance

🩺

Ancillary & Behavioral

Chiro · Psych · PT · OT · SLP

DATA SCIENCE

patient-analysis.py

CODING

Procedure Codes / ICD-10 / HCPCS

PATIENT RECORD — Gardner_Leslie_000001.txt

HEDIS MEASURES

Healthcare Effectiveness Data & Information Set

hedis_roster_2025.csv

MemberIDLast NameFirst NameDOBMeasureNumGap

M-10001AdamsRebecca3/14/1961Breast Cancer ScreenYCLOSED

M-10002AlvarezRoberto8/22/1971Diabetes HbA1c <8NOPEN

M-10003AndersonTanya11/5/1958Controlling BPYCLOSED

M-10004BarnesGregory6/17/1965Colorectal ScreenNOPEN

M-10005BellChristine1/29/1973Diabetes Eye ExamYCLOSED

M-10006BrownCurtis5/6/1960Statin TherapyYCLOSED

M-10007CampbellDenise9/21/1954Breast Cancer ScreenNOPEN

M-10008CarterWilliam12/3/1968Diabetes HbA1c <8YCLOSED

M-10009ChenWei11/30/1965Controlling BPYCLOSED

M-10010ClarkBeverly1/25/1953Colorectal ScreenNOPEN

M-10011ColemanAndre4/8/1977Diabetes Eye ExamNOPEN

M-10012CruzMaria7/19/1962Statin TherapyYCLOSED

M-10013DavisAngela8/1/1957Breast Cancer ScreenYCLOSED

M-10014DixonTerrence3/15/1970Diabetes HbA1c <8NOPEN

M-10015EdwardsPatricia10/27/1959Controlling BPNOPEN

M-10016EllisRaymond2/11/1975Colorectal ScreenYCLOSED

M-10017FloresCarmen6/23/1963Diabetes Eye ExamYCLOSED

M-10018FosterKeith12/9/1956Statin TherapyNOPEN

M-10019GarciaElena9/28/1962Breast Cancer ScreenYCLOSED

M-10020GreenDarlene4/14/1969Diabetes HbA1c <8YCLOSED

M-10021HallSandra4/3/1966Controlling BPYCLOSED

M-10022HarrisJerome8/30/1972Colorectal ScreenNOPEN

M-10023HernandezCarlos11/8/1970Diabetes Eye ExamYCLOSED

M-10024HillBrenda5/16/1955Statin TherapyYCLOSED

M-10025HowardMarcus1/22/1978Breast Cancer ScreenNOPEN

M-10026JacksonTerrell5/20/1961Diabetes HbA1c <8NOPEN

M-10027JamesLoretta9/7/1964Controlling BPYCLOSED

M-10028JohnsonKeisha4/25/1969Colorectal ScreenYCLOSED

M-10029JonesPhillip12/18/1957Diabetes Eye ExamNOPEN

M-10030KimJisoo2/11/1978Statin TherapyYCLOSED

M-10031KingValerie7/4/1963Breast Cancer ScreenYCLOSED

M-10032LeeDavid12/19/1983Diabetes HbA1c <8YCLOSED

M-10033LewisSharon3/26/1956Controlling BPNOPEN

M-10034LopezMiguel10/13/1974Colorectal ScreenYCLOSED

M-10035MartinCynthia8/9/1960Diabetes Eye ExamYCLOSED

M-10036MartinezDiego3/22/1975Statin TherapyNOPEN

M-10037MillerJacqueline1/31/1967Breast Cancer ScreenNOPEN

M-10038MitchellDarryl6/15/1971Diabetes HbA1c <8NOPEN

M-10039MooreDiane4/20/1958Controlling BPYCLOSED

M-10040NguyenLinh12/3/1973Colorectal ScreenNOPEN

M-10041O'BrienSean7/19/1955Diabetes Eye ExamYCLOSED

M-10042PatelArun1/17/1980Statin TherapyYCLOSED

M-10043PerezGloria11/28/1964Breast Cancer ScreenYCLOSED

M-10044PowellFranklin2/6/1959Diabetes HbA1c <8YCLOSED

M-10045RamirezSofia2/28/1972Controlling BPNOPEN

M-10046RobinsonTamara6/30/1964Colorectal ScreenYCLOSED

M-10047SinghPriya10/14/1968Diabetes Eye ExamNOPEN

M-10048ThompsonMaria3/14/1958Statin TherapyYCLOSED

M-10049WalkerJerome7/16/1977Breast Cancer ScreenYCLOSED

M-10050WilliamsDaphne6/9/1952Diabetes HbA1c <8NOPEN

audit_summary.log

hedis_analytics_dashboard.py

HEDIS Quality Measures Analysis

Population Health Analytics Dashboard

Compliance Rate

Compliant (67%)

Partial (21%)

Gap (12%)

Measure Performance

487

Patients Analyzed

Quality Measures

89.2%

Data Completeness

Portfolio Ready

Synthetic Patient Data for ML, Research & Education

Download Free Sample → Browse Datasets

ML-ready: CSV, JSON, Parquet, FHIR R4

100% Synthetic — No HIPAA. No IRB.

Buy → Download in under 2 minutes

18,400+

Records Generated

Export Formats

Zero

HIPAA / IRB Risk

36+

Demographic Fields

Who Uses PatientDatasets.com

Real healthcare data is locked behind HIPAA. Ours isn't.

🤖

Data Scientists & ML Engineers

Train readmission models. Build NLP pipelines on clinical notes. Benchmark classification algorithms. Kaggle-style portfolio projects. All without touching a single real patient record.

Python R pandas scikit-learn SQL Jupyter

📚

Healthcare Students & Educators

Practice claims processing, clinical coding (ICD-10, CPT), and EDI workflows. Real-world billing scenarios for CPC, CCS, and COC certification prep.

CPC CCS RHIT / RHIA Medical Billing RCM Nursing

🔬

Researchers, Admins & Vendors

IRB-free pilot studies. EHR system testing. Healthcare AI startup demos. Public health cohort analysis. Revenue cycle training without using live patient accounts.

Academic Research EHR Testing AI Startups Public Health

What's Inside Each Record

36+ demographic fields, 20+ clinical sections per record — quality-audited

👤

36+ Demographics

Name, DOB, race, gender identity, SDOH, insurance, MRN — matched to real-world distributions.

🏥

Diagnoses (ICD-10-CM)

Primary and secondary diagnoses with active/resolved problem lists and comorbidity patterns.

📋

Full Clinical Notes

CC, HPI, ROS, Physical Exam, Assessment & Plan — 32 structured sections, NLP-ready.

💊

Medications

Name, dose, route, frequency — coherent with diagnoses. Allergies included. FDA-grounded.

🧪

Labs & Imaging

Structured tables: Test, Value, Units, Ref Range, Flag. Radiology findings with impressions.

🔪

Operative Reports

Surgical technique, anesthesia, EBL, implants, post-op plan. Discharge planning included.

🩺

Therapy & Ancillary SOAP

PT, OT, SLP, Chiro, and Psych SOAP notes with outcome measures and session documentation.

🧠

Psychiatric & Behavioral

MSE, risk assessment, DSM-5-TR diagnoses, safety plans, substance use history.

📦

Every dataset ships in multiple formats

Load in Python, R, SQL, or any EHR system — no conversion needed

CSV JSON Parquet FHIR R4 HL7 v2.x C-CDA SQLite

CardiologyPulmonologyNephrologyEndocrinologyGastroenterologyHepatologyHematologyMedical OncologyNeurologyRheumatologyDermatologyInfectious DiseaseInternal MedicineFamily MedicineGeriatricsEmergency MedicineRadiologyOphthalmologyCardiologyPulmonologyNephrologyEndocrinologyGastroenterologyHepatologyHematologyMedical OncologyNeurologyRheumatologyDermatologyInfectious DiseaseInternal MedicineFamily MedicineGeriatricsEmergency MedicineRadiologyOphthalmology

Orthopedic SurgeryGeneral SurgeryNeurosurgeryVascular SurgeryColorectal SurgeryUrologyOb-GynPediatric SurgeryPediatric OncologyPediatricsNeonatologySports MedicinePain ManagementCritical CarePalliative CareSleep MedicineWound CareOrthopedic SurgeryGeneral SurgeryNeurosurgeryVascular SurgeryColorectal SurgeryUrologyOb-GynPediatric SurgeryPediatric OncologyPediatricsNeonatologySports MedicinePain ManagementCritical CarePalliative CareSleep MedicineWound Care

⚡ Chiropractic⚡ Physical Therapy⚡ Occupational Therapy⚡ Speech-Language Pathology⚡ Psychology⚡ Psychiatry⚡ Behavioral Health⚡ Addiction Medicine⚡ Telehealth⚡ Home Health⚡ Hospice⚡ Integrative Medicine⚡ Rehabilitation⚡ Mental Health⚡ Chiropractic⚡ Physical Therapy⚡ Occupational Therapy⚡ Speech-Language Pathology⚡ Psychology⚡ Psychiatry⚡ Behavioral Health⚡ Addiction Medicine⚡ Telehealth⚡ Home Health⚡ Hospice⚡ Integrative Medicine⚡ Rehabilitation⚡ Mental Health

60+ Medical Specialties.

Clinically Coherent Data.

CardiologyPulmonologyNephrologyEndocrinologyGastroenterologyHepatologyHematologyMedical OncologyNeurologyRheumatologyDermatologyInfectious DiseaseInternal MedicineFamily MedicineGeriatricsEmergency MedicineRadiologyOphthalmology CardiologyPulmonologyNephrologyEndocrinologyGastroenterologyHepatologyHematologyMedical OncologyNeurologyRheumatologyDermatologyInfectious DiseaseInternal MedicineFamily MedicineGeriatricsEmergency MedicineRadiologyOphthalmology

Orthopedic SurgeryGeneral SurgeryNeurosurgeryVascular SurgeryColorectal SurgeryUrologyOb-GynPediatric SurgeryPediatric OncologyPediatricsNeonatologySports MedicinePain ManagementCritical CarePalliative CareSleep MedicineWound Care Orthopedic SurgeryGeneral SurgeryNeurosurgeryVascular SurgeryColorectal SurgeryUrologyOb-GynPediatric SurgeryPediatric OncologyPediatricsNeonatologySports MedicinePain ManagementCritical CarePalliative CareSleep MedicineWound Care

⚡ Chiropractic⚡ Physical Therapy⚡ Occupational Therapy⚡ Speech-Language Pathology⚡ Psychology⚡ Psychiatry⚡ Behavioral Health⚡ Addiction Medicine⚡ Telehealth⚡ Home Health⚡ Hospice⚡ Integrative Medicine⚡ Rehabilitation⚡ Mental Health ⚡ Chiropractic⚡ Physical Therapy⚡ Occupational Therapy⚡ Speech-Language Pathology⚡ Psychology⚡ Psychiatry⚡ Behavioral Health⚡ Addiction Medicine⚡ Telehealth⚡ Home Health⚡ Hospice⚡ Integrative Medicine⚡ Rehabilitation⚡ Mental Health

Pearls, not platitudes.

Clinical Depth Generic Synthetic Data Can't Match.

Every high-acuity record is built to read like documentation from a seasoned clinician — the pitfalls, the decision points, the specialty-specific subtlety that distinguishes training data that looks medical from training data that teaches medicine.

💎

Clinical Pearls Where They Matter

HIGH-Acuity Routing Pitfalls Decision Points Specialty-Specific

ICU admissions, surgical emergencies, complex oncology, decompensating pediatrics — our pipeline injects targeted clinical wisdom into every high-acuity record before generation begins. The model writes like a senior clinician, not a med student. Routine visits stay routine. Hard cases carry the nuance.

🔗

Grounded in Authoritative Sources

SNOMED CT RxNorm FDA Labels ICD-10-CM LCD-Compliant

Every record is anchored in peer-reviewed medical literature, standardized terminology, FDA-recognized drug information, and evidence-based clinical pathways. Notes carry the coding specificity and billing defensibility that real downstream systems require — and would survive a payer audit.

🎯

Multi-Stage Clinical Audit

Structural Review Clinical Scoring Terminology QA ~50% Rejected

Every record moves through three audit stages before it ever reaches your download. Records are graded on more than a dozen clinical and documentation dimensions, then cleared by a 1.9-million-term medical terminology pass. Roughly half of what the pipeline generates is rejected. Only the records that clear every gate ship.

📈

Every Release More Rigorous

Traceable Self-Learning Disclosed Improving

Every validated record teaches our system what good documentation looks like; every rejected draft teaches it what not to do. Records with clinically unusual content (for example, gender-affirming care) are flagged for appropriate customer handling. Every release is more rigorous than the last.

Practice Billing Before You Touch a Real Claim.

60+ specialties. Real note structure. Built for coding practice, billing workflows, and EDI testing.

🦴

Chiropractic

98940–98942 M99.xx PART Criteria LCD L33906

Full SOAP documentation with spinal segment notation, orthopedic test findings, subluxation coding, and region-count-to-CPT matching. Every record is structured to satisfy a Medicare LCD audit.

🏥

PT, OT & Speech Therapy

97110 97165–97168 92507 8-Min Rule IDDSI FIM Score POC Cert

Timed CPT codes with 8-minute rule compliance across all three disciplines. PT records include MMT and ROM. OT includes ADL/IADL and FIM scoring. SLP includes dysphagia assessments, IDDSI diet levels, and swallowing eval documentation.

🧠

Psychology & Psychiatry

90837 90792 DSM-5 MSE PHQ-9 Risk Assessment

Session-time-based CPT codes with full mental status exams, DSM-5 diagnoses, validated screening tools, and documented medical necessity. Psychiatric records include medication management and E/M complexity documentation.

📋

EDI & Claims

837P / 837I 835 EOB 270 / 271 Modifiers Denial Codes

Every record contains the subscriber, encounter, and payer data needed to build a complete claim transaction. Practice 837P and 837I construction, remittance reconciliation, eligibility verification, and denial resolution — before any of it touches a live payer.

Stack On Exactly What You Need

Every paid tier ships with CSV. JSON, Parquet, SQLite, and the interop formats unlock from Innovator up. Add formats à la carte at checkout.

{ }

JSON

+$29 Nested Records REST-ready

Fully nested record structure. Every encounter, diagnosis, and medication linked relationally. Drops into any REST pipeline or notebook.

🗄️

SQLite

+$39 Relational DB SQL-ready

Pre-built relational DB. SQL queries, joins, and dashboard connectors — zero setup.

📐

Parquet

+$35 Columnar Spark / BigQuery

Columnar format for Spark, Databricks, and BigQuery. Fastest for ML pipelines.

🏥

FHIR R4

+$59 SMART on FHIR EHR Testing

Industry-standard healthcare exchange format. Test SMART on FHIR apps and EHR integrations.

📡

HL7 v2.x

+$49 ADT / ORM / ORU Legacy EHR

ADT, ORM, ORU messages. Still the dominant format in lab and hospital interfaces.

📄

C-CDA

+$89 Care Transitions CDA R2.1

Structured clinical documents for care transitions, referrals, and compliance reviews.

🎁

All 3 Interoperability Formats

Save $58

FHIR R4 + HL7 v2.x + C-CDA — buy the bundle, skip the add-up.

FHIR R4 HL7 v2.x C-CDA $139 $197

Why PatientDatasets.com

Real clinical structure. Every healthcare workflow. Zero HIPAA.

🧬

Statistically Realistic

Age, race, disease prevalence, and comorbidity patterns calibrated to real-world population statistics. Not toy data.

🔗

Clinical Coherence Built-In

Diagnoses, labs, and medications align throughout every record. A diabetic patient has Metformin and an elevated HbA1c.

🔒

Zero HIPAA. Zero IRB.

100% synthetic — no real patients. No data use agreement, no IRB approval, no compliance overhead. Buy and use today.

🔄

Your Order, Locked In

Each purchase randomly draws from our pre-generated pool. Your exact dataset is tied to your order — share it freely with colleagues. 100% synthetic means no redistribution restrictions.

📊

Multi-Format, Multi-Tool

CSV, JSON, Parquet, SQLite, FHIR R4, HL7, C-CDA — one dataset, every tool you already use.

⚡

Delivered After Checkout

Payment triggers automated format conversion for your tier. You'll receive a download link by email — typically within a few minutes. No account required. One-time purchase.

Engineered by a Clinician. Validated by an Auditor.

These records aren't procedurally assembled — they're clinically reasoned, pathway-guided, and gated through an automated quality audit before release.

🩺

Built by a 23-Year RN

The data generator was designed by a retired registered nurse with 23 years of bedside and health-plan clinical experience — including delegation oversight, utilization review, and clinical quality auditing. The records reflect how real patients actually present, progress, and are documented.

🗺️

Clinically Guided Pathways

Every patient follows a condition-specific clinical pathway. Diagnoses drive medication selection. Lab values are consistent with the primary condition. Comorbidities co-occur at realistic rates. The result: records that hold up under clinical scrutiny — not just statistical inspection.

✅

Automated Quality Gate

Before any record ships, an automated auditor scores it across nine clinical dimensions — structural completeness, HPI depth, medication-diagnosis consistency, coding readiness, and more. Records below the release threshold are held back. Only quality-passing records reach your download.

🔒

100% Synthetic — No Real Patients. No Real Records. No PHI.

Every name, date of birth, address, diagnosis, medication, and clinical note in our datasets is entirely computer-generated. No real patient data was used as source material at any stage of production. There is no Protected Health Information (PHI) to protect because there is none. HIPAA does not apply. IRB approval is not required. You can use these records freely for research, training, and education.

How It Works

Three steps from order to data in your inbox

Choose

Pick a tier (50–2,500 records). Add format packs. Enterprise 10,000+ is a custom quote.

Pay

Secure checkout via Stripe — or submit your email for 5 free records, no card required.

Receive

Your selected formats are built automatically. A download link arrives by email — minutes for smaller orders.

Simple, Transparent Pricing

One-time purchase. Link delivered by email. Zero HIPAA complications.

🎁 Free

5 records

No credit card required

✓ 5 encounter types represented
✓ CSV — open instantly in Excel or pandas
✓ Full clinical notes included

🌱 Sampler

$49

50 records

$0.98 per record

✓ Mixed specialties
✓ CSV format
✓ Proof-of-concept ready

Buy Now

🚀 Starter

$149

250 records

$0.60 per record

✓ 9 encounter types
✓ CSV format
✓ CPC / CCS coding practice

Buy Now

💡 Innovator

$229

500 records

$0.46 per record

✓ Broad specialty mix
✓ CSV + JSON + Parquet
✓ Spark, DuckDB & BigQuery ready

Buy Now

💼 Professional

$299

750 records

$0.40 per record

✓ Deep specialty coverage
✓ CSV + JSON + SQLite
✓ Pre-built relational database

Buy Now

⭐ Architect

$349

1,000 records

$0.35 per record

✓ All 9 encounter types
✓ All 7 formats — CSV, JSON, SQLite, Parquet, FHIR R4, HL7, C-CDA
✓ EHR integration & interoperability testing

Get Architect

💎 Premium

$599

2,500 records

$0.24 per record

✓ 60+ specialties represented
✓ All 7 formats included
✓ Scale for ML training & benchmarks

Buy Now

🏢 Enterprise

Custom

10,000+ records

Filtered by specialty or acuity

✓ All 7 formats, volume pricing
✓ Specialty & demographic filtering
✓ White-label licensing & SLA

Get Quote

Frequently Asked Questions

Start Here

I'm new — which tier should I buy?

Short answer: start with Architect ($349, 1,000 records, all 7 formats). It's where most customers land, and for $50 more than Professional you triple the record count and unlock every format you'll ever need — CSV, JSON, SQLite, Parquet, FHIR R4, HL7 v2, C-CDA. If you're just evaluating, download the free sample first — 5 records, no credit card, delivered in under a minute. See All Tiers →

Can I see a sample record before I pay?

Yes — download 5 full records free spanning a mix of disciplines so you see the breadth of the catalog. You'll see the complete structured header (36+ demographic fields), full narrative (HPI, ROS, physical exam, assessment, plan), structured labs with reference ranges, medications with dosages and indications, ICD-10 coded problem list, and SDOH screening. Every field that's in our paid tiers is in the sample. We want you to know exactly what you're buying.

How is this different from Synthea or MIMIC?

Synthea generates statistical demographics and skeletal encounters — useful scaffolding, but clinically flat. We deliver complete narrative records — HPI, ROS, physical exam, assessment, plan, operative reports when applicable — written the way a clinician actually documents a chart. MIMIC-IV is real de-identified ICU data but gated behind CITI training, DUA, and research-only licensing. We are zero-gate, commercial-use-cleared, on-demand. Synthea gives you the scaffolding. We give you the note the physician actually wrote. (LOL!) Download today. Ship product this week.

Data Scientists & Researchers

Can I use this data to train or benchmark a machine learning model?

Yes — that's the primary use case. The data is 100% synthetic, so there are no HIPAA restrictions, no data use agreements, and no IRB approval required. You can train, test, benchmark, and publish results freely. Parquet and CSV are optimized for pandas, Spark, scikit-learn, and BigQuery pipelines. Every record includes 36+ demographic fields, structured labs, vitals, medications, and full clinical notes. Most ML teams start with Architect ($349, 1,000 records) — enough signal for a real model, small enough to iterate on a laptop. Compare tiers →

Is IRB approval required? Can I publish results based on this data?

No IRB required — there is no real patient data involved, so human subjects research rules do not apply. You can publish results, share the dataset with co-authors, and cite it in a Methods section without restriction. Each release is versioned with a DOI-style citation, which satisfies most journal data transparency requirements.

How clinically realistic is the data?

Very. A patient with Type 2 diabetes carries Metformin, elevated HbA1c, and appropriate comorbidities. Comorbidity co-occurrence, lab distributions, and demographic breakdowns are calibrated against epidemiological baselines — not randomly assembled. Every record is reviewed against clinical documentation standards before it ships. Download the free sample and judge for yourself.

I'm a student. Every free dataset I can find doesn't match the question I actually want to research. How does synthetic help?

You're describing the silent tax every data science student pays: start with a research question → spend weeks hunting free datasets → find nothing that fits → reverse-engineer your hypothesis to match whatever Kaggle had → submit a thesis investigating a question you didn't actually care about. That's not science. That's the data dictating the theory. It's also why so many undergraduate and master's theses end up being "whatever the open dataset allowed" instead of "what the student actually wanted to prove."

Synthetic data inverts the problem. You design the cohort your hypothesis needs — age range, comorbidity pattern, sample size that gives you statistical power, specialty mix, outcome distribution — and we generate it. Testing whether Type 2 diabetes comorbidity predicts 30-day readmission in a rural Medicare population? Build that cohort. Need 500 ancillary SOAP notes to train a CPT auto-coder? Here they are. Want pediatric oncology patients with specific TNM staging and chemotherapy regimens for a survival-analysis project? Done. Research the question you wanted to research — not the one Kaggle happened to have.

Student pricing: 30% off Sampler, Starter, and Innovator tiers with your .edu email — use code ACADEMIC30 at checkout. Innovator tier at the student rate works out to around $160 for 500 records of your exact research cohort. Less than a stats textbook. More useful than any free dataset you were going to find anyway.

EHR Implementers & Health Informatics

Are the FHIR R4 bundles valid for interface and system testing?

Yes. Every FHIR export is a conformant R4 Bundle containing Patient, Encounter, Condition, Observation (vitals + labs), MedicationRequest, and AllergyIntolerance resources. Resources use standard LOINC and SNOMED-CT codes with proper reference chaining. They're suitable for testing FHIR servers, validation engines, and SMART on FHIR apps without needing a live EHR connection.

Does the HL7 v2.x output work with interface engines like Mirth or Rhapsody?

Yes. HL7 exports are v2.4 ADT^A01 messages with full segment coverage: MSH, EVN, PID, PV1, IN1, AL1, DG1, and OBX (vitals + labs with LOINC codes, reference ranges, and abnormal flags). Special characters are properly escaped per the HL7 encoding rules. Drop them into any interface engine and they parse cleanly.

Can I use these records to test EDI 837 claim construction or 835 remittance workflows?

The clinical documentation is structured to support EDI testing — each record contains the subscriber/member data, encounter detail, and payer fields needed to populate an 837P or 837I transaction. The records do not ship as pre-built EDI files, but every data element required for claim construction is present and mapped in the CSV and JSON exports.

Coding Students, RHIA & Revenue Cycle

Can I practice ICD-10-CM/PCS and E/M coding from these records?

Yes — that's exactly what they're built for. Each record contains a complete clinical narrative: chief complaint, HPI (OLDCARTS), review of systems, physical exam, assessment, and plan. The documentation supports E/M level assignment, ICD-10-CM principal and secondary diagnosis selection, and procedure coding practice. Records do not have codes pre-filled — you derive them from the note, the same way you would on a real chart.

Are the records appropriate for CCS, CPC, RHIA, or CRCR exam prep?

Yes. The notes are structured to match the documentation standards those certifications test against — including E/M complexity, operative report detail, discharge summaries, and problem list management. Records span 9 encounter types and 60+ specialties, giving you exposure to the full breadth of case types that appear in certification exams and real-world HIM practice.

Can I get a custom cohort — specific conditions, DRGs, or payer mix?

Yes. Enterprise orders support custom ICD-10 code distributions, targeted condition cohorts, specific payer mixes, and volume above 10,000 records. Email support@patientdatasets.com with your specifications and we'll quote it.

Independent Clinicians & Ancillary Providers

I'm a PT, OT, chiropractor, or psychologist. How do these records help me?

Our ancillary and behavioral health records are built to reflect real clinical documentation in your discipline — not generic medical notes. Chiropractic records include spinal segment notation, PART criteria, and orthopedic test findings. PT records include MMT grades, ROM measurements, validated outcome tools, and plan-of-care language. Psychology records include MSE components, risk assessments, and session documentation. Use them to train new billing staff, audit your own documentation practices, or build internal QA workflows — before those habits touch a real Medicare claim.

Do the records reflect Medicare LCD and payer documentation requirements?

Yes. Chiropractic records reflect LCD L33906 documentation expectations (PART criteria). Therapy records include the clinical elements needed to support medical necessity under CMS guidelines — including skilled care justification, functional baselines, and progress indicators. These records are designed to show your staff what a defensible note looks like, and what a vulnerable one looks like.

Is this real patient data? Can I share it with my staff without legal concerns?

No real patients, no real PHI — 100% AI-generated. HIPAA does not apply. You can share the records openly with your billing staff, front desk team, or students without any compliance concern. There is no BAA required, no de-identification process, and no breach liability. Use it freely for internal training, workflow testing, or onboarding. See Professional & Architect tiers →

Psychiatry & Psychology

Are psychiatric and psychology records session-time accurate for CPT billing?

Yes. CPT codes are matched to documented session durations — 90832 (16–37 min), 90834 (38–52 min), 90837 (53+ min), 90791/90792 for intake. Every record includes complete Mental Status Exam, risk assessment (SI/HI/self-harm), DSM-5 diagnostic justification, and medical necessity documentation. Built to train staff on documentation that won't trigger a denial.

Do the records cover the full range of visit types — intake, therapy, crisis, group, testing?

Yes. Psychology records span six visit subtypes: intake evaluation, individual therapy, family/couples, group, crisis assessment, and neuropsychological testing. Each carries the appropriate CPT code and the documentation elements the payer will actually audit for. Psychiatric records include MSE, risk stratification, substance use history, medication management, and E/M complexity for med checks.

Can I use these for billing department training?

Yes — behavioral health is one of the most commonly-denied specialties. Train your billing team on records with proper medical necessity framing, correct add-on E/M codes, and risk stratification language before they submit their first real claim. Architect or Premium tier recommended for a full training catalog. See pricing →

Healthcare Administrators & Compliance

Can I use this for HEDIS, HCC, or quality measure testing?

Yes. Records carry complete problem lists (active + resolved), preventive care gaps, transitions-of-care documentation, and the Z-code capture (Z55–Z65) that HEDIS social-determinants measures require. Chronic condition capture uses 4th-digit ICD-10 specificity, comorbidity co-occurrence reflects epidemiological reality, and every record carries the supporting clinical documentation a RADV audit would demand. Premium ($599, 2,500 records) is the go-to tier for HCC model development and HEDIS abstraction practice. See pricing →

Scaling up fast. Watch the live counter at the top of this page — we're on a path to 1,000,000 audited synthetic records. Every tier grows with the warehouse, and early buyers lock in today's price before volume tiers reprice. Stay tuned. 🎯

Is the documentation defensible for CMS audit preparation?

Yes. Documentation reflects CMS Conditions of Participation §482.24, Joint Commission RC standards, and AAPC E/M guidelines. Discipline-specific documentation expectations — PART for chiropractic, the 8-minute rule for therapies, MSE for behavioral health — are present in the records they apply to. Built to survive a real audit so your compliance team can train on it with confidence.

Does it include transitions-of-care and discharge planning documentation?

Yes. Inpatient and surgical records ship with full discharge planning: disposition, DME orders, home health referrals, medication reconciliation, PCP notification, follow-up scheduling, and patient education documented with teach-back. Everything a HEDIS Transitions-of-Care measure or Joint Commission survey would ask for.

We're not IRB or HIPAA-constrained. What does synthetic actually do for quality, audit, and compliance workflows?

The honest answer: if compliance is the only reason you'd consider synthetic, you probably don't need it. You're already working with masked or de-identified data, and your legal team has the controls to manage it. Synthetic doesn't earn its keep on privacy — it earns it on operational tempo. Quality, audit, and compliance teams lose weeks every quarter to three specific lags that synthetic collapses:

1. Ground truth you already know. Every HEDIS measure, every HCC capture, every care-gap abstraction requires someone reading a chart and deciding whether the patient met criteria. On real records, you don't know the right answer — you have to build a gold-standard panel (expensive, slow, political). On synthetic records, we wrote the spec; the right answer is known by construction. When NCQA updates a HEDIS measure, CMS changes an HCC mapping, or your CDI team ships a new query template, you revalidate your abstraction tooling against known-answer records in hours, not weeks. Every quality measure release cycle stops being a fire drill.

2. Edge cases on demand. The records that actually break your pre-bill scrubber, NCCI edit engine, prior-auth workflow, or denial-prediction model are rare. Real data is fat-tailed — the catastrophic cases almost never happen naturally. We generate the rare ones on request. Need 50 records with specific modifier combinations to stress-test NCCI edits? Done. Need 200 records with documented conflicts between assessment and medication list to validate CDI queries? Here. Need adversarial upcoding patterns to train a fraud-waste-abuse model before it sees a real claim? Generated. These are the cases those systems exist to catch, and real data barely contains them.

3. Pilot tempo. RADV audit readiness, coder productivity benchmarking, compliance workflow redesign, Corporate Integrity Agreement monitoring — every one of these wants to pilot a change against representative cases without coordinating a one-off data-warehouse extract that takes three weeks to approve. Not because PHI forbids it, but because the ticket for "25 cases matching these criteria" sits behind 40 other tickets. Synthetic cohorts materialize in minutes. The pilot happens. The workflow ships. The measure goes live before the next board meeting, not after.

Here is the plain-English version of how synthetic data actually moves your bottom line.

Healthcare organizations make and lose money based on how quickly they can test and deploy their own software. Your billing system, your claim scrubber, your coding tools, your risk-adjustment algorithms, your quality-measure abstractors — all of them depend on having realistic patient records to test against. When the records are real, you wait three weeks for your data warehouse team to pull a batch. When the records are synthetic, you get them in minutes. That difference between three weeks and three minutes is where synthetic data starts translating into dollars.

On the revenue side. Medicare pays Medicare Advantage plans more money when their members are sicker, and it pays bonuses when the plan hits quality targets. Both payments depend on software that reads patient charts and either captures a diagnosis or decides whether a quality measure was met. Every year, Medicare updates the rules — which diagnoses map to higher payments, which measures count, how they're scored. When the rules change, your software has to be re-tested against charts where you already know the right answer. On real charts, figuring out the right answer means paying a panel of nurses to read them manually, which is slow, expensive, and often debated. On synthetic charts, we wrote the specification, so the right answer is built in. You revalidate your software in days instead of a quarter, start capturing new revenue immediately, and collect the Medicare bonus while your competitors are still testing. For a mid-sized Medicare Advantage plan, landing one additional half-star rating is worth tens of millions of dollars a year. For every chronic condition your software correctly captures on a Medicare Advantage member, your plan earns an additional three thousand to ten thousand dollars a year in capitation. Synthetic data is how you get that software retrained fast enough for the new capture window to matter.

On the cost side. Every insurance claim that gets denied costs your billing department roughly twenty-five to a hundred dollars to fix and resubmit, and it delays payment by a month or two. The software that catches claim errors before submission is only as good as the unusual cases it has been tested against. Real data does not contain many unusual cases — that is what makes them unusual. Synthetic data gives you every tricky modifier combination, every rare documentation gap, every uncommon denial-reason code, on demand. Your scrubber catches more errors before the claim leaves your building, your denial rate drops, and your cash flows in faster. A medical group with a hundred million dollars in annual claims that cuts its denial rate by even a single percent recovers more than a million dollars in accounts receivable that was previously stuck in rework. The same logic applies to training new medical coders, who normally take six to twelve months to reach full productivity. Synthetic training records cut that ramp by two to three months, which is twelve to eighteen thousand dollars in salary saved per new hire.

On the audit-exposure side. Medicare periodically audits health plans to verify that the diagnoses they billed for are actually supported by the patient records. When documentation does not support the diagnosis, Medicare claws the money back — five to fifty million dollars per audit cycle for a mid-sized plan. If your team can run a mock audit against synthetic patient cohorts before the real auditors arrive, you find your documentation gaps yourself at zero cost and fix them quietly. The chargebacks you would have paid stay in your margin instead. This is the single cleanest return on investment synthetic data offers any health plan: you are spending four digits to prevent seven-digit chargebacks.

On the vendor and procurement side. Every software vendor pitching your organization wants to demo their product against realistic patient data, and no compliance officer is going to hand them a production extract on the first call. When the sample data takes a month to extract and legal-review, your vendor evaluation takes a full quarter, and during that entire quarter you are still writing checks to the legacy system you are trying to replace. When the sample data is synthetic and deliverable the same afternoon, your evaluation takes two weeks, and you stop paying for the old system a full quarter sooner.

The mechanism is consistent across every one of these examples. Synthetic data does not invent new revenue, replace your team, or change your regulatory environment. What it does is remove the waiting time between the moment your team knows they need to test something and the moment they actually can. Every line item that improves when your team ships improvements faster — your revenue capture, your denial rate, your audit exposure, your working capital, your coder productivity, your vendor replacement cycle — moves in your direction. That is the entire business case, in one sentence: synthetic data shortens the distance between "we should fix this" and "we fixed it," and every dollar in between belongs to you instead of the problem.

How are health plans, medical groups, and EDI teams actually using this at scale?

Short answer: anywhere real patient data would create a compliance headache or a DUA delay. Long answer — three buyer patterns:

Health plans (payers): EDI pipeline QA — stress-testing 837 ingestion, 835 generation, and 277 response logic against thousands of synthetic claims with realistic denial distributions, zero staging-PHI exposure. HCC/RADV model development — training risk-adjustment ML on records where diagnosis capture is explicit and defensible. HEDIS abstraction automation — validating abstraction tools against records with known Z-code captures, preventive-care gaps, and transitions-of-care documentation (you know the right answer because you wrote the spec). Fraud and abuse detection — generating known-adversarial patterns to train detection models without waiting for real cases to accumulate. Fee-schedule modeling — running reimbursement scenarios against synthetic cohorts before sitting down across the table from providers.

Medical groups (providers): EHR migration validation — loading synthetic patients into a staging Epic, Athena, or Cerner instance so billing and clinical staff run real shifts against fake charts, surfacing every workflow gap before Day 1. Coder and CDI training — new hires code synthetic charts and their errors teach without a real-patient lawsuit lurking; CDI specialists train on records that carry defensible and vulnerable documentation side by side. Revenue cycle ML — training denial prediction, prior-auth approval, and appeal-likelihood models on synthetic 837/835 pairs. Vendor pilot sandbox — every EHR vendor, every RCM vendor, every AI startup pitching you wants "real data to demo against." Hand them a sampler, not your production backup.

EDI teams (on either side of the fence): X12 5010 compliance testing across the full transaction suite (837P/I, 835, 270/271, 276/277, 278) on synthetic files that already carry the clinical context behind them. Clearinghouse onboarding — every Availity, Change Healthcare, Optum, Waystar integration requires test files, and your implementation timeline drops by weeks when you can generate them on demand. CARC/RARC taxonomy coverage — records span the full denial-reason spectrum, so your edit engine catches every code in production, not just the ones that happened to show up in the last 90 days. Modifier-driven reimbursement QA — test every valid code combination at volume without a single real-member exposure.

The pattern underneath all of it: synthetic data removes the compliance tax on experimentation. Every workflow you've wanted to pilot, every ML model you've wanted to train, every vendor you've wanted to evaluate — synthetic lets it happen in days instead of quarters, at a unit cost that's a fraction of what your legal team spends negotiating a single DUA.

Can synthetic data speed up our AI agents, EDI automation, and auto-coding projects — and what does that mean for the people implementing them?

Yes, and this is arguably the single most important reason to buy this today.

Every AI agent, every EDI automation, every claim auto-coder, every prior-auth bot your organization is piloting or planning needs one thing before it can do anything useful: a large corpus of realistic patient records to train on, test against, and validate. That corpus is also the single longest delay in every AI project in healthcare. Real records require compliance review, data-use agreements, de-identification pipelines, and typically ninety days of coordination with your data warehouse team before your ML engineers can even run their first experiment. Multiply that delay across five parallel AI initiatives and you have a year and a half of waiting before any of them produce measurable ROI.

Synthetic patient records collapse that delay into an afternoon. Your ML engineers can train a denial-prediction model, fine-tune a CPT auto-coder, stand up a prior-auth agent, or validate an EDI 837 pipeline this week instead of next quarter. Every automation project your organization ships a quarter sooner is a quarter of accelerated ROI, a quarter of reduced denial rework, and a quarter of faster revenue capture. At typical enterprise automation returns of four to six dollars saved per automated task, shipping a single AI project one quarter earlier across an organization running a hundred thousand claims a month means hundreds of thousands of dollars in margin per project per quarter. Multiply that across your full automation portfolio. This is the real unlock.

For the people who will actually build and operate those systems, the value runs in the same direction. The single most valuable skill in healthcare administration over the next three years will be the ability to configure, query, test, and QA AI agents against realistic clinical data. The coders, billers, CDI specialists, and RCM analysts who know how to feed a synthetic cohort into a claim-edit engine, spot where the model gets it wrong, and refine the prompt or retrain the model — those are the people organizations pay a premium to keep and promote. Synthetic patient records are the training ground where those skills are acquired affordably, safely, and without needing permission to touch production data you do not yet have clearance for.

The dual-sided value is what makes this different from every other data offering on the market. If you are an administrator evaluating AI vendors, synthetic data is the infrastructure that compresses your implementation timeline from quarters to weeks. If you are a coder, biller, analyst, or CDI specialist, synthetic data is the weekend project that puts you on the operator side of every AI workflow your industry is now adopting. Same dataset, different price tier. Starter at $149 for an individual upskilling on their own. Architect at $349 for a small team piloting an agent. Premium and Enterprise for organizational AI-readiness programs.

The hidden multiplier. Synthetic data does not just accelerate your automation projects. It builds the workforce that operates them. Your organization's biggest competitive advantage over the next three years will not be which AI vendor you chose. It will be which of your people learned how to run the AI before the vendor's onboarding deck was even finished. The synthetic dataset sitting on that person's laptop is what makes them that person.

Pricing, Discounts & Support

Do you offer academic or nonprofit discounts?

Yes. 30% off Innovator, Professional, and Architect tiers for verified .edu domains — use coupon code ACADEMIC30 at checkout. 25% off the same tiers for verified 501(c)(3) nonprofits — use coupon code NONPROFIT25. Premium and Enterprise are margin-constrained; for volume pricing, email us directly.

What's the license? Can I use this in a commercial product?

Yes — Innovator through Premium include a commercial-use license covering internal use, research, and publication. Redistribution of the raw dataset is not permitted (derivative models trained on the data are fine). Enterprise includes redistribution and white-label rights. Full terms arrive with your order confirmation.

Refund policy?

14-day no-questions refund on Innovator, Professional, and Architect tiers. Premium and Enterprise are custom-processed and non-refundable once delivered — but we offer a free sample first so you can evaluate quality before committing at that level.

How do I download my records after purchase?

Instantly. After Stripe confirms your payment, you land on a post-purchase dashboard where every format your tier grants is available as a one-click download. Links are signed and valid for 24 hours per request — lose the email, log back in and regenerate. You never lose access to what you paid for.

Will my records be updated with new codes and guidelines?

Every monthly release ships against current-year code sets and clinical references (ICD-10-CM, CPT/HCPCS, LOINC, RxNorm, FDA labels). One-time purchases include the release you bought. Optional annual refresh subscriptions are on our roadmap for institutional customers who want continuous updates.

Support response time?

24-hour weekday response on all paid tiers. Premium customers get 4-hour response. Enterprise customers get 1-hour response with a named account manager. Contact: support@patientdatasets.com

Still have questions?

Two fastest paths: (1) download the free sample and see for yourself whether the records meet your bar; (2) email support@patientdatasets.com with your use case and we'll tell you which tier fits — no pressure, no sales call.

FREE 5 records.

No credit card. No account. Instant download — five complete synthetic patient records delivered as a relational CSV bundle.

5 clinically engineered records

CSV format — ready to open

No spam, ever

Clinically EngineeredData Sets

HEDIS Quality Measures Analysis

Compliance Rate

Measure Performance

Who Uses PatientDatasets.com

Data Scientists & ML Engineers

Healthcare Students & Educators

Researchers, Admins & Vendors

What's Inside Each Record

36+ Demographics

Diagnoses (ICD-10-CM)

Full Clinical Notes

Medications

Labs & Imaging

Operative Reports

Therapy & Ancillary SOAP

Psychiatric & Behavioral

60+ Medical Specialties.

Clinical Depth Generic Synthetic Data Can't Match.

Clinical Pearls Where They Matter

Grounded in Authoritative Sources

Multi-Stage Clinical Audit

Every Release More Rigorous

Practice Billing Before You Touch a Real Claim.

Chiropractic

PT, OT & Speech Therapy

Psychology & Psychiatry

EDI & Claims

Stack On Exactly What You Need

JSON

SQLite

Parquet

FHIR R4

HL7 v2.x

C-CDA

Why PatientDatasets.com

Statistically Realistic

Clinical Coherence Built-In

Zero HIPAA. Zero IRB.

Your Order, Locked In

Multi-Format, Multi-Tool

Delivered After Checkout

Engineered by a Clinician. Validated by an Auditor.

Built by a 23-Year RN

Clinically Guided Pathways

Automated Quality Gate

How It Works

Choose

Pay

Receive

Simple, Transparent Pricing

Frequently Asked Questions

FREE 5 records.

Clinically Engineered
Data Sets