Your Healthcare App Passed Every Test. Then It Met Real Patients.

The call came in at 3:18am. A production alert — medication reconciliation module returning null values for compound drug names. A small hospital in rural Ohio had just deployed the new EHR integration and a night-shift nurse noticed something off. The engineering team had 40 minutes of logs to comb through. And a creeping, terrible realization: this exact edge case had never appeared in their test data.

That story was shared with us by a developer at a mid-sized HealthTech company. He asked us not to use his name. But he wanted others to know how it happened — and why it was preventable.

"We had thousands of test records," he told us. "But they were all clean. Textbook. The kind of patient data you'd construct if you wanted to demonstrate a feature, not the kind that actually shows up in a rural family practice where the physician has been prescribing the same compound for thirty years under a name that doesn't match any formulary."

The compound in question was a topical preparation — a specific concentration of a steroid and an antifungal, compounded by a local pharmacy, referenced in the chart under the pharmacy's house name rather than any standard RxNorm identifier. The medication reconciliation module had never been asked to handle it. Not because the developers hadn't thought about edge cases. Because the edge case had never been in their test data, so the code path had never been exercised, and the null pointer exception had been waiting patiently in production for the first real patient it would encounter.

That night cost the team four hours of emergency response, two days of hotfix development and re-testing, a delayed go-live for a second hospital, and a conversation with a clinical partner that nobody on the team wanted to have. The bug was fixable. The trust damage was harder.

The Problem With "Good Enough" Test Data

Healthcare software development has a test data problem that almost no one talks about openly. The challenge isn't a lack of awareness — most developers know their test data isn't perfect. The challenge is that alternatives are hard, the problem is invisible until it explodes, and there's no obvious moment where someone looks at a test suite and says: this isn't realistic enough.

Real patient data is locked down. Accessing it requires a data use agreement, IRB approval, a business associate agreement, and often institutional sign-off that can take months. And even when you get it, you can't share it with your team freely. You can't use it in your CI/CD pipeline. You can't hand it to a contractor to debug a specific issue at 3am. You can't push it to a staging environment without triggering a HIPAA compliance review.

So teams do what they can. They hand-craft a few hundred test records — carefully, with good intentions, but inevitably shaped by what the developers know rather than what patients actually present with. They use Synthea to generate synthetic data and then wonder why the outputs don't look like anything their clinical partners recognize. They copy a few anonymized examples from a textbook. They create "edge case" records that represent the edges they already knew about. And they ship.

The problem is structural. Test data that's built by developers represents the universe of what developers know about clinical data. But clinical reality is vastly larger than that universe. Patients are messy, complex, longitudinal creatures. Their charts accumulate over decades. Their medications are adjusted, discontinued, restarted. Their diagnoses evolve, combine, and contradict. Their documentation is written by dozens of different providers with dozens of different styles.

It's 2:00pm on a Tuesday. Your QA team is testing a new prior authorization workflow. The test records all look the same — clean demographics, a handful of ICD-10 codes, one medication, normal vitals. Everything passes.

What the tests never saw: a 73-year-old patient with 14 active diagnoses, 22 medications, three active care gaps, and a clinical note so dense with abbreviations that even an experienced coder would need a second read. That's what Tuesday looks like in a real physician's office. Your QA team tested a toy version of that reality and called it done.

The problem isn't that developers are careless. It's that careless test data is invisible until it isn't. Everything passes. The demo looks great. The clinical pilot runs smoothly — because the pilot site hand-selected clean records for the integration testing phase. The software ships. And then, somewhere in month two of production, a night-shift nurse encounters a patient chart that your test suite never imagined.

How Teams Build Test Data Today — and Why It Fails

To understand why this problem is so persistent, it helps to walk through how development teams actually construct their test data. There are four common approaches, and each has a specific, predictable failure mode.

Approach 1: Hand-crafted records

The most common approach, especially early in a project, is to hand-build records that cover the known test scenarios. A developer or QA engineer sits down and creates patients: a 45-year-old with hypertension, a 67-year-old with diabetes and CKD, a pediatric patient for the age-based logic. These records are precise and controllable, which makes them useful for unit testing specific features.

The failure mode is selection bias. Hand-crafted records reflect what the creator knows, expects, and considers important. They don't reflect the full distribution of what real patients look like. A developer who has never worked in a clinical setting will hand-craft records that look like they came from a clinical textbook — which is exactly where they got their mental model of what a patient record looks like.

Approach 2: Anonymized production records

Some teams get access to de-identified records from a clinical partner, strip identifiers, and use those for testing. This seems like an obvious solution — real data, no PHI. In practice, it creates three problems: the records are from a single institution, so they reflect that institution's documentation culture and patient population; de-identification is never complete, and the compliance and legal risk of using even "anonymized" patient data in a development environment is substantial; and the records can't be freely shared, version-controlled, or integrated into automated pipelines.

There's also the HHS Safe Harbor de-identification standard (45 CFR §164.514(b)) to consider. Truly safe-harbored data has 18 categories of identifiers removed, including date elements, geographic information below the state level, and any other unique identifying numbers. After that process, you're often left with records that are clinically impoverished — stripped of exactly the contextual details that make them realistic.

Approach 3: Synthea-generated data

Synthea, the open-source synthetic patient generator from MITRE, produces FHIR-compliant records with plausible demographics, conditions, medications, and clinical observations. It's a genuine contribution to the field, and for many development teams it's a significant improvement over hand-crafted records.

But Synthea has well-documented limitations that become acute when you're building software for real clinical environments. Its clinical note narratives are templated and formulaic — not the way physicians actually write. Its medication lists don't capture compound medications, off-label prescribing, or the kind of medication complexity that appears in patients with multiple chronic conditions. Its comorbidity patterns are probabilistic but not calibrated to specific populations. And its ICD-10 code selection doesn't reflect the coding nuances that real billing environments introduce — specificity variations, combination codes, sequencing issues, and the coding guidance that differs by payer.

Clinical engineers who have worked with both Synthea and real patient data describe the gap immediately: Synthea records look clean in a way that real records never are. The messiness of real clinical documentation — the inconsistencies, the abbreviations, the carry-forward from prior visits, the discrepancies between what the note says and what the codes say — simply isn't there.

Approach 4: Textbook or sandbox data

EHR vendors provide sandbox environments with pre-populated test data. Clinical informatics textbooks include sample records as exercises. CMS provides synthetic beneficiary data for FHIR API testing. These resources exist and are useful — for their intended purposes. But they're explicitly designed to be clean, simple, and illustrative. They're teaching tools, not production stress tests.

A test suite built on sandbox data tells you whether your software works in the best possible case. It tells you almost nothing about whether it works in the realistic case — which is the only case that matters when real patients are involved.

A Taxonomy of Production Failures

To understand what your test data is missing, it helps to walk through the specific categories of failure that healthcare applications encounter in production — and to name the exact mechanisms by which oversimplified test data fails to catch them.

Medication Name Failures

The 3am incident described at the start of this article was a medication name failure. It's one of the most common categories, and one of the most dangerous, because it occurs at the intersection of clinical workflow and patient safety.

Real medication data is extraordinarily diverse. The RxNorm ontology contains over 100,000 concept unique identifiers. A production medication reconciliation system will encounter brand names, generic names, compound pharmacy names, old brand names for discontinued drugs, name variants from international formularies, and physician shorthand that bears only a passing resemblance to any official name. It will encounter drug names with special characters — em dashes, Greek letters, parenthetical strength indicators. It will encounter medications documented in a prior visit note using a name that was superseded by a formulary update.

Test data built from a list of 50 common medications will catch none of this. It will test your happy path flawlessly, and your parse failure will wait in production for the first compounded medication or the first name that doesn't match your lookup table.

Specific failure patterns to test for:

Compound medications (e.g., "hydrocortisone 1%/clotrimazole 1% topical cream, 60g")
Medications with embedded dosage in the name field rather than a separate field
Medications with null or empty strength values because the strength was documented in a separate free-text note
Discontinued brand names still referenced in long-term patients' charts (e.g., Darvocet, Vioxx)
Medications with route specified via non-standard abbreviations (SQ vs. SubQ vs. "subcutaneous injection")
Multi-ingredient medications where the name contains a slash character that your parser might interpret as a path separator

Character Encoding Edge Cases

Healthcare data contains special characters in quantities that most software developers don't anticipate. Patient names with diacritics — accents, umlauts, tildes — are routine in diverse patient populations. Clinical notes contain degree symbols, Greek letters (alpha, beta for drug receptor specificity), medication names with hyphens, en dashes, and em dashes that look similar but encode differently. FHIR JSON payloads that contain improperly encoded Unicode will break parsers that only expect ASCII. HL7 v2.x messages use pipe characters and caret characters as delimiters — and real clinical data sometimes contains those characters in unexpected places, especially in free-text fields.

The encoding failures that surface in production are almost never caught in development because test data is almost always ASCII-safe and cleanly structured. Real patient names are not.

If your test suite has never included a patient named "Müller" or "García-López" or "Nguyễn Thị Hương," you haven't tested your name field handling. Across a population of any size, you will encounter all of these. And if your software fails on the name, it fails on the patient.

Demographic Edge Cases

Demographic data is where software makes assumptions so quietly that teams don't notice them until production. Some specific failure categories:

Age extremes: Neonates with ages measured in hours or days, not years. Patients over 100 years old, whose birth years cause integer overflow in systems that store age as a two-digit number. Logic that assumes a patient's age is a whole number.
Gender identity complexity: The HL7 FHIR R4 standard supports separate fields for administrative gender, gender identity, and sex assigned at birth. Systems built on older standards may not handle the distinction. Systems that hard-code binary gender assumptions will fail on non-binary patients.
Name field complexity: Legal names that include suffixes (Jr., III, Esq.), single-name patients (mononymous individuals, which is common in some cultures), names with particles (de, van, von, el) that sort differently depending on convention, maiden names vs. current names stored inconsistently across systems.
Address edge cases: Rural route addresses (RR 3 Box 47), military addresses (APO AE), addresses in US territories (Puerto Rico, Guam), post office boxes, homelessness indicated by a shelter address or by a missing address field that your required-field validation rejects.
Insurance complexity: Patients with multiple active insurance policies, patients whose insurance coverage changed mid-encounter, patients whose insurance information is stored under a different name than the patient's legal name (common with minors on a parent's policy).

Date and Time Format Edge Cases

Date handling in healthcare software is a remarkably deep source of production failures. The problem is that real clinical data contains dates in formats that your test suite never generates, and clinical workflows create date scenarios that your software's logic never anticipated.

Partial dates: A patient's year of birth is known but not the month or day — this is common in refugee populations and elderly patients whose birth records are incomplete. FHIR allows partial dates. Your validation logic may not.
Date in the future: Scheduled procedures documented with future dates. Follow-up orders written for a date six months out. Insurance coverage start dates that are tomorrow. Your system may interpret a future date as a data error and reject it.
Cross-midnight encounters: An ED encounter that begins at 11:40pm and ends at 2:15am the next day has two different calendar dates. Systems that use the encounter date as a key field may treat these as two separate encounters or may assign the wrong date to the admission record.
Time zone handling: HL7 FHIR requires timestamps in ISO 8601 format with timezone offset. Records generated by systems in different time zones, records transferred between systems with different timezone configurations, and records in systems that store timestamps in local time without offset information all create ambiguities that your test data, generated in a single timezone, will never surface.
Leap year edge cases: February 29th birth dates. Lab results dated February 29th in a non-leap year (a data entry error, but a real one). Age calculations that break on leap years.

Comorbidity Combination Failures

Real patients have multiple conditions. In a Medicare population, the average beneficiary has five or more chronic conditions. A patient with heart failure, diabetes, CKD, COPD, and hypertension — a common combination in elderly patients — generates clinical interactions that simple test records never surface.

The failure modes here are subtle. A prior authorization workflow that checks for a single contraindicated condition will miss a combination that creates the same contraindication. A drug interaction checker that checks each medication against each diagnosis individually may miss an interaction that only emerges from the combination of two medications in the context of a specific comorbidity. A clinical decision support rule that fires on a single lab value may behave unexpectedly when multiple lab values are simultaneously abnormal in ways that don't occur in isolation.

ICD-10-CM has specific combination codes for conditions that frequently co-occur. E11.65, for example, is the code for "Type 2 diabetes mellitus with hyperglycemia" — but there are also combination codes that capture diabetes with specific complications (E11.21 for diabetic nephropathy, E11.311 for diabetic retinopathy with macular edema). A system that parses ICD-10 codes without understanding the combination code structure may misinterpret a record that uses these correctly.

Three Development Teams, Three Hard Lessons

Rather than speak in abstractions, let's walk through three fictionalized but representative accounts of development teams who encountered production failures that their test data didn't catch. The specifics are changed; the patterns are real.

Team One: The EHR Integration Firm

A company building an HL7 v2.x to FHIR translation layer for a regional hospital network ran extensive testing before go-live. Their test suite contained 800 records covering the major message types: ADT, ORU, ORM, SIU. Everything worked.

Three weeks into production, they started seeing translation failures on a subset of messages originating from one of the legacy systems in the network. The legacy system was a 1990s-era lab information system that transmitted HL7 v2.2 messages with non-standard segment separators — using a vertical tab character instead of a carriage return between segments. Their parser, which assumed a carriage return, silently dropped every message from that system.

It took eleven days to identify the issue — because the messages weren't failing loudly, they were failing silently. Lab results for patients being processed through that system were simply not appearing in the FHIR representation. During those eleven days, several patients had lab results that weren't visible in the system the hospitalists were using.

The fix was two lines of code. The test data that would have caught it would have required one non-standard message in their test suite. Neither was there.

Team Two: The Clinical Decision Support Startup

A clinical decision support company built a medication contraindication checker that integrated into a pharmacy system at a community hospital. Their test data included 200 patients with realistic medications and diagnoses. The system performed well in testing — catching known contraindications with high accuracy and a false positive rate low enough that the pharmacists found it useful rather than annoying.

Six months after deployment, the pharmacy director flagged a concern: the system wasn't generating alerts for a specific combination that it should have caught. The combination was warfarin and fluconazole — a well-known drug-drug interaction that dramatically potentiates the anticoagulant effect of warfarin and requires dose adjustment or substitution. The system wasn't catching it because the warfarin in the patient's medication list was documented as "Coumadin" (the brand name), and the system's drug interaction lookup used RxNorm ingredient names for matching. Coumadin mapped correctly — but only when the brand-name-to-generic mapping table had been loaded. A data pipeline issue had caused that table to be stale by six months, meaning recent generic introductions weren't mapping correctly.

The test data had used the generic name "warfarin" exclusively. No test record had ever used "Coumadin." The failure path had been there since day one.

Team Three: The Patient Portal Provider

A patient portal company offering a SMART on FHIR application that allowed patients to view their own records ran into trouble when a major health system integrated the portal with their Epic EHR. During testing on the vendor's sandbox environment, everything had worked correctly. The UI rendered records cleanly. Lab values appeared in the right format. Medication lists displayed completely.

In production, they started receiving support tickets from patients who said their medication lists were cut off — showing only the first 12 medications. The issue was a UI component that had been designed with a maximum list length assumption baked into its render logic. The assumption had never been questioned because no test record had more than 8 medications. In the real health system's patient population, a substantial number of patients — particularly the elderly patients with multiple chronic conditions — had 15 or more active medications. For those patients, the portal was showing an incomplete view of their medication record. They didn't know what they weren't seeing.

The fix took half a day. The reputational impact with the health system took considerably longer to repair.

The Hidden Cost of Every Missed Edge Case

Development teams tend to think about bugs in terms of engineering cost: how long does it take to fix, how many sprints does it disrupt, what's the opportunity cost of the resources involved. These are real costs, but they're not the full picture.

Direct financial costs

A production incident in a healthcare software context typically involves emergency engineering response (often at overtime rates), coordination with clinical partners (who have their own labor costs for the time spent managing the incident), potential SLA penalties if the health system contract specifies uptime requirements, and the cost of the hotfix development, testing, and deployment cycle. For a significant incident at a mid-sized company, this can run $50,000 to $150,000 in direct costs before the revenue impact is even considered.

Revenue impact is harder to quantify but often larger. Health system contracts are long and difficult to win. A production incident in the first 90 days of deployment — the period when the clinical team is forming their lasting impression of your software — can convert a multi-year reference customer into a churned account. In enterprise HealthTech sales, losing a reference customer doesn't just cost you that contract. It costs you the five deals that reference customer would have influenced.

Regulatory and compliance costs

Depending on the nature of the failure, a production incident in healthcare software can trigger regulatory consequences that go far beyond the immediate bug. If the software is certified under ONC's 2015 Edition or 2022 Edition Health IT Certification Criteria (45 CFR Part 170), a functional failure in a certified capability may require reporting to ONC, a corrective action plan, or in serious cases, suspension of certification. The certification itself — which is required for meaningful use attestation and MIPS compliance — represents significant revenue for health systems that depend on it. A software failure that jeopardizes certification is not just a technical problem.

Under the 21st Century Cures Act and its implementing regulations, EHR developers face information blocking prohibitions that can result in civil monetary penalties of up to $1 million per violation. While a test data gap causing a software failure isn't itself information blocking, the downstream consequences of a system that incorrectly processes or fails to transmit clinical data can create compliance exposure that the original bug never anticipated.

Patient safety implications

This is the cost that matters most, and the one that the industry is most reluctant to discuss openly. When healthcare software fails to correctly process clinical data, the downstream effects can reach patients. A medication reconciliation module that returns null values for compound drug names doesn't just create an error log. In a workflow where that module's output feeds a prescriber decision support tool, or feeds an automated prior authorization process, or feeds a discharge summary that a patient takes home — a null value can mean a medication is omitted from a record in ways that a busy clinician under time pressure may not catch.

The Joint Commission's sentinel event data consistently identifies medication errors as one of the leading categories of patient harm events. Healthcare software systems are increasingly embedded in the workflows that should prevent those errors. When that software has a defect that slips through because the test data didn't include the right edge case, the software that was supposed to be a safety net becomes an invisible gap in the safety net.

The developer who builds a healthcare application inherits a form of moral responsibility that most software development doesn't carry. The edge case that you didn't test for isn't just a tech debt item on the backlog. It's a patient who might not get the care they need — not because their clinician failed them, but because the software their clinician depended on failed silently in a way nobody anticipated.

What "Production-Realistic" Actually Means for Each Clinical Data Domain

Production-realistic test data isn't a single standard — it means something different depending on the domain of the clinical data you're working with. Here's what it looks like across the major domains:

Laboratory Results

Production-realistic lab data includes: results from multiple performing labs with different reference ranges (a TSH reported as 2.4 mIU/L with a reference range of 0.4–4.0 from one lab and the same value reported against a reference range of 0.35–4.5 from another); results flagged as critical values (typically indicated with a specific flag code in HL7 or FHIR, with different conventions across lab systems); delta-flag results (where a result has changed significantly from a prior value, triggering additional review); results with comments (free-text addenda that contain clinically important information not captured in the numeric value); and results marked as pending, preliminary, corrected, or cancelled — all states that your processing logic needs to handle correctly.

Realistic lab data also includes realistic patterns of missingness. Real patients don't have every lab drawn at every visit. A longitudinal record will have gaps, and your software's behavior when expected labs are absent is as important to test as its behavior when they're present.

Medications

Production-realistic medication data includes compound medications, off-label prescribing with supporting documentation in the note, medications discontinued due to adverse events (with the adverse event documented in the allergy or reaction list), medications that are "on hold" pending a lab result, medications prescribed by an out-of-network provider that appear in the medication list as documented by the patient rather than confirmed from a pharmacy record, and medications with dose adjustments over time (a patient who has been on warfarin for three years will have dose change events that tell a story about their anticoagulation management).

It also includes the full diversity of dosing language: "take 1 tablet by mouth twice daily with food," "apply thin layer to affected area TID PRN," "1 puff INH QAM, 2 puffs INH QHS," and dozens of other conventions that are human-readable but parser-hostile.

Diagnoses

Production-realistic diagnosis data reflects the full complexity of ICD-10-CM. The ICD-10-CM code set has over 72,000 codes, and real patient records use a much broader range of them than any curated test dataset. Realistic records include codes with up to seven characters (the full hierarchical specificity the code set supports), codes that require additional characters for laterality (right vs. left vs. bilateral, which affects how the code is constructed), codes that require an encounter qualifier (initial, subsequent, sequelae), and codes from categories that textbook test cases rarely include — Z codes for social determinants of health, codes for external causes of morbidity, codes for factors influencing health status that don't constitute active diagnoses.

Realistic records also use combination codes where they're available — E11.65 (type 2 diabetes mellitus with hyperglycemia) rather than E11.9 (type 2 diabetes mellitus without complications) when the documentation supports it. Software that assumes a maximum of 4 or 5 diagnosis codes per encounter will fail on the complex patient with 14 active conditions, each appropriately coded to the level of specificity the documentation supports.

Allergies and Adverse Reactions

Allergy records are a particularly rich source of edge cases. Real allergy records include: drug allergies documented as free text (patient reports "penicillin but I don't remember what happened"); allergies with documented reactions of varying severity (mild rash vs. anaphylaxis — each of which triggers different clinical decision support logic); cross-reactivity documentation (allergy to sulfa drugs documented with a note that this also implies cephalosporin caution); allergies marked as "unverified" or "patient-reported" vs. confirmed by challenge testing; allergies to inactive ingredients in medications (shellfish allergy and iodinated contrast allergy, for example, are sometimes incorrectly treated as equivalent); and allergies that have been deleted, merged, or overridden with documentation of the clinical reasoning.

Vital Signs

Vital sign records need to include outlier values — values that are clinically significant but outside the range that seems "reasonable" to a developer writing validation logic. A blood pressure of 220/130 is a hypertensive emergency. It's also a value that a developer might accidentally code as an upper bound in a validation rule, causing high-BP records to be rejected as data errors. A temperature of 40.1°C is a fever of concern. A heart rate of 130 in a patient with atrial fibrillation is expected. A GCS score of 3 is the minimum and represents a maximally obtunded patient — software that interprets a GCS of 3 as missing data rather than a clinical value creates dangerous gaps in the record.

A Specific List of Edge Cases That Sink Healthcare Applications

This is the practical reference: a catalog of specific edge cases, drawn from documented production failures, that healthcare application developers should build into their test data strategy.

ICD-10 code E11.65 — Type 2 diabetes with hyperglycemia. Systems that parse ICD-10 as a flat code may not handle the hierarchical relationship between E11.65 and the broader E11.* category correctly.
Medication name: "diltiazem HCl CD 180mg capsule" — Tests handling of the "HCl" salt designation, the "CD" formulation indicator, and the embedded strength in a single string.
Patient age: 0 days — A neonate born today. Tests age-dependent logic and date arithmetic when the birth date equals the encounter date.
Patient age: 104 years — Tests systems that store age as a two-digit integer or that calculate age using year-only arithmetic.
Lab result with a ">" prefix — A serum creatinine reported as ">15 mg/dL" because the assay upper limit was exceeded. Tests parser handling of non-numeric prefixes in numeric fields.
Encounter with zero billable diagnoses — A preventive visit where all listed conditions are Z codes. Tests billing logic that assumes at least one "active" diagnosis code per encounter.
Patient with 25 active medications — Tests list rendering, scrolling, pagination, and any logic that assumes a reasonable maximum medication count.
Clinical note containing a medication name with a forward slash — e.g., "hydrocodone/acetaminophen 5/325mg." Tests parsers that use "/" as a delimiter.
ICD-10 code W61.62XA — "Struck by duck, initial encounter." A real, billable code. Tests systems that validate ICD-10 codes against a whitelist rather than against the full code set.
Encounter start time after encounter end time — A documentation error that occurs in real systems when a chart is opened for addenda after the encounter time window. Tests temporal validation logic.
Patient whose preferred language is listed as "Other" — Tests systems that enumerate language choices and may not handle the catch-all category.
Allergy to "aspirin" and an active prescription for "Excedrin" — Tests drug-allergy interaction checking against combination products where the allergen is an ingredient rather than the product name.
A discharge summary referencing an encounter at a different facility — Tests longitudinal record assembly that draws on records from multiple sources.
A FHIR Bundle with a contained resource — Tests parsers that assume all resources are direct Bundle entries rather than using the contained resource pattern.
HL7 v2.x OBX segment with a null value represented as "" — Two double-quotation marks is the HL7 v2.x representation of an explicit null. Parsers that treat it as an empty string lose the clinical distinction between "not measured" and "measured as unknown."

Regulatory Implications: ONC Certification, Interoperability Mandates, and What Your Software Actually Has to Do

For developers building software that will be used in the US healthcare system, regulatory requirements create a second layer of urgency around test data quality. It's not just about catching bugs — it's about demonstrating to regulators that your software can handle the real world.

ONC Health IT Certification

The Office of the National Coordinator for Health Information Technology (ONC) administers the health IT certification program under 45 CFR Part 170. Certification is required for EHRs used in Medicare and Medicaid programs. The certification criteria include specific technical capabilities that must be tested — and the test procedures specified by ONC's Authorized Testing and Certification Bodies (ATCBs) include test data that's been carefully constructed to exercise the certified capabilities.

But ONC certification tests a minimum floor of capability. A product that passes ONC certification testing has demonstrated that it can handle the scenarios in the ONC test procedures — not that it can handle every scenario it will encounter in production. The CURES Act update to ONC certification (85 FR 25642, published May 1, 2020) introduced new requirements around interoperability, including the requirement to support FHIR R4 APIs. The FHIR API test data used in certification testing is, by design, simplified. Production FHIR data is not.

The 21st Century Cures Act and Information Blocking

The information blocking provisions of the 21st Century Cures Act (42 U.S.C. 300jj-52) prohibit practices that are likely to interfere with access, exchange, or use of electronic health information. While the provision targets intentional blocking, the practical implication for developers is clear: software that fails to correctly process, transmit, or display clinical data may create information blocking situations even when the failure is unintentional.

The ONC Information Blocking Final Rule (85 FR 25642) includes exceptions for practices that meet specific conditions. One of the conditions relevant to developers is the "Maintaining and Supporting Health IT" exception — which covers practices that are necessary to maintain or improve health IT performance. To invoke that exception, a developer needs to show that the practice was reasonable and necessary. A software defect caused by inadequate test data is harder to characterize as reasonable and necessary.

Meaningful Use and MIPS

Health systems that use certified EHR technology for Medicare and Medicaid reimbursement must meet Meaningful Use attestation requirements (now known as the Promoting Interoperability program under MIPS). The specific measures include requirements around electronic prescribing, clinical information reconciliation, and patient electronic access. When EHR software fails to correctly process data in these areas, the failure can jeopardize the health system's Promoting Interoperability score — which affects their total MIPS composite performance score and their Medicare reimbursement adjustments.

A software vendor whose product causes a health system to miss Promoting Interoperability requirements is not just facing a support ticket. They're potentially facing contractual liability and a damaged relationship with a customer whose regulatory standing has been affected by a software defect.

Building a Test Data Strategy, Not Just Test Cases

The solution to inadequate test data isn't to add more hand-crafted edge cases to an existing test suite. It's to develop a test data strategy — a systematic approach to ensuring that the data your software tests against reflects the full distribution of what it will encounter in production.

Start with a data domain inventory

Before you can assess whether your test data is adequate, you need to know what data domains your software touches. For each domain — medications, diagnoses, labs, vitals, allergies, demographics, clinical notes, insurance — document the full range of values, formats, and edge cases that could appear in production. This is a clinical knowledge exercise, not just an engineering one. Partnering with clinical informatics specialists or practicing clinicians to review and expand your domain inventory is worth the investment.

Assess coverage, not just volume

Test data volume is a poor proxy for test data coverage. A suite of 10,000 records that all have the same structure and complexity profile is less useful than a suite of 1,000 records that covers the realistic distribution of complexity. For each domain, assess: what percentage of your test records include the high-complexity variants? What percentage include the edge cases? What percentage include values in the realistic extremes of the distribution?

Integrate clinical fidelity review

Have a clinician or clinical informaticist review a sample of your test records and answer a simple question: does this look like a real patient chart? If the answer is "sort of, but it's much cleaner than what I see in practice," that gap in cleanliness is a gap in your test coverage. Real clinical notes have shorthand, abbreviations, carry-forward text from prior visits, and the occasional inconsistency. Real medication lists have medications that look like they shouldn't go together, because the clinical context that makes them appropriate isn't obvious from the list alone. Real lab histories have missing values, corrected values, and values from external labs that use different reference ranges.

Build test data into your CI/CD pipeline

Test data should run automatically against every build. This means the test data needs to be versionable, shareable, and pipeline-compatible — characteristics that eliminate real patient data and require synthetic data with realistic clinical properties. When a test fails because a new build can't handle a specific edge case in the test data, you've caught the failure in development instead of in production. That's the goal.

Expand coverage as you learn from production

Every production incident that involves an edge case the test data didn't cover should result in that edge case being added to the test data. This is how a test data library grows to reflect the real world: not by anticipating every possible scenario in advance, but by encoding each production learning as a permanent addition to the test suite. Over time, a team that does this consistently builds a test data library that reflects years of real-world clinical complexity — a library that new software can be tested against before it encounters any of those situations in production.

See What Your Test Suite Is Missing

Download a free sample of synthetic patient records — clinically realistic, 7 export formats, ready to drop into your test pipeline. Complex comorbidities, edge-case medications, production-grade demographic diversity.

Download Free Sample →

A Note on Synthetic Data Quality

Not all synthetic patient data is equally useful. The key differentiators are clinical fidelity — does the data reflect how real patients actually present? — and structural realism — does it have the format and complexity that your software will encounter in production?

PatientDatasets.com generates records across 60+ medical specialties with clinically validated comorbidity patterns, complete clinical documentation (HPI, ROS, physical exam, assessment, plan), structured lab values with reference ranges including critical values and delta flags, compound and off-label medication names, realistic demographic distributions including age extremes and naming conventions, and accurate ICD-10-CM coding to the level of specificity the documentation supports. Available in 7 formats including FHIR R4, HL7 v2.x, CSV, Parquet, and SQLite — so it drops directly into your existing pipeline.

No data use agreements. No IRB approval required. No BAA. Download today and start testing against data that looks like your customers' actual patients.

What It Means When Your App Works for Everyone

Here's what the developer who called us at 3am understood, once the dust had settled from his production incident. His software had worked perfectly — for easy patients. It had worked for the patients his test data could represent. But his software didn't get to choose which patients used it. It had to work for everyone.

Healthcare is not a domain where "works for most cases" is good enough. A medication reconciliation module that fails on compound medications doesn't fail a data type — it fails a patient. A patient who might be elderly, who might be managing a complex regimen, who might have been counting on that software to catch something a busy clinician missed. When the software fails, it fails that specific person, in that specific moment, when they needed it to work.

The flip side of that moral weight is the moral significance of getting it right. When your software handles every edge case — the 25-medication patient, the compound drug, the 104-year-old, the non-ASCII name, the partial date of birth — it works for every patient. Not just the easy ones. Not just the ones your developers happened to think of when they were building the test suite. Every one.

That's what production-realistic test data makes possible. Not just confidence in your software — confidence in your software for everyone who will ever use it. And in healthcare, that means something beyond what it means in any other domain.

The 3am call is optional. The patients are not.