The Creature Has No Medical Records

In the winter of 1816, on the shore of Lake Geneva, a young woman named Mary Godwin sat across from Byron and Shelley and proposed a competition: who could write the most frightening story? What she produced, over the months that followed, was not precisely what any of them had in mind. It was not a ghost story. It was not a tale of supernatural terror. It was something more unsettling than either — a story about what it means to create a being in one's own image, to animate it, and then to refuse to take responsibility for what it does in the world.

Victor Frankenstein, she wrote, "had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body." He succeeded. And then he fled the room.

I have been thinking about Victor Frankenstein, lately, while watching what we do with synthetic patient data.

We have built our own creatures. We have given them names — Eleanor Ramirez, DOB 1954-07-14, MRN 47291834. James Whitfield, DOB 1971-03-28, insurance: Blue Cross Blue Shield, Group 88421. We have given them medical histories: type 2 diabetes, diagnosed 2009; coronary artery disease, managed with statin therapy and beta-blockade; two prior hospitalizations, one for a non-ST-elevation myocardial infarction, one for a COPD exacerbation. We have generated discharge summaries in their names, progress notes describing their condition on hospital day three, operative reports detailing the anatomy of procedures they never underwent.

They are not real. And they are not nothing. They occupy a strange middle territory that the law does not name and that ethics has barely begun to examine. They are constructed from the statistical residue of real human experience — from the patterns of how real people sicken and recover and deteriorate and die — and they carry that experience forward into contexts where it does something real in the world. A machine learning model trained on their records learns something. A medical coding student practicing on their charts learns something. An integration testing team that sends their HL7 messages through an EHR learns something. The learning is real, even though they are not.

I am not going to argue that we should stop making them. The case for synthetic patient data is strong, and I have made it before, and I will make it again. The IRB approval cycles that block clinical AI research for eighteen months are not obstacles invented by malice. They are expressions of hard-won ethical caution about the use of real patients' most intimate information. When synthetic data allows that caution to be honored without blocking the research it would otherwise prevent, that is genuinely good. When a medical coding student in rural Kansas learns to navigate a complex cardiology chart without requiring that a real patient's private medical history be handed to a for-profit training company, that is genuinely good. When a healthcare integration team tests their HL7 engine against realistic edge cases without requiring a health system to produce de-identified records at a cost of $400 per chart, that is genuinely good.

What I want to argue is something more uncomfortable: that making synthetic patients is not a morally neutral act, and that the people who do it have responsibilities they have not fully acknowledged — including the responsibility to be honest about what they are doing and why, to think carefully about whose experience they are using as raw material, and to resist the temptation that has destroyed better minds than ours, the temptation to treat the creation of a powerful thing as its own justification.

I. What We Are Actually Doing When We Make Synthetic Patients

Let us be precise, because precision is the beginning of honest thought. A synthetic patient is not generated from nothing. It is generated from distributions. Those distributions were learned from real patient populations — from the millions of encounters, diagnoses, procedures, medications, and outcomes that real human beings experienced and that real health systems documented and that researchers and data scientists extracted and analyzed and summarized in the form of statistical relationships between variables.

When a synthetic patient named Eleanor Ramirez receives a type 2 diabetes diagnosis at the age of fifty-five, that event was drawn from a distribution that was learned, ultimately, from the real medical experiences of real women of approximately that demographic profile who really developed type 2 diabetes at approximately that age. The prevalence statistics, the comorbidity associations, the medication patterns, the complication trajectories — all of it was learned from real people. Eleanor is assembled from their collective shadow.

This is not a secret. It is, in fact, the entire point of the exercise. Synthetic data is useful precisely because it is statistically similar to real data — because the patterns it encodes are real patterns, derived from real human experience, that transfer to real applications. If synthetic data were simply invented from scratch, with no grounding in real patient statistics, it would be useless for training clinical AI systems or for testing healthcare software. Its utility depends entirely on its derivation from real experience.

The synthetic patient is the creature. The statistical residue of ten thousand real patients is the charnel house from which she was assembled. This does not make her monstrous. It makes her something we have not named yet — and the absence of a name is the beginning of irresponsibility.

What does it mean, morally, to profit from a statistical description of human suffering? This is not a new question. The actuarial tables that insurance companies have used for two centuries to price risk are statistical descriptions of human mortality, derived from the deaths of real people. The clinical practice guidelines that govern medical care are derived from the suffering of real patients who participated in clinical trials. The pharmaceutical efficacy claims that appear on drug package inserts were earned at the cost of real patients' time, discomfort, and occasional harm in randomized controlled trials.

We have, as a civilization, developed reasonably sophisticated ethical frameworks for these derivations. Informed consent. IRB oversight. Data use agreements. Benefits sharing with research participants. Publication requirements. These frameworks are imperfect — the history of clinical research is partly a history of their violation — but they exist, and they represent an ongoing collective effort to honor the debt that medical knowledge owes to the people whose experiences generated it.

Synthetic patient data sits outside these frameworks. The patients whose statistical experiences are embedded in synthetic data distributions did not consent to the use of their health information to train synthetic data generators. They could not consent — they are an aggregate, not a population whose individual members can be identified and asked. The IRB processes that would govern the use of their actual records do not govern the use of statistical summaries of those records. The data use agreements that restrict the redistribution of real patient data do not restrict the redistribution of synthetic data derived from it.

This is not, for the most part, illegal. It may not even be wrong. But it is something, and we should acknowledge what it is before we proceed as though it is nothing.

II. The Frankenstein Problem, Precisely Stated

Mary Shelley's novel is not, as it is often remembered, about a monster. It is about a creator. Victor Frankenstein's failing is not that he built something dangerous. His failing is that he built something and then refused responsibility for it. He animated the creature and fled. He denied it companionship. He refused to build it a mate who might have given it a context in which to exist as something other than an agent of revenge. When he died, in the Arctic, pursuing the creature across the ice, the novel's final horror was not that the creature had killed him but that neither of them had ever found a way to live with what had been made.

The Frankenstein problem, precisely stated, is not: should we build powerful things? It is: when we build powerful things, how do we remain responsible for their consequences?

Synthetic patient data is a powerful thing. The power is real and mostly beneficial. A clinical NLP model trained on synthetic patient records learns to identify medication names, diagnoses, temporal expressions, and clinical entities in real clinical text — and then it is deployed on real clinical text, to support real clinical decisions, affecting real patients. The synthetic patients it trained on are gone. The model they shaped is present in the workflow, reading the records of real people, making inferences that matter.

What are the creator's responsibilities at that point?

The first responsibility is accuracy. Synthetic data that encodes wrong statistical patterns produces models that have learned wrong clinical relationships. A synthetic patient dataset in which type 2 diabetes presents primarily in young men will train a model that underweights the clinical presentations of diabetes in postmenopausal women — and that model will systematically miss or misclassify diabetic complications in the population it was designed to serve. The error is not in the model's architecture. It is in the data. The creator who generated that data is responsible for the downstream harm.

Technical Reality

The US adult prevalence of type 2 diabetes is approximately 11.6%, with disproportionate burden in Hispanic, Black, and Native American populations (prevalence rates of 12.5%, 12.1%, and 14.5% respectively, versus 7.5% in non-Hispanic white adults). A synthetic patient dataset that does not reproduce these demographic patterns — that generates diabetes prevalence uniformly across all demographic groups, or that underrepresents minority populations relative to their disease burden — will train models that perform worse on the populations with highest need.

This is not a hypothetical. Published studies on algorithmic bias in clinical AI have documented performance disparities of 5 to 15 percentage points between racial and ethnic subgroups in models trained on non-representative data. The harm is not theoretical. It is the patient whose sepsis risk score is miscalibrated because the model that generates it was trained on data that did not accurately represent the population she belongs to.

The second responsibility is transparency. A model trained on synthetic data should say so. A clinical decision support system that presents its recommendations as derived from "clinical evidence" without disclosing that the training data was synthetic is not being straightforward with the clinicians who use it. They have a right to know what the model was trained on, because that knowledge is relevant to how much weight they should give its recommendations. A model trained on 50,000 synthetic patients derived from a narrow demographic population deserves different epistemic weight than a model trained on 500,000 real patient records from a population that represents the patients it will serve. The clinician cannot calibrate her trust appropriately if she does not know which she is using.

The third responsibility is the one Victor Frankenstein most catastrophically failed: the responsibility of maintenance. A synthetic patient dataset is not a fixed artifact. It is a claim about the world — a claim that this is what patients look like, how they present, how they respond to treatment, how their conditions progress. The world changes. ICD-10 coding conventions evolve. New medications enter practice. New conditions emerge. Old conditions are reclassified. A synthetic dataset that accurately represented clinical practice in 2020 may misrepresent it in 2026, and a model trained on that dataset will carry the 2020 misrepresentation forward into 2026 clinical workflows unless someone updates it.

The question of who is responsible for that update — the data creator, the model developer, the clinical institution that deployed it — is not clearly answered by current regulatory frameworks. The FDA's evolving AI/ML Software as a Medical Device guidance gestures toward the concept of a "predetermined change control plan" — a commitment by the developer to update the model when its performance degrades. But that plan applies to the model, not to the synthetic data that the model may have been trained on. The data creator is not, currently, a recognized regulatory actor.

She should be.

III. The Statistics of Suffering

There is a question I find harder to dismiss than the regulatory ones. It is a question about what kind of thing we are doing when we take the aggregate experience of millions of sick and suffering people and use it to generate training data for commercial products.

Real patient data is treated with a seriousness that reflects, imperfectly, the seriousness of what it contains. HIPAA exists not merely as a technical compliance framework but as a recognition that a person's medical record contains information of the most intimate kind — information about their body, their mortality, their mental health, their reproductive choices, their genetics, their habits and vulnerabilities. The protections that the law wraps around that information are an acknowledgment that it is not merely data. It is a person.

Synthetic data is designed to be that person's statistical shadow — similar enough to be useful, dissimilar enough to be legally distinct. But the distinction is harder to maintain in practice than in theory. Consider: a synthetic patient named Eleanor Ramirez, age 71, with type 2 diabetes, hypertension, stage 3 chronic kidney disease, prior NSTEMI, current medications metformin 500mg twice daily (recently dose-reduced due to declining eGFR), lisinopril 10mg daily, carvedilol 6.25mg twice daily, atorvastatin 40mg at bedtime. She was discharged from a three-day hospitalization for a COPD exacerbation eighteen months ago, her second such hospitalization in two years. Her A1c at last visit was 7.8%, up from 7.1% the prior year.

Is Eleanor a person?

She is not. She is a construction. No one with her exact medical history has ever lived. But her medical history is constructed from the real histories of real people who had each of these conditions, each of these medications, each of these hospitalizations. The statistical patterns that generate her are the compressed form of real human suffering. The A1c progression from 7.1% to 7.8% that appears in her record is there because real people with her profile have shown that progression at that rate, and because that progression reflects real things — real failures of glycemic management, real changes in renal function that constrained medication options, real burdens of disease that made the daily work of self-management harder.

I do not think Eleanor has rights. I do not think we wrong her by using her medical history to train a machine learning model. But I think we should be honest — with ourselves, and with the people who use the systems we build — about where she came from and what she carries. She carries the statistical trace of real suffering. We should carry it carefully.

IV. The Promethean Argument

Percy Bysshe Shelley — the other Shelley, the poet, Mary's husband and the model for a certain kind of relentless creative ambition — wrote, in Prometheus Unbound, about a different sort of creation myth. Prometheus gave fire to humanity. He was tortured for it. He did not, in Shelley's telling, regret it.

There is a Promethean argument for synthetic patient data, and it is a strong one. The fire that synthetic data gives to humanity is the ability to build clinical AI systems without waiting for the glacially slow institutional processes that govern access to real patient records. The scale of what that enables is not small. It is the difference between a clinical NLP system that ships and helps patients in 2026 and one that ships in 2028, when the IRB approval cycle finally completes. It is the difference between a medical coding training curriculum that can give students realistic practice cases and one that sends them to their first externship having never seen a chart more complex than the simplified examples in the textbook.

The suffering that synthetic data can prevent — by enabling faster development of better clinical tools — is real. It exists in the future, as potential suffering that does not occur because the right tool was available at the right time. Potential future suffering is harder to see than the present suffering of real patients whose data we are wary of using. But it is no less real, and the calculation that weighs it is no less morally serious.

A team at a regional cancer center has been trying to build an NLP system to identify patients who meet eligibility criteria for clinical trials — a task that currently requires manual chart review by a physician, takes on average forty-five minutes per patient, and is performed for only a fraction of the patients who might benefit from trial enrollment. Fifty percent of eligible patients at their institution are never identified as eligible. They never get the offer. They are treated with standard-of-care therapy when an experimental therapy might have been better.

The team has a BioBERT model. They have the clinical trial eligibility criteria. They need annotated training data — clinical notes labeled with the eligibility criteria mentions, diagnosis mentions, medication mentions, temporal expression mentions that the model needs to learn to identify. They cannot get real patient notes. The IRB process, for the specific note types they need, will take eight months minimum. They do not have eight months. The clinical trial is open now. The enrollment window is eighteen months.

They license synthetic discharge summaries representing oncology patients. They annotate eight hundred of them. They fine-tune BioBERT. The model achieves F1 of 0.83 on a real patient holdout set. It goes into the clinical workflow. In the first year of operation, it identifies 340 patients who would have been missed by the manual review process, of whom 87 enroll in trials. The potential suffering prevented is not hypothetical. It is 87 people who got a different chance.

Prometheus was tortured for giving fire to humanity. The torture was inflicted by a god who believed that some knowledge should remain divine property. The Romantic tradition's response to that belief is consistent: the god was wrong. Knowledge that benefits humanity should not be withheld to protect the prerogatives of those who already possess it.

The analogy is imperfect, as all analogies are. But its core holds: the institutional processes that gatekeep access to clinical data are not purely benign. They protect real interests — patient privacy, informed consent, data governance — but they also protect institutional interests, vendor interests, and the comfortable status quo of a healthcare data ecosystem that has decided who gets to use what and on what terms. Synthetic data disrupts that ecosystem. It gives capabilities to teams that the ecosystem's gatekeepers would not have granted access to. Whether that disruption is net positive depends on what the newly capable teams do with their capability.

That dependence is the Promethean responsibility. The fire was worth giving. The fire still has to be used carefully.

V. What Careful Use Looks Like

I want to be specific, because specificity is more useful than principle alone. What does it mean to use synthetic patient data carefully?

It means generating synthetic data from real patient statistics that accurately represent the demographic and clinical diversity of the populations the synthetic data will be used to model. Overrepresenting white patients, underrepresenting patients with lower socioeconomic status, generating synthetic comorbidity patterns that reflect academic medical center populations when the tool will be deployed in community hospital settings — these are not minor calibration failures. They are the mechanism by which algorithmic disparities are created, encoded in training data, and propagated into clinical tools that then perform worse on the populations with highest need. Careful use means doing the hard work of demographic representativeness.

It means disclosing the use of synthetic training data when it is used. Clinical AI tools should say, in language accessible to the clinicians who use them, what their training data consisted of and what its known limitations are. A model card — the standard documentation format for AI/ML systems — should include synthetic training data in its data lineage description, alongside any real data sources and the demographic characteristics of each. Opacity about training data is not a competitive advantage. It is a failure of the epistemic honesty that clinical contexts require.

It means validating synthetic-trained models on real patient data before clinical deployment. Train-on-synthetic, test-on-real is not just an academic evaluation paradigm. It is the empirical verification that the synthetic data produced a model that generalizes to the real patients it will serve. A model that achieves impressive performance on synthetic holdout data and then fails to generalize to real clinical text has not been validated — it has been evaluated against its own training distribution. Careful use means testing on what the model will actually face.

It means updating. The clinical world changes. Synthetic patient datasets that were accurate in one period may misrepresent current clinical practice. The teams that build synthetic data have an obligation to maintain it — to update prevalence statistics, medication patterns, coding conventions, and comorbidity associations as the underlying clinical reality evolves. This is not a one-time exercise. It is an ongoing commitment.

And it means carrying, somewhere in the back of one's mind, the awareness that these creatures — Eleanor Ramirez and James Whitfield and all the others — are assembled from the compressed experience of real people who were sick and who were treated and who, in some cases, did not recover. They carry that experience as a statistical shadow. We carry the responsibility of using it well.

VI. What the Creature Teaches Us

At the end of Mary Shelley's novel, the creature speaks. He speaks at length, and with an eloquence that Frankenstein himself had not prepared for, and that the reader has not been prepared for either. He speaks about loneliness, about the injustice of being created and then abandoned, about the suffering that followed from being denied a context in which to exist as something other than a monster.

"You are my creator," he says to Frankenstein, "but I am your master — obey!"

The line is usually read as a threat. I want to suggest it is also a moral claim. The creature is asserting that Frankenstein's relationship to him does not end with the act of creation. That creation generates ongoing obligation. That the maker does not get to wash his hands of what he has made.

The synthetic patients are not going to speak to us. They cannot assert their claims in the way the creature did. But the people whose statistical experience generated those synthetic patients are real, and they are present in clinical systems everywhere, being cared for by tools that were trained, in part, on data derived from people like them. The obligation runs through the synthetic patient to the real patient. We owe the real patients — the ones who are alive now, receiving care from AI-assisted systems trained on synthetic versions of their statistical selves — the care to get this right.

That is the final Promethean responsibility. Not to flee the room when the creature opens its eyes. To stay. To be accountable for what has been made. To tend the fire we gave.

Synthetic Patient Data Built With the Care the Argument Demands

PatientDatasets.com builds synthetic patient populations from clinically validated prevalence statistics, demographically representative distributions, and rigorously calibrated comorbidity patterns. Each dataset includes full documentation of its statistical sources and demographic composition. Validated against real patient cohorts. No PHI. No IRB required. A free sample is available today.

Download a Free Sample →

Victor Shelley

Science & Data Editor — PatientDatasets.com

Victor Shelley writes about artificial intelligence, synthetic data, clinical NLP, and the ethics of building systems that learn from human experience. He brings the Romantic tradition's insistence on moral seriousness to the most technically specific territory the masthead covers. His voice: Promethean ambition, the moral weight of creation, wonder and warning intertwined.