The data team had been at it for six weeks. They'd designed and built a processing pipeline — a proper one, with unit tests and code review and documentation — to strip names, addresses, phone numbers, social security numbers, and medical record numbers from a large longitudinal hospital encounter dataset. They ran validation checks. They documented every transformation. When they were done, they were genuinely proud of what they'd built.
Then the IRB reviewer — a physician with a background in clinical informatics — asked a single question that stopped the project cold: "How did you handle ZIP codes for patients in rural counties, and what's your analysis of re-identification risk for the patients in your dataset who have rare diagnoses?"
They hadn't. The ZIP codes were still in the data — truncated to five digits, which felt intuitively like de-identification. But in rural counties where a single five-digit ZIP code covers 800 people, and one of those patients carries a diagnosis of a condition that affects 1 in 50,000 people nationally, the combination of ZIP code, age, sex, and ICD-10 code can narrow the field to a handful of people — sometimes just one. The IRB flagged the dataset as insufficiently de-identified and required the team to rework the entire geographic and rare-diagnosis handling before resubmission.
Three additional months. A project that was supposed to be in model training was still in data preparation. And when the rework was complete and the revised dataset was approved, the team lead said something that stuck: "I wish someone had told us at the beginning that de-identification is not the same as removing the obvious things."
De-identification sounds like it should be simple. Take out the identifying information. What remains isn't identifying. The problem is that "identifying information" in HIPAA's framework is not the same as "information that could identify someone." The regulation defines a specific list of things to remove — but data that contains none of the 18 listed identifiers can still be re-identifiable in combination with external data sources. Safe Harbor removes specific obvious identifiers. It does not eliminate re-identification risk. It does not even guarantee that risk is low.
This distinction is one of the most consequential things compliance teams need to understand about healthcare data de-identification — and one of the most frequently misunderstood.
HIPAA's Two De-identification Standards: Safe Harbor and Expert Determination
HIPAA's Privacy Rule provides two legally recognized methods for de-identifying protected health information. Both, when properly applied, produce data that is no longer legally PHI and therefore not subject to HIPAA's use and disclosure restrictions. But they differ significantly in what they require and what they guarantee.
Safe Harbor: The Checklist Method (45 CFR §164.514(b)(2))
Safe Harbor is the de-identification method that most organizations use, because it provides a concrete checklist: remove these 18 categories of identifiers, confirm you have no actual knowledge that the remaining information could identify an individual, and the data is legally de-identified. No statistical expertise required. No external review required. Just the checklist.
The appeal is obvious. The limitation is equally obvious once you understand it: Safe Harbor is a bright-line rule that creates a legal status — "this data is not PHI" — without providing any statistical assurance about actual re-identification risk. A dataset that has had all 18 identifiers removed is de-identified under Safe Harbor even if a data scientist could re-identify a substantial fraction of patients by combining it with publicly available voter registration data.
Safe Harbor is appropriate for many use cases, and it is the practical standard for the majority of de-identification work done in healthcare today. But organizations that treat Safe Harbor compliance as a complete privacy solution — rather than a regulatory minimum — are exposed to risks that are real even if they're not immediately apparent.
Expert Determination: The Statistical Method (45 CFR §164.514(b)(1))
Expert Determination requires engaging a qualified statistical or scientific expert who applies generally accepted principles for statistical and scientific methodology to determine that the risk of identifying an individual is very small. The expert must document the methods and results, and the covered entity must retain that documentation.
The Expert Determination standard provides a positive assertion about re-identification risk that Safe Harbor does not: an expert has analyzed the data using appropriate statistical methods and concluded that the probability of re-identification is below a threshold that qualifies as "very small." This is a fundamentally stronger privacy protection than Safe Harbor, because it's grounded in actual risk analysis rather than identifier removal.
Expert Determination is more expensive — engaging a qualified privacy statistician to perform a thorough re-identification risk analysis can cost $10,000-$50,000 or more depending on the complexity of the dataset. It is also more time-consuming, typically taking two to four months for a rigorous engagement. For datasets that will be widely distributed, used in high-visibility research, or combined with external data sources, the investment is often justified. For routine clinical research datasets that remain within a controlled access environment, Safe Harbor may be sufficient.
The choice between the two methods is ultimately a risk management decision. Safe Harbor creates legal protection. Expert Determination creates legal protection plus a defensible statistical basis for the privacy claim. Organizations that face significant reputational, regulatory, or legal consequences from a re-identification incident should seriously consider Expert Determination for sensitive datasets.
All 18 Safe Harbor Identifiers: The Complete Reference
The 18 categories of identifiers that must be removed under Safe Harbor are specified in 45 CFR §164.514(b)(2)(i). Each category has nuances that are easy to miss in a first read of the regulation. Below is a complete reference with the specific implementation details that compliance teams most frequently overlook.
The 18 Identifier Categories Under HIPAA Safe Harbor — Complete Reference
Names. All names — first, last, middle, maiden, married, prefixes (Mr., Mrs., Dr.), and suffixes (Jr., Sr., MD, PhD). This includes nicknames, aliases, initials, pseudonyms, and any name by which an individual is or has been known. For clinical data, this includes names in free-text fields: "Patient is a 67-year-old man named James" in a clinical note must be treated as a name even if the structured name field has been removed. Provider names in clinical notes are also covered when they could be used to link to specific institutional records that identify the patient.
Geographic subdivisions smaller than a state. This is arguably the most complex identifier to implement correctly. The rule prohibits all geographic data smaller than state level: street addresses, cities, counties, precincts, ZIP codes, and their equivalent. ZIP codes may be retained — but only under a specific condition: the three-digit prefix of the ZIP code must cover a geographic area containing more than 20,000 people. HHS has published a specific list of three-digit ZIP prefixes that are restricted (those covering fewer than 20,000 people), which must be replaced with "000." There are currently 17 restricted three-digit prefixes representing rural and frontier areas. Any five-digit ZIP code whose first three digits are restricted must have the entire ZIP code removed or replaced with 000. Any five-digit ZIP code whose first three digits are not restricted may retain the first three digits only — not the full five-digit code.
All dates (except year) directly related to an individual. Date of birth — including any element of the date beyond year. Admission date, discharge date, and death date — including month and day. Any date of service for any clinical encounter. Any date element beyond year that relates to the individual. The year alone may be retained. Ages derived from dates may be retained if the age is 89 or younger; ages of 90 and above must be aggregated into a single category (typically "90 or older"). This means you cannot retain the exact year of birth for a patient who is 90 or older, because year of birth combined with "90 or older" provides near-exact age information for very elderly individuals. Relative time references — "3 weeks after admission" — are not explicitly prohibited but may constitute quasi-identifiers when combined with other data.
Telephone numbers. All phone numbers associated with the individual: home phone, work phone, mobile phone, pager number. This includes numbers appearing in structured fields and in free-text notes. A clinical note that mentions "patient can be reached at 617-555-1234" must have the phone number removed even if the structured phone number field has been cleared.
Fax numbers. Explicitly enumerated separately from telephone numbers in the regulation. Fax numbers associated with the individual — including fax numbers appearing in referral documentation, prior authorization fax cover sheets, and similar records that may be stored in the patient's electronic record.
Electronic mail addresses. All email addresses associated with the individual. Patient portal email addresses, contact emails provided at registration, emails captured in patient-generated health data. As with phone numbers, email addresses appearing in free-text fields — "contact patient at jsmith@email.com" in a care coordination note — must also be removed.
Social security numbers. Full nine-digit SSN and any partial representation that could be combined with other available data to reconstruct the full SSN. Last four digits of SSN, while commonly used as a "partial identifier" for administrative purposes, should be treated as an identifier under Safe Harbor given how easily they can be combined with publicly available information to identify individuals.
Medical record numbers. Any identifier assigned by a healthcare provider to track an individual patient within the institution's systems: MRN, patient ID, episode number, account number assigned by the healthcare system. Internal record identifiers must be replaced with pseudonymous codes if records need to be linked across tables — the pseudonymous code must not be derivable from the original MRN, must not be a simple sequential number that preserves ordering, and must be stored separately from the linked data.
Health plan beneficiary numbers. Insurance member ID numbers, group numbers, subscriber IDs, beneficiary IDs from Medicare, Medicaid, or any other health plan. These identifiers are particularly sensitive because they appear in claims data and can often be verified against insurance company records that contain additional identifying information.
Account numbers. Any account number associated with the individual in any financial or administrative system — bank account numbers if captured for payment purposes, credit card numbers, patient financial services account numbers. Also includes account numbers in the sense of any system-assigned account identifier that persists across transactions and can be used to track an individual over time.
Certificate and license numbers. Professional license numbers (medical license, nursing license, driver's license, DEA registration number) if they are associated with the patient record rather than the provider record. Also includes any certificate number tied to the individual — insurance certificates, certification numbers from accrediting bodies if applicable to the patient record.
Vehicle identifiers and serial numbers, including license plate numbers. Vehicle identification numbers (VINs), license plate numbers, and any other vehicle-related identifier associated with the individual. These may appear in emergency department records (motor vehicle accident presentations), social work documentation, or transportation-related care coordination records. They are comparatively rare in clinical data but must be addressed when present.
Device identifiers and serial numbers. Serial numbers for implantable medical devices (pacemakers, defibrillators, cochlear implants, insulin pumps), durable medical equipment assigned to a specific patient, and any other device identifier that is linked to an individual. Device identifiers are particularly sensitive because they are often registered with manufacturers under the patient's name and can be traced through the FDA's device registry. A pacemaker serial number in a dataset can sometimes identify the patient through the manufacturer's recall notification database.
Web universal resource locators (URLs). Any URL that could identify an individual: patient portal URLs that contain patient identifiers, personal website URLs, URLs appearing in patient-provided contact information, or URLs in social work documentation that reference the patient's online presence. Also includes URLs embedded in clinical communication records if they contain session tokens or identifiers that link to the patient's record.
Internet Protocol (IP) addresses. IP addresses associated with any individual: IP addresses captured in patient portal access logs, telehealth session IP addresses, IP addresses from patient-generated health data submissions. IP addresses are particularly sensitive in the context of telehealth, where the patient's home network IP may appear in session metadata. IP addresses can be resolved to approximate geographic locations and, for static residential IP addresses, can sometimes identify a specific household.
Biometric identifiers, including finger and voice prints. Any biometric data tied to an individual: fingerprints, palm prints, retinal scans, iris patterns, facial geometry measurements, voice prints, gait measurements. Biometric identifiers are by definition unique to an individual and are therefore maximally identifying. Biometric data is increasingly present in clinical settings — patient identification using fingerprint or palm vein scanners, voice authentication for patient portal access, facial recognition at registration. All such biometric data must be removed.
Full-face photographs and any comparable images. Photographs where the individual's face is visible and could be used to identify them, as well as any other image that serves a comparable identifying function — wound photographs that include identifying birthmarks or tattoos, imaging that includes patient labels with name and MRN visible, photographs in social work records, patient ID photographs. Radiological images (X-rays, CT scans, MRI) are generally considered non-identifying because they don't show the face, but care should be taken with images that include visible identifiers like patient name overlays from DICOM metadata.
Any other unique identifying number, characteristic, or code. The catch-all provision. This clause requires covered entities to remove any other identifier not explicitly listed — any characteristic, number, or code that could be used to identify an individual, either alone or in combination with other data. In practice, this clause is broader than it appears: it encompasses any internally assigned pseudonymous code that could be reversed if the mapping table is accessible, any combination of quasi-identifiers that together are uniquely identifying, and any rare characteristic that effectively identifies an individual by its singularity.
The Hidden Traps: Where Safe Harbor Datasets Remain Re-identifiable
Removing all 18 identifier categories satisfies the regulatory requirement. It does not eliminate re-identification risk. The following failure modes are the ones that compliance teams most frequently encounter — and that IRBs, privacy officers, and journal reviewers most often catch.
The ZIP Code Trap in Detail
The three-digit ZIP code rule is one of the most technically complex aspects of Safe Harbor implementation, and it is one of the most frequently implemented incorrectly. HHS has published guidance listing the specific three-digit ZIP code prefixes that must be suppressed — these are the prefixes where the covered population is fewer than 20,000. There are currently 17 such restricted prefixes, covering frontier areas of states like Wyoming, Montana, Alaska, and rural New England.
But the restriction doesn't end with those 17 prefixes. Even in areas where the three-digit ZIP prefix covers more than 20,000 people, the combination of three-digit ZIP with other retained information — age, sex, and a specific diagnosis — can dramatically reduce the population matching that combination. In a metropolitan area where the three-digit ZIP covers 100,000 people, a specific three-digit ZIP plus age 82 plus sex male plus diagnosis amyotrophic lateral sclerosis may match only two or three individuals. The Safe Harbor rule doesn't prohibit this, but it creates real re-identification risk.
Some institutions take the more conservative position of removing all geographic information below the state level, accepting the loss of geographic granularity in exchange for stronger re-identification protection. For datasets covering rare condition populations, this is often the right call.
The Rare Disease Problem: When a Diagnosis Is an Identifier
This is the failure mode that caught the team in our opening story, and it's one of the most consequential — and least discussed — aspects of Safe Harbor's limitations.
Consider a condition with a prevalence of 1 in 50,000 people in the United States. In a metropolitan area of 500,000 people, there are approximately 10 people with this condition. If your dataset contains a record for one of those 10 people with a ZIP code (even three-digit) that identifies the metropolitan area, an age, and a sex, you've potentially reduced the candidate population to 2-5 people. If the dataset also contains the year of first diagnosis, you've narrowed it further. The diagnosis itself has become an identifier — not because of the diagnosis code, which is not on the Safe Harbor list, but because of the statistical rarity of the condition combined with other retained quasi-identifiers.
The Safe Harbor checklist has no item for "diagnosis codes for rare conditions." Technically, ICD-10 codes are not identifiers under Safe Harbor. But practically, a dataset that contains records of patients with conditions affecting fewer than 1 in 10,000 people — or even 1 in 1,000 in some geographic contexts — may carry re-identification risk through those diagnosis codes that is not addressed by removing all 18 listed identifiers.
Expert Determination is the appropriate de-identification approach when a dataset contains significant numbers of records with rare diagnoses, because only a proper statistical analysis can quantify the re-identification risk that the ICD codes create.
The Date Re-identification Risk
Safe Harbor requires removing date elements below the year level — month and day must go. But several forms of date-related re-identification risk persist even after this truncation.
First, if a patient's clinical event was publicly notable — a local politician hospitalized, a high-profile accident, a community figure's death — the year alone may be sufficient to link the clinical record to the individual when combined with other retained information (diagnosis, geographic region, age, sex). Adversaries with access to news archives can use public events as anchors for re-identification attempts.
Second, when a dataset spans many years and includes records for elderly patients, the combination of retained year of birth and clinical history can narrow the re-identification pool significantly. An 89-year-old patient (the maximum retained age before aggregation) whose birth year is 1936 has a very specific demographic profile. Combined with a cancer diagnosis, a specific treatment history, and a geographic region, this patient's record may match only a handful of real individuals even without any of the 18 listed identifiers.
Third, relative time references in clinical notes — "three weeks after discharge," "presenting two months after diagnosis" — can effectively recover approximate dates when the discharge year is known. If a dataset retains the year of admission and a clinical note says "patient presented two months after her diagnosis of stage IV ovarian cancer," a motivated adversary who knows from public records that a woman of approximately this age in this geographic region was diagnosed with ovarian cancer in a specific year can match the records.
Free-Text Clinical Notes: The Biggest Hidden Trap
Free-text clinical notes are the most significant and most commonly overlooked re-identification risk in de-identified clinical datasets. The 18 Safe Harbor identifiers are defined for structured data fields. Clinical notes are unstructured — they contain clinician-authored narratives that routinely include information that falls outside the 18-identifier list but is highly identifying in practice.
A few illustrative examples of note content that is not on the Safe Harbor list but creates re-identification risk:
- Employer names: "Patient works at the county school district as a bus driver." Employer name plus occupation plus geographic region can narrow a population to tens of people.
- Family member names: "Patient's daughter, Susan, is involved in care." A family member's name combined with the patient's approximate age and geographic region is often searchable in public records.
- Reference to specific events: "Patient was admitted following a fall at the town's Fourth of July celebration." A public event plus timing plus geographic region plus injury type can link to news coverage that identifies the patient.
- Physical characteristics not covered by identifier #16: height, weight (if unusual), distinctive tattoos or birthmarks described in examination notes, unusual physical characteristics mentioned in history and physical examinations.
- Provider names: Notes that mention referring physicians by name can sometimes be combined with referral pattern analysis to narrow the patient's geographic origin or insurance network.
The only adequate approach for clinical notes in a de-identified dataset is NLP-based identification and removal of identifying content, followed by human review of a statistically representative sample. Regular expression-based removal of phone numbers, SSN patterns, and email addresses is necessary but nowhere near sufficient. Clinical NLP tools for de-identification — including MIST, MedDEID, and commercial offerings from clinical NLP vendors — apply machine learning models trained on clinical text to identify a broader range of potentially identifying content. Even these tools have false negative rates that justify human auditing of samples.
Safe Harbor is a regulatory minimum, not a privacy guarantee. Complying with all 18 identifier categories satisfies HIPAA's de-identification standard and removes legal PHI status from the data. It does not guarantee that a determined adversary with access to external data cannot re-identify individuals in your dataset. For high-sensitivity use cases — datasets covering rare conditions, elderly populations, or small geographic areas — Expert Determination provides a stronger foundation because it requires actual statistical analysis of re-identification risk rather than a checklist of items removed.
The "Any Other" Clause: Broader Than It Looks
Identifier number 18 — "any other unique identifying number, characteristic, or code" — is frequently treated as a minor afterthought to the 17 specific categories that precede it. It is actually one of the most consequential provisions in the Safe Harbor framework, and compliance teams that treat it as a footnote consistently underestimate its scope.
The "any other" clause is a general catch-all that was deliberately written broadly to ensure that the 18-item list could not be worked around by structuring data in ways that technically avoid the listed identifiers while preserving their identifying function. In practice, it means that any characteristic, code, or combination of characteristics that uniquely identifies an individual — even if not on the explicit list — must be treated as a Safe Harbor identifier.
Practical implications that often surprise compliance teams:
- Pseudonymous record identifiers. When you replace an MRN with a pseudonymous code to allow record linkage across tables (joining patient records to encounter records to diagnosis records), the pseudonymous code must itself comply with the "any other" clause. If the pseudonymous code was generated by a reversible algorithm — a hash of the MRN with a known key, for example — it is not adequately de-identified, because anyone with the key can reverse the pseudonym to recover the MRN.
- Combinations of quasi-identifiers. The "any other" clause extends to combinations of variables that are individually non-identifying but are collectively identifying. Age + sex + ZIP prefix + rare diagnosis can be uniquely identifying even though none of these variables is on the Safe Harbor list individually. The covered entity's attestation that it has "no actual knowledge that the remaining information could be used to identify an individual" must account for these combinations — not just the individual variables.
- Occupation and employer in some contexts. A specific job title combined with a small employer or a rare specialty in a specific geographic area can be uniquely identifying. "Chief Medical Officer at the county's only critical access hospital" combined with an age and a diagnosis may identify a specific person even without any listed identifier.
The Mosaic Effect: How Safe Data Becomes Unsafe
The Mosaic Effect — also called the aggregation problem or the jigsaw identification attack — refers to the phenomenon by which individually innocuous pieces of data become identifying when assembled. It's the de-identification challenge that makes statistical approaches like Expert Determination valuable and makes checklist approaches like Safe Harbor insufficient for high-risk scenarios.
The canonical illustration: age is not an identifier. Sex is not an identifier. ZIP code (above the 20,000-person threshold) is not an identifier. Diagnosis code is not an identifier. But the combination of age 67, female, three-digit ZIP 024, and ICD-10 code C50.911 (malignant neoplasm of right breast, unspecified site) combined with a year of diagnosis may match fewer than 10 people in the geographic area — and may match only one or two when combined with the year of first treatment.
Harvard's Latanya Sweeney demonstrated in a landmark 2000 paper that 87% of the US population at the time could be uniquely identified using only three variables: five-digit ZIP code, date of birth, and sex. This was before social media, consumer genomics, and the proliferation of data brokers that have expanded the pool of external data available to adversaries. The re-identification threat has increased substantially since that analysis, not decreased.
The practical implication for compliance teams: Safe Harbor's approach of removing specific identifiers addresses the obvious, direct identifiers. It does not address the Mosaic Effect — the way that combinations of non-identifier variables can become identifying. Addressing the Mosaic Effect requires statistical analysis, not a checklist. It requires understanding the distribution of values in your specific dataset and comparing the uniqueness of record combinations against relevant external data sources. This is exactly what Expert Determination is designed to provide.
Real-World Re-identification Attacks: What the Research Shows
Re-identification research has repeatedly demonstrated that de-identified clinical datasets can be re-identified at higher rates than the de-identification process was designed to permit. Understanding this research is not academic — it directly informs the risk assessment that organizations must conduct when deciding whether de-identified data is safe to share or publish.
Sweeney's Massachusetts Health Data Study
In 1997, then-Governor William Weld of Massachusetts assured the public that patient records released by the state's Group Insurance Commission were safe because they'd been de-identified — names, addresses, and SSNs had been removed. Latanya Sweeney, then a graduate student at MIT, purchased the voter registration list for Cambridge, Massachusetts for $20. She matched the de-identified medical records to voter records using ZIP code, birth date, and sex. She identified Governor Weld's medical record and mailed it to his office. The anecdote is not just illustrative — it fundamentally shaped the development of HIPAA's de-identification standards and the subsequent research on re-identification risk.
The Netflix Prize Re-identification Study
In 2007, researchers Arvind Narayanan and Vitaly Shmatikoff demonstrated that the Netflix Prize dataset — a collection of 100 million movie ratings that Netflix had released after removing subscriber names — could be de-anonymized by matching it against the public Internet Movie Database (IMDb). Using only two public movie ratings with their approximate dates, they could identify the corresponding Netflix subscribers with high accuracy. While not a healthcare dataset, this study was seminal for privacy researchers because it demonstrated the general principle: de-identified data can be re-identified when combined with any external data that shares a common linking variable.
The AOL Search Data Incident
In 2006, AOL released a dataset of 20 million search queries "anonymized" by replacing usernames with random numbers. Within days, New York Times journalists identified individual users from their search histories — including "Thelma Arnold," a 62-year-old widow from Georgia, whose identity was recovered from queries about "landscapers in Lilburn, Ga" and specific ailments. Her privacy violation was both real and predictable: the queries themselves were uniquely identifying in combination. No identifier was needed. The queries were the identifier.
Clinical Dataset Re-identification Studies
Multiple peer-reviewed studies have analyzed the re-identification risk in clinical datasets. A 2013 study by Benitez and Malin found that patients with rare diseases were re-identifiable in hospital discharge data at rates orders of magnitude higher than patients with common diseases — underscoring the specific risk that rare diagnoses create. A 2019 study in Nature Communications analyzed genetic and demographic data and found that nearly any modern individual can be identified from a few dozen single-nucleotide polymorphisms combined with demographic information — foreshadowing the challenge that genomic data integration will create for clinical dataset de-identification in the coming decade.
Expert Determination: When and How to Use It
Expert Determination under 45 CFR §164.514(b)(1) requires a qualified statistical or scientific expert to apply generally accepted principles to determine that the risk of identifying an individual from the remaining data is very small. The regulation doesn't define "very small," but HHS guidance and published literature generally treat risks below 0.04 (4 in 100) as very small for this purpose — though many privacy scientists use stricter thresholds.
The Expert Determination process typically involves:
- Dataset characterization. The expert analyzes the dataset's structure, the distribution of values across all retained variables, and the proportion of records containing rare or potentially identifying value combinations.
- Threat model development. The expert identifies the most plausible re-identification attack vectors given the dataset's intended use and distribution context. A dataset shared with a small group of academic researchers faces a different threat model than one published on a public data repository.
- Population uniqueness analysis. Using the k-anonymity framework or more sophisticated approaches (l-diversity, t-closeness, differential privacy metrics), the expert quantifies the degree to which records in the dataset are uniquely identifiable.
- External data linkage analysis. The expert identifies relevant external data sources that an adversary might use in a linkage attack and models the re-identification success rate using those sources.
- Risk quantification and documentation. The expert documents the methods used, the results of the analysis, and the conclusion that the risk meets the "very small" standard — or recommends transformations to reach that standard.
Expert Determination is particularly valuable when:
- The dataset covers rare conditions (prevalence below 1 in 1,000) or populations with narrow demographic profiles.
- The dataset will be publicly released or shared with a large number of recipients.
- The dataset will be combined with or linked to external data sources.
- The dataset contains detailed geographic information, longitudinal records, or other quasi-identifiers that Safe Harbor retains.
- The re-identification consequences — legal, reputational, or harm to data subjects — would be severe.
What De-identification Does NOT Protect Against
Organizations that achieve HIPAA Safe Harbor de-identification sometimes assume they have resolved all privacy and compliance obligations related to the dataset. This assumption is wrong in several important ways.
State Privacy Laws
HIPAA de-identification removes PHI status under federal law. It does not preempt state health privacy laws that may apply independently. California's CMIA governs "medical information" rather than PHI — a differently defined category. Washington's My Health MY Data Act applies to consumer health data held by entities that are not HIPAA covered entities. State breach notification laws impose obligations when health-related information is breached even if the data was not technically PHI. Organizations that share de-identified clinical data across state lines must evaluate the applicable state law in each relevant jurisdiction.
CCPA and State Consumer Privacy Laws
The California Consumer Privacy Act (CCPA) and the California Privacy Rights Act (CPRA) apply to "personal information" — a category that is broader than HIPAA's PHI and that may encompass de-identified health data in certain contexts. CCPA's definition of de-identified data requires that the data be processed to make it "reasonably unable to be associated with" the individual — a standard that may not be satisfied by Safe Harbor de-identification alone. California consumers have rights including the right to know, right to delete, and right to opt out that apply to covered businesses handling personal information, regardless of HIPAA status.
FTC Enforcement Authority
The Federal Trade Commission has enforcement authority over deceptive or unfair trade practices — authority that it has used in healthcare data contexts even where HIPAA doesn't apply. The FTC's enforcement actions related to health data privacy have consistently taken the position that companies' privacy representations must be accurate and that re-identification of ostensibly de-identified health data can constitute an unfair or deceptive practice. FTC enforcement is not precluded by HIPAA compliance.
Contractual and DUA Obligations
As noted elsewhere, achieving HIPAA de-identification doesn't change the contractual restrictions in a Data Use Agreement. A DUA that prohibits commercial use applies equally to de-identified data from the covered dataset. A DUA that requires data destruction applies to de-identified derivatives as well as the original PHI. De-identification changes the regulatory status of the data — it doesn't change the contract.
Business Associate Agreements: When They're Required
The Business Associate Agreement (BAA) is the contractual mechanism by which HIPAA's requirements extend to organizations that receive PHI from covered entities. Understanding when a BAA is required — and what it must contain — is essential for any organization working in the healthcare data ecosystem.
Who Is a Business Associate?
A business associate is a person or entity that performs functions or activities involving the use or disclosure of PHI on behalf of a covered entity. The definition is broad: data analytics vendors, cloud storage providers that host PHI, software-as-a-service platforms that process PHI, ML companies that train models on PHI, and research organizations that receive PHI from hospitals are all typically business associates.
Critically, the BAA requirement applies whether or not the business associate was aware it was receiving PHI. An organization that receives what it was told was de-identified data — but which was later found to contain identifying information — is nevertheless a business associate if the data was PHI. The organizational responsibility to verify de-identification status before treating data as non-PHI is non-trivial.
What a BAA Must Contain
45 CFR §164.504(e) specifies the required elements of a Business Associate Agreement. The BAA must: establish the permitted and required uses and disclosures of PHI by the business associate; require the business associate to implement appropriate safeguards to prevent unauthorized use or disclosure; require the business associate to report any use or disclosure not provided for in the BAA; require the business associate to ensure that any subcontractors also agree to appropriate restrictions; require the business associate to make PHI available for patient access, amendment, and accounting of disclosures; and require the business associate to return or destroy PHI when the business relationship ends.
Organizations that receive PHI without a signed BAA from the covered entity are in violation of HIPAA — even if the use of the data is otherwise appropriate. BAA violations are among the most common findings in HIPAA enforcement actions, and they frequently occur because organizations began using data before finalizing the BAA paperwork.
Why Synthetic Data Sidesteps the Entire De-identification Problem
The fundamental insight that motivates sophisticated teams' growing use of synthetic data: de-identification is an attempt to make real patient data safe for broader use by removing or obscuring identifying information. The problem is that real patient data carries identifying information not just in the 18 listed identifiers but in the statistical structure of the data itself — in the combinations of quasi-identifiers that are retained, in the clinical narratives of free-text notes, and in the rarity of certain clinical presentations that makes them effectively unique.
Synthetic data doesn't have this problem. A synthetic patient record is not derived from any real patient's information. There is no underlying identity to recover. There is no real person whose privacy can be violated. The Mosaic Effect cannot reconstitute a real individual from a synthetic record because the synthetic record was never a real individual to begin with.
This is not a legal argument. It's a factual one. Synthetic data generated from statistical models of real populations captures the population-level patterns of real clinical data without instantiating any real patient's specific combination of characteristics. The population exists in the synthetic data. The individuals do not.
From a compliance perspective, this means: no HIPAA applicability (synthetic data is not PHI), no BAA requirement, no IRB review for the data itself, no DUA to negotiate, no Safe Harbor checklist, no Expert Determination analysis, no data destruction obligation, no re-identification risk, and no state privacy law concern from the data itself. The entire compliance machinery that governs real patient data simply doesn't apply.
PatientDatasets vs. Going Through De-identification Yourself
The team that built the six-week de-identification pipeline — the one the IRB rejected — eventually got their project approved. But when the data science lead reflected on the experience, she said something that has stayed with us: "We spent six weeks building a pipeline for data we ended up having to re-process anyway. And even after the re-processing, there was still a background anxiety every time we shared the data with someone new. What if this person could figure out who these patients are? We never had that with synthetic data."
That background anxiety has a real cost. It shapes decisions about who can access the data, whether it can be shared with external collaborators, whether it can be shown in demos, whether it can be included in publications. Synthetic data eliminates it entirely.
Building a proper de-identification pipeline for real clinical data — one that actually satisfies a rigorous IRB review — requires:
- An NLP pipeline for free-text de-identification, including clinical NLP tools for entity recognition and a human review process for sampling and validation.
- A geographic processing pipeline that correctly identifies restricted three-digit ZIP prefixes, applies the 20,000-person threshold rule, and handles edge cases like PO boxes and non-standard ZIP codes.
- Date handling that correctly removes all date elements below the year level, aggregates ages over 89, and handles relative time references in clinical notes.
- A pseudonymization system for record linkage keys that satisfies the "any other" clause — generating pseudonymous codes that cannot be reversed, are not simply sequential, and are maintained in a secure, separate key mapping table.
- A rare diagnosis analysis to identify ICD-10 codes with low population prevalence in the covered geographic area and assess their re-identification contribution.
- A device identifier check that handles implanted device serial numbers against the FDA's device registry.
- An audit and validation process that confirms the pipeline's performance on a representative sample before the data is used or shared.
This is months of engineering work, not weeks. And it's work that needs to be repeated whenever the underlying data changes, whenever the regulatory requirements are updated, and whenever the use case or sharing context changes in ways that affect the re-identification risk profile.
PatientDatasets provides synthetic patient data that eliminates this entire process. Not because it corners a regulatory loophole, but because synthetic data is genuinely a different category of information — one to which healthcare privacy regulation simply doesn't apply. The time your team would spend building and validating a de-identification pipeline, you spend building the model. The anxiety your team would carry about re-identification risk, you don't carry.
The datasets are clinically realistic — calibrated to real population statistics, with ICD-10 distributions that match CMS utilization data, lab value distributions that match known population ranges, and comorbidity co-occurrence patterns that reflect published clinical epidemiology. They include CPT codes, claim-level financial data, and relational schema with documented foreign keys. They are available under explicit commercial licenses, with no NC restriction.
No De-identification Required
Synthetic patient data that is clinically realistic, commercially licensed, and carries zero re-identification risk — because there are no real patients to re-identify. No Safe Harbor checklist. No Expert Determination analysis. No IRB review. No BAA requirement. No data destruction obligation. Available today across 60+ specialties, in seven export formats including FHIR R4 and Parquet. Get started free.
Get Started Free →The team whose IRB flagged their ZIP codes eventually got their project approved — after three additional months of rework, a full geographic re-processing pass, a rare diagnosis analysis, a revised privacy plan, and resubmission to the IRB. The project succeeded. But the data science lead told us that the lesson she took from the experience wasn't about de-identification technique — it was about project design.
"In retrospect," she said, "the question I should have asked at the start was: does this phase of the project actually require real patient data? And the honest answer was no — we were building and testing an algorithm. We didn't need real patients. We needed realistic data. Those are different things, and synthetic data would have given us the second thing without any of the compliance overhead. The real data phase came much later, and by then we had something worth validating."
That's the insight that most compliance teams learn too late — not from textbooks or training, but from projects that stalled, IRBs that flagged, and re-identification reviews that sent engineering teams back to the drawing board. The goal isn't to de-identify data as efficiently as possible. The goal is to use the right category of data for each phase of the work. And in most phases — development, testing, prototyping, training, integration — the right category is synthetic. Not because it's easier. Because it's right.