Synthetic Patient Data for Pharma R&D: Three Real Pilots Reviewed

Executive Summary

Synthetic patient data — data generated by algorithms to resemble real patient records without containing actual patient information — has moved through three distinct phases over the past five years. Phase 1 was conceptual, dominated by academic and methodological papers. Phase 2 was experimental, with sponsor-side pilots in low-stakes use cases. Phase 3, which we are now in, is operational: pilots have moved into use cases that affect regulatory decisions, and the FDA’s posture has begun to crystallize through public guidance and engagement.

This article reviews three pilot patterns that are publicly visible in 2025-2026: synthetic control arms in clinical trials, rare disease cohort augmentation, and privacy-preserving data sharing for external research. We summarize what each pattern is, what it has delivered, where it is struggling, and what regulators have said. The aim is to give pharma R&D leaders a calibrated read on where the technology is genuinely useful versus where it remains aspirational.

3 phases of synthetic patient data adoption are now visible: conceptual (pre-2020), experimental (2020-2024), and operational (2024 onward). The transition into the operational phase is what makes the current regulatory and industry signal meaningfully different from earlier years.

Why Synthetic Patient Data Is Moving Now

Three forces have converged to push synthetic patient data from concept to operational use. The first is the maturation of generative modeling techniques, particularly those based on transformer architectures and diffusion models, which can now generate synthetic data that preserves the statistical properties of source data with substantially higher fidelity than earlier approaches. The second is the regulatory direction, which has moved from skepticism to structured engagement. The FDA’s Real-World Evidence framework, EMA’s reflection papers on synthetic data, and ICH-level discussions have all signaled that regulators are prepared to engage with synthetic data when it is used credibly. The third is the operational pressure on pharma R&D timelines, which has made any technology that can reduce time to evidence — including synthetic data approaches — strategically attractive.

The convergence of these forces is what distinguishes the current operational phase from the earlier experimental phase. Five years ago, synthetic data pilots were primarily about demonstrating technical feasibility. Today, they are about demonstrating regulatory credibility, operational integration with existing R&D workflows, and measurable acceleration of specific R&D activities. The conversation has shifted from “can synthetic data work?” to “what is it good for, what is it not good for, and what does it take to convince a regulator?”

The FDA’s Real-World Evidence program has been one of the most significant catalysts. While not specific to synthetic data, the RWE framework’s articulation of when and how non-traditional data can support regulatory decisions has created intellectual scaffolding that synthetic data approaches can attach to. The 2023 publication of the FDA’s draft guidance on the use of real-world data to support clinical study designs formalized a framework that pharma sponsors are now extending into synthetic data discussions.

Pilot Pattern 1: Synthetic Control Arms

The first and most public pilot pattern is the synthetic control arm: a constructed comparison cohort generated to model what would happen to patients receiving standard of care, used in place of (or alongside) a traditional control arm in a clinical trial. The use case is most attractive in oncology, where ethical and recruitment considerations make randomized controls difficult, and in rare diseases, where natural-history controls may be unobtainable at sufficient scale.

Public examples include several oncology trials where synthetic control arms have been constructed from electronic health record data, registry data, and historical trial data. The pattern is typically: a sponsor generates a synthetic control cohort that resembles the patients in the active treatment arm on prognostic variables, and the trial’s primary efficacy analysis compares the active arm against the synthetic control. The FDA has been engaged with several of these submissions, and the agency’s posture has been notably nuanced: open to the approach in specific contexts of use, demanding on the credibility evidence, and explicit that the synthetic control is not a substitute for randomization where randomization is feasible.

What has worked. Synthetic control arms have demonstrably reduced trial timelines and have made trials feasible that would not otherwise have been practical (particularly in rare-disease contexts and in second-line oncology). They have produced data packages that have supported regulatory engagement, including in some cases regulatory approval pathways. The published methodological literature, including work surveyed by BCG’s biopharmaceuticals practice and reported in clinical trial conferences, indicates that the pattern is being used by multiple top-20 pharma companies.

What has been harder. Demonstrating credibility under the FDA’s evolving credibility framework requires substantial methodological investment. Bias from confounders, distribution shift between historical and current standard-of-care populations, and the technical question of how the synthetic data was generated all have to be addressed in submission documentation. Pilots that have not invested adequately in this work have stalled at regulatory engagement.

Pilot Pattern 2: Rare Disease Cohort Augmentation

The second pilot pattern uses synthetic data to augment small cohorts in rare disease research. The problem is structural: in many rare diseases, the patient population is too small to support traditional statistical analyses, and the available real-world data is fragmented across registries, electronic health records, and natural history studies. Synthetic data is being used to construct extended cohorts that preserve the statistical properties of the small real cohort while permitting analyses that would be underpowered against the real data alone.

The most credible examples have come out of academic-pharma partnerships, where the academic partner has access to the natural history data and the pharma partner contributes computational infrastructure and the regulatory engagement experience. The pattern typically involves: characterizing the real-world rare disease cohort, generating a synthetic cohort that preserves the joint distribution of clinically meaningful variables, validating the synthetic cohort against held-out real-world data, and then using the synthetic cohort to support specific analytical activities (such as natural history modeling, sample-size determination for prospective trials, or comparator analyses).

What has worked. The pattern has materially accelerated rare disease research by making certain analyses possible where they were not previously. It has supported regulatory engagement on natural history characterization, which is itself a key input to rare disease drug development pathways. It has also opened collaborative research models between academic centers and pharma sponsors that would have been impractical without the privacy-preserving properties of synthetic data.

What has been harder. The validation discipline required to demonstrate that the synthetic cohort accurately represents the real population is substantial, and the methodological choices are scrutinized by both regulators and reviewers. Pilots that have over-claimed the fidelity of their synthetic cohorts have faced credibility challenges that have set back the entire research program. The discipline is to be conservative about what the synthetic data is being used for and explicit about its limits.

Pilot Pattern 3: Privacy-Preserving Data Sharing

The third pilot pattern uses synthetic data to enable data sharing across organizations without exposing the underlying patient-level data. The use case is most attractive in multi-stakeholder research consortia (academic centers, pharma companies, regulators) where data sharing is technically possible but practically constrained by privacy regulations, institutional review board requirements, and contractual restrictions.

Public examples include several pharma-led consortia that have used synthetic data to share data with academic collaborators, with regulators in pre-submission engagements, and with internal teams that would not otherwise have access to the underlying patient-level data. The pattern is typically: an organization generates a synthetic version of a real dataset, validates that the synthetic version preserves the statistical properties needed for the receiving party’s intended analyses, and then shares the synthetic version under terms that are materially less restrictive than the underlying real data would require.

What has worked. The pattern has enabled collaborations that would not have been practical otherwise, particularly with external academic researchers. It has accelerated the engagement cycle with regulators in pre-submission meetings, where the regulator can engage with realistic data without the friction of formal patient-level data access. It has also supported internal capability development, where data science teams can work with realistic data without the friction of HIPAA-equivalent access controls. The Bain healthcare and life sciences insights have documented similar patterns across the broader life sciences ecosystem.

What has been harder. The risk of re-identification — even theoretical re-identification — has slowed adoption in some organizations whose privacy posture is conservative. Methodologies for quantifying re-identification risk have matured significantly, but the legal and regulatory comfort level around residual risk remains uneven. Pilots that have over-rotated toward maximizing utility at the expense of privacy guarantees have faced internal resistance that has delayed deployment.

Pilot pattern	Primary use case	Maturity (May 2026)	Key barrier
Synthetic control arms	Replace or augment control cohorts in trials	Operational at top-20 pharma	Credibility framework documentation
Rare disease cohort augmentation	Extend small cohorts for analytical power	Operational in select partnerships	Validation discipline against real cohorts
Privacy-preserving data sharing	Enable cross-organization data flows	Operational in multi-stakeholder consortia	Re-identification risk tolerance

What’s Working Across the Three Patterns

Across the three patterns, several common success factors are visible. First, the use cases that have moved fastest are those where the underlying analytical question is well-defined and where the synthetic data is being deployed against a specific, narrow purpose rather than as a general-purpose replacement for real data. The credibility framework discipline rewards specificity, and pilots that have over-generalized have struggled.

Second, the partnerships that have produced the most defensible work involve close engagement between data scientists, biostatisticians, regulatory affairs leaders, and clinical scientists. The cross-functional integration is where the credibility work happens; pilots that have been run by data science alone have produced technically interesting outputs that do not survive regulatory or clinical scrutiny.

Third, the organizations that have moved fastest have invested in their methodological documentation as deliberately as in the technical work. The validation reports, the credibility framework alignment, the explicit articulation of the synthetic data’s intended use and its limits — these are what regulators engage with, and pilots that have documented this work well have produced engagement experiences that pilots with weaker documentation have not.

Sakara Digital perspective: The pharma R&D leaders who are getting the most value from synthetic data are not the ones treating it as a general-purpose acceleration tool. They are the ones who have identified specific, narrow use cases where the synthetic data fills a gap that real data cannot fill — rare disease cohorts that are too small, control arms that are ethically or practically difficult, data sharing that would not otherwise happen — and who have invested in the methodological discipline that makes the synthetic data credible to regulators and to clinical reviewers. The pattern is specificity and discipline, not generality and speed.

What Isn’t Working Yet

Several use cases have been proposed for synthetic patient data that have not yet produced the results their advocates suggested they would.

General-purpose acceleration of clinical trial design. The idea that synthetic data could allow trials to be largely designed and simulated before any real patient enrollment has not materialized at scale. The challenges include the difficulty of generating synthetic data that captures the full distribution of patient heterogeneity, including the long-tail behaviors that often drive trial outcomes, and the regulatory caution about replacing prospective data with generated data for high-stakes design decisions.

Replacement of natural history data for rare diseases. Synthetic data has been useful for augmenting natural history understanding but has not replaced the underlying natural history work. Regulators want to see real natural history data, with synthetic data as a complement rather than a substitute. Pilots that proposed pure synthetic-data natural history models have not advanced through regulatory engagement.

Pre-marketing safety signal detection. Synthetic data has been explored as a tool for pre-marketing safety signal detection, with the intuition that simulating large populations could reveal rare adverse events that small trials cannot. The reality is that synthetic data inherits the safety profile of its source data; it cannot reveal adverse events that were not present in the data it was trained on. The use case has therefore been narrowed to specific scenarios where the safety question can be addressed by re-sampling existing data with adjusted weights.

Health economics and outcomes research at scale. Synthetic data has been proposed for HEOR modeling at scale, but the credibility challenges are significant. Payers and regulators are skeptical of HEOR models built on synthetic data unless the synthetic data is being used to address very specific gaps in the underlying real-world data. The general-purpose HEOR use case has not produced credible deliverables yet.

The Converging Regulatory Posture

The FDA and EMA postures on synthetic patient data have converged on a recognizable framework over the past 18 months. The framework rests on three principles.

Context of use specificity. Both agencies expect sponsors to articulate the specific use case for synthetic data with enough precision that the credibility requirements can be calibrated. Generic claims that “we used synthetic data” are not adequate; specific articulation of the analytical question, the data flow, and the role of synthetic data within that flow is.

Risk-proportional validation. The validation expectation for synthetic data is proportional to the regulatory consequence of the analysis it supports. Synthetic data used in early-stage exploratory analyses faces a lighter validation expectation than synthetic data used to support a primary efficacy endpoint in a registration trial. The FDA’s credibility framework, articulated in its January 2025 draft guidance, applies directly.

Privacy guarantees as a prerequisite. Both agencies expect sponsors to articulate the privacy guarantees underlying the synthetic data, including the methodology for assessing re-identification risk and the residual risk’s acceptability against the use case. Privacy is treated as a prerequisite, not as a tradeoff against utility. The EMA’s reflection papers on related data topics have reinforced this point repeatedly.

The convergence is significant. Sponsors deploying synthetic data across multiple jurisdictions can now work to a substantially harmonized framework, rather than navigating jurisdiction-specific expectations. The harmonization is not yet complete — ICH has not yet published a guideline specifically on synthetic data — but the directional alignment between FDA and EMA is clear enough that operational planning can proceed with confidence.

Implications for R&D leaders planning synthetic data deployment

For pharma R&D leaders planning synthetic data deployment over the next 12 to 18 months, several implications follow from the publicly visible pilot patterns and the converging regulatory posture. First, the use cases where synthetic data is most likely to deliver value are well-defined: synthetic control arms in contexts where randomization is impractical, rare disease cohort augmentation where real cohorts are too small, and privacy-preserving data sharing across organizational boundaries. R&D leaders should prioritize these use cases over more speculative applications.

Second, the investment required to make a synthetic data deployment defensible is substantial and should not be underestimated. Methodological documentation, validation against real-world cohorts, credibility framework alignment, and regulatory engagement are all material work streams. R&D leaders who scope synthetic data deployments as primarily technical projects, without budgeting for the regulatory and methodological work, consistently produce deployments that stall at the engagement phase.

Third, partnerships with academic centers, methodology specialists, and (where appropriate) regulators themselves are leverage. The credibility of a synthetic data deployment is materially enhanced when the methodology has been reviewed by external experts, validated against external datasets, and discussed with regulators in pre-submission engagement. The investments in partnership are typically smaller than the cost savings from running a stalled deployment, and the credibility dividend is significant.

The trajectory across the next 24 months will likely include the publication of more formal regulatory guidance specifically on synthetic data, increasing operational deployment of synthetic control arms in non-rare-disease oncology, growing use of privacy-preserving synthetic data in multi-stakeholder consortia, and emerging applications in pediatric drug development where real-world data is structurally limited. R&D leaders who anticipate this trajectory and build the methodological and regulatory infrastructure to support it will be positioned to capture material value as the technology continues to mature.

References & Sources

For Further Reading

References & Sources

FDA Real-World Evidence Program — U.S. Food and Drug Administration. The intellectual scaffolding within which synthetic data discussions are situated; defines when and how non-traditional data can support regulatory decisions.
FDA Draft Guidance: Considerations for the Use of Real-World Data and Real-World Evidence to Support Regulatory Decision-Making for Drug and Biological Products — FDA. The 2023 draft guidance referenced throughout for context-of-use and credibility framing as it extends into synthetic data discussions.
European Medicines Agency — EMA. Source for EMA reflection papers and scientific opinions on real-world data, synthetic data, and the converging regulatory posture between FDA and EMA on non-traditional data approaches.
BCG Biopharmaceuticals Practice — Boston Consulting Group. Industry analysis covering synthetic control arms, real-world evidence integration, and the operational patterns visible across top-20 pharma R&D organizations.
Bain Healthcare and Life Sciences Insights — Bain & Company. Strategy analysis covering data sharing models, privacy-preserving research collaboration, and the consortium patterns that have produced the most visible privacy-preserving synthetic data deployments.
McKinsey Life Sciences Insights — McKinsey & Company. Industry analysis covering R&D acceleration patterns, synthetic data deployment economics, and the operational trajectory expected over the next 24 months in pharma R&D.

Amie Harpe Founder and Principal Consultant

Amie Harpe is a strategic consultant, IT leader, and founder of Sakara Digital, with 20+ years of experience delivering global quality, compliance, and digital transformation initiatives across pharma, biotech, medical device, and consumer health. She specializes in GxP compliance, AI governance and adoption, document management systems (including Veeva QMS), program management, and operational optimization — with a proven track record of leading complex, high-impact initiatives (often with budgets exceeding $40M) and managing cross-functional, multicultural teams. Through Sakara Digital, Amie helps organizations navigate digital transformation with clarity, flexibility, and purpose, delivering senior-level fractional consulting directly to clients and through strategic partnerships with consulting firms and software providers. She currently serves as Strategic Partner to IntuitionLabs on GxP compliance and AI-enabled transformation for pharmaceutical and life sciences clients. Amie is also the founder of Peacefully Proven (peacefullyproven.com), a wellness brand focused on intentional, peaceful living.

See Full Bio

Table of Contents

Executive Summary

For Further Reading

References & Sources

Download the Free White Paper

Your perspective matters—join the conversation.Cancel reply

Trending