Schedule a Call

Building a Real-World Evidence Data Lakehouse Without Migrating CDISC

Executive Summary

Most pharma real-world evidence (RWE) initiatives stall on a false premise: that CDISC-formatted clinical trial data must be migrated into the RWE lakehouse before the lakehouse can produce meaningful integrated analyses. The migration tax is large, the migration timeline is long, and the migration target is fragile because CDISC standards evolve. The result is RWE programs that announce ambitious roadmaps and then deliver narrow proofs-of-concept while the migration drags on.

There is a better pattern: build the RWE lakehouse to link to CDISC data where it lives rather than migrating CDISC into the lakehouse. The pattern preserves the validated state of CDISC datasets, leverages the regulatory submission tooling that already operates on CDISC, and produces an RWE lakehouse that delivers value within months rather than years. This article describes the link-don’t-migrate architecture, the identity resolution layer that makes it work, the governance considerations, the adoption sequence that produces value fast, and what the pattern unlocks for pharma RWE programs.

18-30 months is the typical timeline pharma RWE programs allocate to migrating CDISC-formatted clinical trial data into a unified RWE platform before the platform produces meaningful integrated analyses. The link-don’t-migrate pattern compresses time-to-value to 4-9 months by removing the migration as a dependency.1

Why So Many Pharma RWE Programs Stall

Pharma RWE programs typically launch with broad ambitions. Integrate clinical trial data with real-world data from EHRs, claims, registries, wearables, and patient-reported outcomes. Produce comparative effectiveness analyses, support label expansions, generate evidence for payer negotiations, and ground regulatory submissions in real-world context. The strategic logic is sound, and the value case is well-established.

The execution reality is harder. Most programs run into a recurring set of blockers within the first 12 months. The clinical trial data is in CDISC SDTM and ADaM formats, validated against regulatory requirements, sitting in submission-ready archives. Migrating this data into the RWE platform requires translating from CDISC into whatever target schema the lakehouse uses, validating the translations, and maintaining the validated state as CDISC standards evolve. The work is real, expensive, and politically fraught because it involves re-validating data that has already been validated for submission purposes.

Programs that proceed with migration as a prerequisite typically spend the first 12 to 24 months on migration work, deliver narrow proofs-of-concept on the migrated subset, and burn through executive patience before producing the integrated analyses that justified the program. By the time the migration is mostly complete, the underlying business case has often shifted, and the program is restructured or quietly scaled back.

The pattern is consistent enough that it is worth questioning the migration premise itself.

The CDISC Migration Premise and Why It Is Wrong

The premise behind CDISC migration into the RWE lakehouse is that integrated analyses require the data to live in a single platform with a unified schema. This premise is wrong, or more precisely, it is right at the analytical query layer but wrong at the storage layer.

Modern data platforms support federated query patterns that allow analyses to span multiple storage locations and multiple schemas. The analytical query layer can produce a unified view of clinical trial data and real-world data without the underlying data being in a single platform. The federation pattern is well-established in financial services and retail, and the underlying technology has matured to the point where it is operationally viable for pharma RWE.

The cost of preserving CDISC in place is the cost of building the federation and identity resolution layer. The cost of migrating CDISC is the cost of translation, revalidation, ongoing standards alignment, and the political work of re-validating submission-ready data. In our analysis, the federation cost is substantially lower than the migration cost for any RWE program with material CDISC data volumes, and the federation cost produces value faster because it does not require completing migration before analyses can run.

The link-don’t-migrate pattern, in summary: keep CDISC datasets where they are, in their submission-ready validated state. Build the RWE lakehouse for real-world data sources. Build a federation and identity resolution layer that allows analyses to span CDISC and the RWE lakehouse without moving the CDISC data.

The architecture has five components. None is new in isolation, and the integration discipline across the components is where the design effort concentrates.

ComponentRole
CDISC archive (existing)Preserves submission-ready clinical trial data in SDTM and ADaM formats; remains the source of truth for clinical data
RWE lakehouseIngests and curates real-world data from EHRs, claims, registries, wearables, and patient-reported outcomes
Identity resolution layerMaps subjects, sites, indications, and time references across CDISC and RWE datasets
Federation query layerAllows analyses to span CDISC and RWE without physically moving data
Governance and audit layerMaintains lineage, access control, and regulatory traceability across the federated environment

The architecture’s key insight is that the RWE lakehouse does not need to contain CDISC data in order to serve integrated analyses. The lakehouse contains the real-world data, the federation layer joins the lakehouse to the CDISC archive at query time, and the identity resolution layer ensures the joins produce meaningful results.

The federation query layer is a well-established pattern in modern data platforms. Snowflake’s external tables, Databricks’ Delta Sharing and federated catalogs, and standalone federated query engines all support federation across heterogeneous storage. The technology is mature; the design effort is in how to structure the federation to produce performant, governed analyses against pharma’s specific data shapes.

The Identity Resolution Layer

The identity resolution layer is the component that makes the federation meaningful. Without identity resolution, the federated query layer can join CDISC and RWE datasets but the joins will not produce coherent analyses because the keys do not match.

Identity resolution in pharma RWE involves several distinct identity domains:

  • Subject identity. Clinical trial subjects are identified by trial-internal IDs that are not the same as the patient identifiers in EHR or claims data. Linking trial subjects to their real-world data requires careful tokenization, often through third-party data linkage services, while preserving privacy.
  • Site and provider identity. Clinical trial sites have site IDs that are not the same as the facility identifiers in real-world data sources. Linking requires reference data that maps sites to facilities.
  • Indication and condition identity. Clinical trial indications are coded in MedDRA. Real-world data uses ICD-10, SNOMED CT, and other coding systems. Linking requires cross-coding maps that are maintained over time.
  • Product identity. Clinical trial products are identified by sponsor-internal codes that are not the same as NDC, ATC, or RxNorm codes used in real-world data. Linking requires product reference data.
  • Time and visit identity. Clinical trial visits are scheduled events. Real-world data has irregular care encounters. Linking these on a time axis requires care in how visit windows and real-world encounters are aligned.

Each identity domain has established methods, but the integration of all five into a coherent layer requires deliberate design. The work is data engineering with a substantial regulatory and clinical content component, which is why pharma RWE programs often underinvest in the identity resolution layer relative to other components.

Governance and Regulatory Considerations

The link-don’t-migrate pattern has several governance and regulatory considerations that need to be addressed explicitly.

The first is data integrity for the CDISC archive. Federation that reads from the CDISC archive must not modify it, and the read patterns must preserve the validated state. This is a technical and procedural requirement. The CDISC archive’s data integrity remains intact under the federation pattern, but the access patterns need to be governed so that the integrity is verifiably preserved. The principles in the FDA’s data integrity guidance apply.

The second is regulatory submission readiness. The CDISC archive serves regulatory submissions directly. The federated environment is for analytical work. The two purposes need to be distinguished operationally, so that submission-bound analyses use the appropriate tools and the federated environment does not become an inadvertent submission source.

The third is privacy and consent. Real-world data is governed by HIPAA, state privacy laws, GDPR for EU data, and increasingly by patient consent frameworks. The federation must respect these governance constraints on the RWE side. The CDISC side has its own consent and privacy framework from the original trials. The interaction between the two needs to be governed coherently.

The fourth is RWE regulatory framework. The FDA’s real-world evidence program articulates expectations for RWE quality, provenance, and analytical methods. The federated environment should be designed to support these expectations from the start, particularly the data quality and provenance requirements that underpin the FDA’s RWE framework guidance.

Sakara Digital perspective: The link-don’t-migrate pattern does not eliminate governance work; it shifts the governance work from data migration validation to federation governance. The total governance investment is comparable, but the federation governance produces value earlier and the federation pattern preserves the optionality to migrate selectively in the future if specific use cases warrant it. Programs that adopt link-don’t-migrate retain more strategic flexibility than programs that commit to a full migration upfront.

The Adoption Sequence That Produces Value Fast

The adoption sequence that produces value within 4 to 9 months follows three phases.

Phase 1: RWE lakehouse foundation (0-3 months). Stand up the RWE lakehouse with a single real-world data source, typically claims data or a single EHR network. Build the curation, harmonization, and access control layers for that data source. The lakehouse produces its first analyses against the single source within Phase 1.

Phase 2: Federation with CDISC (3-6 months). Build the federation query layer connecting the lakehouse to the CDISC archive. Implement identity resolution for one or two specific use cases, typically one indication area where integrated analyses are most valuable. The first integrated analyses spanning CDISC and RWE data are delivered within Phase 2.

Phase 3: Expansion (6-12 months and beyond). Onboard additional real-world data sources to the lakehouse. Extend identity resolution to additional indications. Build the analytical patterns and the user-facing tools that the RWE function uses operationally. By the 12-month mark, the program is producing meaningful integrated analyses across multiple indications and data sources.

The compression of time-to-value relative to the migration-first pattern is the central operational benefit. Programs that follow the link-don’t-migrate pattern typically deliver their first integrated analysis within 6 months, versus 18 to 30 months for programs that complete migration first.

What the Pattern Unlocks for RWE Programs

Once the link-don’t-migrate pattern is operational, three categories of value become accessible faster than they would under the migration pattern.

The first is comparative effectiveness analyses for products with active commercial portfolios. RWE programs can support payer negotiations, label expansion discussions, and post-marketing commitments with integrated analyses that combine trial data with real-world outcomes. The pattern aligns with FDA’s RWE framework, which has consistently emphasized the use of fit-for-purpose real-world data alongside trial data.

The second is portfolio-level epidemiology and target identification. The RWE lakehouse, populated with claims and EHR data, supports epidemiological analyses that inform pipeline strategy and target identification. These analyses do not always require CDISC integration, but they benefit from federation with CDISC for indications where the company has trial data.

The third is faster regulatory engagement on real-world evidence. The FDA, EMA, and other agencies are increasingly engaging with sponsors on RWE submissions and on the role of real-world data in regulatory decisions. Sponsors that have an operational RWE platform can engage these conversations from a position of capability rather than from a position of “we are still building.” BioPharma Dive coverage of RWE-driven label updates and regulatory engagements consistently highlights this readiness gap.

For pharma RWE leaders, the strategic implication is that the migration premise should be challenged explicitly. The link-don’t-migrate pattern is technically viable, governance-compatible, and substantially faster to value. Programs that commit to the migration premise without examining it tend to spend 18 to 30 months in migration work before producing the integrated analyses that justify the program. Programs that adopt the link-don’t-migrate pattern produce those analyses within 6 to 12 months and retain the optionality to migrate selectively in the future. The strategic posture is to interrogate the premise before committing the program’s first two years to executing on it.

References & Sources

References & Sources

  1. FDA Real-World Evidence Program — FDA. The agency’s framework for the use of real-world data and real-world evidence in regulatory decision-making.
  2. Data Integrity and Compliance With Drug CGMP — FDA Guidance. The agency’s data integrity expectations that apply to the federated environment’s access to validated CDISC archives.
  3. CDISC Standards — Clinical Data Interchange Standards Consortium. The standards framework that governs clinical trial data and that the link-don’t-migrate pattern preserves in place rather than translating.
  4. Deloitte Life Sciences and Health Care — Deloitte. Strategic analysis of RWE program design, including architecture and execution considerations.
  5. BioPharma Dive — BioPharma Dive. Industry reporting on RWE-driven regulatory engagements, label updates, and post-marketing commitments.
  6. IntuitionLabs Articles — IntuitionLabs. Practitioner perspectives on pharma RWE architecture and the data engineering patterns that make integrated analyses feasible at scale.
author avatar
Amie Harpe Founder and Principal Consultant
Amie Harpe is a strategic consultant, IT leader, and founder of Sakara Digital, with 20+ years of experience delivering global quality, compliance, and digital transformation initiatives across pharma, biotech, medical device, and consumer health. She specializes in GxP compliance, AI governance and adoption, document management systems (including Veeva QMS), program management, and operational optimization — with a proven track record of leading complex, high-impact initiatives (often with budgets exceeding $40M) and managing cross-functional, multicultural teams. Through Sakara Digital, Amie helps organizations navigate digital transformation with clarity, flexibility, and purpose, delivering senior-level fractional consulting directly to clients and through strategic partnerships with consulting firms and software providers. She currently serves as Strategic Partner to IntuitionLabs on GxP compliance and AI-enabled transformation for pharmaceutical and life sciences clients. Amie is also the founder of Peacefully Proven (peacefullyproven.com), a wellness brand focused on intentional, peaceful living.


Your perspective matters—join the conversation.

Discover more from Sakara Digital

Subscribe now to keep reading and get access to the full archive.

Continue reading