Schedule a Call

The Data Quality Cleanup That Came Before AI: A Mid-Cap Pharma Case Pattern

Executive Summary

Mid-cap pharma companies that have successfully deployed AI in regulated workflows almost always preceded the deployment with twelve to eighteen months of focused data quality cleanup. The cleanup work is unglamorous, underdiscussed in conference talks, and consistently underestimated by leadership teams that want to move directly to AI use cases. The pattern is recognizable enough across the mid-cap segment to extract as a working reference for organizations earlier in the journey.

This article reconstructs the pattern: the starting state most mid-caps share, the first twelve months of inventory and triage work, the governance scaffolding that has to be built in parallel, the remediation and master data consolidation in months twelve to eighteen, and what the pattern actually enables for AI work afterward. We close with the mistakes mid-caps consistently make and how to avoid them.

80% of AI projects that fail through 2026 will do so because of inadequate data quality and data governance infrastructure, not because of model limitations, according to Gartner research synthesized across industry analyses of life sciences AI readiness.1

Why the Pattern Is Worth Extracting

The public story of AI in pharma is dominated by the largest companies: AstraZeneca’s workforce training programs, Pfizer’s manufacturing analytics, Novartis’s clinical AI partnerships. The mid-cap segment, which includes hundreds of companies between roughly $500M and $5B in annual revenue, gets far less coverage. This is unfortunate because the mid-cap pattern is more relevant to the majority of pharma organizations than the large-cap pattern is. Most pharma quality leaders work in mid-cap or smaller organizations, not in the global top-10.

The mid-cap pattern is also more honest about the work required. Large-cap pharma can afford to brute-force the data quality cleanup with thousand-person data engineering organizations and tens of millions in tooling. Mid-cap pharma cannot. The mid-cap cleanup is necessarily more targeted, more sequenced, and more visible — which makes the pattern more useful as a reference for organizations that need to make tradeoffs.

The pattern matters for three operational reasons. First, it gives quality leaders a defensible timeline to present to executive leadership when scoping AI initiatives. Second, it surfaces the governance work that has to happen in parallel with the technical work, which is often missed in initial scoping. Third, it provides a realistic cost baseline for the foundational investment that AI requires before the AI itself produces value.

The Starting State Most Mid-Caps Share

The mid-cap starting state is remarkably consistent across the segment. The recurring features:

Fragmented data across functional systems. ERP, MES, LIMS, EDC, CTMS, regulatory information management, pharmacovigilance, commercial CRM — each function has its own system of record, and the systems were typically procured at different times by different teams with limited attention to cross-system data consistency. As Acceldata’s analysis of data quality governance in pharma observes, this fragmentation is one of the most persistent data quality challenges facing the industry, with information silos preventing the integrated analysis that AI applications require.

Inconsistent master data. The same product, customer, supplier, or material is represented differently across systems. Different identifiers, different naming conventions, different attribute structures. Mid-caps generally have no master data management discipline at the start of the journey; consolidation has been deferred because the pain of fragmentation has been absorbed by individual functional teams rather than addressed structurally.

Variable data quality within systems. Even within a single system, data quality varies substantially. Some fields are well-maintained because they drive critical workflows; others are populated inconsistently because they are not enforced and not consumed downstream. The variability is rarely documented; quality leaders typically discover it during the inventory phase.

Limited data lineage. Data flows between systems through batch jobs, API integrations, manual extracts, and reports that were built ad hoc over years. Tracing a specific data element from its origin to its current location is typically a manual archaeological exercise rather than a documented capability.

Documentation gaps. System interfaces are documented unevenly. Data dictionaries exist in some places, not others. Business rules are encoded in code, in stored procedures, in spreadsheet logic, and in tribal knowledge. The combination of these is the actual operational state of the data; reconstructing it requires substantial effort.

Mid-caps that recognize themselves in this description are not failing; they are typical. The starting state reflects organic growth, sequential system acquisition, and consistent prioritization of functional delivery over cross-functional data discipline. The cleanup pattern is the work required to remediate this starting state to a level where AI can be deployed responsibly.

The First Twelve Months: Inventory, Profiling, Triage

The first twelve months of the cleanup are dominated by inventory, profiling, and triage. The work produces relatively little visible output but creates the foundation that subsequent remediation depends on. The pattern across the mid-cap segment:

MonthsWorkstreamOutput
0-3System and data inventoryCatalog of systems, data domains, and primary data flows
2-5Data profiling for priority domainsQuality assessment by domain (accuracy, completeness, consistency, timeliness)
4-7Master data current-state assessmentDocumentation of master data variations across systems
5-9Business glossary and data dictionary developmentShared vocabulary for cross-functional data discussions
7-10Triage and prioritizationRanked remediation backlog with business impact framing
9-12Governance committee chartering and operating modelActive data governance committee with defined decision rights

As IntuitionLabs’s analysis of pharma AI pilots and data foundations documents, the mid-cap pattern consistently produces a triaged remediation backlog by month nine to twelve, with the highest-priority issues identified and the governance scaffolding in place to address them. This is the inflection point at which active remediation can begin.

The work in the first twelve months is heavily dependent on cross-functional collaboration. The data engineering and analytics teams that own the technical work cannot produce the prioritization on their own; the prioritization requires business stakeholders to articulate which data quality gaps are blocking which business outcomes. Mid-caps that under-invest in the business engagement during this phase consistently produce remediation backlogs that look technical but do not align with business priority.

The Governance Scaffolding That Has to Be Built

The governance scaffolding built during the first twelve months is, in the mid-cap pattern, often more consequential than the technical work itself. The scaffolding determines whether the remediation work persists beyond the initial project funding and whether the AI deployment that follows operates on data quality that is actively maintained.

The core elements of the mid-cap governance scaffolding:

Data governance committee. Cross-functional, with representation from IT, Quality, Regulatory, Commercial, Manufacturing, and R&D. Owns data quality standards, data domain ownership decisions, remediation prioritization, and exception management. The committee operates on a regular cadence (typically biweekly during the cleanup, monthly afterward) and produces documented decisions that the operational teams reference.

Data domain ownership model. Each major data domain — products, materials, customers, sites, assays, study data — has a designated owner from the business with accountability for data quality in that domain. This is not a part-time responsibility; mid-caps that treat data ownership as nominal consistently fail to maintain data quality after the initial cleanup.

Data stewardship roles. Beneath the domain owners, data stewards are responsible for day-to-day data quality monitoring, exception handling, and remediation execution. The stewards are typically embedded in functional teams rather than centralized; their role is to apply governance standards to the operational data their teams produce and consume.

Data quality standards by domain. Each domain has documented standards for what good data looks like — required fields, valid value sets, format rules, cross-field validation rules. Standards are the reference against which remediation is scoped and against which ongoing quality is measured. As the Ideagen guide to data quality frameworks for pharma and life sciences explains, the framework approach has become the predominant pattern in the industry because it produces consistent standards across data domains without requiring centralized control over every data interaction.

Issue management process. When data quality issues are detected — whether by automated monitoring, by business users, or during AI model validation — there is a documented pathway for triage, assignment, remediation, and closure. The process produces a feedback loop that maintains quality over time rather than allowing it to degrade after the cleanup.

Sakara Digital perspective: The single most underestimated element of the mid-cap pattern is the governance scaffolding. Quality leaders who treat the cleanup as a technical project consistently produce one-time improvements that degrade within six months of project completion. Quality leaders who treat the cleanup as a governance build with technical execution embedded consistently produce durable improvements. The framing matters more than the specific technical choices.

Months 12-18: Remediation and Master Data Work

Months twelve to eighteen are when the cleanup work becomes operationally visible. The triaged backlog from the first phase is worked through in priority order, master data consolidation begins, and the governance scaffolding starts producing measurable outcomes.

The remediation work falls into several recognizable buckets:

Reference data alignment. Standardization of reference data across systems — units of measure, country codes, currency codes, regulatory codes, product classifications. This is often the lowest-hanging fruit because the gaps are well-defined and the remediation is mechanical. Reference data alignment also enables cross-system reporting that was previously impossible.

Master data consolidation. Establishing single golden records for products, materials, customers, suppliers, and sites. This is the most consequential and most complex of the remediation work because it requires resolving conflicts between system-of-record claims from multiple functional teams. As P360’s analysis of MDM in pharma describes, master data consolidation is the foundation for the integrated analytics that AI applications require, but the consolidation work itself requires substantial governance engagement.

Critical data element remediation. Within each domain, specific data elements that are critical to downstream use are remediated through a combination of automated correction, manual review, and source system improvement. The work is targeted by the business impact framing developed during the prioritization phase.

Integration and lineage documentation. Data flows between systems are documented and, where possible, instrumented for ongoing monitoring. Lineage documentation enables both impact analysis (what downstream uses are affected by a change in a source system) and traceability (where did this specific data element come from).

Monitoring infrastructure. Automated data quality monitoring is deployed for the highest-priority domains, producing ongoing visibility into quality metrics rather than relying on periodic assessment. The monitoring is the operational foundation for maintaining quality after the cleanup completes.

The work in months twelve to eighteen is more visible than the work in months zero to twelve because it produces measurable improvement in operational data quality. This visibility is important politically: executive leadership that funded the initial work begins to see the return on investment, which is essential for maintaining funding for ongoing governance.

What the Pattern Actually Enables for AI

Mid-caps that complete the eighteen-month cleanup pattern are positioned to deploy AI in ways that mid-caps starting from the original fragmented state are not. The differences are substantial.

Reliable training data. AI models trained on data that has been through the cleanup pattern produce more reliable outputs because the underlying data more accurately represents the operational reality. The accuracy of training data is the foundation for the credibility of model output, and cleaned data produces credible models that fragmented data cannot.

Defensible validation. Model validation requires reference data sets with known characteristics. Cleaned data produces reference sets that can be defended; fragmented data produces reference sets whose characteristics are uncertain, which undermines validation credibility.

Operational integration. AI models that need to integrate with operational systems require consistent data interfaces. Cleaned master data and documented integration patterns make this integration tractable; fragmented data makes it brittle.

Ongoing monitoring. Production AI requires monitoring of input data quality to detect drift and degradation. The monitoring infrastructure deployed during the cleanup is directly reusable for AI monitoring, providing operational continuity rather than parallel systems.

Regulatory defensibility. When regulatory inspectors ask about the data underpinning an AI deployment, organizations with the cleanup pattern in place can produce documented data governance, lineage, quality metrics, and ownership. Organizations without the pattern produce post-hoc reconstructions that read as exactly what they are.

The IntuitionLabs analysis of the critical role of data quality and data culture in successful AI solutions for pharma articulates this enabling effect explicitly: AI deployments built on cleaned data produce reliable outputs whose credibility can be defended, while deployments built on uncleaned data consistently produce outputs whose credibility is uncertain even when the model itself is well-constructed.

Mistakes Mid-Caps Consistently Make

The mid-cap pattern is recognizable in part because the failure modes are also recognizable. Five mistakes recur consistently and are worth flagging for organizations earlier in the journey.

Mistake 1: Treating the cleanup as a technical project. The cleanup is a governance build with technical execution embedded. Treating it as a technical project produces one-time improvements that degrade. The governance framing is what produces durability.

Mistake 2: Underestimating the timeline. Eighteen months is the realistic minimum for mid-cap pharma to complete the foundational cleanup. Programs that scope twelve months or less consistently underdeliver, and the underdelivery is visible to executive leadership in ways that erode trust in the data team. Honest scoping is more politically defensible than optimistic scoping that fails to deliver.

Mistake 3: Deferring master data work to the second phase. Master data is foundational. Programs that defer it to a “second phase” after the initial cleanup find that the cleanup itself is undermined by the master data gaps. Master data work should be in scope from month one, even if remediation happens later.

Mistake 4: Building monitoring after remediation rather than during. Monitoring infrastructure built after remediation often fails to capture the operational patterns it needs to observe. Monitoring built during remediation, with the remediation work as the test bed, produces durable infrastructure that operates correctly in production.

Mistake 5: Allowing AI use case selection to bypass the cleanup pattern. The most damaging mistake is allowing executive enthusiasm for AI to drive the deployment of AI use cases that depend on data that has not been cleaned. The resulting deployments produce unreliable outputs, undermine trust in AI more broadly, and create remediation pressure that compresses the cleanup work into unworkable timelines. The discipline of completing the foundational work before deploying production AI is the discipline that makes the AI work credible.

The mid-cap pattern is achievable. Organizations earlier in the journey have a tractable problem, not an open one. The discipline is in honest scoping, realistic timelines, governance-first framing, and the patience to complete the foundational work before pursuing the AI use cases that the foundation enables. Quality leaders who hold this line produce sustained AI capability; quality leaders who do not produce a series of failed pilots that consume credibility without producing operational value.

How the pattern interacts with regulatory expectations

One additional dimension worth flagging is how the mid-cap cleanup pattern interacts with the emerging regulatory expectations for AI in pharma. The FDA’s credibility framework, the FDA/EMA Good AI Practice principles, the ICH M15 harmonized framework, and the EMA Annex 22 draft all assume that AI models are built on data whose quality and lineage are documented. Organizations completing the cleanup pattern produce documentation that satisfies these regulatory expectations as a byproduct; organizations skipping the cleanup pattern face mounting documentation burden when AI deployments come under regulatory scrutiny.

This regulatory alignment is itself a significant return on the cleanup investment. The cleanup work is justified by AI enablement, but it pays compounding returns in regulatory defensibility, operational efficiency, and cross-functional data discipline. Quality leaders making the business case for the cleanup should articulate all three dimensions of return rather than framing the work narrowly as AI prerequisite.

The role of external partnerships in the mid-cap pattern

Mid-caps generally do not complete the cleanup pattern entirely in-house. The recurring pattern involves partnership with specialized data quality consultancies, master data management platform vendors, and data governance advisory firms. The partnerships contribute capabilities that mid-cap data teams typically do not have at sufficient depth — particularly in master data consolidation, data governance operating model design, and data quality tooling implementation.

Mid-caps that attempt the cleanup entirely in-house typically extend the timeline beyond eighteen months and produce uneven coverage across domains. Mid-caps that engage external partners strategically — for specific capability gaps rather than blanket delivery — complete the work on the eighteen-month timeline and produce more even coverage. The strategic engagement of partners is itself a governance decision that should be made at the steering committee level rather than handled tactically.

What success looks like at eighteen months

For organizations earlier in the journey, it is useful to articulate what success looks like at the eighteen-month mark. The recognizable success pattern includes: a chartered and active data governance committee that has been making documented decisions for at least nine months, master data consolidated for at least three priority domains, automated quality monitoring deployed for the top five data domains, a documented data quality remediation backlog with measurable progress, an enterprise data dictionary that is referenced rather than aspirational, and data lineage documented for the data flows that AI use cases will depend on.

This success pattern is achievable for mid-cap pharma with focused investment and disciplined execution. It is not the end of the data quality journey — the work continues — but it is the inflection point at which AI deployment becomes defensible. Quality leaders communicating progress to executive leadership should anchor in this success pattern rather than in incremental metric improvements that fail to capture the structural shift.

References & Sources

References & Sources

  1. A Comprehensive Guide to Data Quality Governance in Pharmaceuticals — Acceldata. Industry analysis of data quality challenges in pharma including the recurring fragmentation pattern that mid-caps share.
  2. Pharma AI Pilots: Fixing Data Foundations for Scale — IntuitionLabs. Practitioner analysis of why AI pilots stall on data foundations and the cleanup pattern required to enable scale.
  3. The Critical Role of Data Quality and Data Culture in Successful AI Solutions for Pharma — IntuitionLabs. Analysis of how data quality enables credible AI outputs in regulated pharma environments.
  4. Data Quality Frameworks guide for pharma and life sciences — Ideagen. Industry guide to data quality framework approaches for pharma, including the governance scaffolding pattern described in the article.
  5. Master Data Management in Pharma: Overcome Data Challenges — P360 Activate. Analysis of master data management in pharma and its role in enabling integrated analytics and AI.
  6. Fixing the Foundations: How Pharma Can Remediate Common Data Quality Issues — Sakara Digital. Practitioner analysis of common data quality remediation patterns in pharma.
author avatar
Amie Harpe Founder and Principal Consultant
Amie Harpe is a strategic consultant, IT leader, and founder of Sakara Digital, with 20+ years of experience delivering global quality, compliance, and digital transformation initiatives across pharma, biotech, medical device, and consumer health. She specializes in GxP compliance, AI governance and adoption, document management systems (including Veeva QMS), program management, and operational optimization — with a proven track record of leading complex, high-impact initiatives (often with budgets exceeding $40M) and managing cross-functional, multicultural teams. Through Sakara Digital, Amie helps organizations navigate digital transformation with clarity, flexibility, and purpose, delivering senior-level fractional consulting directly to clients and through strategic partnerships with consulting firms and software providers. She currently serves as Strategic Partner to IntuitionLabs on GxP compliance and AI-enabled transformation for pharmaceutical and life sciences clients. Amie is also the founder of Peacefully Proven (peacefullyproven.com), a wellness brand focused on intentional, peaceful living.


Your perspective matters—join the conversation.

Discover more from Sakara Digital

Subscribe now to keep reading and get access to the full archive.

Continue reading