AI and Data Governance: You Can't Govern AI Without Governing Data

Your organization's AI governance framework probably includes model selection criteria, approval workflows, risk scoring matrices, and vendor questionnaires. Those things matter. But if your data isn't classified, if you can't trace its lineage, if access controls are inconsistent, and if retention policies are aspirational rather than enforced, your AI governance program is built on sand.

I've reviewed dozens of AI policies over the past two years. Most of them focus heavily on the AI itself—what models can be used, what approvals are needed, how outputs should be reviewed. Very few address the underlying data governance foundation that makes any of those controls meaningful. This creates a predictable pattern: organizations discover data quality issues only after an AI system produces problematic outputs, or worse, after a compliance event forces them to explain what data was used to train or fine-tune a model.

AI and data governance aren't separate disciplines. They can't be. Every AI compliance risk—bias, privacy violations, inaccurate outputs, regulatory breaches—traces back to data. If you can't govern the data going into your AI systems, you cannot govern the AI.

The Data Governance Gap Most Organizations Don't See

The enthusiasm around AI adoption has created a dangerous assumption: that existing data governance is "good enough" to support AI use cases. In my experience working with healthcare providers, defense contractors, and other regulated entities, this is almost never true.

Traditional business intelligence and reporting can tolerate a certain amount of data governance sloppiness. If a quarterly report pulls from an inconsistently labeled dataset, someone usually catches it during review. The feedback loop is short, the audience is limited, and the stakes are manageable.

AI systems don't offer the same safety net. They scale instantly, operate with minimal human review, and make decisions or generate content that can reach customers, patients, or regulators before anyone realizes the underlying data was incomplete, mislabeled, or out of scope. A model trained on data you thought was production-ready but that actually included test records, deprecated fields, or improperly retained information doesn't just produce bad outputs—it creates compliance exposure you may not discover until an audit.

The pattern I see most often: an organization deploys an AI tool—a chatbot, a document analysis system, a predictive model—without a clear inventory of what data it touches. When I ask basic questions during an assessment—what data sources does this use, who classified that data, what's the retention schedule, who approved access—the answers are vague. "It pulls from the CRM." "It uses historical records." "The vendor handles that." None of those are governance answers.

Data Classification: The Foundation You Can't Skip

You cannot make intelligent decisions about AI risk if you don't know what kind of data you're feeding into the system. This sounds obvious, but data classification remains one of the weakest areas in most compliance programs.

Data classification isn't just labeling files as "confidential" or "public." It's a systematic process of identifying what data you have, where it lives, who owns it, what regulatory or contractual obligations apply to it, and what risk it represents. For AI purposes, classification also needs to capture whether data is approved for training, whether it contains attributes that could introduce bias, and whether it falls under restrictions that would prohibit automated decision-making.

In healthcare environments governed by HIPAA, this becomes especially critical. An AI tool processing protected health information (PHI) without proper classification and safeguards isn't just a technical problem—it's a regulatory violation. I've seen organizations deploy AI-powered documentation tools that ingested PHI without anyone confirming that the data was properly classified, that the vendor had signed a business associate agreement, or that the use case was even appropriate under the organization's HIPAA compliance program.

Defense contractors face similar challenges with Controlled Unclassified Information (CUI) and ITAR-controlled technical data. If you're using AI to analyze engineering documents, generate reports, or assist with proposal development, you need to know whether those datasets contain CUI or export-controlled information. Feeding ITAR technical data into a commercial AI service without appropriate controls isn't an innovation risk—it's an export violation.

Effective data classification for AI governance requires three things most organizations don't have in place:

Automated discovery and labeling: Manual classification doesn't scale to the data volumes AI systems consume. You need tooling that can scan repositories, identify sensitive data types, and apply labels consistently.
Clear AI-specific categories: Traditional classification schemes (public, internal, confidential, restricted) don't answer the questions AI governance needs answered. You need to know whether data can be used for training, whether it contains personal information subject to data subject rights, and whether it includes attributes protected under anti-discrimination law.
Enforceable policies tied to classification: Classification is meaningless if it doesn't drive access control, retention, and usage decisions. If your AI platform can access any data regardless of its label, your classification program is performative.

Data Lineage: Knowing Where Your AI's Inputs Come From

When an AI system produces a problematic output—a biased recommendation, a privacy violation, an inaccurate clinical summary—the first question investigators ask is: what data did it use? If you can't answer that question with specificity, your governance program has failed.

Data lineage is the ability to trace data from its origin through every transformation, aggregation, and use. For AI governance, this means knowing:

What source systems contributed data to a training dataset or knowledge base
What transformations, filters, or enrichments were applied
When the data was collected and when it was last updated
Who approved the data for this specific use case
What subset of the data the model actually accessed during training or inference

Without lineage, you're managing AI risk blind. You can't assess whether a model was trained on representative data. You can't identify whether sensitive data leaked into a training set. You can't respond to a data subject access request that asks whether someone's personal information was used to train a model. You can't demonstrate to an auditor that you complied with data minimization principles.

I've worked with healthcare organizations that couldn't trace which patient records contributed to an AI model's training data. They knew the model was built using "historical encounter data," but they couldn't specify the date range, the patient population, the encounter types, or whether any of those records had since been subject to deletion requests. That's not a technical gap—it's a compliance failure that becomes indefensible during an audit.

Data lineage for AI doesn't happen by accident. It requires deliberate architecture:

Metadata capture at every stage: When data moves from a source system to a data lake, from a lake to a training pipeline, or from a pipeline to a model, that movement needs to be logged with sufficient detail to reconstruct the chain later.
Versioning and immutability: Training datasets need version control. You should be able to identify exactly which version of a dataset was used to train a specific model version, and that dataset version should be immutable.
Integration with governance workflows: Lineage isn't just for forensics. It should feed into approval processes. Before a model goes into production, someone should review its lineage and confirm that the data sources and transformations align with policy.

Need to Build a Governance Foundation for AI?

Carl speaks to leadership teams and boards about the data governance work that makes AI compliance possible—and how to close the gaps before they become audit findings.

Book Carl to Speak

Access Control: Governing Who Can Use Data for AI

AI systems often consolidate data from multiple sources, crossing organizational boundaries that were carefully constructed for a reason. A model trained on aggregated customer data might pull from sales, support, billing, and product usage systems—each of which has different access controls in the source environment. If the AI platform doesn't enforce equivalent controls, you've just created a data leakage path.

This is where AI and data governance intersect most directly with identity and access management. The principle is straightforward: access to data through an AI system should be governed by the same controls that apply to direct access. If a user isn't authorized to query the finance database directly, they shouldn't be able to extract financial data by asking an AI assistant that has access to it.

In practice, this is harder than it sounds. Many AI platforms operate with broad service account permissions, accessing data on behalf of all users rather than enforcing user-level entitlements. This is convenient for the platform vendor, but it's a governance problem for you. If your organization operates under role-based access controls—especially in regulated environments—you need to ensure those controls extend to AI interactions.

For organizations subject to HIPAA, this becomes a specific requirement under the minimum necessary standard. An AI tool that allows any authenticated user to access any patient record, regardless of their role or treatment relationship, violates the rule. I've seen this pattern in AI-powered clinical decision support tools and documentation assistants: technically sophisticated, but deployed without the access controls that the rest of the EHR enforces.

Defense contractors face similar challenges under CMMC and NIST 800-171. If your AI platform can access CUI without enforcing least privilege, without logging access, or without applying need-to-know restrictions, you're not meeting the access control requirements. The fact that it's an AI system rather than a traditional application doesn't change the obligation.

Access Control Patterns That Work for AI

Effective access control for AI requires integration between your identity provider, your data classification system, and the AI platform. Here's what that looks like in practice:

User-context enforcement: The AI platform authenticates users through your identity provider and applies their entitlements before accessing data. This isn't optional for regulated environments.
Data-level permissions: Access control applies at the data object level, not just the system level. A user authorized to access certain customer records but not others should see those restrictions honored when interacting with an AI tool.
Audit logging tied to identity: Every data access by the AI system should be logged with the user's identity, not just a service account. You need to know who asked the question that caused the model to retrieve specific data.
Approval workflows for sensitive use cases: Access to use certain datasets for training or fine-tuning should require explicit approval, not just inherited permissions from production access.

Data Retention: The AI Complication No One Wants to Talk About

Your retention policies were probably designed for structured databases, email systems, and file shares. They likely weren't written with AI in mind. This creates a significant gap: what happens to data that's been ingested into a training dataset, embedded into a model's weights, or cached in a vector database after the retention period expires?

The legal and regulatory answer is clear: data subject to retention limits needs to be deleted when those limits are reached, regardless of where it lives. The technical reality is messier. Once data is baked into a machine learning model, you can't simply delete a record and assume the model no longer "knows" it. Depending on the architecture, the data may persist in model weights, in fine-tuning layers, or in retrieval-augmented generation (RAG) indexes.

This isn't theoretical. Under GDPR, individuals have the right to erasure. Under CCPA and state privacy laws, consumers can request deletion. Under HIPAA, patients can request restrictions or amendments. If your AI system was trained on or has indexed data subject to these rights, you need a process to honor them. "The data is in the model, we can't remove it" is not a compliant answer.

I've seen organizations struggle with this in predictive analytics and customer service automation. A customer requests deletion under state privacy law. The database team confirms deletion from the CRM. But the AI-powered recommendation engine was trained on historical data that included that customer's purchase and browsing history. The model continues to generate recommendations influenced by that data. The organization has a problem.

Retention Strategies for AI Systems

Managing retention in AI environments requires thinking about data lifecycle differently:

Retention-aware training pipelines: Before data is included in a training set, check its retention status. Data approaching its retention limit shouldn't be used for training a model that will be in production for years.
Model retraining cycles tied to retention: If your retention policies require data deletion on a regular cycle, your model retraining schedule needs to align. A model trained on data that's since been deleted may need to be retrained from scratch with compliant data.
Separation of training data from inference data: RAG systems and knowledge bases that retrieve data at inference time are easier to manage for retention than models where data is embedded in weights. If retention compliance is a significant concern, architectural choices matter.
Documentation of what can't be undone: In some cases, you may determine that certain data cannot be fully removed from a model without retraining. That determination needs to be documented, and it may affect whether you can use that data for AI in the first place.

Vendor AI and the Governance Problem You're Inheriting

Most AI deployments in regulated organizations aren't built in-house. They're vendor products: SaaS platforms with embedded AI, specialized tools for document analysis or customer service, off-the-shelf models fine-tuned for industry use cases. Every one of these vendor relationships imports someone else's data governance—or lack thereof—into your environment.

When I conduct third-party risk assessments for AI vendors, the questions I ask aren't about model architecture or accuracy metrics. They're about data governance:

What data does your system collect from us, and how is it classified?
Is our data used to train or improve models used by other customers?
Where is our data stored, and who has access to it?
What is your data retention policy, and can we enforce deletion on our schedule?
How do you handle data subject rights requests?
Can you provide lineage showing what data contributed to outputs generated for us?
What happens to our data if we terminate the contract?

The answers to these questions are often unsatisfying. Many AI vendors operate on assumptions that don't align with regulated industry requirements. They may retain data indefinitely for model improvement. They may not support user-level access controls. They may not be able to trace lineage. They may not even be willing to sign a business associate agreement if you're in healthcare.

This is where AI governance and procurement need to converge. If your vendor can't meet your data governance requirements, the AI capabilities they offer are irrelevant. You can't comply your way out of a vendor's architectural limitations. The conversation needs to happen before the contract is signed, and it needs to include your legal, compliance, and information security teams—not just the business unit excited about the AI features.

Building AI Governance That Actually Works

Carl delivers keynotes on AI governance, regulatory risk, and the compliance work that often gets overlooked until it's too late. See all keynote speaking topics or reach out about your event.

Book Carl for Your Event

The Intersection of AI Governance and Regulatory Frameworks

AI-specific regulations are emerging, but most AI governance obligations are already embedded in existing frameworks. HIPAA doesn't have an "AI section," but it has clear requirements about safeguarding PHI, limiting access to the minimum necessary, and ensuring business associate agreements cover all disclosures. Those requirements apply whether you're using a traditional database or an AI-powered clinical tool.

Similarly, CMMC and NIST 800-171 don't explicitly mention AI, but they have detailed controls for protecting CUI, enforcing access controls, conducting risk assessments, and managing third-party risk. If you're using AI to process CUI—and many defense contractors are—those controls apply.

The EU AI Act introduces AI-specific obligations, but even there, much of the compliance burden rests on data governance. The Act's requirements around transparency, documentation, human oversight, and risk management all depend on your ability to govern the data flowing through AI systems. If you can't demonstrate what data you used, how you validated it, and how you minimized bias, you can't demonstrate compliance with the Act's risk-based obligations.

Organizations waiting for "AI regulations" before building governance are missing the point. The regulatory foundation already exists. Executive leaders should be asking whether their existing compliance programs extend to AI use cases, not whether they need a separate AI compliance program.

Making Data Governance an Enabler, Not a Blocker

The risk with emphasizing data governance is that it sounds like a barrier to AI adoption. That's the wrong framing. Good data governance makes AI adoption faster and safer. It reduces the risk of deploying systems on bad data. It shortens the time required for compliance reviews. It prevents the kind of post-deployment discoveries that force you to pull systems offline and start over.

Organizations with mature data governance programs—accurate classification, documented lineage, enforced access controls, clear retention policies—can evaluate and deploy AI tools quickly because they already know what data they have and what they can do with it. They can answer vendor questions. They can complete risk assessments without getting stuck on basic data inventory questions. They can move fast because the foundation is solid.

Organizations without that foundation end up in a cycle of pilot projects that never reach production because someone finally asks the data governance questions and no one can answer them. Or worse, they deploy without asking, and the governance gap becomes an incident.

If you're leading AI strategy and you're not investing equally in data governance, you're building on a foundation that won't hold. The AI governance framework you need isn't separate from data governance—it's built on top of it. Classification, lineage, access control, and retention aren't obstacles to AI. They're the prerequisites for doing it in a way that doesn't create unmanageable risk.

What CISOs and Compliance Leaders Should Do Now

If your organization is deploying AI—or preparing to—the data governance work can't wait until after the AI strategy is finalized. It needs to happen in parallel, and in many cases, it needs to come first.

Start with an honest assessment of your current data governance posture. Can you produce an inventory of your data sources, classified by sensitivity and regulatory obligation? Can you trace the lineage of data flowing into your most critical systems? Are your access controls enforced consistently across platforms, or do they vary by system? Do you have retention policies that are actually enforced, or are they documented but not implemented?

If the answers to those questions expose gaps, those gaps will become AI governance failures. Address them now, before they're compounded by AI complexity.

For AI-specific initiatives, require data governance review as part of the approval process. Before any AI tool goes into production, someone needs to document what data it accesses, how that data is classified, whether access controls are enforced, and how retention will be managed. This doesn't need to be a bureaucratic process, but it does need to produce clear answers.

Work with your vendors to understand their data governance practices and ensure they align with your requirements. If they can't or won't meet your standards, that's a signal. Either the vendor isn't mature enough for regulated use cases, or you need to architect around their limitations.

And recognize that AI governance and data governance are not separate workstreams. They're the same work. You cannot govern AI without governing data. The organizations that figure this out early will have a significant advantage—not just in managing risk, but in capturing the value AI can provide when it's built on a solid foundation.