Most organizations racing to deploy AI are asking the wrong first question. They want to know which model to use, how to write policies, or how to satisfy the EU AI Act. But you can't govern AI effectively if you don't know what data it's consuming, where that data came from, who had access to it, or whether you were supposed to keep it in the first place. AI governance depends on data governance — the unglamorous, infrastructure-level work that most organizations have been postponing for years.
I see this pattern constantly: leadership greenlights an AI pilot, IT spins up a vendor integration, and six weeks later someone in compliance asks where the training data came from. No one knows. Was it customer data? Protected health information? Controlled technical data under ITAR? The answer is usually "probably some of each," and that's when the project stops or the organization accepts risk it doesn't understand.
AI and data governance aren't separate disciplines. One is simply the application layer of the other. If your data governance is weak — if you lack accurate inventories, clear classification schemes, enforced access controls, or defensible retention practices — your AI governance will inherit every one of those weaknesses and amplify them at scale.
The Data Governance Debt Comes Due
For the past decade, many organizations have treated data governance as a compliance checkbox rather than an operational discipline. They built data inventories during audit prep and let them go stale. They wrote data classification policies that no one outside legal could understand, much less follow. They designed access controls that accumulated exceptions until the controls became meaningless.
AI exposes all of it. When you feed data into a model, you're not just storing or transmitting it — you're transforming it into decision-making logic that might persist for years and affect thousands of downstream transactions. If you didn't know that dataset contained PII, you'll find out when your model starts making inferences about individuals. If you didn't enforce your retention policy, you'll discover it when a model trained on data you should have deleted three years ago produces a result you now have to defend in litigation.
The technical properties of AI make poor data governance exponentially more damaging. Models are trained on datasets that may aggregate information from dozens of sources, crossing jurisdictions, sensitivity classifications, and retention schedules. A single training run can blend customer relationship data, employee files, third-party research, scraped web content, and vendor-supplied datasets into a single artifact. If you couldn't tell me what data went in, you cannot tell your general counsel, your auditor, or your regulator what the model "knows."
What Happens When You Get It Wrong
In regulated industries, the consequences are not hypothetical. Healthcare organizations using AI diagnostic tools have struggled to demonstrate that training data complied with minimum necessary standards under HIPAA. Defense contractors deploying AI to analyze technical data are discovering, too late, that ITAR-controlled information was mixed into commercial datasets. Financial institutions are finding that models trained on transaction data can inadvertently expose patterns that reveal information about individuals who never consented to AI processing.
The regulatory response is underway. The EU AI Act explicitly ties AI risk classification to the sensitivity of underlying data. The FTC has made clear that data practices — provenance, consent, fairness — are within scope when evaluating AI deployments. NIST's AI Risk Management Framework anchors trustworthiness to data quality, integrity, and privacy controls. Sector-specific rules are coming faster: HIPAA and AI tools are under increasing scrutiny, and DoD contractors face overlapping requirements from CMMC, DFARS, and ITAR when defense-related data enters AI systems.
Data Lineage: You Need to Know Where It Came From
Data lineage is the record of where data originated, how it moved through your systems, what transformations it underwent, and where it ultimately landed. In traditional data management, lineage is useful for debugging ETL pipelines and satisfying auditors. In AI governance, it's load-bearing infrastructure.
You cannot assess the legal or regulatory risk of an AI model if you cannot trace its training data back to source. You cannot respond to a data subject access request if you don't know which datasets contributed to a model that made a decision about that person. You cannot retire a vendor relationship cleanly if you cannot identify which models were trained on that vendor's data.
In my experience, fewer than a third of organizations have tooling or process to track data lineage reliably, and almost none extend that lineage tracking through to AI model training. Most can tell you what database a report drew from, but they cannot tell you what sources contributed to a feature engineering pipeline that fed a model now running in production. That gap is not sustainable once AI is making decisions that have legal, financial, or safety impact.
Building Lineage That Scales
Good lineage starts with inventory: an accurate, current, and queryable catalog of data assets. That catalog should include not just structured databases, but also file shares, cloud storage buckets, SaaS application data, data lake partitions, and third-party data feeds. Every asset should be tagged with origin, classification, jurisdiction, sensitivity, and retention schedule.
Next, you need to capture transformations. When data moves from one system to another, when it's aggregated or anonymized, when it's enriched with external data, that event should be logged in a way that allows you to reconstruct the chain later. This doesn't require expensive tooling at the start — many organizations begin with structured metadata and audit logs before investing in dedicated lineage platforms. But it does require discipline and process enforcement.
Finally, lineage must extend into the AI pipeline. When a data scientist pulls a dataset for model training, that action should be recorded with enough context to answer basic questions: What data was included? What was excluded and why? What preprocessing was applied? What version of the model resulted? Who approved the training run? These are not aspirational questions. They're the questions your legal team will ask after an incident, and the questions regulators are already asking in enforcement actions.
Data Classification: If You Can't Label It, You Can't Govern It
Classification is the process of identifying and marking data according to its sensitivity, legal status, and handling requirements. It's foundational to every other control: access rules depend on classification, retention schedules depend on classification, and breach notification thresholds depend on classification. Yet most organizations have classification schemes that are either too vague to enforce or too complex for anyone to follow.
AI makes bad classification immediately visible. When you ask a model to process data, it doesn't interpret your classification policy — it consumes whatever you feed it. If you misclassified customer payment information as "internal business data," the model won't know the difference. If you failed to mark a dataset as containing ITAR technical data, the model won't refuse to process it. The model is indifferent to your schema. It will do exactly what you told it to do, using exactly the data you provided, and you will own the outcome.
What Actually Works
Effective classification schemes share certain traits. They are simple: three to five levels, clearly defined, with bright lines between categories. They are role-aligned: people can apply the scheme without needing a law degree or a data science background. And they are enforced: technical controls prevent misclassified data from being used in ways the classification forbids.
For AI governance, classification must be machine-readable and programmatically enforceable. If your classification lives in a SharePoint document and depends on user judgment at the point of data access, it will not survive contact with an AI pipeline where datasets are pulled programmatically, merged automatically, and fed into training jobs that run overnight. You need metadata that travels with the data and policies that systems can evaluate without human intervention.
In regulated industries, classification must map to legal definitions. In healthcare, you need to distinguish between protected health information and de-identified data under HIPAA standards. In defense, you need to mark controlled unclassified information (CUI) and ITAR-controlled technical data according to federal guidelines. In privacy-regulated environments, you need to identify personal data, special category data, and data subject to cross-border transfer restrictions. Your AI governance framework inherits these obligations, and your classification scheme is the mechanism for ensuring compliance.
Speaking on AI Governance and Data Privacy
Carl B. Johnson delivers keynotes on AI governance, regulatory compliance, and data privacy strategy for organizations navigating real-world risk. His presentations are built on experience, not theory, and tailored to the challenges leadership teams face when deploying AI in regulated industries.
Book Carl to SpeakAccess Control: Who Touches the Data Touches the Model
Access control determines who can read, modify, or delete data. In traditional environments, access failures lead to unauthorized disclosure or tampering. In AI environments, access failures lead to unauthorized disclosure, tampering, and training — and the training part is where things get complicated.
When someone with excessive access pulls data for model training, they embed that access decision into the model. If a contractor with temporary credentials downloads a dataset that includes customer financial records, and that dataset is later used to train a customer service AI, you've now enshrined a policy violation in production logic. The model doesn't forget. The model doesn't respect the expiration of the contractor's access. The model operates on the data it was given, and auditing what that data contained after the fact is harder than preventing the access in the first place.
This pattern shows up constantly in shadow AI deployments, where teams use personal cloud accounts or unapproved SaaS tools to bypass IT controls. A product manager uploads a CRM export to an AI coding assistant to generate customer segmentation logic. A financial analyst feeds transaction data into a third-party forecasting tool. An HR director uses a generative AI service to draft performance review templates based on employee records. In every case, data left the environment where access controls applied and entered one where they didn't. That's not an AI governance failure — it's a data governance failure that AI made visible.
Designing Access Controls for AI Workflows
AI workflows are different from traditional data access patterns, and access controls need to account for those differences. Data scientists and ML engineers often need broader access than typical users because their job is to explore datasets, identify patterns, and prototype models. Locking down access so tightly that they cannot work is not a viable strategy. But giving them unrestricted access and hoping they'll self-govern is not viable either.
The answer is structured access with explicit logging and approval workflows. Data access for AI purposes should require a documented use case, a defined scope, and a time limit. Access should be logged with enough detail to reconstruct what data was accessed, when, and for what purpose. And access should be reviewed regularly — not annually in a compliance exercise, but continuously as part of operational governance.
Technical controls help. Tokenization and anonymization can reduce the sensitivity of datasets used for training without eliminating their analytical value. Data masking can hide PII in non-production environments. Federated learning allows models to train on data without centralizing it. These are not silver bullets, but they are practical tools that reduce the blast radius of an access control failure.
Retention and Deletion: Models Remember What You've Forgotten
Data retention policies specify how long data should be kept and when it must be deleted. Every organization has one, even if it's implicit. Most organizations do not follow it, and most have no systematic way to verify compliance. AI makes that failure a liability.
When you train a model on data that should have been deleted, you've violated your retention policy in a way that's difficult to remediate. The model retains information about that data — not necessarily the records themselves, but the patterns, correlations, and inferences the data made possible. If that data was subject to a legal hold, a regulatory retention limit, or a customer deletion request, you now have a compliance problem that cannot be solved by deleting the source data. The model itself is the violation.
This is not a theoretical concern. Privacy regulators have made clear that the right to deletion applies not just to source data but to systems that process it. If a California consumer submits a CCPA deletion request, and you delete their record from your CRM but continue to operate a recommendation model trained on their purchase history, you have not honored the request. The model is still making inferences based on their data. The model is still, in a meaningful sense, processing information about them.
Retention Policy in Practice
Enforcing retention policy in AI environments requires more than scheduled deletion jobs. It requires knowing what data went into which models, when those models were trained, and whether any of the underlying data is now subject to deletion obligations. That means maintaining not just data lineage but model lineage — a record of which datasets contributed to which model versions, and when.
Some organizations are beginning to implement model expiration policies that mirror data retention schedules. If a dataset has a three-year retention limit, any model trained on that dataset might have a three-year operational limit as well, after which it must be retrained on current data or retired. This is not yet standard practice, but it's the direction privacy-forward organizations are moving, particularly in jurisdictions where data minimization is a legal requirement.
Retention also intersects with incident response. If you suffer a data breach and must notify affected individuals, you need to know whether the breached data was used in AI training. If it was, you may have downstream obligations: re-evaluating the model, notifying additional stakeholders, or even retiring the model if the compromised data cannot be cleanly excised. None of that is possible without the governance infrastructure to connect data incidents to model artifacts.
Tailored Keynotes on AI Risk and Compliance
Carl delivers presentations on AI governance, data privacy, and compliance strategy for conferences, corporate leadership events, and industry associations. See all keynote speaking topics or reach out about your event.
Book Carl for Your EventThe Tooling Problem: Governance Needs Infrastructure
Data governance and AI governance both require tooling, but most organizations discover their existing tools were not designed for the scale, complexity, or velocity of AI workflows. Data catalogs built for reporting and analytics often cannot track the feature engineering and transformation steps that feed model training. Access management platforms designed for human users struggle with service accounts and API-driven data access. Retention automation built for structured databases doesn't extend to data lakes, object storage, or third-party AI platforms.
I've seen organizations try to govern AI with spreadsheets, SharePoint sites, and email threads. It doesn't work. The volume of data movement, the speed of experimentation, and the distributed nature of modern AI development exceed what manual processes can manage. You need systems that can enforce policy programmatically, log activity automatically, and surface violations in real time.
That said, tooling is not a substitute for policy, and vendors are eager to sell you platforms that promise governance without requiring you to do the hard work of defining what you're governing. A data catalog is only as good as the metadata you put into it. A policy enforcement engine is only as effective as the policies you write. A lineage tracker is only useful if your teams log transformations consistently. The tooling enables governance, but governance is still a people and process problem first.
Building the Right Stack
A functional AI and data governance stack typically includes several categories of tools. You need discovery and classification tools that can scan your environment, identify sensitive data, and apply labels automatically or with minimal human input. You need access management systems that integrate with your identity provider and enforce attribute-based or role-based access controls at the data layer. You need lineage and cataloging platforms that track data movement and make that information queryable for compliance and operational purposes. And you need monitoring and alerting systems that flag policy violations, unusual access patterns, or high-risk AI activity before it becomes an incident.
Not every organization needs enterprise-grade tooling on day one. Smaller organizations can start with native cloud provider tools, open-source solutions, and structured logging before investing in commercial platforms. But you cannot scale AI without eventually investing in infrastructure that makes governance enforceable, auditable, and sustainable.
AI Governance Is Data Governance at Scale
If you're building an AI governance framework, start with data. Inventory what you have. Classify it according to sensitivity and legal obligations. Enforce access controls that map to roles and use cases. Implement retention policies that apply not just to source data but to the models trained on it. Log everything in a way that allows you to reconstruct decisions, respond to incidents, and satisfy auditors.
The organizations that govern AI well are not necessarily the ones with the most sophisticated models or the largest data science teams. They're the ones that did the unglamorous data governance work before AI became a priority. They built inventories that stay current. They wrote classification schemes that people actually follow. They implemented access controls that survived the shift to cloud, remote work, and third-party SaaS. They enforced retention policies that now protect them from retaining data that creates liability.
AI does not require a fundamentally different approach to data governance — it requires a better version of the approach you should have been following all along. The stakes are higher because AI amplifies risk and operates at scale. The scrutiny is greater because regulators and customers are paying attention. But the principles are the same: know your data, control who touches it, enforce how long you keep it, and be able to prove all of the above when someone asks.
What Leadership Should Do Now
If you're a CISO, CTO, or compliance leader, you cannot govern AI without first understanding the state of your data governance. That means conducting an honest assessment: Do we have a current, accurate inventory of data assets? Can we trace data lineage through transformation pipelines and into AI training processes? Do we enforce classification and access controls programmatically, or do we rely on user judgment and policy documents no one reads? Can we demonstrate compliance with retention schedules, and do those schedules account for data embedded in models?
If the answer to any of those questions is no, that's your starting point. You do not need to solve every data governance gap before deploying AI, but you do need to understand which gaps create unacceptable risk and prioritize closing them. Some AI use cases — internal productivity tools, low-risk automation, non-sensitive data analysis — may be viable even with immature data governance. Others — customer-facing decisioning, processing of regulated data, high-risk AI under the EU AI Act — are not.
Communicate the dependency clearly. AI governance is not a separate workstream that can proceed in parallel with data governance — it's downstream. If your organization is investing in AI strategy, model development, and vendor evaluations without simultaneously investing in data inventory, classification, lineage, and access control, you are building on sand. The structure will work until the first audit, the first incident, or the first regulator inquiry, and then it will collapse in a way that's expensive and embarrassing to fix.
For organizations that are serious about getting this right, the path forward is not a mystery. Inventory your data. Classify it. Control who can access it and log when they do. Enforce retention and deletion. Extend those practices into your AI pipelines and hold your data science and engineering teams to the same standards you hold the rest of IT. Make governance a prerequisite for production deployment, not a checklist you complete after the fact.
The organizations that thrive in the next decade of AI will be the ones that treated data governance as infrastructure rather than overhead. They will respond faster to regulatory changes because they already know what data they have and where it lives. They will deploy AI with confidence because they can demonstrate compliance at every layer. And they will avoid the high-profile failures that come from discovering, too late, that the model everyone depends on was trained on data no one should have touched.
AI and data governance are not separate disciplines. One is simply the visible expression of the other. If your data governance is strong, your AI governance has a foundation. If your data governance is weak, your AI governance is a facade. Most organizations will figure that out the hard way. You don't have to be one of them.