Data Governance

What Is Data Governance in Purview?

Microsoft Purview Data Governance provides a unified data catalog and data map that discovers, classifies, and inventories data assets across on-premises, multi-cloud, and SaaS sources. This layer answers the foundational question: where is sensitive data, and what kind is it?

Governance discovery feeds directly into the protection stack — labels applied by governance scanning inform DLP enforcement, and posture gaps surfaced by scanning inform DSPM risk scoring. The governance layer is discovery-focused; it does not enforce controls directly.

Two-Lane Governance Model

🏛️ Lane 1 — Compliance Governance

Managed by the Compliance team. Focuses on regulatory requirements: HIPAA, HITECH, PCI DSS, GDPR, SOX. Outputs are DLP policies, retention policies, audit reports, and eDiscovery holds. Primary tool: Purview Compliance Portal.

DLP policy ownership
Retention and records management
eDiscovery and litigation hold
Audit log review and reporting

🗺️ Lane 2 — Data Asset Governance

Managed by the Data team. Focuses on data discoverability, cataloging, lineage, and classification consistency across the enterprise data estate. Primary tool: Purview Data Catalog / Unified Governance Portal.

Data Map scanning and registration
Data Catalog ownership and glossary
Data classification at asset level
Business glossary and lineage

Purview Data Map

The Purview Data Map is the foundational metadata store that registers, scans, and classifies data sources. It creates a living inventory of all data assets — tables, files, reports, databases — with sensitivity classifications, ownership, and lineage attached.

Supported Source Types

☁️ Azure

Azure Data Lake, Azure SQL, Azure Blob, Synapse Analytics, Azure Cosmos DB, Azure Data Factory lineage

🏢 On-Premises

SQL Server, Oracle, SAP HANA, Teradata, file shares (Windows/Linux) via Self-Hosted Integration Runtime (SHIR)

☁️ Multi-Cloud

Amazon S3, Google Cloud Storage, Snowflake. Cross-cloud scanning requires network connectivity and registered credentials.

📁 Microsoft 365

SharePoint, OneDrive, Exchange (email body and attachments). Scanned natively — no SHIR required.

💼 SaaS

Power BI, Salesforce, SAP S/4HANA, Erwin, Looker. Requires connector registration and credential management.

🖥️ DFSR / File Shares

Windows file servers with DFSR replication scanned via SHIR with MIP scanner agent deployed on member servers.

SHIR On-Premises Onboarding

What is SHIR? The Self-Hosted Integration Runtime (SHIR) is a data movement agent installed on an on-premises or private network server. It enables the Purview Data Map to scan on-premises sources without exposing them to the public internet.

Step	Action	Notes
1	Register SHIR in Purview portal	Integration Runtimes → New → Self-Hosted. Download installer key.
2	Install SHIR on dedicated Windows Server	Minimum: Windows Server 2016+, 4 vCPU, 8GB RAM. Isolated from domain controller.
3	Configure firewall outbound rules	Allow HTTPS (443) to purview.azure.com and servicebus.windows.net. No inbound required.
4	Register data source in Data Map	Sources → Register → select source type. Provide SHIR as runtime.
5	Create and run scan	Set classification rules, scanning scope, trigger (manual or scheduled). Review scan report.
6	Review classifications in Catalog	Asset details show detected sensitive info types and applied labels. Validate accuracy.

Data Catalog

Asset Registration

Every scanned source populates the catalog with asset entries. Each asset has: schema, sensitivity classification, owner, glossary terms, lineage, and scan history. Assets are searchable across the entire data estate.

Business Glossary

Canonical definitions for business terms — linked to catalog assets. Ensures consistent vocabulary across data teams. Governance stewards own term definitions. Examples: "Member Account," "PII Asset," "Regulated Data."

Lineage Tracking

Shows data flow from source to consumption: raw files → ETL pipelines → data warehouse → reports. Lineage enables impact analysis — who consumes a dataset and what breaks if it changes.

Classification Accuracy Review

Classification results should be reviewed quarterly. False positives (e.g., test data classified as PII) should be corrected and the scan rule tuned. False negatives require SIT or classifier adjustment.

Governance Best Practices

Scan scheduling and scope

Start with targeted scopes (specific folders, databases) rather than full-system scans
Schedule scans during off-peak hours — scans can be I/O intensive on source systems
Run full scans monthly; incremental scans weekly for active repositories
Review scan failure logs — failed scans silently miss assets without alerting by default

Data owner assignment

Every asset should have a declared data owner — enforce this via catalog policy
Owners are responsible for classification accuracy — not the security team
Build a quarterly data owner review process — catalog ownership decays without maintenance
Use Collections in the catalog to align assets with business domains and assign ownership at scale

References

Supported Source Types

Data Governance

🏛️ Lane 1 — Compliance

🗺️ Lane 2 — Data Asset

Cloud

On-Premises

Microsoft 365

Steps 1–3: Deploy

Steps 4–6: Scan