![Databricks-4-3-A]()
Introduction
Modern data platforms make it easy to scale analytics,but access management often remains surprisingly manual.
A team asks for access, an engineer reviews the request, someone grants permissions object by object and weeks later nobody is fully sure who still has access or why.
That manual process does not fail because people are careless. It fails because the process itself is hard to scale. Databricks Unity Catalog gives us a strong permissions model with securable objects and privilege inheritance but many teams still lack a reusable operational layer that translates business requests into governed technical permissions.
This article presents a practical pattern for solving that problem: a Python based access engine packaged in a Git backed Databricks repo, executed through notebooks,and validated through Genie Spaces.
The goal is not to build a full identity platform. The goal is to create a lightweight and auditable control layer that sits between a business access request and Unity Catalog permissions.
Why this pattern is useful ?
Unity Catalog already supports catalogs, schemas, tables, views, functions, volumes and a broad privilege model.
Permissions are inherited through the object hierarchy, which is powerful, but also means that grant and revoke behavior must be designed carefully.
A user can keep effective access through catalog or schema level privileges even if a table level privilege changes.
That is why a notebook full of ad hoc GRANT statements is usually not enough.
A better design is to separate the problem into two layers:
The target operating model
Let's suppose that this the business flow we want:
Admin action → Configuration row → Access engine → Unity Catalog permissions → Audit log
In plain language:
“Maria needs access to the Sales business unit and the Finance department.”
The engine converts that into technical actions such as:
This is the principal idea: let business language drive technical permissions through a governed translation layer.
A simple data model for access requests
A simple data model is enough to get started.
1. user_access_config
This table stores the requested access.
| Column | Meaning |
|---|
| user_email | Who needs access |
| group_name | What business role/profile they are requesting |
| business_unit | Requested business unit |
| department | Requested department |
| is_active | Whether the scope is currently active |
| processed_business_unit | Last provisioned business unit |
| processed_department | Last provisioned department |
| last_processed | Last execution timestamp |
2. valid_groups
Maps a user to a domain and role.
3. role_permissions
Defines the permission model for each role.
Example:
| role | object_type | privilege |
|---|
| data-analyst | CATALOG | USE CATALOG |
| data-analyst | SCHEMA | USE SCHEMA |
| data-analyst | TABLE | SELECT |
| data-engineer | TABLE | MODIFY |
4. domain_catalog_mapping
Maps a business domain to a catalog and schema.
5. business_unit_patterns and department_patterns
Store the matching logic used to identify which tables belong to which scope.
6. access_grant_log
Records every grant and revoke attempt with a status.
From notebook prototype to Python package
The first version of this pattern is usually built as a notebook. That is useful for experimentation, but it becomes difficult to review, reuse and maintain.
The better next step is to package the logic.
A minimal structure looks like this:
user_mgt_poc/
├── src/
│ ├── access_engine.py
│ └── admin_ops.py
└── notebooks/
├── run_provisioning.py
└── admin_operations.py
This separation matters:
access_engine.py contains the provisioning logic
admin_ops.py contains safe and admin friendly helper functions
notebooks become the operational runbook
The notebook stops being the "place where all logic lives" and becomes the "controlled execution surface" for packaged code.
The operational notebook pattern
Once the Python files are placed in a Databricks repo, the notebook becomes very simple.
First, we make the repo importable:
import sys
REPO_ROOT = "/Workspace/Users/<user>/my_repo"
if REPO_ROOT not in sys.path:
sys.path.append(REPO_ROOT)
Then we import the package:
from user_mgt_poc.src.access_engine import (
CONFIG,
provision_access,
get_user_access_df,
get_audit_log_df,
)
from user_mgt_poc.src.admin_ops import (
add_user,
upsert_user,
update_user_row,
exclude_access,
deactivate_user,
reactivate_user,
remove_user,
list_users,
check_duplicates,
)
Then we choose the execution mode:
CONFIG["verbose"] = True
CONFIG["dry_run"] = True # switch to False for real execution
That one line is extremely useful operationally. It gives us a rehearsal mode before any real privilege changes are applied.
The admin operations in plain English
A good operational interface should not force admins to manually write SQL inserts or GRANT statements. That is what the helper functions are for.
add_user()to :add a brand new access row.
upsert_user():adds or reactivates safely. In practice, this is the best default function for daily use.
update_user_row():moves a user from one scope to another.
exclude_access():useful when a user has a broad scope such as "ALL business units" and we need to exclude one department.
deactivate_user():turns off a scope but keeps the row for traceability.
reactivate_user():restores a previously deactivated scope.
remove_user():performs full cleanup by deactivating, revoking and deleting the row.
list_users() and check_duplicates():support operational visibility and data quality.
How the engine works internally ?
We do not need to understand every line of code to understand the design.
At a high level, the provisioning engine does 6 things:
Detects duplicate logical rows.
Processes deactivated rows and runs revoke logic.
Reads active rows.
Detects new scopes or changed scopes.
Grants the new permissions.
Updates the processed state.
A simplified execution model looks like this:
def provision_access(spark):
duplicates = validate_no_duplicates(spark)
if duplicates:
auto_resolve_duplicates(spark, duplicates)
deactivated_rows = load_deactivated_rows(spark)
for row in deactivated_rows:
revoke_access(...)
mark_row_unprocessed(...)
active_rows = load_active_rows(spark)
for row in active_rows:
if is_new_scope(row) or is_changed_scope(row):
grant_access(...)
mark_row_processed(...)
That is the entire mental model.
The engine does not "guess" access. It works entirely from config rows and matching rules.
Matching business scopes to technical objects
This is where the pattern becomes powerful.
Let's suppose our gold billing tables use naming like this:
billing_sales_finance_invoices
billing_sales_finance_invoice_lines
billing_support_hr_invoices
We can store matching rules such as:
Then the engine finds tables where both patterns match.
A simplified matching function looks like this:
def get_matching_tables(spark, catalog, schema, business_unit, department):
bu_pattern = load_bu_pattern(...)
dept_pattern = load_dept_pattern(...)
rows = spark.sql(f"""
SELECT table_name
FROM system.information_schema.tables
WHERE table_catalog = '{catalog}'
AND table_schema = '{schema}'
AND LOWER(table_name) LIKE LOWER('{bu_pattern}')
AND LOWER(table_name) LIKE LOWER('{dept_pattern}')
AND table_name NOT LIKE '__materialization_%'
""").collect()
return [r["table_name"] for r in rows]
Grant logic versus revoke logic
This is one of the most important design details.
In Unity Catalog, privileges can apply at catalog, schema, and table level and access can flow from parent objects to child objects. That means grant and revoke logic should be designed symmetrically.
A common first implementation grants:
But then revokes only the table level permissions.
That creates a gap: the user may still keep effective access through schema or catalog level privileges.
The improved design is:
always revoke matching table privileges for the removed scope
revoke schema privilege only if the user has no other active scope that still needs that schema
revoke catalog privilege only if the user has no other active scope that still needs that catalog
That makes the revoke behavior far more realistic and avoids over-revoking shared access.
An example business scenario
Let’s use a clean generic example.
Request
Maria needs access to:
Business Unit: sales
Department: finance
Profile: data-analyst
Notebook action
upsert_user(
spark,
"[email protected]",
"data-analyst",
"sales",
"finance"
)
provision_access(spark)
Result :
Maria’s row is active
the processed fields are updated
Unity Catalog permissions are granted on the matching catalog, schema and tables
audit log shows the grant actions
Admin validation
display(list_users(spark, "[email protected]"))
display(get_audit_log_df(spark, limit=50))
Why Git and Databricks Repos matter ?
Once the logic lives in Python files instead of only notebooks, it becomes much easier to manage properly.
Putting the module in Git gives us:
Databricks Repos make this model practical because the same packaged code can be pulled into the workspace and executed from notebooks in a controlled way.
Using Genie Spaces as a business facing validation layer
Once the access is granted technically, business users still need a way to validate that they can actually use the data.
This is where Genie Spaces become useful.
Databricks describes Genie Spaces as natural language chat interfaces over curated data, and setting up a Genie Space requires a supported SQL warehouse, access to the relevant data and privileges such as SELECT on the connected objects.
The important point is:
the Python package is the access engine
the notebook is the runbook
Genie is the business facing validation layer
So after provisioning access, we can create a small Genie Space on curated gold billing tables and validate with business friendly questions such as:
“Show me 10 recent invoices for Sales / Finance”
“Show me invoice lines for Sales / Finance”
“How many invoices are available for Sales / Finance?”