Databricks  

Building a reusable access management layer on Databricks with Python, Notebooks and Genie spaces

Databricks-4-3-A

Introduction

Modern data platforms make it easy to scale analytics,but access management often remains surprisingly manual.

A team asks for access, an engineer reviews the request, someone grants permissions object by object and weeks later nobody is fully sure who still has access or why.

That manual process does not fail because people are careless. It fails because the process itself is hard to scale. Databricks Unity Catalog gives us a strong permissions model with securable objects and privilege inheritance but many teams still lack a reusable operational layer that translates business requests into governed technical permissions.

This article presents a practical pattern for solving that problem: a Python based access engine packaged in a Git backed Databricks repo, executed through notebooks,and validated through Genie Spaces.

The goal is not to build a full identity platform. The goal is to create a lightweight and auditable control layer that sits between a business access request and Unity Catalog permissions.

Why this pattern is useful ?

Unity Catalog already supports catalogs, schemas, tables, views, functions, volumes and a broad privilege model.

Permissions are inherited through the object hierarchy, which is powerful, but also means that grant and revoke behavior must be designed carefully.

A user can keep effective access through catalog or schema level privileges even if a table level privilege changes.

That is why a notebook full of ad hoc GRANT statements is usually not enough.

A better design is to separate the problem into two layers:

  • A config layer that stores who needs access to what.

  • An execution layer that calculates and applies the required grants and revokes.

The target operating model

Let's suppose that this the business flow we want:

Admin action → Configuration row → Access engine → Unity Catalog permissions → Audit log

In plain language:

“Maria needs access to the Sales business unit and the Finance department.”

The engine converts that into technical actions such as:

  • grant USE CATALOG

  • grant USE SCHEMA

  • grant SELECT on the matching tables

  • log each action and outcome

This is the principal idea: let business language drive technical permissions through a governed translation layer.

A simple data model for access requests

A simple data model is enough to get started.

1. user_access_config

This table stores the requested access.

ColumnMeaning
user_emailWho needs access
group_nameWhat business role/profile they are requesting
business_unitRequested business unit
departmentRequested department
is_activeWhether the scope is currently active
processed_business_unitLast provisioned business unit
processed_departmentLast provisioned department
last_processedLast execution timestamp

2. valid_groups

Maps a user to a domain and role.

3. role_permissions

Defines the permission model for each role.

Example:

roleobject_typeprivilege
data-analystCATALOGUSE CATALOG
data-analystSCHEMAUSE SCHEMA
data-analystTABLESELECT
data-engineerTABLEMODIFY

4. domain_catalog_mapping

Maps a business domain to a catalog and schema.

5. business_unit_patterns and department_patterns

Store the matching logic used to identify which tables belong to which scope.

6. access_grant_log

Records every grant and revoke attempt with a status.

From notebook prototype to Python package

The first version of this pattern is usually built as a notebook. That is useful for experimentation, but it becomes difficult to review, reuse and maintain.

The better next step is to package the logic.

A minimal structure looks like this:

user_mgt_poc/
├── src/
│   ├── access_engine.py
│   └── admin_ops.py
└── notebooks/
    ├── run_provisioning.py
    └── admin_operations.py

This separation matters:

  • access_engine.py contains the provisioning logic

  • admin_ops.py contains safe and admin friendly helper functions

  • notebooks become the operational runbook

The notebook stops being the "place where all logic lives" and becomes the "controlled execution surface" for packaged code.

The operational notebook pattern

Once the Python files are placed in a Databricks repo, the notebook becomes very simple.

First, we make the repo importable:

import sys

REPO_ROOT = "/Workspace/Users/<user>/my_repo"

if REPO_ROOT not in sys.path:
    sys.path.append(REPO_ROOT)

Then we import the package:

from user_mgt_poc.src.access_engine import (
    CONFIG,
    provision_access,
    get_user_access_df,
    get_audit_log_df,
)

from user_mgt_poc.src.admin_ops import (
    add_user,
    upsert_user,
    update_user_row,
    exclude_access,
    deactivate_user,
    reactivate_user,
    remove_user,
    list_users,
    check_duplicates,
)

Then we choose the execution mode:

CONFIG["verbose"] = True
CONFIG["dry_run"] = True   # switch to False for real execution

That one line is extremely useful operationally. It gives us a rehearsal mode before any real privilege changes are applied.

The admin operations in plain English

A good operational interface should not force admins to manually write SQL inserts or GRANT statements. That is what the helper functions are for.

  • add_user()to :add a brand new access row.

  • upsert_user():adds or reactivates safely. In practice, this is the best default function for daily use.

  • update_user_row():moves a user from one scope to another.

  • exclude_access():useful when a user has a broad scope such as "ALL business units" and we need to exclude one department.

  • deactivate_user():turns off a scope but keeps the row for traceability.

  • reactivate_user():restores a previously deactivated scope.

  • remove_user():performs full cleanup by deactivating, revoking and deleting the row.

  • list_users() and check_duplicates():support operational visibility and data quality.

How the engine works internally ?

We do not need to understand every line of code to understand the design.

At a high level, the provisioning engine does 6 things:

  • Detects duplicate logical rows.

  • Processes deactivated rows and runs revoke logic.

  • Reads active rows.

  • Detects new scopes or changed scopes.

  • Grants the new permissions.

  • Updates the processed state.

A simplified execution model looks like this:

def provision_access(spark):
    duplicates = validate_no_duplicates(spark)
    if duplicates:
        auto_resolve_duplicates(spark, duplicates)

    deactivated_rows = load_deactivated_rows(spark)
    for row in deactivated_rows:
        revoke_access(...)
        mark_row_unprocessed(...)

    active_rows = load_active_rows(spark)
    for row in active_rows:
        if is_new_scope(row) or is_changed_scope(row):
            grant_access(...)
            mark_row_processed(...)

That is the entire mental model.

The engine does not "guess" access. It works entirely from config rows and matching rules.

Matching business scopes to technical objects

This is where the pattern becomes powerful.

Let's suppose our gold billing tables use naming like this:

billing_sales_finance_invoices
billing_sales_finance_invoice_lines
billing_support_hr_invoices

We can store matching rules such as:

  • Business Unit = sales → '%sales%'

  • Department = finance → '%finance%'

Then the engine finds tables where both patterns match.

A simplified matching function looks like this:

def get_matching_tables(spark, catalog, schema, business_unit, department):
    bu_pattern = load_bu_pattern(...)
    dept_pattern = load_dept_pattern(...)

    rows = spark.sql(f"""
        SELECT table_name
        FROM system.information_schema.tables
        WHERE table_catalog = '{catalog}'
          AND table_schema = '{schema}'
          AND LOWER(table_name) LIKE LOWER('{bu_pattern}')
          AND LOWER(table_name) LIKE LOWER('{dept_pattern}')
          AND table_name NOT LIKE '__materialization_%'
    """).collect()

    return [r["table_name"] for r in rows]

Grant logic versus revoke logic

This is one of the most important design details.

In Unity Catalog, privileges can apply at catalog, schema, and table level and access can flow from parent objects to child objects. That means grant and revoke logic should be designed symmetrically.

A common first implementation grants:

  • USE CATALOG

  • USE SCHEMA

  • table level SELECT or MODIFY

But then revokes only the table level permissions.

That creates a gap: the user may still keep effective access through schema or catalog level privileges.

The improved design is:

  • always revoke matching table privileges for the removed scope

  • revoke schema privilege only if the user has no other active scope that still needs that schema

  • revoke catalog privilege only if the user has no other active scope that still needs that catalog

That makes the revoke behavior far more realistic and avoids over-revoking shared access.

An example business scenario

Let’s use a clean generic example.

Request

Maria needs access to:

  • Business Unit: sales

  • Department: finance

  • Profile: data-analyst

Notebook action

upsert_user(
    spark,
    "[email protected]",
    "data-analyst",
    "sales",
    "finance"
)

provision_access(spark)

Result :

  • Maria’s row is active

  • the processed fields are updated

  • Unity Catalog permissions are granted on the matching catalog, schema and tables

  • audit log shows the grant actions

Admin validation

display(list_users(spark, "[email protected]"))
display(get_audit_log_df(spark, limit=50))

Why Git and Databricks Repos matter ?

Once the logic lives in Python files instead of only notebooks, it becomes much easier to manage properly.

Putting the module in Git gives us:

  • version history

  • pull request review

  • safer promotion between environments

  • cleaner collaboration

  • consistent reuse from multiple notebooks

Databricks Repos make this model practical because the same packaged code can be pulled into the workspace and executed from notebooks in a controlled way.

Using Genie Spaces as a business facing validation layer

Once the access is granted technically, business users still need a way to validate that they can actually use the data.

This is where Genie Spaces become useful.

Databricks describes Genie Spaces as natural language chat interfaces over curated data, and setting up a Genie Space requires a supported SQL warehouse, access to the relevant data and privileges such as SELECT on the connected objects.

The important point is:

  • the Python package is the access engine

  • the notebook is the runbook

  • Genie is the business facing validation layer

So after provisioning access, we can create a small Genie Space on curated gold billing tables and validate with business friendly questions such as:

  • “Show me 10 recent invoices for Sales / Finance”

  • “Show me invoice lines for Sales / Finance”

  • “How many invoices are available for Sales / Finance?”