AI-Based Data Classification Pipeline (PII Detection, Anonymization Rules)

Rajesh Gami
1h
70
0
0

Article

Introduction

Every enterprise handles sensitive data — names, emails, IDs, addresses, medical notes, financial numbers. Regulations (GDPR, CCPA, HIPAA) and simple business risk demand you find personally identifiable information (PII) quickly and treat it correctly. Manual scans are slow and error-prone. An AI-based data classification pipeline fast-tracks discovery, applies anonymization or masking rules, and provides human-review flows for edge cases.

This article walks you through designing and implementing a production pipeline that:

Detects PII with hybrid methods (rules + ML)
Assigns sensitivity and confidence levels
Applies anonymization, pseudonymization, or redaction policies
Supports human-in-the-loop review (Angular UI)
Integrates with a .NET backend, storage, and message buses
Is auditable, testable, and scalable

I assume you want a practical guide you can implement in enterprise systems: clear architecture, data models, sample code, testing strategy, and operational guidance.

Goals and Non-Goals

Goals

Reliable PII detection across structured (tables) and unstructured (text, documents) data.
Risk-based anonymization: reversible pseudonymization for analytics; irreversible masking for exports.
Human review for low-confidence detections.
Scalable and observable pipeline for batch and streaming workloads.

Non-Goals

This is not a legal compliance checklist — consult privacy counsel for final policy.
This is not an attempt to cover every PII type (biometrics, genetic data) — extend patterns as needed.

High-Level Architecture (Block Diagram B)

             ┌────────────────────────┐
             │   Data Sources         │
             │ (DBs, Files, Streams)  │
             └──────────┬─────────────┘
                        │
                        ▼
               ┌────────────────────┐
               │ Ingest Layer       │
               │ (Batch / Stream)   │
               └──────────┬─────────┘
                          │
            ┌─────────────▼────────────┐
            │ Preprocessing & Normal.  │
            │ (tokenize, normalize)    │
            └─────────────┬────────────┘
                          │
             ┌────────────▼────────────┐
             │ Hybrid Detector         │
             │ - Regex & Heuristics    │
             │ - ML Model (NER)        │
             └────────────┬────────────┘
                          │
         ┌────────────────┴────────────────┐
         │ Confidence Scorer & Risk Engine │
         └────────────┬─────────────┬──────┘
                      │             │
                      ▼             ▼
       ┌────────────────────┐  ┌──────────────┐
       │ Auto-Anonymizer     │  │ Review Queue │
       │ (mask/replace/hash) │  │ (UI + Human) │
       └─────────┬───────────┘  └──────────────┘
                 │                        │
                 ▼                        ▼
        ┌────────────────────┐    ┌────────────────┐
        │ Output Store (Safe)│    │ Audit + Logs   │
        │ (masked / pseud.)  │    │ (who, when)    │
        └────────────────────┘    └────────────────┘

Key Concepts and Decisions

Hybrid Detection: Rules + ML

Rules (regular expressions, dictionaries, checksum validators) are deterministic and fast. Use them for exact patterns: email, credit-card numbers (Luhn), phone numbers.
ML (NER — Named Entity Recognition) catches context-dependent PII: person names in free text, organization names, ID tokens that vary by country.
Combine both in a pipeline: run fast rules first, then ML for remaining uncertain tokens.

Confidence And Risk

Every detection must carry a confidence score and a risk label (Low/Medium/High). Risk drives action:

High confidence, high-risk → automatic anonymize.
Medium confidence → send to review queue.
Low confidence → annotate but leave untouched.

Anonymization Modes

Masking / Redaction: Replace characters with *** — irreversible. Use for public exports.
Pseudonymization: Replace identifier with a reversible token (store mapping in a secure vault) — useful for analytics where you need to join events across datasets without exposing raw PII.
Tokenization (Vault): Store actual value in a secrets store or HSM and replace with token reference.
Hashing: Irreversible hash, optionally salted per-tenant. Use when joinability is needed and reversal is forbidden.

Human-in-the-Loop

Low confidence items should be pushed into a review UI where privacy officers or data stewards can approve, adjust classification, or add suppression rules.

Auditability and Explainability

Record each detection event (original value, detected type, detector used, score, action taken, operator decisions). This is essential for compliance audits.

Data Model (Core Tables)

Simplified SQL schema for meta:

CREATE TABLE DetectionJob (
  JobId UNIQUEIDENTIFIER PRIMARY KEY,
  SourceName NVARCHAR(200),
  StartedAt DATETIME2,
  CompletedAt DATETIME2 NULL,
  Status NVARCHAR(20) -- Pending, Running, Completed, Failed
);

CREATE TABLE DetectionRecord (
  RecordId BIGINT IDENTITY PRIMARY KEY,
  JobId UNIQUEIDENTIFIER,
  SourceKey NVARCHAR(400), -- pointer to source row or file path
  FieldName NVARCHAR(200) NULL,
  OriginalSnippet NVARCHAR(MAX),
  DetectedType NVARCHAR(100), -- EMAIL, SSN, NAME, CREDIT_CARD
  Confidence FLOAT,
  RiskLevel NVARCHAR(20), -- Low, Medium, High
  ActionTaken NVARCHAR(50), -- Masked, Pseudonymized, Reviewed
  ActionedAt DATETIME2 NULL,
  ActionedBy NVARCHAR(200) NULL,
  AuditJson NVARCHAR(MAX) NULL -- detector traces, regex matched groups, model logits
);

For pseudonymization mapping:

CREATE TABLE PseudonymMap (
  PseudoId UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
  HashKey NVARCHAR(200), -- salted hash of original
  EncryptedValue VARBINARY(MAX), -- encrypted original value
  CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);

Encryption and key management must use Key Vault/HSM in production.

Pipeline Flow (Detailed)

Ingest
- Batch: schedule ETL jobs to extract columns / files.
- Stream: subscribe to CDC topics, Kafka, or messaging streams.
- For files (PDF/DOCX), run OCR first (Tesseract or cloud OCR) to get text.
Preprocess
- Normalize whitespace, remove HTML tags, canonicalize formats (phone internationalization).
- Tokenize into sentences/words; keep offsets for redaction.
Rule-Based Pass
- Run fast regex checks and dictionary lookups.
- Validate numbers (credit-card Luhn, IBAN patterns, checksum validation).
- Mark exact matches as high confidence.
ML Model Pass
- Run a Named Entity Recognition model (fine-tuned for PII).
- Models can be Transformers (DistilBERT, RoBERTa), CRF over tokens, or lighter spaCy pipelines for speed.
- Capture per-entity confidence logits.
Confidence Fusion & Risk Scoring
- Combine rule and ML signals with heuristics: e.g., regex match + model support → raise confidence.
- Apply contextual checks: presence of "SSN" label near numbers increases risk.
- Assign final action via policy rules.
Auto-Anonymize or Queue for Review
- If auto-action → apply anonymization per policy (mask, pseudonymize).
- If review → emit event to review queue with original snippet and suggested action.
Persist and Audit
- Store detection record and mapping (if pseudonymized) securely.
- Push audit logs to immutable store (append-only) with operator metadata.
Downstream
- Safe data flows to analytics, exports, or storage.
- Provide APIs for data retrieval which enforce decryption/authorized token resolution.

Implementation: .NET Backend

Technology Choices

.NET 8 / ASP.NET Core for APIs and workers.
Entity Framework Core for metadata and mapping tables (or plain Dapper for performance).
Kafka / Azure Event Hub for streaming ingestion and review queue.
Redis for fast rule caches (blacklists, dictionaries).
Key Vault / HSM for pseudonym encrypted storage.
ML model serving: options
- Host models in Python microservice (FastAPI) and call from .NET via HTTP/gRPC.
- Use ONNX models and run them in .NET via Microsoft.ML or ONNX Runtime for faster, native inference.

Recommendation: Use ONNX for inference performance and to avoid hybrid language complexity. For development you can prototype with Python + HuggingFace.

Example: Detection Worker (C# pseudocode)

public class DetectionWorker : BackgroundService
{
    private readonly IDataSource _source;
    private readonly IDetector _detector;
    private readonly IDbContext _db;
    private readonly IEventProducer _producer;

    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        await foreach (var batch in _source.ReadBatchesAsync(ct))
        {
            foreach (var row in batch)
            {
                var normalized = Preprocess(row);
                var ruleMatches = _detector.RunRules(normalized);
                var mlEntities = await _detector.RunModelAsync(normalized);
                var fused = Fuse(ruleMatches, mlEntities);
                foreach (var entity in fused)
                {
                    var record = MapToDetectionRecord(entity, row);
                    await _db.DetectionRecords.AddAsync(record);
                    if (ShouldAutoAnonymize(entity))
                    {
                        var anonymized = await ApplyAnonymizerAsync(entity);
                        await SaveResult(row, anonymized);
                    }
                    else
                    {
                        await _producer.ProduceAsync("review-queue", CreateReviewEvent(record));
                    }
                }
            }
            await _db.SaveChangesAsync(ct);
        }
    }
}

Anonymization Example (Pseudonymization)

public async Task<Guid> PseudonymizeAsync(string value)
{
    var salt = GetTenantSalt();
    var key = ComputeHmac(salt, value); // stable hash per tenant
    var existing = await _db.PseudonymMap.FirstOrDefaultAsync(p => p.HashKey == key);
    if (existing != null) return existing.PseudoId;
    var encrypted = EncryptWithVault(value);
    var map = new PseudonymMap { HashKey = key, EncryptedValue = encrypted };
    _db.PseudonymMap.Add(map);
    await _db.SaveChangesAsync();
    return map.PseudoId;
}

Security note: Use KMS-managed encryption keys, audit every decrypt operation.

Implementation: ML Models

Model Choices

spaCy (fast, pipeline-friendly) — good starting point for NER.
Transformers (DistilBERT / RoBERTa) — higher accuracy for ambiguous entities.
Custom CRF over token features — good for lower-resource or domain-specific tuning.

Training Data

Public NER corpora (CoNLL) for structure; synthetic generation for domain-specific PII (generate fake names/numbers in realistic contexts).
Labelled datasets from internal manual review (human-in-the-loop) will drastically improve performance.

Model Serving

Convert trained model to ONNX for fast cross-platform inference.
Host inference as a microservice with batching and GPU support.
Provide metrics: latency, throughput, model confidence distribution.

Example Fusion Heuristic

If regex confidence >= 0.95 → mark high confidence.
Else if ML confidence >= 0.9 → high confidence.
If both exist → boost to 0.99.
If ML conf between 0.6–0.9 → review.

Tune thresholds with ROC/AUC analysis against labeled validation set.

Angular UI: Review And Rule Management

UI Responsibilities

Show pending detection items with context (original text + highlighted tokens).
Offer suggested action (mask, pseudonymize, ignore) with one-click.
Allow editing of entity span or type.
Provide rule management UI: add regex, add dictionary entries, set policy per tenant.
Approval workflow: assign to reviewer, status updates, audit trail.

Component Outline (Angular)

ReviewListComponent — paged list of pending items.
ReviewDetailComponent — text viewer with highlighted spans and action buttons.
RuleManagerComponent — create/edit/delete rules; test regex against sample text.
PolicyEditorComponent — per-tenant anonymization rules (which PII types auto-mask, pseudonymize, etc.)

Sample HTTP Interceptor (auto attach tokens)

@Injectable()
export class AuthInterceptor implements HttpInterceptor {
  intercept(req, next) {
    const token = this.auth.getToken();
    const cloned = req.clone({ setHeaders: { Authorization: `Bearer ${token}` }});
    return next.handle(cloned);
  }
}

UX Tips

Show confidence visually (bar or color).
Provide quick "Accept All Low Risk" for bulk handling with audit logs.
Support keyboard shortcuts for fast triage.

Policies And Governance

Create policy documents that define actions mapped to detection types and risk levels, for example:

Type	Low	Medium	High
EMAIL	annotate	pseudonymize	mask
NAME	annotate	review	mask
CREDIT_CARD	mask	mask	mask

Policies should be configurable per tenant and stored as versioned JSON policies.

Testing Strategy

Unit Tests: tokenizer, regex rules, HMAC hashing, DB mapping.
Model Validation: holdout dataset with metrics (precision, recall, F1) per PII type.
Integration Tests: end-to-end ingestion → detection → action → audit.
Human Review Trials: measure disagreement rates and tune thresholds.
Load Tests: ensure model server handles expected QPS with acceptable latency.

Also run false-positive and false-negative audits periodically.

Performance And Scaling

Rule pass is fast; run it inline.
ML inference is heavier — batch multiple samples and use GPU if available.
Separate pipeline: preprocess + rule pass synchronous; collect uncertain items and run batched ML inference.
Use horizontal scaling for workers and autoscale model serving by CPU/GPU metrics.

Throughput design targets

Latency sensitive flows (ingest → mask) need sub-second rule-based masking.
Bulk historical scans can take hours — prioritize by risk and use distributed workers.

Privacy, Security And Legal Considerations

Minimize exposure of raw PII inside application logs — censor values in logs.
Encrypt PII at rest and in transit.
Audit all access to PII (who viewed original, who approved changes).
Implement role-based access control for review UI.
Provide data subject request APIs (export, delete) and ensure pipeline respects them.
Keep data retention policies and deletion workflows.

Monitoring And Observability

Metrics to capture:

Number of detections per PII type.
Auto-anonymize rate vs review rate.
Average ML confidence per type.
Review queue size and reviewer throughput.
Time to action (how long items stay in review).

Logs and traces should include traceId, job id and detection id for end-to-end debugging.

Example End-to-End Scenario

Scenario: Bulk scan of customer support transcripts to remove PII before analytics.

ETL extracts transcripts into object store and emits a scan-request event.
Detection worker picks up a transcript, runs rule pass: finds a few emails and phone numbers → masks them immediately.
ML pass identifies suspected names with medium confidence → creates detection records and enqueues them in review-queue.
Data steward opens Angular review UI, approves most suggestions and corrects a wrong span. Approval triggers policy action: pseudonymize approved names.
Final masked transcript stored in safe store and analytics job consumes it.

Audit trail records the original snippet (encrypted), reviewer decisions, and the final action.

Roadmap And Improvements

Add active learning: feed reviewer corrections back to retrain model periodically.
Add language detection and per-language models for better performance.
Expand to structured data sources (databases) with column-level policies and differential thresholds.
Integrate BYOK for tenant-specific key control over pseudonymization.

Summary

An AI-based data classification pipeline is a practical, high-value capability for any organisation that stores or processes sensitive data. A production-ready pipeline combines deterministic rules, ML models, a robust .NET backend, and a focused Angular review UI. Key success factors are accuracy (low false positives/negatives), auditable actions, scalable model serving, and clear governance rules.

Start with a simple hybrid detector (regex + off-the-shelf NER), add human review for low-confidence items, and iterate with active learning.