Observability in Azure Functions: Monitoring, Metrics

Tuhin Paul
Oct 24
704
0
2

Article

Introduction
Why Observability Isn’t Optional in Serverless
Real-World Scenario: Real-Time Fraud Detection in Digital Banking
Enabling Application Insights for Azure Functions
Logging Custom Metrics from Function Code
End-to-End Monitoring Dashboard Setup
Best Practices for Enterprise Observability
Conclusion

Introduction

In the world of serverless computing, you don’t manage infrastructure—but you absolutely must manage observability. Without visibility into execution duration, failure rates, dependency latency, and custom business metrics, your Azure Functions become black boxes. And in production, black boxes fail silently—until they cost you customers, compliance, or capital.

This article cuts through abstraction with a real-time fraud detection system in digital banking, demonstrates how to embed Application Insights from day one, and shows you how to log custom metrics that align with business KPIs—all with production-ready, error-free code.

Why Observability Isn’t Optional in Serverless

Azure Functions abstract away servers, but not responsibility. When a payment transaction is processed in 80ms versus 1.2 seconds, or when a fraud alert is missed due to an unlogged exception, the impact is financial and reputational.

Observability in serverless means:

Tracking every invocation’s duration, success, and dependencies
Correlating requests across microservices using operation IDs
Emitting custom metrics (e.g., “high-risk transactions blocked”)
Alerting on anomalies before users notice

Without this, you’re flying blind.

Real-World Scenario: Real-Time Fraud Detection in Digital Banking

A Tier-1 European bank processes 2.4 million digital transactions daily through its mobile app. Each transaction triggers an Azure Function that:

Validates the transaction
Scores fraud risk using a machine learning model
Blocks or allows the transaction in < 150ms

Business Requirement

P95 latency ≤ 120ms (to avoid user abandonment)
Zero undetected high-risk transactions
Real-time dashboard for the fraud ops team showing blocked transactions per minute

The Crisis

During a holiday sale surge, the team noticed spikes in payment timeouts—but logs showed no errors. Only after enabling deep telemetry did they discover:

Cold starts are adding 900ms during off-peak scaling
ML model loading is taking 300ms on the first invocation
No visibility into “transactions blocked due to risk score > 0.95”

The fix wasn’t code—it was observability.

Enabling Application Insights for Azure Functions

Application Insights is automatically integrated when you create a Function App in the Azure portal—but for IaC-driven enterprises, you must enable it explicitly.

Bicep Deployment with App Insights

// infra.bicep
param location string = 'westeurope'
param appName string = 'fraud-detection-fn'

resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: '${appName}-ai'
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    Request_Source: 'IbizaWebAppExtensionCreate'
  }
}

resource storage 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: replace('${appName}storage', '-', '')
  location: location
  sku: { name: 'Standard_LRS' }
  kind: 'StorageV2'
}

resource plan 'Microsoft.Web/serverfarms@2022-09-01' = {
  name: '${appName}-asp'
  location: location
  sku: { name: 'EP1' } // Premium plan for low-latency
}

resource functionApp 'Microsoft.Web/sites@2022-09-01' = {
  name: appName
  location: location
  kind: 'functionapp'
  properties: {
    serverFarmId: plan.id
    siteConfig: {
      appSettings: [
        {
          name: 'APPLICATIONINSIGHTS_CONNECTION_STRING'
          value: appInsights.properties.ConnectionString
        }
        {
          name: 'FUNCTIONS_EXTENSION_VERSION'
          value: '~4'
        }
        {
          name: 'AzureWebJobsStorage'
          value: 'DefaultEndpointsProtocol=https;AccountName=${storage.name};AccountKey=${listKeys(storage.id, storage.apiVersion).keys[0].value}'
        }
      ]
      http20Enabled: true
    }
  }
}

Deploy with:

az deployment group create -g banking-rg --template-file infra.bicep

Once deployed, every function invocation is automatically traced—duration, success, exceptions, and downstream calls (e.g., to Cosmos DB or Key Vault).

Logging Custom Metrics from Function Code

Built-in telemetry isn’t enough. You need business-aware metrics.

Here’s a Python function that logs:

Transaction risk score
Whether the transaction was blocked
Model inference latency

# __init__.py
import azure.functions as func
import logging
from applicationinsights import TelemetryClient
import time
import os

# Initialize Application Insights telemetry client
APP_INSIGHTS_KEY = os.getenv('APPINSIGHTS_INSTRUMENTATIONKEY')
telemetry_client = TelemetryClient(APP_INSIGHTS_KEY)

# Load ML model once per instance (during warm-up)
from fraud_model import RiskScorer
scorer = RiskScorer()

def main(req: func.HttpRequest) -> func.HttpResponse:
    transaction_id = req.headers.get('X-Transaction-ID', 'unknown')
    amount = float(req.params.get('amount', 0))
    user_id = req.params.get('user_id')

    start_time = time.perf_counter()
    
    try:
        # Score transaction
        risk_score = scorer.predict(user_id, amount)
        inference_time = (time.perf_counter() - start_time) * 1000  # ms

        # Log custom metrics
        telemetry_client.track_metric("Fraud_Risk_Score", risk_score)
        telemetry_client.track_metric("Model_Inference_Latency_ms", inference_time)
        
        is_blocked = risk_score > 0.95
        if is_blocked:
            telemetry_client.track_metric("Transactions_Blocked", 1)
            logging.warning(f"BLOCKED: High-risk transaction {transaction_id} (score={risk_score:.2f})")
            return func.HttpResponse("Transaction blocked: High fraud risk", status_code=403)

        telemetry_client.track_metric("Transactions_Allowed", 1)
        return func.HttpResponse("OK", status_code=200)

    except Exception as e:
        logging.exception("Fraud check failed")
        telemetry_client.track_exception(e)
        return func.HttpResponse("Internal error", status_code=500)

    finally:
        # Ensure telemetry is flushed (critical in serverless)
        telemetry_client.flush()

End-to-End Monitoring Dashboard Setup

In the Azure portal:

Go to your Application Insights resource
Open Logs (Analytics)
Run this Kusto query to build a live fraud ops dashboard:

customMetrics
| where name in ("Transactions_Blocked", "Fraud_Risk_Score")
| extend timestamp = bin(timestamp, 1m)
| summarize 
    blocked = sumif(value, name == "Transactions_Blocked"),
    avg_risk = avgif(value, name == "Fraud_Risk_Score")
  by timestamp
| render timechart

Set up alert rules:

Alert if Transactions_Blocked drops to zero for 10 minutes (model failure?)
Alert if P95 duration > 150ms

Best Practices for Enterprise Observability

Always enable Application Insights at deployment—never as an afterthought
Log custom metrics aligned with business outcomes (e.g., “fraud prevented”)
Use operation_Id to trace requests across functions and services
Flush telemetry explicitly in serverless environments
Avoid logging PII—sanitize logs and use custom dimensions, not messages
Set up synthetic transactions to detect cold-start regressions

Conclusion

In serverless architectures, observability is your control plane. The bank in our scenario reduced fraud losses by 22% and cut payment latency by 63%—not by rewriting code, but by instrumenting it correctly from the start.

Application Insights + custom metrics + proactive alerting = trust at scale.