Clustering

Comprehensive guide to running Elsa Workflows in clustered and distributed production environments, covering architecture patterns, distributed locking, scheduling, and operational best practices.

Executive Summary

Running Elsa Workflows in a clustered environment is essential for achieving high availability, scalability, and fault tolerance in production deployments. A clustered setup allows multiple Elsa instances to work together, distributing workload across nodes while maintaining consistency and preventing data corruption.

Why Clustering Matters

Production Requirements Clustering Solves:

High Availability: If one node fails, others continue processing workflows without interruption
Horizontal Scalability: Handle increased load by adding more nodes rather than scaling vertically
Zero-Downtime Deployments: Rolling updates with no service interruption
Geographic Distribution: Deploy nodes across regions for disaster recovery and reduced latency
Load Balancing: Distribute HTTP requests and background jobs across multiple instances

Key Challenges Clustering Addresses:

Concurrent Modification: Preventing multiple nodes from modifying the same workflow instance simultaneously
Duplicate Scheduling: Ensuring timers and scheduled tasks execute only once
Cache Consistency: Keeping in-memory caches synchronized across nodes
Race Conditions: Managing concurrent bookmark resume attempts

Without proper clustering configuration, you may encounter:

Workflow state corruption from simultaneous updates
Duplicate timer executions causing repeated notifications or side effects
Cache inconsistencies leading to stale workflow definitions
Race conditions when external events trigger workflow resumption

Conceptual Overview

Understanding Corruption and Duplication Risks

Problem 1: Duplicate Timer Execution

Scenario: A workflow with a timer activity (e.g., "Send reminder email in 24 hours") is deployed across 3 nodes.

Without Clustering:

Time T+24h:
- Node 1 checks: "Timer expired? Yes" → Sends email
- Node 2 checks: "Timer expired? Yes" → Sends email  ❌ Duplicate!
- Node 3 checks: "Timer expired? Yes" → Sends email  ❌ Duplicate!

Result: Customer receives 3 identical reminder emails instead of 1.

With Clustering (Quartz.NET Clustering):

Time T+24h:
- Quartz Cluster: Node 1 acquires job lock → Executes task
- Node 2 attempts lock → Already held by Node 1 → Skips
- Node 3 attempts lock → Already held by Node 1 → Skips

Result: Customer receives exactly 1 email as intended.

Problem 2: Concurrent Workflow Modification

Scenario: An HTTP workflow receives two simultaneous requests that both attempt to resume the same workflow instance.

Without Distributed Locking:

Request A arrives at Node 1:
1. Load workflow instance (State: Step 2)
2. Execute Step 3
3. Save workflow instance (State: Step 3)

Request B arrives at Node 2 (simultaneously):
1. Load workflow instance (State: Step 2)  ← Loads stale state!
2. Execute Step 3
3. Save workflow instance (State: Step 3)  ← Overwrites Node 1's changes!

Result: Workflow execution is corrupted; steps may be skipped or repeated.

With Distributed Locking:

Request A arrives at Node 1:
1. Acquire lock on workflow instance
2. Load, execute, save
3. Release lock

Request B arrives at Node 2 (simultaneously):
1. Attempt to acquire lock → Blocked (Node 1 holds it)
2. Wait for lock release
3. Lock released → Detect workflow already resumed → Skip or continue appropriately

Result: Workflow executes correctly without corruption.

Problem 3: Cache Invalidation

Scenario: An administrator updates a workflow definition in Elsa Studio.

Without Distributed Cache Invalidation:

Admin updates workflow via Node 1:
- Node 1: Clears local cache, reloads definition ✓
- Node 2: Keeps stale cache ❌ (uses old version)
- Node 3: Keeps stale cache ❌ (uses old version)

Result: New workflow instances on Node 2 and 3 use outdated definitions.

With Distributed Cache Invalidation (MassTransit):

Admin updates workflow via Node 1:
- Node 1: Publishes "WorkflowDefinitionChanged" event to message bus
- Node 2: Receives event → Clears local cache ✓
- Node 3: Receives event → Clears local cache ✓

Result: All nodes use the updated workflow definition immediately.

How Elsa Mitigates These Risks

Elsa provides four key mechanisms for safe clustering:

1. Bookmark Hashing

Bookmarks (suspension points in workflows) are assigned deterministic hashes based on their properties. When multiple nodes attempt to create the same bookmark, the hash collision is detected, and only one bookmark is persisted.

Code Reference: src/modules/Elsa.Workflows.Core/Contexts/ActivityExecutionContext.cs - CreateBookmark method

// Simplified illustration of bookmark hashing
var bookmarkHash = GenerateHash(activityId, payload, correlationId);
var existingBookmark = await FindBookmarkByHash(bookmarkHash);

if (existingBookmark != null)
{
    // Bookmark already exists; don't create duplicate
    return existingBookmark;
}

// Create new bookmark with unique hash
var bookmark = new Bookmark { Hash = bookmarkHash, ... };
await SaveBookmark(bookmark);

2. Distributed Locking

The WorkflowResumer service acquires a distributed lock before resuming a workflow instance. This ensures only one node processes a resume request at a time.

Code Reference: src/modules/Elsa.Workflows.Runtime/Services/WorkflowResumer.cs

// Simplified illustration from WorkflowResumer
public async Task<ResumeWorkflowResult> ResumeWorkflowAsync(
    string workflowInstanceId,
    string bookmarkId,
    CancellationToken cancellationToken)
{
    // Generate deterministic lock key
    var lockKey = $"workflow:{workflowInstanceId}:bookmark:{bookmarkId}";
    
    // Acquire distributed lock (Redis, PostgreSQL, etc.)
    await using var lockHandle = await _distributedLockProvider
        .AcquireLockAsync(lockKey, timeout: TimeSpan.FromSeconds(30), cancellationToken);
    
    if (lockHandle == null)
    {
        // Another node is already processing this resume
        _logger.LogInformation("Lock acquisition failed; resume already in progress");
        return ResumeWorkflowResult.AlreadyInProgress();
    }
    
    try
    {
        // Safe to resume - we hold the lock
        var result = await ResumeWorkflowCoreAsync(...);
        return result;
    }
    finally
    {
        // Lock automatically released when lockHandle is disposed
    }
}

Lock Providers Supported:

Redis: Fast, in-memory locking via Medallion.Threading.Redis
PostgreSQL: Database-backed locking via Medallion.Threading.Postgres
SQL Server: Database-backed locking via Medallion.Threading.SqlServer
Azure Blob Storage: Cloud-native locking via Medallion.Threading.Azure

3. Centralized Scheduler (Quartz.NET Clustering)

Quartz.NET clustering ensures scheduled jobs (timers, delays, cron triggers) execute only once across the cluster.

Code References:

src/modules/Elsa.Scheduling/Services/DefaultBookmarkScheduler.cs - Enqueues bookmark resume tasks
src/modules/Elsa.Scheduling/Tasks/ResumeWorkflowTask.cs - Quartz job that resumes workflows

How It Works:

DefaultBookmarkScheduler creates a Quartz job for each scheduled bookmark
Quartz stores job metadata in a shared database
At execution time, nodes compete for a database lock
The node that acquires the lock executes the job; others skip it
Failed nodes' jobs are recovered by surviving nodes (failover)

4. Distributed Cache Invalidation

When workflow definitions or other cached data changes, MassTransit publishes cache invalidation events to all nodes via a message broker (RabbitMQ, Azure Service Bus, etc.).

Message Flow:

Node 1 (Admin updates workflow) → Publish event to RabbitMQ
                                    ↓
                    ┌───────────────┼───────────────┐
                    ↓               ↓               ↓
                  Node 1          Node 2          Node 3
            (clear cache)   (clear cache)   (clear cache)

Architecture Patterns and Deployment Models

Pattern 1: Shared Database + Distributed Locks

Best for: Most production scenarios with moderate to high traffic

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Elsa Node  │  │  Elsa Node  │  │  Elsa Node  │
│   (Pod 1)   │  │   (Pod 2)   │  │   (Pod 3)   │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       └────────────────┼────────────────┘
                        ↓
            ┌───────────────────────┐
            │  PostgreSQL Database  │
            │  - Workflow instances │
            │  - Bookmarks          │
            │  - Distributed locks  │
            │  - Quartz tables      │
            └───────────────────────┘
                        ↑
                        │
            ┌───────────────────────┐
            │  Redis (optional)     │
            │  - Distributed locks  │
            │  - Session cache      │
            └───────────────────────┘
                        ↑
                        │
            ┌───────────────────────┐
            │  RabbitMQ             │
            │  - Cache invalidation │
            │  - Event pub/sub      │
            └───────────────────────┘

Configuration:

All nodes connect to the same database
Distributed runtime enabled: runtime.UseDistributedRuntime()
Distributed locks via Redis or PostgreSQL
Quartz clustering enabled for scheduled tasks
MassTransit for cache invalidation

Pros:

Simple architecture
Easy to scale horizontally
No single point of failure (stateless nodes)

Cons:

Database becomes a bottleneck at extreme scale
Requires careful database tuning

Pattern 2: Leader-Election Scheduler

Best for: Environments where you want precise control over scheduling overhead

┌─────────────────────┐
│  Scheduler Node     │ ← Leader (elected or manually designated)
│  - Quartz scheduler │
│  - No HTTP endpoint │
└─────────┬───────────┘
          │ Schedules bookmark resume tasks
          ↓
┌──────────────────────────────────────┐
│  Worker Nodes (HTTP-only)            │
│  - Handle HTTP requests              │
│  - Execute workflows                 │
│  - No Quartz scheduler               │
└──────────────────────────────────────┘

Configuration:

Scheduler Node:

builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowRuntime(runtime => runtime.UseDistributedRuntime());
    elsa.UseScheduling(scheduling => scheduling.UseQuartzScheduler());
    elsa.UseQuartz(quartz => quartz.UsePostgreSql(connectionString));
    // Don't expose HTTP endpoints (or expose for admin only)
});

Worker Nodes:

builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowRuntime(runtime => runtime.UseDistributedRuntime());
    elsa.UseWorkflowsApi();  // Handle HTTP requests
    // Don't call UseScheduling() - no scheduler on workers
});

Pros:

Centralized scheduling (easier to monitor)
Workers focused on request handling
Lower resource usage on workers

Cons:

Scheduler is a single point of failure (mitigate with active-standby setup)
More complex deployment configuration

Pattern 3: Quartz Clustering (All-Nodes-Participate)

Best for: Simplicity and automatic failover

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Node 1     │  │  Node 2     │  │  Node 3     │
│  - Quartz   │  │  - Quartz   │  │  - Quartz   │
│  - HTTP     │  │  - HTTP     │  │  - HTTP     │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       └────────────────┼────────────────┘
                        ↓
            ┌───────────────────────┐
            │  PostgreSQL           │
            │  - Quartz tables      │
            │  - Cluster locks      │
            └───────────────────────┘

Configuration:

builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowRuntime(runtime => runtime.UseDistributedRuntime());
    elsa.UseScheduling(scheduling => scheduling.UseQuartzScheduler());
    elsa.UseQuartz(quartz => 
    {
        quartz.UsePostgreSql(connectionString);
        // Clustering enabled automatically
    });
});

Pros:

Simple configuration (same for all nodes)
Automatic failover (if Node 1 crashes, Node 2 picks up its jobs)
No single point of failure

Cons:

Every node runs Quartz scheduler (slightly higher resource usage)
More database queries for cluster coordination

Recommendation: Use this pattern unless you have specific reasons to use leader-election.

Pattern 4: External Scheduler

Best for: Multi-tenant environments or complex orchestration needs

┌───────────────────────┐
│  External Scheduler   │ ← Kubernetes CronJob, Azure Functions, AWS Lambda
│  - Triggers workflows │
│  - No Elsa runtime    │
└───────────┬───────────┘
            │ HTTP API calls
            ↓
┌──────────────────────────────────────┐
│  Elsa Worker Nodes                   │
│  - REST API endpoints                │
│  - Execute workflows                 │
│  - No internal scheduler             │
└──────────────────────────────────────┘

Example: Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report-trigger
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: trigger
              image: curlimages/curl:latest
              command:
                - /bin/sh
                - -c
                - |
                  curl -X POST https://elsa.example.com/elsa/api/workflow-instances \
                    -H "Content-Type: application/json" \
                    -d '{"definitionId": "DailyReport", "input": {}}'
          restartPolicy: OnFailure

Pros:

No Quartz dependency
Leverage platform-native scheduling (Kubernetes, cloud functions)
Easier multi-cloud deployments

Cons:

External system must remain operational
More complex integration (API authentication, error handling)
No built-in bookmark scheduling (must implement externally)

Practical Configuration

Configuring Distributed Locks

Option 1: Redis-Based Locking (Recommended for Performance)

Prerequisites:

Redis 6.0+ deployed and accessible
NuGet package: Medallion.Threading.Redis

Configuration Example:

using Elsa.Extensions;
using Medallion.Threading.Redis;
using StackExchange.Redis;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowRuntime(runtime =>
    {
        runtime.UseDistributedRuntime();
        
        runtime.DistributedLockProvider = serviceProvider =>
        {
            var redisConnectionString = builder.Configuration.GetConnectionString("Redis");
            var connection = ConnectionMultiplexer.Connect(redisConnectionString);
            
            return new RedisDistributedSynchronizationProvider(
                connection.GetDatabase(),
                options =>
                {
                    // Lock expires after 30 seconds if not released
                    options.Expiry(TimeSpan.FromSeconds(30));
                    
                    // Minimum time before lock can expire (prevents premature expiration)
                    options.MinimumDatabaseExpiry(TimeSpan.FromSeconds(10));
                });
        };
    });
});

var app = builder.Build();
app.Run();

Connection String Example:

redis-host:6379,password=YOUR_PASSWORD,ssl=False,abortConnect=False,connectTimeout=5000

See: examples/redis-lock-setup.md for detailed configuration and troubleshooting.

Option 2: PostgreSQL-Based Locking (No Additional Infrastructure)

Prerequisites:

PostgreSQL 12+ (same database as Elsa workflow storage)
NuGet package: Medallion.Threading.Postgres

Configuration Example:

using Elsa.Extensions;
using Medallion.Threading.Postgres;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowRuntime(runtime =>
    {
        runtime.UseDistributedRuntime();
        
        runtime.DistributedLockProvider = serviceProvider =>
            new PostgresDistributedSynchronizationProvider(
                builder.Configuration.GetConnectionString("PostgreSql"),
                options =>
                {
                    // Keep connection alive with periodic keepalive
                    options.KeepaliveCadence(TimeSpan.FromMinutes(5));
                    
                    // Use connection multiplexing for efficiency
                    options.UseMultiplexing();
                });
    });
});

var app = builder.Build();
app.Run();

Connection String Example:

Server=postgres-host;Port=5432;Database=elsa;User Id=elsa;Password=YOUR_PASSWORD;MaxPoolSize=100

Pros vs Redis:

✅ No additional infrastructure required
✅ Uses existing database connection
❌ Slower lock acquisition (disk I/O vs in-memory)
❌ Adds load to database

Medallion.Threading Usage in Elsa Core:

Elsa uses Medallion.Threading abstractions to remain agnostic to the lock provider. The IDistributedLockProvider interface is implemented by all Medallion providers:

RedisDistributedSynchronizationProvider
PostgresDistributedSynchronizationProvider
SqlDistributedSynchronizationProvider
AzureDistributedSynchronizationProvider

To use a different provider, simply register it as shown above. Elsa's WorkflowResumer will automatically use the registered provider.

Configuring Quartz Clustering

Example quartz.properties:

# Cluster instance configuration
quartz.scheduler.instanceName = ElsaQuartzCluster
quartz.scheduler.instanceId = AUTO

# Thread pool
quartz.threadPool.type = Quartz.Simpl.SimpleThreadPool, Quartz
quartz.threadPool.threadCount = 10

# Persistent job store with clustering
quartz.jobStore.type = Quartz.Impl.AdoJobStore.JobStoreTX, Quartz
quartz.jobStore.dataSource = default
quartz.jobStore.tablePrefix = qrtz_
quartz.jobStore.driverDelegateType = Quartz.Impl.AdoJobStore.PostgreSQLDelegate, Quartz

# Enable clustering
quartz.jobStore.clustered = true
quartz.jobStore.clusterCheckinInterval = 20000
quartz.jobStore.clusterCheckinMisfireThreshold = 60000

# PostgreSQL data source
quartz.dataSource.default.provider = Npgsql
quartz.dataSource.default.connectionString = Server=localhost;Database=elsa;User Id=elsa;Password=YOUR_PASSWORD

# Serialization
quartz.serializer.type = json

Configuration in Program.cs:

using Elsa.Extensions;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddElsa(elsa =>
{
    // Enable distributed runtime
    elsa.UseWorkflowRuntime(runtime => runtime.UseDistributedRuntime());
    
    // Enable Quartz scheduling
    elsa.UseScheduling(scheduling => scheduling.UseQuartzScheduler());
    
    // Configure Quartz with PostgreSQL and clustering
    elsa.UseQuartz(quartz =>
    {
        // This automatically enables clustering
        quartz.UsePostgreSql(builder.Configuration.GetConnectionString("PostgreSql"));
    });
});

// Configure Quartz service
builder.Services.AddQuartzHostedService(options =>
{
    options.WaitForJobsToComplete = true;
});

var app = builder.Build();
app.Run();

Environment Variables:

QUARTZ__CLUSTERED=true
QUARTZ__INSTANCENAME=ElsaQuartzCluster
QUARTZ__SCHEDULER_INSTANCEID=AUTO
CONNECTIONSTRINGS__POSTGRESQL="Server=postgres-host;Database=elsa;User Id=elsa;Password=YOUR_PASSWORD"

See: examples/quartz-cluster-config.md for detailed configuration options.

Kubernetes Configuration

Minimal Deployment Snippet:

See examples/k8s-deployment.yaml for a complete example with:

Deployment with 3 replicas
Service (ClusterIP, no session affinity)
HorizontalPodAutoscaler
Pod Disruption Budget
Health probes (liveness, readiness, startup)

Key Points:

No Sticky Sessions Required: Elsa's distributed runtime manages state externally, so requests can be routed to any node
Readiness Probes: Use /health/ready endpoint to ensure pods are ready before receiving traffic
Liveness Probes: Use /health/live endpoint to restart unhealthy pods
Anti-Affinity: Spread pods across nodes for high availability
Resource Limits: Set appropriate CPU/memory limits to prevent resource contention

Ingress Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: elsa-ingress
  namespace: elsa-workflows
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    # No session affinity required
    nginx.ingress.kubernetes.io/affinity: "none"
spec:
  ingressClassName: nginx
  rules:
    - host: elsa.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: elsa-server
                port:
                  number: 80

Helm Values:

See examples/helm-values.yaml for an annotated Helm values file with:

Multiple replicas configuration
Database, Redis, and RabbitMQ settings
Distributed runtime and locking configuration
Quartz clustering settings
HPA and resource limits
Health probe configurations

Docker Compose Development Example

For local development and testing clustering behavior:

version: '3.8'

services:
  elsa-node1:
    image: elsaworkflows/elsa-server-v3:latest
    environment:
      - ASPNETCORE_ENVIRONMENT=Development
      - CONNECTIONSTRINGS__POSTGRESQL=Server=postgres;Database=elsa;User Id=elsa;Password=elsa
      - REDIS__CONNECTIONSTRING=redis:6379
      - RABBITMQ__CONNECTIONSTRING=amqp://guest:guest@rabbitmq:5672/
      - ELSA__RUNTIME__TYPE=Distributed
      - QUARTZ__CLUSTERED=true
    ports:
      - "5001:8080"
    depends_on:
      - postgres
      - redis
      - rabbitmq

  elsa-node2:
    image: elsaworkflows/elsa-server-v3:latest
    environment:
      - ASPNETCORE_ENVIRONMENT=Development
      - CONNECTIONSTRINGS__POSTGRESQL=Server=postgres;Database=elsa;User Id=elsa;Password=elsa
      - REDIS__CONNECTIONSTRING=redis:6379
      - RABBITMQ__CONNECTIONSTRING=amqp://guest:guest@rabbitmq:5672/
      - ELSA__RUNTIME__TYPE=Distributed
      - QUARTZ__CLUSTERED=true
    ports:
      - "5002:8080"
    depends_on:
      - postgres
      - redis
      - rabbitmq

  elsa-node3:
    image: elsaworkflows/elsa-server-v3:latest
    environment:
      - ASPNETCORE_ENVIRONMENT=Development
      - CONNECTIONSTRINGS__POSTGRESQL=Server=postgres;Database=elsa;User Id=elsa;Password=elsa
      - REDIS__CONNECTIONSTRING=redis:6379
      - RABBITMQ__CONNECTIONSTRING=amqp://guest:guest@rabbitmq:5672/
      - ELSA__RUNTIME__TYPE=Distributed
      - QUARTZ__CLUSTERED=true
    ports:
      - "5003:8080"
    depends_on:
      - postgres
      - redis
      - rabbitmq

  postgres:
    image: postgres:16-alpine
    environment:
      - POSTGRES_DB=elsa
      - POSTGRES_USER=elsa
      - POSTGRES_PASSWORD=elsa
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes

  rabbitmq:
    image: rabbitmq:3-management-alpine
    ports:
      - "5672:5672"
      - "15672:15672"  # Management UI

volumes:
  postgres-data:

Testing Commands:

# Start cluster
docker-compose up -d

# Check all nodes are healthy
curl http://localhost:5001/health/ready
curl http://localhost:5002/health/ready
curl http://localhost:5003/health/ready

# Create a workflow with a timer
curl -X POST http://localhost:5001/elsa/api/workflow-instances \
  -H "Content-Type: application/json" \
  -d '{"definitionId": "TimerTest"}'

# Check Quartz cluster state
docker-compose exec postgres psql -U elsa -c "SELECT * FROM qrtz_scheduler_state;"

# Check logs for distributed lock activity
docker-compose logs -f elsa-node1 | grep -i "lock"

Operational Topics

Metrics to Monitor

Workflow Execution Metrics:

Active workflow instances
Workflows completed per minute
Workflow execution failures
Average workflow execution time

Distributed Locking Metrics:

Lock acquisition time (P50, P95, P99)
Lock acquisition failures
Lock hold duration
Lock contention rate

Quartz Scheduling Metrics:

Scheduled jobs count
Job execution rate
Job misfires
Scheduler heartbeat intervals

System Metrics:

CPU usage per pod
Memory usage per pod
Database connection pool utilization
Redis connection pool utilization

Database Metrics:

Query execution time
Connection pool exhaustion
Deadlocks
Lock wait time

Log Levels and Structured Logging

Recommended Log Levels:

Production:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information",
      "Elsa": "Information",
      "Elsa.Workflows.Runtime": "Information",
      "Quartz": "Warning"
    }
  }
}

Debugging Clustering Issues:

{
  "Logging": {
    "LogLevel": {
      "Elsa.Workflows.Runtime.Services.WorkflowResumer": "Debug",
      "Elsa.Scheduling": "Debug",
      "Quartz": "Debug"
    }
  }
}

Key Log Messages to Watch:

Successful Lock Acquisition:

[INF] Acquired distributed lock for workflow instance {WorkflowInstanceId}

Lock Acquisition Failure (expected in clusters):

[INF] Lock acquisition failed; resume already in progress for workflow {WorkflowInstanceId}

Quartz Job Execution:

[INF] Quartz job executed: ResumeWorkflowTask for bookmark {BookmarkId}

Cache Invalidation:

[INF] Received cache invalidation event for workflow definition {DefinitionId}

Troubleshooting Common Issues

Issue: Duplicate Workflow Executions

Symptoms:

Timers firing multiple times
Duplicate notifications or side effects
Multiple log entries for the same workflow execution

Diagnosis:

# Check if distributed runtime is enabled
kubectl logs <pod-name> | grep "UseDistributedRuntime"

# Check Quartz cluster state
kubectl exec -it postgres-pod -- psql -U elsa -c "SELECT * FROM qrtz_scheduler_state;"

# Check distributed lock acquisitions
kubectl logs <pod-name> | grep -i "lock acquisition"

Solutions:

Verify runtime.UseDistributedRuntime() is called in configuration
Ensure Quartz clustering is enabled (quartz.jobStore.clustered = true)
Check distributed lock provider is registered and accessible
Verify all nodes use the same database and Redis instance

Issue: Bookmark Not Found

Symptoms:

Scheduled tasks fail with "Bookmark not found" error
Workflows not resuming at expected time

Diagnosis:

-- Check bookmarks table
SELECT * FROM elsa.bookmarks WHERE workflow_instance_id = '<instance-id>';

-- Check Quartz scheduled jobs
SELECT * FROM qrtz_triggers WHERE trigger_group = 'Elsa.Scheduling';

Solutions:

Check database connectivity from all nodes
Verify clock synchronization across nodes (NTP)
Ensure time zones are configured consistently
Check for database replication lag (if using replicas)

Issue: Lock Acquisition Timeouts

Symptoms:

Workflows stuck in "Suspended" state
Logs show "Failed to acquire lock after timeout"

Diagnosis:

# Check Redis connectivity
redis-cli -h redis-host PING

# Check for stuck locks in Redis
redis-cli --scan --pattern "workflow:*" | wc -l

# Check PostgreSQL locks
SELECT * FROM pg_locks WHERE locktype = 'advisory';

Solutions:

Increase lock acquisition timeout
Check Redis/database connectivity and latency
Verify lock expiration is configured to prevent stuck locks

Clear stale locks (use with caution):

# Redis
redis-cli --scan --pattern "workflow:*" | xargs redis-cli DEL

# PostgreSQL (Medallion.Threading creates advisory locks, they auto-release)

Issue: Cache Inconsistencies

Symptoms:

Nodes using different workflow definitions
Changes not reflecting immediately on all nodes

Diagnosis:

# Check MassTransit/RabbitMQ connectivity
kubectl logs <pod-name> | grep -i "masstransit"

# Check RabbitMQ queues
kubectl exec -it rabbitmq-pod -- rabbitmqctl list_queues

Solutions:

Verify elsa.UseDistributedCache(dc => dc.UseMassTransit()) is configured
Check RabbitMQ connectivity from all nodes
Verify message broker is operational
Restart all pods to force cache refresh

Retention and Cleanup

Workflow Instance Cleanup:

Old completed or faulted workflow instances should be cleaned up periodically:

// Configure retention policy in Elsa
builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowManagement(management =>
    {
        management.UseWorkflowInstanceRetention(retention =>
        {
            retention.RetentionPeriod = TimeSpan.FromDays(30);
            retention.SweepInterval = TimeSpan.FromHours(1);
        });
    });
});

Manual Cleanup (SQL):

-- Delete completed workflows older than 30 days
DELETE FROM elsa.workflow_instances
WHERE status = 'Completed'
  AND finished_at < NOW() - INTERVAL '30 days';

-- Delete faulted workflows older than 90 days
DELETE FROM elsa.workflow_instances
WHERE status = 'Faulted'
  AND finished_at < NOW() - INTERVAL '90 days';

Quartz Cleanup:

Quartz automatically cleans up completed jobs, but you may want to clean up old execution history:

-- Delete old fired triggers (already executed)
DELETE FROM qrtz_fired_triggers
WHERE fired_time < EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000;

Bookmark Cleanup:

Orphaned bookmarks (no associated workflow instance) should be cleaned up:

-- Find orphaned bookmarks
SELECT b.* FROM elsa.bookmarks b
LEFT JOIN elsa.workflow_instances wi ON b.workflow_instance_id = wi.id
WHERE wi.id IS NULL;

-- Delete orphaned bookmarks
DELETE FROM elsa.bookmarks
WHERE workflow_instance_id NOT IN (SELECT id FROM elsa.workflow_instances);

Validation Checklist for Cluster Behavior

Use this checklist to validate your clustering setup in a test environment:

1. Distributed Runtime Validation

Deploy at least 3 Elsa nodes
Create an HTTP workflow with a suspend/resume pattern
Send simultaneous requests to resume the same workflow instance via different nodes
Verify in logs that only one node acquires the lock and processes the resume
Verify the workflow completes successfully without state corruption

Test Command:

# Send simultaneous requests to different nodes
for i in {1..10}; do
  curl -X POST http://node1/elsa/api/workflow-instances/<id>/resume &
  curl -X POST http://node2/elsa/api/workflow-instances/<id>/resume &
  curl -X POST http://node3/elsa/api/workflow-instances/<id>/resume &
done
wait

# Check logs for lock acquisitions
kubectl logs -l app=elsa-server | grep "Acquired distributed lock"

2. Scheduled Task Validation

Create a workflow with a timer (e.g., delay 30 seconds)
Start the workflow on Node 1
Monitor all nodes' logs
Verify that exactly one node executes the timer resume
Check Quartz scheduler state in database

Test Command:

# Create workflow with timer
curl -X POST http://node1/elsa/api/workflow-instances \
  -H "Content-Type: application/json" \
  -d '{"definitionId": "TimerWorkflow"}'

# Monitor all nodes
kubectl logs -l app=elsa-server -f | grep "ResumeWorkflowTask"

# Check Quartz state
kubectl exec -it postgres-pod -- psql -U elsa -c "SELECT * FROM qrtz_scheduler_state;"

3. Cache Invalidation Validation

Connect to Node 1's Elsa Studio
Update a workflow definition
Immediately create a new workflow instance on Node 2
Verify Node 2 uses the updated definition (not cached old version)
Check logs for cache invalidation events on all nodes

Test Command:

# Update workflow via Node 1
curl -X PUT http://node1/elsa/api/workflow-definitions/<id> \
  -H "Content-Type: application/json" \
  -d '{...updated definition...}'

# Create instance on Node 2
curl -X POST http://node2/elsa/api/workflow-instances \
  -H "Content-Type: application/json" \
  -d '{"definitionId": "<id>"}'

# Check cache invalidation logs
kubectl logs -l app=elsa-server | grep "cache invalidation"

4. Failover Validation

Start a workflow with a timer scheduled 5 minutes in the future
Note which node is scheduled to execute it (check Quartz)
Kill that node before the timer fires
Verify another node picks up and executes the scheduled task
Check Quartz for failover recovery in logs

Test Command:

# Start workflow with delayed timer
curl -X POST http://node1/elsa/api/workflow-instances \
  -H "Content-Type: application/json" \
  -d '{"definitionId": "DelayedTimerWorkflow"}'

# Check which node owns the scheduled job
kubectl exec -it postgres-pod -- psql -U elsa -c \
  "SELECT * FROM qrtz_fired_triggers;"

# Kill the owning node
kubectl delete pod elsa-server-<pod-id>

# Wait for timer to fire and verify another node executed it
kubectl logs -l app=elsa-server | grep "ResumeWorkflowTask executed"

5. High Availability Validation

Deploy cluster with 3 nodes
Generate continuous load (workflow executions)
Perform rolling restart of all nodes
Verify zero failed workflow executions during restart
Check that readiness probes prevent traffic to restarting nodes

Test Command:

# Generate load
while true; do
  curl -X POST http://elsa-service/elsa/api/workflow-instances \
    -H "Content-Type: application/json" \
    -d '{"definitionId": "TestWorkflow"}'
  sleep 1
done &

# Perform rolling restart
kubectl rollout restart deployment/elsa-server

# Monitor
kubectl rollout status deployment/elsa-server
kubectl logs -l app=elsa-server --tail=100

6. Distributed Lock Validation

Enable debug logging for Elsa.Workflows.Runtime.Services.WorkflowResumer
Trigger concurrent workflow resumes
Check logs for lock acquisition and release messages
Verify lock acquisition times are reasonable (< 100ms for Redis, < 500ms for DB)
Confirm no deadlocks or stuck locks

Test Command:

# Enable debug logging (update configmap or environment variable)
kubectl set env deployment/elsa-server \
  LOGGING__LOGLEVEL__ELSA_WORKFLOWS_RUNTIME_SERVICES_WORKFLOWRESUMER=Debug

# Trigger concurrent resumes
for i in {1..50}; do
  curl -X POST http://elsa-service/elsa/api/workflow-instances/<id>/resume &
done
wait

# Analyze logs
kubectl logs -l app=elsa-server | grep -E "(Acquired|Released) distributed lock"

Security and Networking

Database Access Security

Recommendations:

Use TLS/SSL for database connections:

Server=postgres-host;Database=elsa;User Id=elsa;Password=...;SSLMode=Require

Restrict database access to Elsa nodes only:
- Use Kubernetes Network Policies
- Configure database firewall rules (cloud-managed databases)

Use dedicated database users with minimal permissions:

CREATE USER elsa WITH PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE elsa TO elsa;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO elsa;

Rotate credentials regularly:
- Use external secret management (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault)
- Implement automated rotation policies

Network Latency Considerations

Impact on Clustering:

Distributed lock acquisition time increases with latency
Quartz cluster check-ins may timeout with high latency
Cache invalidation events delayed

Recommendations:

Co-locate Elsa nodes and dependencies in the same region/AZ
Monitor network latency between Elsa nodes and:
- Database (< 10ms recommended)
- Redis (< 5ms recommended)
- RabbitMQ (< 10ms recommended)

Adjust timeouts if cross-region deployment is unavoidable:

// Increase lock acquisition timeout for high-latency environments
var lockHandle = await _distributedLockProvider.AcquireLockAsync(
    lockKey,
    timeout: TimeSpan.FromSeconds(60),  // Increased from 30s
    cancellationToken);

Use database connection pooling:

Server=postgres-host;Database=elsa;MaxPoolSize=100;Connection Idle Lifetime=300

Time Zone Considerations for Timers

Issue: Scheduled workflows may execute at incorrect times if nodes have different time zones.

Recommendations:

Ensure all nodes use UTC:

# Dockerfile
ENV TZ=UTC
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

Configure time zone in Kubernetes:
```
env:
  - name: TZ
    value: "UTC"
```
Store all timestamps in UTC in the database
Convert to user's local time in the UI/API layer

Tokenized Resume URL Security

Code Reference: src/modules/Elsa.Http/Extensions/BookmarkExecutionContextExtensions.cs - GenerateBookmarkTriggerUrl

When workflows are suspended waiting for HTTP requests, Elsa can generate tokenized URLs that allow external systems to resume the workflow:

// Example: HTTP endpoint activity generates a resume URL
var resumeUrl = context.GenerateBookmarkTriggerUrl();
// Result: https://elsa.example.com/workflows/resume/{token}

Security Considerations:

Tokens are opaque and unguessable:
- Generated using cryptographically secure random number generator
- Typically 32-64 characters long
Tokens should be single-use:
- Elsa automatically invalidates tokens after workflow resumes
- Replay attacks prevented
Use HTTPS for resume URLs:
- Never send tokens over unencrypted HTTP
- Configure TLS/SSL on ingress controller
Token expiration:
- Configure bookmark expiration to automatically clean up old tokens
- Expired bookmarks cannot be used to resume workflows
Audit logging:
- Log all resume attempts (successful and failed)
- Monitor for unusual patterns (repeated resume attempts, token scanning)

Example Configuration:

builder.Services.AddElsa(elsa =>
{
    elsa.UseHttp(http =>
    {
        // Require HTTPS for all HTTP workflows
        http.RequireHttps = true;
        
        // Base URL for generated resume URLs
        http.BaseUrl = new Uri("https://elsa.example.com");
    });
});

Studio-Specific Notes

Embedding Studio Behind Ingress

When deploying Elsa Studio (the visual workflow designer) in a clustered environment:

Deployment Pattern:

                    ┌─────────────────┐
                    │  Ingress/LB     │
                    └────┬───────┬────┘
                         │       │
             ┌───────────┘       └───────────┐
             ↓                               ↓
    ┌─────────────────┐           ┌─────────────────┐
    │  Elsa Server    │           │  Elsa Studio    │
    │  (API)          │           │  (UI)           │
    │  Replicas: 3+   │←──────────│  Replicas: 2+   │
    └─────────────────┘           └─────────────────┘

Configuration Example:

# Ingress routing
spec:
  rules:
    - host: elsa.example.com
      http:
        paths:
          # Studio UI
          - path: /
            pathType: Prefix
            backend:
              service:
                name: elsa-studio
                port:
                  number: 80
          # API
          - path: /elsa/api
            pathType: Prefix
            backend:
              service:
                name: elsa-server
                port:
                  number: 80

Session Affinity for Studio UI

Do you need sticky sessions for Studio?

Short answer: No - If Studio is a stateless SPA (Single Page Application) that only communicates with the API.

Long answer:

Elsa Studio (Blazor WebAssembly) is stateless and doesn't require session affinity
All state is managed by the Elsa Server API (which uses distributed state)
Studio can be freely routed to any pod

Exception: If using Elsa Studio Blazor Server (not WebAssembly), you do need session affinity:

annotations:
  nginx.ingress.kubernetes.io/affinity: "cookie"
  nginx.ingress.kubernetes.io/session-cookie-name: "elsa-studio-session"
  nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"

Recommendation: Use Elsa Studio WebAssembly for clustered deployments to avoid session affinity complexity.

Ingress Settings

Recommended Annotations (NGINX Ingress):

metadata:
  annotations:
    # SSL redirect
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    
    # Body size (for uploading large workflow definitions)
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    
    # Timeouts (for long-running workflow executions)
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    
    # CORS (if Studio and API on different domains)
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://studio.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    
    # No session affinity (not needed for Elsa Server or Studio WebAssembly)
    nginx.ingress.kubernetes.io/affinity: "none"
    
    # Rate limiting (optional, for API protection)
    nginx.ingress.kubernetes.io/limit-rps: "100"

Studio Authentication in Clusters

Scenario: Multiple Studio pods behind a load balancer.

Requirements:

Shared authentication provider (don't use in-memory auth)
Distributed session storage (if using cookie-based auth)
Token-based authentication (recommended for stateless clusters)

Example: JWT Bearer Token Authentication:

// In Elsa Server (API)
builder.Services.AddElsa(elsa =>
{
    elsa.UseWorkflowsApi(api =>
    {
        api.AddJwtAuthentication(jwt =>
        {
            jwt.Authority = "https://identity.example.com";
            jwt.Audience = "elsa-api";
        });
    });
});

// In Elsa Studio
builder.Services.AddElsaStudio(studio =>
{
    studio.ServerUrl = "https://api.example.com/elsa/api";
    studio.Authentication = new BearerTokenAuthentication("your-jwt-token");
});

Alternative: OpenID Connect with External Provider:

Azure AD
Auth0
Keycloak
IdentityServer

This ensures authentication state is managed externally and works seamlessly across all cluster nodes.

Placeholders for Screenshots

[Screenshot: Elsa Studio showing workflow definitions synchronized across nodes]

[Screenshot: Kubernetes dashboard displaying 3 healthy Elsa pods with auto-scaling enabled]

[Screenshot: Grafana dashboard showing distributed lock acquisition metrics and Quartz job execution rates]

[Screenshot: Logs demonstrating only one node executing a scheduled timer across a 3-node cluster]

[Screenshot: Redis Commander showing distributed lock keys with TTL]

[Screenshot: PostgreSQL query result showing Quartz cluster state with multiple scheduler instances]

Distributed Hosting - Core distributed runtime concepts
Kubernetes Deployment Guide - General Kubernetes deployment
Database Configuration - Database setup
Authentication Guide - Securing your deployment

Example Code Repository

For complete, deployable examples:

elsa-samples - Official sample projects
elsa-guides - Step-by-step guide implementations

References

Elsa Core Source Code: https://github.com/elsa-workflows/elsa-core
Medallion.Threading: https://github.com/madelson/DistributedLock
Quartz.NET: https://www.quartz-scheduler.net/
MassTransit: https://masstransit.io/

Support

GitHub Discussions
GitHub Issues
Community: Discord, Slack (see main README for links)

Last Updated: 2025-11-24

Acceptance Criteria Checklist (DOC-015):

✅ Explains clustering concepts and common pitfalls with practical mitigation steps
✅ Contains working, copy-pasteable configuration snippets for Redis and SQL-based lock options
✅ Includes Quartz clustering configuration snippet
✅ Provides Kubernetes deployment examples (k8s-deployment.yaml)
✅ References elsa-core source files:
- WorkflowResumer (distributed locking)
- DefaultBookmarkScheduler (enqueuing scheduled bookmarks)
- ResumeWorkflowTask (Quartz job execution)
- ActivityExecutionContext.CreateBookmark (bookmark hashing)
- BookmarkExecutionContextExtensions.GenerateBookmarkTriggerUrl (tokenized URLs)
✅ Adds SUMMARY.md entry (see next step)
✅ Includes operational guidance, troubleshooting, validation checklist, and security notes

PreviousKubernetes Deployment NextHTTP Workflows

Last updated 8 minutes ago

Executive Summary

Why Clustering Matters

Conceptual Overview

Understanding Corruption and Duplication Risks

Problem 1: Duplicate Timer Execution

Problem 2: Concurrent Workflow Modification

Problem 3: Cache Invalidation

How Elsa Mitigates These Risks

1. Bookmark Hashing

2. Distributed Locking

3. Centralized Scheduler (Quartz.NET Clustering)

4. Distributed Cache Invalidation

Architecture Patterns and Deployment Models

Pattern 1: Shared Database + Distributed Locks

Pattern 2: Leader-Election Scheduler

Pattern 3: Quartz Clustering (All-Nodes-Participate)

Pattern 4: External Scheduler

Practical Configuration

Configuring Distributed Locks

Option 1: Redis-Based Locking (Recommended for Performance)

Option 2: PostgreSQL-Based Locking (No Additional Infrastructure)

Configuring Quartz Clustering

Kubernetes Configuration

Docker Compose Development Example

Operational Topics

Metrics to Monitor

Log Levels and Structured Logging

Troubleshooting Common Issues

Issue: Duplicate Workflow Executions

Issue: Bookmark Not Found

Issue: Lock Acquisition Timeouts

Issue: Cache Inconsistencies

Retention and Cleanup

Validation Checklist for Cluster Behavior

1. Distributed Runtime Validation

2. Scheduled Task Validation

3. Cache Invalidation Validation

4. Failover Validation

5. High Availability Validation

6. Distributed Lock Validation

Security and Networking

Database Access Security

Network Latency Considerations

Time Zone Considerations for Timers

Tokenized Resume URL Security

Studio-Specific Notes

Embedding Studio Behind Ingress

Session Affinity for Studio UI

Ingress Settings

Studio Authentication in Clusters

Placeholders for Screenshots

Related Documentation

Example Code Repository

References

Support