Clustering

Comprehensive guide to running Elsa Workflows in clustered and distributed production environments, covering architecture patterns, distributed locking, scheduling, and operational best practices.

Executive Summary

Running Elsa Workflows in a clustered environment is essential for achieving high availability, scalability, and fault tolerance in production deployments. A clustered setup allows multiple Elsa instances to work together, distributing workload across nodes while maintaining consistency and preventing data corruption.

Why Clustering Matters

Production Requirements Clustering Solves:

  1. High Availability: If one node fails, others continue processing workflows without interruption

  2. Horizontal Scalability: Handle increased load by adding more nodes rather than scaling vertically

  3. Zero-Downtime Deployments: Rolling updates with no service interruption

  4. Geographic Distribution: Deploy nodes across regions for disaster recovery and reduced latency

  5. Load Balancing: Distribute HTTP requests and background jobs across multiple instances

Key Challenges Clustering Addresses:

  • Concurrent Modification: Preventing multiple nodes from modifying the same workflow instance simultaneously

  • Duplicate Scheduling: Ensuring timers and scheduled tasks execute only once

  • Cache Consistency: Keeping in-memory caches synchronized across nodes

  • Race Conditions: Managing concurrent bookmark resume attempts

Without proper clustering configuration, you may encounter:

  • Workflow state corruption from simultaneous updates

  • Duplicate timer executions causing repeated notifications or side effects

  • Cache inconsistencies leading to stale workflow definitions

  • Race conditions when external events trigger workflow resumption

Conceptual Overview

Understanding Corruption and Duplication Risks

Problem 1: Duplicate Timer Execution

Scenario: A workflow with a timer activity (e.g., "Send reminder email in 24 hours") is deployed across 3 nodes.

Without Clustering:

Result: Customer receives 3 identical reminder emails instead of 1.

With Clustering (Quartz.NET Clustering):

Result: Customer receives exactly 1 email as intended.

Problem 2: Concurrent Workflow Modification

Scenario: An HTTP workflow receives two simultaneous requests that both attempt to resume the same workflow instance.

Without Distributed Locking:

Result: Workflow execution is corrupted; steps may be skipped or repeated.

With Distributed Locking:

Result: Workflow executes correctly without corruption.

Problem 3: Cache Invalidation

Scenario: An administrator updates a workflow definition in Elsa Studio.

Without Distributed Cache Invalidation:

Result: New workflow instances on Node 2 and 3 use outdated definitions.

With Distributed Cache Invalidation (MassTransit):

Result: All nodes use the updated workflow definition immediately.

How Elsa Mitigates These Risks

Elsa provides four key mechanisms for safe clustering:

1. Bookmark Hashing

Bookmarks (suspension points in workflows) are assigned deterministic hashes based on their properties. When multiple nodes attempt to create the same bookmark, the hash collision is detected, and only one bookmark is persisted.

Code Reference: src/modules/Elsa.Workflows.Core/Contexts/ActivityExecutionContext.cs - CreateBookmark method

2. Distributed Locking

The WorkflowResumer service acquires a distributed lock before resuming a workflow instance. This ensures only one node processes a resume request at a time.

Code Reference: src/modules/Elsa.Workflows.Runtime/Services/WorkflowResumer.cs

Lock Providers Supported:

  • Redis: Fast, in-memory locking via Medallion.Threading.Redis

  • PostgreSQL: Database-backed locking via Medallion.Threading.Postgres

  • SQL Server: Database-backed locking via Medallion.Threading.SqlServer

  • Azure Blob Storage: Cloud-native locking via Medallion.Threading.Azure

3. Centralized Scheduler (Quartz.NET Clustering)

Quartz.NET clustering ensures scheduled jobs (timers, delays, cron triggers) execute only once across the cluster.

Code References:

  • src/modules/Elsa.Scheduling/Services/DefaultBookmarkScheduler.cs - Enqueues bookmark resume tasks

  • src/modules/Elsa.Scheduling/Tasks/ResumeWorkflowTask.cs - Quartz job that resumes workflows

How It Works:

  1. DefaultBookmarkScheduler creates a Quartz job for each scheduled bookmark

  2. Quartz stores job metadata in a shared database

  3. At execution time, nodes compete for a database lock

  4. The node that acquires the lock executes the job; others skip it

  5. Failed nodes' jobs are recovered by surviving nodes (failover)

4. Distributed Cache Invalidation

When workflow definitions or other cached data changes, MassTransit publishes cache invalidation events to all nodes via a message broker (RabbitMQ, Azure Service Bus, etc.).

Message Flow:

Architecture Patterns and Deployment Models

Pattern 1: Shared Database + Distributed Locks

Best for: Most production scenarios with moderate to high traffic

Configuration:

  • All nodes connect to the same database

  • Distributed runtime enabled: runtime.UseDistributedRuntime()

  • Distributed locks via Redis or PostgreSQL

  • Quartz clustering enabled for scheduled tasks

  • MassTransit for cache invalidation

Pros:

  • Simple architecture

  • Easy to scale horizontally

  • No single point of failure (stateless nodes)

Cons:

  • Database becomes a bottleneck at extreme scale

  • Requires careful database tuning

Pattern 2: Leader-Election Scheduler

Best for: Environments where you want precise control over scheduling overhead

Configuration:

Scheduler Node:

Worker Nodes:

Pros:

  • Centralized scheduling (easier to monitor)

  • Workers focused on request handling

  • Lower resource usage on workers

Cons:

  • Scheduler is a single point of failure (mitigate with active-standby setup)

  • More complex deployment configuration

Pattern 3: Quartz Clustering (All-Nodes-Participate)

Best for: Simplicity and automatic failover

Configuration:

Pros:

  • Simple configuration (same for all nodes)

  • Automatic failover (if Node 1 crashes, Node 2 picks up its jobs)

  • No single point of failure

Cons:

  • Every node runs Quartz scheduler (slightly higher resource usage)

  • More database queries for cluster coordination

Recommendation: Use this pattern unless you have specific reasons to use leader-election.

Pattern 4: External Scheduler

Best for: Multi-tenant environments or complex orchestration needs

Example: Kubernetes CronJob

Pros:

  • No Quartz dependency

  • Leverage platform-native scheduling (Kubernetes, cloud functions)

  • Easier multi-cloud deployments

Cons:

  • External system must remain operational

  • More complex integration (API authentication, error handling)

  • No built-in bookmark scheduling (must implement externally)

Practical Configuration

Configuring Distributed Locks

Prerequisites:

  • Redis 6.0+ deployed and accessible

  • NuGet package: Medallion.Threading.Redis

Configuration Example:

Connection String Example:

See: examples/redis-lock-setup.md for detailed configuration and troubleshooting.

Option 2: PostgreSQL-Based Locking (No Additional Infrastructure)

Prerequisites:

  • PostgreSQL 12+ (same database as Elsa workflow storage)

  • NuGet package: Medallion.Threading.Postgres

Configuration Example:

Connection String Example:

Pros vs Redis:

  • ✅ No additional infrastructure required

  • ✅ Uses existing database connection

  • ❌ Slower lock acquisition (disk I/O vs in-memory)

  • ❌ Adds load to database

Medallion.Threading Usage in Elsa Core:

Elsa uses Medallion.Threading abstractions to remain agnostic to the lock provider. The IDistributedLockProvider interface is implemented by all Medallion providers:

  • RedisDistributedSynchronizationProvider

  • PostgresDistributedSynchronizationProvider

  • SqlDistributedSynchronizationProvider

  • AzureDistributedSynchronizationProvider

To use a different provider, simply register it as shown above. Elsa's WorkflowResumer will automatically use the registered provider.

Configuring Quartz Clustering

Example quartz.properties:

Configuration in Program.cs:

Environment Variables:

See: examples/quartz-cluster-config.md for detailed configuration options.

Kubernetes Configuration

Minimal Deployment Snippet:

See examples/k8s-deployment.yaml for a complete example with:

  • Deployment with 3 replicas

  • Service (ClusterIP, no session affinity)

  • HorizontalPodAutoscaler

  • Pod Disruption Budget

  • Health probes (liveness, readiness, startup)

Key Points:

  • No Sticky Sessions Required: Elsa's distributed runtime manages state externally, so requests can be routed to any node

  • Readiness Probes: Use /health/ready endpoint to ensure pods are ready before receiving traffic

  • Liveness Probes: Use /health/live endpoint to restart unhealthy pods

  • Anti-Affinity: Spread pods across nodes for high availability

  • Resource Limits: Set appropriate CPU/memory limits to prevent resource contention

Ingress Configuration:

Helm Values:

See examples/helm-values.yaml for an annotated Helm values file with:

  • Multiple replicas configuration

  • Database, Redis, and RabbitMQ settings

  • Distributed runtime and locking configuration

  • Quartz clustering settings

  • HPA and resource limits

  • Health probe configurations

Docker Compose Development Example

For local development and testing clustering behavior:

Testing Commands:

Operational Topics

Metrics to Monitor

Workflow Execution Metrics:

  • Active workflow instances

  • Workflows completed per minute

  • Workflow execution failures

  • Average workflow execution time

Distributed Locking Metrics:

  • Lock acquisition time (P50, P95, P99)

  • Lock acquisition failures

  • Lock hold duration

  • Lock contention rate

Quartz Scheduling Metrics:

  • Scheduled jobs count

  • Job execution rate

  • Job misfires

  • Scheduler heartbeat intervals

System Metrics:

  • CPU usage per pod

  • Memory usage per pod

  • Database connection pool utilization

  • Redis connection pool utilization

Database Metrics:

  • Query execution time

  • Connection pool exhaustion

  • Deadlocks

  • Lock wait time

Log Levels and Structured Logging

Recommended Log Levels:

Production:

Debugging Clustering Issues:

Key Log Messages to Watch:

Successful Lock Acquisition:

Lock Acquisition Failure (expected in clusters):

Quartz Job Execution:

Cache Invalidation:

Troubleshooting Common Issues

Issue: Duplicate Workflow Executions

Symptoms:

  • Timers firing multiple times

  • Duplicate notifications or side effects

  • Multiple log entries for the same workflow execution

Diagnosis:

Solutions:

  1. Verify runtime.UseDistributedRuntime() is called in configuration

  2. Ensure Quartz clustering is enabled (quartz.jobStore.clustered = true)

  3. Check distributed lock provider is registered and accessible

  4. Verify all nodes use the same database and Redis instance

Issue: Bookmark Not Found

Symptoms:

  • Scheduled tasks fail with "Bookmark not found" error

  • Workflows not resuming at expected time

Diagnosis:

Solutions:

  1. Check database connectivity from all nodes

  2. Verify clock synchronization across nodes (NTP)

  3. Ensure time zones are configured consistently

  4. Check for database replication lag (if using replicas)

Issue: Lock Acquisition Timeouts

Symptoms:

  • Workflows stuck in "Suspended" state

  • Logs show "Failed to acquire lock after timeout"

Diagnosis:

Solutions:

  1. Increase lock acquisition timeout

  2. Check Redis/database connectivity and latency

  3. Verify lock expiration is configured to prevent stuck locks

  4. Clear stale locks (use with caution):

Issue: Cache Inconsistencies

Symptoms:

  • Nodes using different workflow definitions

  • Changes not reflecting immediately on all nodes

Diagnosis:

Solutions:

  1. Verify elsa.UseDistributedCache(dc => dc.UseMassTransit()) is configured

  2. Check RabbitMQ connectivity from all nodes

  3. Verify message broker is operational

  4. Restart all pods to force cache refresh

Retention and Cleanup

Workflow Instance Cleanup:

Old completed or faulted workflow instances should be cleaned up periodically:

Manual Cleanup (SQL):

Quartz Cleanup:

Quartz automatically cleans up completed jobs, but you may want to clean up old execution history:

Bookmark Cleanup:

Orphaned bookmarks (no associated workflow instance) should be cleaned up:

Validation Checklist for Cluster Behavior

Use this checklist to validate your clustering setup in a test environment:

1. Distributed Runtime Validation

Test Command:

2. Scheduled Task Validation

Test Command:

3. Cache Invalidation Validation

Test Command:

4. Failover Validation

Test Command:

5. High Availability Validation

Test Command:

6. Distributed Lock Validation

Test Command:

Security and Networking

Database Access Security

Recommendations:

  1. Use TLS/SSL for database connections:

  2. Restrict database access to Elsa nodes only:

    • Use Kubernetes Network Policies

    • Configure database firewall rules (cloud-managed databases)

  3. Use dedicated database users with minimal permissions:

  4. Rotate credentials regularly:

    • Use external secret management (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault)

    • Implement automated rotation policies

Network Latency Considerations

Impact on Clustering:

  • Distributed lock acquisition time increases with latency

  • Quartz cluster check-ins may timeout with high latency

  • Cache invalidation events delayed

Recommendations:

  1. Co-locate Elsa nodes and dependencies in the same region/AZ

  2. Monitor network latency between Elsa nodes and:

    • Database (< 10ms recommended)

    • Redis (< 5ms recommended)

    • RabbitMQ (< 10ms recommended)

  3. Adjust timeouts if cross-region deployment is unavoidable:

  4. Use database connection pooling:

Time Zone Considerations for Timers

Issue: Scheduled workflows may execute at incorrect times if nodes have different time zones.

Recommendations:

  1. Ensure all nodes use UTC:

  2. Configure time zone in Kubernetes:

  3. Store all timestamps in UTC in the database

  4. Convert to user's local time in the UI/API layer

Tokenized Resume URL Security

Code Reference: src/modules/Elsa.Http/Extensions/BookmarkExecutionContextExtensions.cs - GenerateBookmarkTriggerUrl

When workflows are suspended waiting for HTTP requests, Elsa can generate tokenized URLs that allow external systems to resume the workflow:

Security Considerations:

  1. Tokens are opaque and unguessable:

    • Generated using cryptographically secure random number generator

    • Typically 32-64 characters long

  2. Tokens should be single-use:

    • Elsa automatically invalidates tokens after workflow resumes

    • Replay attacks prevented

  3. Use HTTPS for resume URLs:

    • Never send tokens over unencrypted HTTP

    • Configure TLS/SSL on ingress controller

  4. Token expiration:

    • Configure bookmark expiration to automatically clean up old tokens

    • Expired bookmarks cannot be used to resume workflows

  5. Audit logging:

    • Log all resume attempts (successful and failed)

    • Monitor for unusual patterns (repeated resume attempts, token scanning)

Example Configuration:

Studio-Specific Notes

Embedding Studio Behind Ingress

When deploying Elsa Studio (the visual workflow designer) in a clustered environment:

Deployment Pattern:

Configuration Example:

Session Affinity for Studio UI

Do you need sticky sessions for Studio?

Short answer: No - If Studio is a stateless SPA (Single Page Application) that only communicates with the API.

Long answer:

  • Elsa Studio (Blazor WebAssembly) is stateless and doesn't require session affinity

  • All state is managed by the Elsa Server API (which uses distributed state)

  • Studio can be freely routed to any pod

Exception: If using Elsa Studio Blazor Server (not WebAssembly), you do need session affinity:

Recommendation: Use Elsa Studio WebAssembly for clustered deployments to avoid session affinity complexity.

Ingress Settings

Recommended Annotations (NGINX Ingress):

Studio Authentication in Clusters

Scenario: Multiple Studio pods behind a load balancer.

Requirements:

  1. Shared authentication provider (don't use in-memory auth)

  2. Distributed session storage (if using cookie-based auth)

  3. Token-based authentication (recommended for stateless clusters)

Example: JWT Bearer Token Authentication:

Alternative: OpenID Connect with External Provider:

  • Azure AD

  • Auth0

  • Keycloak

  • IdentityServer

This ensures authentication state is managed externally and works seamlessly across all cluster nodes.

Placeholders for Screenshots

[Screenshot: Elsa Studio showing workflow definitions synchronized across nodes]

[Screenshot: Kubernetes dashboard displaying 3 healthy Elsa pods with auto-scaling enabled]

[Screenshot: Grafana dashboard showing distributed lock acquisition metrics and Quartz job execution rates]

[Screenshot: Logs demonstrating only one node executing a scheduled timer across a 3-node cluster]

[Screenshot: Redis Commander showing distributed lock keys with TTL]

[Screenshot: PostgreSQL query result showing Quartz cluster state with multiple scheduler instances]

Example Code Repository

For complete, deployable examples:

References

  • Elsa Core Source Code: https://github.com/elsa-workflows/elsa-core

  • Medallion.Threading: https://github.com/madelson/DistributedLock

  • Quartz.NET: https://www.quartz-scheduler.net/

  • MassTransit: https://masstransit.io/

Support


Last Updated: 2025-11-24

Last updated