Clustering
Comprehensive guide to running Elsa Workflows in clustered and distributed production environments, covering architecture patterns, distributed locking, scheduling, and operational best practices.
Executive Summary
Running Elsa Workflows in a clustered environment is essential for achieving high availability, scalability, and fault tolerance in production deployments. A clustered setup allows multiple Elsa instances to work together, distributing workload across nodes while maintaining consistency and preventing data corruption.
Why Clustering Matters
Production Requirements Clustering Solves:
High Availability: If one node fails, others continue processing workflows without interruption
Horizontal Scalability: Handle increased load by adding more nodes rather than scaling vertically
Zero-Downtime Deployments: Rolling updates with no service interruption
Geographic Distribution: Deploy nodes across regions for disaster recovery and reduced latency
Load Balancing: Distribute HTTP requests and background jobs across multiple instances
Key Challenges Clustering Addresses:
Concurrent Modification: Preventing multiple nodes from modifying the same workflow instance simultaneously
Duplicate Scheduling: Ensuring timers and scheduled tasks execute only once
Cache Consistency: Keeping in-memory caches synchronized across nodes
Race Conditions: Managing concurrent bookmark resume attempts
Without proper clustering configuration, you may encounter:
Workflow state corruption from simultaneous updates
Duplicate timer executions causing repeated notifications or side effects
Cache inconsistencies leading to stale workflow definitions
Race conditions when external events trigger workflow resumption
Conceptual Overview
Understanding Corruption and Duplication Risks
Problem 1: Duplicate Timer Execution
Scenario: A workflow with a timer activity (e.g., "Send reminder email in 24 hours") is deployed across 3 nodes.
Without Clustering:
Result: Customer receives 3 identical reminder emails instead of 1.
With Clustering (Quartz.NET Clustering):
Result: Customer receives exactly 1 email as intended.
Problem 2: Concurrent Workflow Modification
Scenario: An HTTP workflow receives two simultaneous requests that both attempt to resume the same workflow instance.
Without Distributed Locking:
Result: Workflow execution is corrupted; steps may be skipped or repeated.
With Distributed Locking:
Result: Workflow executes correctly without corruption.
Problem 3: Cache Invalidation
Scenario: An administrator updates a workflow definition in Elsa Studio.
Without Distributed Cache Invalidation:
Result: New workflow instances on Node 2 and 3 use outdated definitions.
With Distributed Cache Invalidation (MassTransit):
Result: All nodes use the updated workflow definition immediately.
How Elsa Mitigates These Risks
Elsa provides four key mechanisms for safe clustering:
1. Bookmark Hashing
Bookmarks (suspension points in workflows) are assigned deterministic hashes based on their properties. When multiple nodes attempt to create the same bookmark, the hash collision is detected, and only one bookmark is persisted.
Code Reference: src/modules/Elsa.Workflows.Core/Contexts/ActivityExecutionContext.cs - CreateBookmark method
2. Distributed Locking
The WorkflowResumer service acquires a distributed lock before resuming a workflow instance. This ensures only one node processes a resume request at a time.
Code Reference: src/modules/Elsa.Workflows.Runtime/Services/WorkflowResumer.cs
Lock Providers Supported:
Redis: Fast, in-memory locking via Medallion.Threading.Redis
PostgreSQL: Database-backed locking via Medallion.Threading.Postgres
SQL Server: Database-backed locking via Medallion.Threading.SqlServer
Azure Blob Storage: Cloud-native locking via Medallion.Threading.Azure
3. Centralized Scheduler (Quartz.NET Clustering)
Quartz.NET clustering ensures scheduled jobs (timers, delays, cron triggers) execute only once across the cluster.
Code References:
src/modules/Elsa.Scheduling/Services/DefaultBookmarkScheduler.cs- Enqueues bookmark resume taskssrc/modules/Elsa.Scheduling/Tasks/ResumeWorkflowTask.cs- Quartz job that resumes workflows
How It Works:
DefaultBookmarkSchedulercreates a Quartz job for each scheduled bookmarkQuartz stores job metadata in a shared database
At execution time, nodes compete for a database lock
The node that acquires the lock executes the job; others skip it
Failed nodes' jobs are recovered by surviving nodes (failover)
4. Distributed Cache Invalidation
When workflow definitions or other cached data changes, MassTransit publishes cache invalidation events to all nodes via a message broker (RabbitMQ, Azure Service Bus, etc.).
Message Flow:
Architecture Patterns and Deployment Models
Pattern 1: Shared Database + Distributed Locks
Best for: Most production scenarios with moderate to high traffic
Configuration:
All nodes connect to the same database
Distributed runtime enabled:
runtime.UseDistributedRuntime()Distributed locks via Redis or PostgreSQL
Quartz clustering enabled for scheduled tasks
MassTransit for cache invalidation
Pros:
Simple architecture
Easy to scale horizontally
No single point of failure (stateless nodes)
Cons:
Database becomes a bottleneck at extreme scale
Requires careful database tuning
Pattern 2: Leader-Election Scheduler
Best for: Environments where you want precise control over scheduling overhead
Configuration:
Scheduler Node:
Worker Nodes:
Pros:
Centralized scheduling (easier to monitor)
Workers focused on request handling
Lower resource usage on workers
Cons:
Scheduler is a single point of failure (mitigate with active-standby setup)
More complex deployment configuration
Pattern 3: Quartz Clustering (All-Nodes-Participate)
Best for: Simplicity and automatic failover
Configuration:
Pros:
Simple configuration (same for all nodes)
Automatic failover (if Node 1 crashes, Node 2 picks up its jobs)
No single point of failure
Cons:
Every node runs Quartz scheduler (slightly higher resource usage)
More database queries for cluster coordination
Recommendation: Use this pattern unless you have specific reasons to use leader-election.
Pattern 4: External Scheduler
Best for: Multi-tenant environments or complex orchestration needs
Example: Kubernetes CronJob
Pros:
No Quartz dependency
Leverage platform-native scheduling (Kubernetes, cloud functions)
Easier multi-cloud deployments
Cons:
External system must remain operational
More complex integration (API authentication, error handling)
No built-in bookmark scheduling (must implement externally)
Practical Configuration
Configuring Distributed Locks
Option 1: Redis-Based Locking (Recommended for Performance)
Prerequisites:
Redis 6.0+ deployed and accessible
NuGet package:
Medallion.Threading.Redis
Configuration Example:
Connection String Example:
See: examples/redis-lock-setup.md for detailed configuration and troubleshooting.
Option 2: PostgreSQL-Based Locking (No Additional Infrastructure)
Prerequisites:
PostgreSQL 12+ (same database as Elsa workflow storage)
NuGet package:
Medallion.Threading.Postgres
Configuration Example:
Connection String Example:
Pros vs Redis:
✅ No additional infrastructure required
✅ Uses existing database connection
❌ Slower lock acquisition (disk I/O vs in-memory)
❌ Adds load to database
Medallion.Threading Usage in Elsa Core:
Elsa uses Medallion.Threading abstractions to remain agnostic to the lock provider. The IDistributedLockProvider interface is implemented by all Medallion providers:
RedisDistributedSynchronizationProviderPostgresDistributedSynchronizationProviderSqlDistributedSynchronizationProviderAzureDistributedSynchronizationProvider
To use a different provider, simply register it as shown above. Elsa's WorkflowResumer will automatically use the registered provider.
Configuring Quartz Clustering
Example quartz.properties:
Configuration in Program.cs:
Environment Variables:
See: examples/quartz-cluster-config.md for detailed configuration options.
Kubernetes Configuration
Minimal Deployment Snippet:
See examples/k8s-deployment.yaml for a complete example with:
Deployment with 3 replicas
Service (ClusterIP, no session affinity)
HorizontalPodAutoscaler
Pod Disruption Budget
Health probes (liveness, readiness, startup)
Key Points:
No Sticky Sessions Required: Elsa's distributed runtime manages state externally, so requests can be routed to any node
Readiness Probes: Use
/health/readyendpoint to ensure pods are ready before receiving trafficLiveness Probes: Use
/health/liveendpoint to restart unhealthy podsAnti-Affinity: Spread pods across nodes for high availability
Resource Limits: Set appropriate CPU/memory limits to prevent resource contention
Ingress Configuration:
Helm Values:
See examples/helm-values.yaml for an annotated Helm values file with:
Multiple replicas configuration
Database, Redis, and RabbitMQ settings
Distributed runtime and locking configuration
Quartz clustering settings
HPA and resource limits
Health probe configurations
Docker Compose Development Example
For local development and testing clustering behavior:
Testing Commands:
Operational Topics
Metrics to Monitor
Workflow Execution Metrics:
Active workflow instances
Workflows completed per minute
Workflow execution failures
Average workflow execution time
Distributed Locking Metrics:
Lock acquisition time (P50, P95, P99)
Lock acquisition failures
Lock hold duration
Lock contention rate
Quartz Scheduling Metrics:
Scheduled jobs count
Job execution rate
Job misfires
Scheduler heartbeat intervals
System Metrics:
CPU usage per pod
Memory usage per pod
Database connection pool utilization
Redis connection pool utilization
Database Metrics:
Query execution time
Connection pool exhaustion
Deadlocks
Lock wait time
Log Levels and Structured Logging
Recommended Log Levels:
Production:
Debugging Clustering Issues:
Key Log Messages to Watch:
Successful Lock Acquisition:
Lock Acquisition Failure (expected in clusters):
Quartz Job Execution:
Cache Invalidation:
Troubleshooting Common Issues
Issue: Duplicate Workflow Executions
Symptoms:
Timers firing multiple times
Duplicate notifications or side effects
Multiple log entries for the same workflow execution
Diagnosis:
Solutions:
Verify
runtime.UseDistributedRuntime()is called in configurationEnsure Quartz clustering is enabled (
quartz.jobStore.clustered = true)Check distributed lock provider is registered and accessible
Verify all nodes use the same database and Redis instance
Issue: Bookmark Not Found
Symptoms:
Scheduled tasks fail with "Bookmark not found" error
Workflows not resuming at expected time
Diagnosis:
Solutions:
Check database connectivity from all nodes
Verify clock synchronization across nodes (NTP)
Ensure time zones are configured consistently
Check for database replication lag (if using replicas)
Issue: Lock Acquisition Timeouts
Symptoms:
Workflows stuck in "Suspended" state
Logs show "Failed to acquire lock after timeout"
Diagnosis:
Solutions:
Increase lock acquisition timeout
Check Redis/database connectivity and latency
Verify lock expiration is configured to prevent stuck locks
Clear stale locks (use with caution):
Issue: Cache Inconsistencies
Symptoms:
Nodes using different workflow definitions
Changes not reflecting immediately on all nodes
Diagnosis:
Solutions:
Verify
elsa.UseDistributedCache(dc => dc.UseMassTransit())is configuredCheck RabbitMQ connectivity from all nodes
Verify message broker is operational
Restart all pods to force cache refresh
Retention and Cleanup
Workflow Instance Cleanup:
Old completed or faulted workflow instances should be cleaned up periodically:
Manual Cleanup (SQL):
Quartz Cleanup:
Quartz automatically cleans up completed jobs, but you may want to clean up old execution history:
Bookmark Cleanup:
Orphaned bookmarks (no associated workflow instance) should be cleaned up:
Validation Checklist for Cluster Behavior
Use this checklist to validate your clustering setup in a test environment:
1. Distributed Runtime Validation
Test Command:
2. Scheduled Task Validation
Test Command:
3. Cache Invalidation Validation
Test Command:
4. Failover Validation
Test Command:
5. High Availability Validation
Test Command:
6. Distributed Lock Validation
Test Command:
Security and Networking
Database Access Security
Recommendations:
Use TLS/SSL for database connections:
Restrict database access to Elsa nodes only:
Use Kubernetes Network Policies
Configure database firewall rules (cloud-managed databases)
Use dedicated database users with minimal permissions:
Rotate credentials regularly:
Use external secret management (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault)
Implement automated rotation policies
Network Latency Considerations
Impact on Clustering:
Distributed lock acquisition time increases with latency
Quartz cluster check-ins may timeout with high latency
Cache invalidation events delayed
Recommendations:
Co-locate Elsa nodes and dependencies in the same region/AZ
Monitor network latency between Elsa nodes and:
Database (< 10ms recommended)
Redis (< 5ms recommended)
RabbitMQ (< 10ms recommended)
Adjust timeouts if cross-region deployment is unavoidable:
Use database connection pooling:
Time Zone Considerations for Timers
Issue: Scheduled workflows may execute at incorrect times if nodes have different time zones.
Recommendations:
Ensure all nodes use UTC:
Configure time zone in Kubernetes:
Store all timestamps in UTC in the database
Convert to user's local time in the UI/API layer
Tokenized Resume URL Security
Code Reference: src/modules/Elsa.Http/Extensions/BookmarkExecutionContextExtensions.cs - GenerateBookmarkTriggerUrl
When workflows are suspended waiting for HTTP requests, Elsa can generate tokenized URLs that allow external systems to resume the workflow:
Security Considerations:
Tokens are opaque and unguessable:
Generated using cryptographically secure random number generator
Typically 32-64 characters long
Tokens should be single-use:
Elsa automatically invalidates tokens after workflow resumes
Replay attacks prevented
Use HTTPS for resume URLs:
Never send tokens over unencrypted HTTP
Configure TLS/SSL on ingress controller
Token expiration:
Configure bookmark expiration to automatically clean up old tokens
Expired bookmarks cannot be used to resume workflows
Audit logging:
Log all resume attempts (successful and failed)
Monitor for unusual patterns (repeated resume attempts, token scanning)
Example Configuration:
Studio-Specific Notes
Embedding Studio Behind Ingress
When deploying Elsa Studio (the visual workflow designer) in a clustered environment:
Deployment Pattern:
Configuration Example:
Session Affinity for Studio UI
Do you need sticky sessions for Studio?
Short answer: No - If Studio is a stateless SPA (Single Page Application) that only communicates with the API.
Long answer:
Elsa Studio (Blazor WebAssembly) is stateless and doesn't require session affinity
All state is managed by the Elsa Server API (which uses distributed state)
Studio can be freely routed to any pod
Exception: If using Elsa Studio Blazor Server (not WebAssembly), you do need session affinity:
Recommendation: Use Elsa Studio WebAssembly for clustered deployments to avoid session affinity complexity.
Ingress Settings
Recommended Annotations (NGINX Ingress):
Studio Authentication in Clusters
Scenario: Multiple Studio pods behind a load balancer.
Requirements:
Shared authentication provider (don't use in-memory auth)
Distributed session storage (if using cookie-based auth)
Token-based authentication (recommended for stateless clusters)
Example: JWT Bearer Token Authentication:
Alternative: OpenID Connect with External Provider:
Azure AD
Auth0
Keycloak
IdentityServer
This ensures authentication state is managed externally and works seamlessly across all cluster nodes.
Placeholders for Screenshots
[Screenshot: Elsa Studio showing workflow definitions synchronized across nodes]
[Screenshot: Kubernetes dashboard displaying 3 healthy Elsa pods with auto-scaling enabled]
[Screenshot: Grafana dashboard showing distributed lock acquisition metrics and Quartz job execution rates]
[Screenshot: Logs demonstrating only one node executing a scheduled timer across a 3-node cluster]
[Screenshot: Redis Commander showing distributed lock keys with TTL]
[Screenshot: PostgreSQL query result showing Quartz cluster state with multiple scheduler instances]
Related Documentation
Distributed Hosting - Core distributed runtime concepts
Kubernetes Deployment Guide - General Kubernetes deployment
Database Configuration - Database setup
Authentication Guide - Securing your deployment
Example Code Repository
For complete, deployable examples:
elsa-samples - Official sample projects
elsa-guides - Step-by-step guide implementations
References
Elsa Core Source Code: https://github.com/elsa-workflows/elsa-core
Medallion.Threading: https://github.com/madelson/DistributedLock
Quartz.NET: https://www.quartz-scheduler.net/
MassTransit: https://masstransit.io/
Support
Community: Discord, Slack (see main README for links)
Last Updated: 2025-11-24
Last updated