Cluster
Distributed worker coordination, leader election, heartbeats, and work stealing.
The cluster package enables multiple Dispatch instances to coordinate work. One instance is the cluster leader at any time; the others are followers.
How it works
- Every Dispatch instance registers a
cluster.Workerrecord at startup with its hostname, queues, and concurrency. - Workers send heartbeats on
HeartbeatInterval(default: 10s) to updateLastSeen. - The leader is determined by
AcquireLeadershipusing optimistic locking in the store. Leadership is renewed before it expires. - The leader periodically:
- Fires due cron entries
- Scans for stale workers (workers whose heartbeat has been missing longer than
StaleJobThreshold) - Reclaims stale workers' in-flight jobs and re-enqueues them (work stealing)
Worker states
| State | Meaning |
|---|---|
active | Healthy and processing jobs |
draining | Finishing in-flight jobs, not accepting new ones (shutdown) |
dead | Stopped responding; in-flight jobs eligible for reassignment |
Leadership
Only the leader performs:
- Cron scheduling
- Stale-job recovery
- Work rebalancing
If leadership is lost mid-operation, dispatch.ErrLeadershipLost is returned. This is normal and safe to ignore — another worker will acquire leadership shortly.
Kubernetes consensus
For Kubernetes deployments, use cluster/k8s which implements leader election via the Kubernetes API (client-go):
import "github.com/xraph/dispatch/cluster/k8s"
leaderElector := k8s.NewLeaderElector(kubeClient, namespace, leaseName)
eng := engine.Build(d,
engine.WithLeaderElector(leaderElector),
)This ensures correct behavior across pod restarts and rolling updates.
Configuration
| Config field | Default | Meaning |
|---|---|---|
HeartbeatInterval | 10s | How often workers update LastSeen |
StaleJobThreshold | 30s | How long before a worker is considered dead |