Cluster

Distributed worker coordination, leader election, heartbeats, and work stealing.

The cluster package enables multiple Dispatch instances to coordinate work. One instance is the cluster leader at any time; the others are followers.

How it works

Every Dispatch instance registers a cluster.Worker record at startup with its hostname, queues, and concurrency.
Workers send heartbeats on HeartbeatInterval (default: 10s) to update LastSeen.
The leader is determined by AcquireLeadership using optimistic locking in the store. Leadership is renewed before it expires.
The leader periodically:
- Fires due cron entries
- Scans for stale workers (workers whose heartbeat has been missing longer than StaleJobThreshold)
- Reclaims stale workers' in-flight jobs and re-enqueues them (work stealing)

Worker states

State	Meaning
`active`	Healthy and processing jobs
`draining`	Finishing in-flight jobs, not accepting new ones (shutdown)
`dead`	Stopped responding; in-flight jobs eligible for reassignment

Leadership

Only the leader performs:

Cron scheduling
Stale-job recovery
Work rebalancing

If leadership is lost mid-operation, dispatch.ErrLeadershipLost is returned. This is normal and safe to ignore — another worker will acquire leadership shortly.

Kubernetes consensus

For Kubernetes deployments, use cluster/k8s which implements leader election via the Kubernetes API (client-go):

import "github.com/xraph/dispatch/cluster/k8s"

leaderElector := k8s.NewLeaderElector(kubeClient, namespace, leaseName)
eng := engine.Build(d,
    engine.WithLeaderElector(leaderElector),
)

This ensures correct behavior across pod restarts and rolling updates.

Configuration

Config field	Default	Meaning
`HeartbeatInterval`	`10s`	How often workers update `LastSeen`
`StaleJobThreshold`	`30s`	How long before a worker is considered dead