Partial downtime in Hatchet Cloud us-west-2
Resolved
May 13, 2026 at 6:20pm UTC
Hatchet experienced ~15 minutes of partial, unexpected downtime on some parts of the system on May 13th, 2026 between 17:56 and 18:20 UTC, causing delayed task runs, the dashboard to appear frozen, and errors on ingestion in certain cases. Cron runs were also impacted, and will need to be replayed manually (which we can help with, if needed). This was an isolated incident, which should not happen again.
Root Cause: At approximately 17:56, we began a routine deploy included a database migration. This migration ran an ALTER TABLE statement before a VALIDATE CONSTRAINT in the same transaction, which resulted in the ACCESS EXCLUSIVE lock on the table that was acquired during the original ALTER TABLE to be held for the duration of the transaction. This lock was then held for the duration. of the VALIDATE CONSTRAINT (which would otherwise not require an ACCESS EXCLUSIVE lock). This kept the internal table locked for an extended period of time, causing significant backpressure on our internal queue, and some upstream effects on publishers.
We took steps to mitigate the effects of the issue as soon as we were aware of it, at which point full recovery took approximately ten minutes.
At this time, we believe that task ingestion was continuing to progress as normal, albeit with delays on execution, as long as the task was accepted by Hatchet.
Action Items: We’re actively working on an improved way of running migrations under load on large databases, and are planning to have that ready to use within the upcoming weeks, to help safeguard against this kind of unexpected failure scenario in our deploy process. Additionally, we’re continuously working on improving our monitoring and alerting to help recover more quickly from incidents such as this one.
Affected services