3.
Railway outage postmortem: GCP suspended multiple accounts
A multi-AZ platform went down after a single-cloud account failure, turning cloud-provider dependency into an immediate reliability design problem
2 appearances on the backlist front page in the last 30 days.
A multi-AZ platform went down after a single-cloud account failure, turning cloud-provider dependency into an immediate reliability design problem
We designed the network control plane to survive AZ failures without interruption However, nuking every AZ within a single cloud was not in our threat model The fix: running a shard on every cloud in the network ring (AWS, GCP, Metal)