Service Disruption
Summary of impact: Starting around 15:43 UTC, IRP monitoring began alerting for disrupted access to IRP merchant sites. This started with increased latencies, progressing to timeouts and errors. This lasted until approx 19:07 UTC, after which services began to recover, however some access was intermittent while the recovery process was ongoing. IRP Merchant services were fully recovered by 22:40 UTC.
Root cause: Azure confirmed degradation of the Azure Front Door (CDN) service, which IRP uses to host client services. This outage also affected other Azure and Microsoft services that have dependencies on Azure Front Door including the Azure Portal, which IRP use for management and monitoring of client services. This was a global outage, and Azure have confirmed that a tenant configuration triggered this outage after their protection mechanisms failed to detect the at fault deployment.
Party Responsible: Azure
Mitigation: IRP actively monitored the situation and were in discussions with our hosting partners. Azure actively took actions to restore services including blocking further configuration changes and performing a roll back to a last known good state. Azure confirmed that the rollback completed successfully and that they were recovering nodes and re-routing traffic through healthy nodes using a "gradual by design" process to ensure stability and prevent overloading of recovered services. During this process of recovery some traffic may have ended up at unhealthy nodes, so IRP merchant services continued to be intermittent until the process was complete. Full recovery was expected to be completed around 23:20 UTC, but IRP monitoring recorded no further disruption beyond 22:40 UTC. Azure have also implemented additional validation and rollback controls to prevent a reoccurrence.
Resolution: Azure actively took steps to restore services.