2021/09/07 - 15:41 UTC
2021/09/07 - 17:30 UTC
|SMS||HTTP SMS API request timeouts, high response times, and HTTP 504 responses. Affected approximately half of our traffic during the impact period.||15:41 UTC- 17:30 UTC|
|Verify||API requests failures due to failed internal quota service requests (most of them eventually successfully retried)||15:41 UTC- 17:26 UTC|
|Messages API||undelivered messages||15:41 UTC- 17:26 UTC|
|Voice||Certain calls may have been impacted due to failed internal quota service requests|
On September 7th at 15:41 UTC the HTTP SMS API in our US data center began suffering request timeouts. Soon after, the HTTP SMS API in our Asia-Pacific data center, which is used for failing over traffic from the US data center, also began to suffer the same symptoms. Verify API, Messages API, and Voice APIs were also impacted to a lesser extent.
The incident was resolved at 17:30 UTC.
|15:36||Cloud provider starts suffering packet loss within and into our US data center|
|15:41||First timeouts of HTTP SMS API in US data centers|
|15:45||Automated alert raised for growing SMS queues|
|15:56||Automated alert raised for high SMS API response times|
|15:57||First timeouts of HTTP SMS API in Asia-Pacific data centers|
|16:02||Public announcement requested|
|16:06||Public HTTP SMS traffic failed over to Asia-Pacific data centers|
|16:13||Public HTTP SMS traffic moved back to US data centers because of timeouts in Asia-Pacific|
|16:23||Public announcement posted on https://www.nexmostatus.com/incidents/ddt05b9m6l5q|
|17:26||Incident recovered for HTTP SMS API in US data centers|
|17:30||Incident recovered for HTTP SMS API in Asia-Pacific data centers|
|17:35||Cloud provider confirms their packet loss issues are resolved|
A global networking issue with one of our cloud providers led to very high packet loss; their interim postmortem attributes a misconfiguration of their backbone network.
Due to the high packet loss, our global SMS platform was unable to reliably send messages to suppliers in various data centers, leading to rapidly growing internal queues across a significant proportion of our supplier connections. Under normal operation we reject messages when individual queues become too large to ensure that APIs remain stable. In this incident the queues for many supplier connections grew rapidly, causing memory usage to grow unsustainably and the API to become unresponsive; load balancers responded with request timeouts.
Our normal practice of failing over to Asia-Pacific data centers did not resolve the API issues as messages that needed to be sent to the other data centers began queuing in the internal queues of the Asia-Pacific SMS platform causing similar effect as in our US data center.
Our systems alerted us to growing queues and packet loss in the US data center, and we initiated a standard failover to the Asia-Pacific data center; this began suffering similar queueing when trying to send messages back to US and EU suppliers and the failover was reversed.
As the packet loss subsided, we were able to recover the global SMS platform through targeted traffic redirection and platform restarts.