Issues with HTTP SMS in Washington Data Center
Incident Report for Vonage API
Postmortem

Date/Time Impacted

2021/09/07 - 15:41 UTC

2021/09/07 - 17:30 UTC

Services Impacted

Product Impact Timeframe
SMS HTTP SMS API request timeouts, high response times, and HTTP 504 responses. Affected approximately half of our traffic during the impact period. 15:41 UTC- 17:30 UTC
Verify API requests failures due to failed internal quota service requests (most of them eventually successfully retried) 15:41 UTC- 17:26 UTC
Messages API undelivered messages 15:41 UTC- 17:26 UTC
Voice Certain calls may have been impacted due to failed internal quota service requests

Summary

On September 7th at 15:41 UTC the HTTP SMS API in our US data center began suffering request timeouts. Soon after, the HTTP SMS API in our Asia-Pacific data center, which is used for failing over traffic from the US data center, also began to suffer the same symptoms. Verify API, Messages API, and Voice APIs were also impacted to a lesser extent.

The incident was resolved at 17:30 UTC.

Timeline

UTC Event
15:36 Cloud provider starts suffering packet loss within and into our US data center
15:41 First timeouts of HTTP SMS API in US data centers
15:45 Automated alert raised for growing SMS queues
15:56 Automated alert raised for high SMS API response times
15:57 First timeouts of HTTP SMS API in Asia-Pacific data centers
16:02 Public announcement requested
16:06 Public HTTP SMS traffic failed over to Asia-Pacific data centers
16:13 Public HTTP SMS traffic moved back to US data centers because of timeouts in Asia-Pacific
16:23 Public announcement posted on https://www.nexmostatus.com/incidents/ddt05b9m6l5q
17:26 Incident recovered for HTTP SMS API in US data centers
17:30 Incident recovered for HTTP SMS API in Asia-Pacific data centers
17:35 Cloud provider confirms their packet loss issues are resolved

Root Cause

A global networking issue with one of our cloud providers led to very high packet loss; their interim postmortem attributes a misconfiguration of their backbone network.

Due to the high packet loss, our global SMS platform was unable to reliably send messages to suppliers in various data centers, leading to rapidly growing internal queues across a significant proportion of our supplier connections. Under normal operation we reject messages when individual queues become too large to ensure that APIs remain stable. In this incident the queues for many supplier connections grew rapidly, causing memory usage to grow unsustainably and the API to become unresponsive; load balancers responded with request timeouts.

Our normal practice of failing over to Asia-Pacific data centers did not resolve the API issues as messages that needed to be sent to the other data centers began queuing in the internal queues of the Asia-Pacific SMS platform causing similar effect as in our US data center.

Restoration Action

Our systems alerted us to growing queues and packet loss in the US data center, and we initiated a standard failover to the Asia-Pacific data center; this began suffering similar queueing when trying to send messages back to US and EU suppliers and the failover was reversed.

As the packet loss subsided, we were able to recover the global SMS platform through targeted traffic redirection and platform restarts.

Next Steps

  • Finish the migration away from the affected cloud provider for core services (Q4'21 US, Q1'22 APAC)
  • Finish replacing the message queueing system to allow queues to grow indefinitely without impacting the memory usage and application performance (30th Nov'21)
  • Decrease alerting threshold for increased SMS API response (10th Sep'21)
  • Add packet loss alerting (completed)
  • Improve platform failover procedure (17th Sep'21)
  • Communication managers training to ensure there are no delays in public announcement
Posted Sep 09, 2021 - 18:18 UTC

Resolved
UPDATE : From 15:40 UTC, customers will have experienced timeouts, rejections and failures for HTTP SMS in Washington and Singapore Data Center. Issues have been fully resolved now but we are continuing to monitor this situation.

Customer impact would have been with the following services...

SMS API
Verify
MessagesAPI

A post mortem with further details will be published in the next two days and in the meantime please contact support with any questions.
Posted Sep 07, 2021 - 19:29 UTC
Monitoring
Customers should now be seeing stable HTTP SMS API from 17:30 UTC
Posted Sep 07, 2021 - 18:25 UTC
Update
We are continuing to investigate this issue.
Posted Sep 07, 2021 - 17:37 UTC
Update
We continue to investigate as a matter of urgency, In addition to slow response times, customers may also be experiencing failures and rejections.
Posted Sep 07, 2021 - 17:32 UTC
Update
We are continuing to investigate this issue.
Posted Sep 07, 2021 - 17:19 UTC
Update
We are continuing to investigate this issue.
Posted Sep 07, 2021 - 16:36 UTC
Investigating
Some customers may have experienced slow response times for HTTP SMS and Verify API in Washington and Singapore Data Center from 15:40UTC.
Posted Sep 07, 2021 - 16:23 UTC
This incident affected: SMS API (Outbound SMS) and Verify API and Verify SDK.