Service Issue in Platform - SMS
Incident Report for Vonage API
Postmortem

Summary

On March 2nd 2021 at 10:09 UTC, our monitoring services alerted us to an issue impacting our SMS infrastructure in our US data center. In order to remediate the failure the on-call team initiated a failover of the HTTP SMS traffic to our Asia-Pacific data center at 10:24 UTC; this completed successfully at 10:31 UTC.

We currently do not support failing over SMPP traffic to other data centers. Failing over between SMPP cluster instances in any datacenter is a responsibility delegated to our customers.

This incident began during a BAU deployment, leading our engineering team to incorrectly assess that the issue was related, and they therefore initiated the rollback process; the standard remedial action in such situations. The rollback did not recover the service and the team started investigating further possible causes for the incident.

During the investigation the team concluded that a severe known issue with the Quota balance management service had occurred; the standard procedure to recover from this type of issue is a data center-wide restart of the entire Quota and SMS cluster. This action restored full SMPP functionality at 11:31 UTC.

After an extended period of monitoring, SMS HTTP traffic was restored back to the US datacenter on March 3rd 2021 at 09:15 UTC; all systems were then fully restored and operational.

Technical Analysis

In our WDC datacenter (IBM/SoftLayer) we operate a number of SMS servers to offer redundancy:

SMPP1/2 clustered pair of SMPP servers

Active-Active load-balanced HTTP SMS servers

There is also a standalone SMPP server, SMPP0; as a mitigation to this single point of failure, there is a cold standby that can be used.

All SMS servers, independent of the protocol, use the Quota service to check credit and deduct charges from customer account balances. The Quota service runs as a highly-available cluster on multiple servers in the same datacenter. The Quota cluster nodes use the JGroups toolkit to facilitate communication and synchronisation between themselves. The toolkit manages the joining and leaving of other clusters, node membership detection, notifications for joined/left/crashed cluster nodes, and the removal of crashed nodes, as well as the sending and receiving of node-to-cluster messages.

The team identified two likely causes for the Quota cluster failure, both related to a combination of a large traffic spike and an additional trigger:

Ungraceful removal of the JGroups node during the deployment process impacted the flow control mechanism of the cluster that aims to ensure no servers are overwhelmed during periods of high traffic. JGroups relies on timely inter-node communication in order to ensure target nodes are ready to handle new messages. If any node is unable to process messages, the cluster will apply back-pressure until impacted nodes can resume processing. Since the impacted node had not yet been evicted from the cluster, and requests sent via JGroups went unanswered, the overall throughput of the cluster rapidly decreased. This, in turn, resulted in a backlog of messages awaiting further processing and a significant increase in the memory footprint.

The majority of the traffic spike was being routed to a single outbound gateway, causing an unusually high queue in an SMS cluster node; this exposed a concurrency bug in the way JGroups cluster state is accessed and shared under high load. During the deployment one of the servers was removed from the cluster causing an even further spike in traffic on the remaining nodes in the cluster, again triggering the concurrency issue on those nodes.

Timeline

Time (UTC) 2021-02-03 Event
10:02 Unpredictable surge of SMS traffic in our US data center - no issues with the platform at this point
10:04 Routine SMS software deployment started
10:07 First occurrence of SMS failures
10:09 First internal alert
10:24 Manual failover of HTTP traffic to Asia-Pacific data center
10:27 Engineering identifies issues processing traffic on all US SMPP endpoints
10:31 HTTP SMS traffic failover to Asia-Pacific data center complete (continued customer impact halted)
11:00 Problematic SMPP traffic cause is identified
11:15 Decision to restart core services to fully restore Quota cluster and related SMPP functionality
11:31 Services restarted successfully, all SMPP endpoints now fully operational
11:31 - 2021-03-03 09:00 Team investigates root cause and monitors services before planning to rollback SMS traffic to the US data center
Time (UTC) 2021-03-03 Event
09:00 Start SMS traffic rollback to the US data center
09:15 Rollback completed, marking the end of all customer impact

Product-specific impact

During this incident, customers may have experienced issues sending SMS through the following products:

Product Impact Impact timeframe (2021-03-02)
SMS HTTP API SMS API requests timed out or received 499 or 503 HTTP responses 10:07 - 10:31 UTC
SMS SMPP smpp0.nexmo.com smpp1.nexmo.com smpp2.nexmo.com MT submissions may have been rejected or timed out 10:07 - 11:31 UTC
Verify API Verify API requests were accepted; however, associated SMS may have failed. Subsequently, the next applicable step in the Verify workflow would be triggered 10:07 - 11:31 UTC
Messages API Messages API requests were accepted; however, associated SMS traffic may have failed, resulting in an undeliverable callback 10:07 - 11:31 UTC
Dispatch API Dispatch API requests were accepted; however, associated SMS traffic may have failed, resulting in an undeliverable callback 10:07 - 11:31 UTC

Preventive Actions

Preventive Action Expected Outcome Status
Added checks to deployment plans to ensure we have minimal queues before every step Reduced likelihood of cluster overload during deployments Completed
Implemented a fix to address nodes notifying the JGroups cluster on shutdown More stable cluster during deployments (fix 1st identified root cause) Completed
Improve how the JGroups cluster state is accessed and shared More stable cluster under high load (fix 2nd identified root cause) In Progress Production: April 2nd
Remove the Quota synchronous queries from the SMS processing pipeline Note: The Quota cluster architecture experienced the first instance of instability on the 3rd of November 2020, and since this time we have been working on architectural changes to the platform to decouple SMS traffic from the Quota cluster, so that failures of Quota do not impact SMS traffic flow - this work had not been completed by the time of the incident SMS traffic will be unaffected in the event of a quota cluster issue. In progress Production: April 2nd Migration: April/May
Enhance runbook processes and define more granular investigative steps Faster recovery from failures Completed
Posted Mar 04, 2021 - 11:51 UTC

Resolved
This problem has been resolved.

Customers might have experienced timeouts sending messages via the SMS API and SMPP endpoints smpp0, smpp1, and smpp2 endpoints.

The following services were impacted:

SMS API and SMPP endpoints smpp0, smpp1 and smpp2

Verify API:
Customers may seen SMS-only workflows fail and failover to any TTS sequences in their SMS/TTS workflows.

Messages and Dispatch API:
Customers may have experienced an increase in SMS undeliverable callbacks.

Services were impacted from 2021-03-02 10:07  UTC to  2021-03-02 11:31 UTC.


All services have been restored. 

Please let support@nexmo.com know if you continue to experience problems with this. 

A post-mortem will be published in the next few days.
Posted Mar 02, 2021 - 16:44 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 02, 2021 - 11:55 UTC
Identified
We have implemented a fix for this issue.

The following services were impacted:
SMS API and SMPP endpoints smpp0, smpp1 and smpp2

Customers may have experienced timeouts sending messages via the SMS API and SMPP endpoints smpp0, smpp1 and smpp2 endpoints.

Verify API
Customers may seen SMS-only workflows fail and failover to any TTS sequences in their SMS/TTS workflows.

Messages and Dispatch API
Customers may have experienced an increase in SMS undeliverable callbacks.

We will update this status as soon as we have more detailed information to share.

Please contact support@nexmo.com if you have any questions on this.
Posted Mar 02, 2021 - 11:53 UTC
Investigating
Our monitoring has alerted us to a service issue within our SMS platform. The following services are impacted:

SMS API
SMPP
SMS sent via Verify API, Messages API and Dispatch API

We aim to provide an update on our investigation within 1 hour.

Please see this article - https://help.nexmo.com/hc/en-us/articles/360015693092-Nexmo-Incident-Handling - for an explanation of our approach to publishing incidents.
Posted Mar 02, 2021 - 10:41 UTC
This incident affected: SMS API (Outbound SMS, Inbound SMS, SMPP) and [Beta] Messages API, [Beta] Dispatch API.