On March 2nd 2021 at 10:09 UTC, our monitoring services alerted us to an issue impacting our SMS infrastructure in our US data center. In order to remediate the failure the on-call team initiated a failover of the HTTP SMS traffic to our Asia-Pacific data center at 10:24 UTC; this completed successfully at 10:31 UTC.
We currently do not support failing over SMPP traffic to other data centers. Failing over between SMPP cluster instances in any datacenter is a responsibility delegated to our customers.
This incident began during a BAU deployment, leading our engineering team to incorrectly assess that the issue was related, and they therefore initiated the rollback process; the standard remedial action in such situations. The rollback did not recover the service and the team started investigating further possible causes for the incident.
During the investigation the team concluded that a severe known issue with the Quota balance management service had occurred; the standard procedure to recover from this type of issue is a data center-wide restart of the entire Quota and SMS cluster. This action restored full SMPP functionality at 11:31 UTC.
After an extended period of monitoring, SMS HTTP traffic was restored back to the US datacenter on March 3rd 2021 at 09:15 UTC; all systems were then fully restored and operational.
In our WDC datacenter (IBM/SoftLayer) we operate a number of SMS servers to offer redundancy:
SMPP1/2 clustered pair of SMPP servers
Active-Active load-balanced HTTP SMS servers
There is also a standalone SMPP server, SMPP0; as a mitigation to this single point of failure, there is a cold standby that can be used.
All SMS servers, independent of the protocol, use the Quota service to check credit and deduct charges from customer account balances. The Quota service runs as a highly-available cluster on multiple servers in the same datacenter. The Quota cluster nodes use the JGroups toolkit to facilitate communication and synchronisation between themselves. The toolkit manages the joining and leaving of other clusters, node membership detection, notifications for joined/left/crashed cluster nodes, and the removal of crashed nodes, as well as the sending and receiving of node-to-cluster messages.
The team identified two likely causes for the Quota cluster failure, both related to a combination of a large traffic spike and an additional trigger:
Ungraceful removal of the JGroups node during the deployment process impacted the flow control mechanism of the cluster that aims to ensure no servers are overwhelmed during periods of high traffic. JGroups relies on timely inter-node communication in order to ensure target nodes are ready to handle new messages. If any node is unable to process messages, the cluster will apply back-pressure until impacted nodes can resume processing. Since the impacted node had not yet been evicted from the cluster, and requests sent via JGroups went unanswered, the overall throughput of the cluster rapidly decreased. This, in turn, resulted in a backlog of messages awaiting further processing and a significant increase in the memory footprint.
The majority of the traffic spike was being routed to a single outbound gateway, causing an unusually high queue in an SMS cluster node; this exposed a concurrency bug in the way JGroups cluster state is accessed and shared under high load. During the deployment one of the servers was removed from the cluster causing an even further spike in traffic on the remaining nodes in the cluster, again triggering the concurrency issue on those nodes.
|Time (UTC) 2021-02-03||Event|
|10:02||Unpredictable surge of SMS traffic in our US data center - no issues with the platform at this point|
|10:04||Routine SMS software deployment started|
|10:07||First occurrence of SMS failures|
|10:09||First internal alert|
|10:24||Manual failover of HTTP traffic to Asia-Pacific data center|
|10:27||Engineering identifies issues processing traffic on all US SMPP endpoints|
|10:31||HTTP SMS traffic failover to Asia-Pacific data center complete (continued customer impact halted)|
|11:00||Problematic SMPP traffic cause is identified|
|11:15||Decision to restart core services to fully restore Quota cluster and related SMPP functionality|
|11:31||Services restarted successfully, all SMPP endpoints now fully operational|
|11:31 - 2021-03-03 09:00||Team investigates root cause and monitors services before planning to rollback SMS traffic to the US data center|
|Time (UTC) 2021-03-03||Event|
|09:00||Start SMS traffic rollback to the US data center|
|09:15||Rollback completed, marking the end of all customer impact|
During this incident, customers may have experienced issues sending SMS through the following products:
|Product||Impact||Impact timeframe (2021-03-02)|
|SMS HTTP API||SMS API requests timed out or received 499 or 503 HTTP responses||10:07 - 10:31 UTC|
|SMS SMPP smpp0.nexmo.com smpp1.nexmo.com smpp2.nexmo.com||MT submissions may have been rejected or timed out||10:07 - 11:31 UTC|
|Verify API||Verify API requests were accepted; however, associated SMS may have failed. Subsequently, the next applicable step in the Verify workflow would be triggered||10:07 - 11:31 UTC|
|Messages API||Messages API requests were accepted; however, associated SMS traffic may have failed, resulting in an undeliverable callback||10:07 - 11:31 UTC|
|Dispatch API||Dispatch API requests were accepted; however, associated SMS traffic may have failed, resulting in an undeliverable callback||10:07 - 11:31 UTC|
|Preventive Action||Expected Outcome||Status|
|Added checks to deployment plans to ensure we have minimal queues before every step||Reduced likelihood of cluster overload during deployments||Completed|
|Implemented a fix to address nodes notifying the JGroups cluster on shutdown||More stable cluster during deployments (fix 1st identified root cause)||Completed|
|Improve how the JGroups cluster state is accessed and shared||More stable cluster under high load (fix 2nd identified root cause)||In Progress Production: April 2nd|
|Remove the Quota synchronous queries from the SMS processing pipeline Note: The Quota cluster architecture experienced the first instance of instability on the 3rd of November 2020, and since this time we have been working on architectural changes to the platform to decouple SMS traffic from the Quota cluster, so that failures of Quota do not impact SMS traffic flow - this work had not been completed by the time of the incident||SMS traffic will be unaffected in the event of a quota cluster issue.||In progress Production: April 2nd Migration: April/May|
|Enhance runbook processes and define more granular investigative steps||Faster recovery from failures||Completed|