10:53 UTC A deployment was made including code changes to the SMS product.
*13:32 * Our US data center returned intermittent 504 errors to SMS and Verify HTTP requests in increasing frequency. SMPP clients started to receive intermittent errors when attempting to bind to smpp0/1/2 servers.
14:05 The volume of errors reached a threshold triggered our monitoring systems and we started investigating these errors.
14:23 We published the first announcement that we were experiencing an incident at nexmostatus.com
14:31 We attempted to fail over to our Asian data center as a temporary fix but these intermittent failures then continued in the new location, and were therefore global. SMPP clients in Asia were then receiving intermittent errors when attempting to bind to smpp3/4.
In the meantime, we identified and reverted a problematic code change (see below) and restarted services in the US data center.
14:57 We moved all HTTP traffic to the US data center and all services started to quickly recover.
15:05 All services back to normal, including geolocation fail-overs.
10:53 UTC A deployment was made to our customer-facing servers, including some code to aid traceability of requests within the SMS product. This improvement passed all existing tests in our QA environment, but caused resources to be slowly exhausted under production conditions of high complexity and concurrency.
Improve HTTP error monitoring and alerting.
Improve system monitoring and alerting.
Remove service dependencies from within SMS product code, to reduce fault impact and improve tolerance.
Review and improve SMS services deployment processes to minimise risk to our customers. This includes, but is not limited to, a plan to follow a more gradually phased rollout by region.
Details about the Nexmo Incident Management process, including post-mortems can be found at https://help.nexmo.com/hc/en-us/articles/360015693092-Nexmo-Incident-Handling