504 errors returned intermittently from rest.nexmo.com
Incident Report for Nexmo
Postmortem

Post-Mortem

What happened

10:53 UTC A deployment was made including code changes to the SMS product.

*13:32 * Our US data center returned intermittent 504 errors to SMS and Verify HTTP requests in increasing frequency. SMPP clients started to receive intermittent errors when attempting to bind to smpp0/1/2 servers.

14:05 The volume of errors reached a threshold triggered our monitoring systems and we started investigating these errors.

14:23 We published the first announcement that we were experiencing an incident at nexmostatus.com

14:31 We attempted to fail over to our Asian data center as a temporary fix but these intermittent failures then continued in the new location, and were therefore global. SMPP clients in Asia were then receiving intermittent errors when attempting to bind to smpp3/4.

In the meantime, we identified and reverted a problematic code change (see below) and restarted services in the US data center.

14:57 We moved all HTTP traffic to the US data center and all services started to quickly recover.

15:05 All services back to normal, including geolocation fail-overs.

Summary of failed HTTP requests

Cause

10:53 UTC A deployment was made to our customer-facing servers, including some code to aid traceability of requests within the SMS product. This improvement passed all existing tests in our QA environment, but caused resources to be slowly exhausted under production conditions of high complexity and concurrency.

Preventive Actions

Improve HTTP error monitoring and alerting.

Improve system monitoring and alerting.

Remove service dependencies from within SMS product code, to reduce fault impact and improve tolerance.

Review and improve SMS services deployment processes to minimise risk to our customers. This includes, but is not limited to, a plan to follow a more gradually phased rollout by region.

Details about the Nexmo Incident Management process, including post-mortems can be found at https://help.nexmo.com/hc/en-us/articles/360015693092-Nexmo-Incident-Handling

Posted 2 months ago. Sep 13, 2018 - 09:03 UTC

Resolved
All queued callbacks (inbound SMS and delivery receipts) have now been cleared and we have seen no further recurrences of the 504 errors while monitoring following the announcement at 15:14 UTC.

A thorough investigation will be completed and post-mortem shared here.
Posted 2 months ago. Sep 11, 2018 - 17:35 UTC
Monitoring
We have resolved the immediate problem, and have confirmed that customers using either https://rest.nexmo.com or SMPP can now connect and use the API. We are continuing to monitor the situation closely. Customers may experience delays in callbacks for delivery receipts and inbound messages until some queues are cleared.
Posted 2 months ago. Sep 11, 2018 - 15:14 UTC
Identified
We have identified the underlying cause of the issue and are working to resolve it.

The incident has widened to all data centers globally.

Note that SMPP customers will also experience errors when attempting to bind to our SMPP servers.
Posted 2 months ago. Sep 11, 2018 - 14:59 UTC
Investigating
We have observed that the end-point https://rest.nexmo.com is intermittently returning 504 errors in our US data center. Our Operations team is investigating and we will update this announcement with more information as we have it.
Posted 2 months ago. Sep 11, 2018 - 14:23 UTC
This incident affected: SMS API (Outbound SMS, SMPP), Voice APIs (See https://goo.gl/2WoAtx for Current known problems) (TTS API, Voice API), and Number Insight API, Verify API and Verify SDK, Chat App API.