HTTP Service Degradation
Incident Report for Nexmo
Postmortem

What happened

Between 14-Jun-2019 and 17-Jun-2019 customers encountered timeouts when performing API requests or receiving callbacks from Nexmo. The incident also caused latency in the processing of CDRs. This affected all Nexmo products and all regions. Customers may have experienced timeouts during the following periods:

14-Jun-2019 00:00 - 00:40 UTC - 4.011%
14-Jun-2019 04:00 - 04:15 UTC - 0.015%
14-Jun-2019 07:00 - 07:20 UTC - 0.028%
14-Jun-2019 13:32 - 13:35 UTC - 12%
15-Jun-2019 05:00 - 05:20 UTC - 41%
15-Jun-2019 13:17 - 13:23 UTC - 8%
15-Jun-2019 21:42 - 21:46 UTC - 14%
15-Jun-2019 22:52 - 23:03 UTC - 7%
17-Jun-2019 Brief periods between 02:00 - 09:30 UTC - 0.2%

Causes

Anomalous traffic and routing protocol instability within the infrastructure of Nexmo's US Data Center caused the underlying issue. Nexmo mitigated the problem where possible to reduce customer impact, involving failing services over to our Singapore Data Center. Once the traffic had been failed over to our Singapore Data Center, capacity issues were experienced, and customers in APac were impacted by timeouts until the capacity issue was resolved.

Preventive Actions

A program of preventive work has been identified including:

  • Improved failover actions and policies
  • Improved monitoring and alerting
Posted 2 months ago. Jun 21, 2019 - 09:15 UTC

Resolved
The HTTP service degradation problem has now been fully resolved following a 24 hour monitoring period with no further issues.

All services were impacted for short periods between 14-Jun-2019 00:00 - 17-Jun-2019 09:30 UTC

All services have been restored. SMS and Voice call transaction record searches are still delayed for our APIs in our dashboard and this continues to decrease.

Please let support@nexmo.com know if you continue to experience problems with this.
A post-mortem will be published in the next few days.
Posted 2 months ago. Jun 18, 2019 - 09:53 UTC
Update
We have seen no new issues since 9:30 UTC and services are currently stable. SMS and Voice call transaction record searches are still delayed in our dashboard and this is slowly decreasing. Our data center provider has now classified their incident as "Resolved".

Due to the complexity of the initial issue and all the mitigating actions that were implemented in response, we will continue to monitor services closely to ensure no further disruptions before closing out this incident from our side.
Posted 2 months ago. Jun 17, 2019 - 15:21 UTC
Update
We have experienced brief periods of timeouts from 02:00 - 09:30 UTC for Voice API requests ("/v1/calls") in Asia-Pacific region. (0.2% of total requests)
We are also currently experiencing latency in CDR availability in our Dashboard.
No other product services are currently impacted.
We and our data center provider both continue to monitor the situation closely.
Posted 2 months ago. Jun 17, 2019 - 11:08 UTC
Update
Our data center provider has now communicated that they have addressed the underlying cause of this incident, and we have therefore fully restored services in all data centers. We and our data center provider both continue to monitor the situation closely.

As always, please contact us at support@nexmo.com if you are experience any service degradations.
Posted 2 months ago. Jun 16, 2019 - 21:55 UTC
Update
Our data center provider has informed us that they have found the cause of the issue and are working on a fix with the hardware vendor. The problematic hardware is no longer being used and they therefore do not predict there will be further instability.

At the same time, we are aware that they are operating with reduced redundancy, so all teams will be monitoring systems very closely.
Posted 2 months ago. Jun 16, 2019 - 10:56 UTC
Update
We have again experienced periods of higher latency and timeouts from 22:52 - 23:03 UTC and therefore will continue to closely monitor this issue. Mitigating actions are being taken to lessen any customer impact as we work with our data center provider.
Posted 2 months ago. Jun 16, 2019 - 00:15 UTC
Update
We have again experienced periods of higher latency and timeouts from 21:42 - 21:46 UTC and therefore will continue to closely monitor this issue. Mitigating actions are being taken to lessen any customer impact as we work with our data center provider.
Posted 2 months ago. Jun 15, 2019 - 22:32 UTC
Update
Our datacenter provider has confirmed seeing no further issues from their side since 13:23 UTC. They have further narrowed down the source of the underlying issue, but are continuing their investigations. We are currently working with them to implement mitigating measures in order to ensure minimal customer impact.

Please contact us at support@nexmo.com if you are experiencing any service degradations.
Posted 2 months ago. Jun 15, 2019 - 19:05 UTC
Update
We have again experienced increased latency and timeouts from 1:17pm to 1:23pm UTC and therefore will continue to closely monitor this issue. Mitigating actions are being taken to lessen any customer impact as we work with our data center provider.
Posted 2 months ago. Jun 15, 2019 - 14:10 UTC
Update
We have again experienced increased latency and timeouts from 5:00am to 5:20am UTC and therefore continue to closely monitor this issue. Mitigating actions are being taken to lessen any customer impact as we work with our data center provider.
Posted 2 months ago. Jun 15, 2019 - 07:37 UTC
Monitoring
Our datacenter provider has found root cause and our services are now fully restored.

We will continue to monitor the service during the next few hrs and if you experience any issues please contact us support@nexmo.com.
Posted 2 months ago. Jun 14, 2019 - 21:15 UTC
Update
We are working with our datacenter provider to understand the root cause and awaiting for further updates.

We still see signs of intermittent network issues and therefore will continue to monitor the service during the next few hours.

Please contact us at support@nexmo.com if you experience any issues.
Posted 2 months ago. Jun 14, 2019 - 17:20 UTC
Identified
We've seen a further spike of failed requests from 13:32 - 13:35 UTC. These have been mostly 499s and 504s returned from rest.nexmo.com.

We have now re-routed traffic geographically and are no longer seeing these errors. We are working with our hosting provider to further investigate.
Posted 2 months ago. Jun 14, 2019 - 13:45 UTC
Monitoring
The issue has been resolved and we are following up to fully identify the root cause of the issue.

We will continue to monitor the service during the next few hours and post updates should anything change.
Posted 2 months ago. Jun 14, 2019 - 13:15 UTC
Identified
We are currently investigating an issue that affected traffic in our US Data Center.

You may have experienced errors when making API requests and receiving callbacks.

The following services were impacted 14-Jun-2019:
00:00 - 00:40 UTC - 4.011% of all traffic impacted
04:00 - 04:15 UTC - 0.015% of all traffic impacted
07:00 - 07:20 UTC - 0.028% of all traffic impacted

We will update this status as soon as we have more information on this issue.
Posted 2 months ago. Jun 14, 2019 - 12:14 UTC
This incident affected: SMS API (Outbound SMS, Inbound SMS, SMPP), Voice APIs (See https://goo.gl/2WoAtx for Current known problems) (TTS API, SIP, Voice API), and Number Insight API, Verify API and Verify SDK, Developer API, Reports API, [Beta] Messages API, [Beta] Dispatch API.