On February the 25th at 12:58 UTC, our monitoring services alerted us to an issue impacting one of our database servers in Europe. Our engineering team immediately investigated the alerts and re-routed the traffic to a working server at 13:47 UTC. Services started to recover shortly after and the problematic database server was fixed and put back into rotation at 14:29 UTC. This caused customers not being able to log into the customer dashboard, and API calls failing with internal server errors for the following services: MMS sent via the Messages API, Numbers and Account API and Reports API. The impact on Customer Dashboard started at 12:00 and the rest of the services started at 13:00. Impact finished at 13:47 UTC.
One of our machines became unresponsive which caused a slowdown in our cluster and eventually caused the cluster to become unresponsive
Upgrade our machine with a newer software version.
Implement a new database performance monitoring tool
Improve periodic testing alerts and monitoring processes.
Enhance alert escalation processes to assign engineering resources more efficiently