From 11-10-2018 05:00 AM UTC until 11-10-2018 11:30 AM UTC aggregates data was queued, causing reporting data to be unavailable to the dashboard in the expected fashion.
An exceptionally large query overloaded the reporting database. This created a cascading impact on our data pipeline systems that severely delayed the ability to write aggregated usage data. While the delay was occurring the pending data queue could not be resolved with additional writing capacity alone. To solve this queueing issue with minimum disruption and achieve swift resolution we restarted the aggregation service, flushing the pending queue and allowing the service to return to processing data from that point on as normal. This operation was successful in reinstating real time aggregated reporting to the Dashboard immediately. We then continued with the offline process of rebuilding the affected aggregates over the next two weeks. No data was lost.
Redesigning the queries to avoid unnecessary locking on read only requests.
Implementing the use of new real time aggregates functionality to avoid this happening in future.