The incident was primarily caused by a critical microservice responsible for data processing experienced an unhandled exception that led to a crash. Unfortunately, this crash went undetected, causing a data processing stall. As a result, data processing tasks were not completed within the expected timeframe
Upon discovering the data processing stall, our operations team promptly initiated an investigation to identify the cause. To resolve the issue, we first restarted the crashed microservice, which allowed the data processing to resume. With the investigation we did not detect any data loss.
We are continuiously working on improving our monitoring infrastructure to include automated checks for microservice health and performance. This includes proactive monitoring of critical metrics, such as resource usage, response times, and error rates, to detect and alert us of any anomalies or crashes.
We regret the inconvenience caused by the data processing stall resulting from the undetected microservice crash, and we are committed to implementing the necessary measures to prevent such incidents in the future.