Electric Utility Company Chooses Ubertal Ovela AIOps Platform
Introduction
Our client is one of the largest electric utility companies in the world. Their department of Big Data analytics and AI platform is responsible for the infrastructure that runs all the machine learning and AI algorithms. It owns the operations of services including Hadoop, HBase, Zookeeper, Hive, Spark, Storm, Kafka, etc.
Challenges
The KPI for this platform has an uptime of 99.99%. However, every couple of weeks, the HBase service would go down, usually in the middle of night. Without real time alerting, operators were not informed of the incidents. All the machine learning and AI jobs running on the platform were interrupted and had to rerun, wasting days or even weeks of progress. Without the ability to get down to the root cause, operators were only left with the option to restart the service. This problem continued again and again over months. The operation team was unable to pinpoint the exact root cause of the problem.
Solution
With Ubertal’s Ovela AIOps platform, the real time alerting triggered various methods of notifications when the system went down (emails, phone calls, short messages or mobile app notifications). The Root Cause Analytics (RCA) module was able to create correlations among different metrics and log entries and lead to the determination of the root cause: The drifting of time servers and the use of different time servers among different key services was the culprit. It caused the zookeeper service to falsely think that the HBase service was down due to the in-sync of heartbeat timestamps.