Fault
The Zisson Interact platform was unstable, problems with call handling, logging on/off queues.
Description
The Zisson Interact platform consists of numerous microservices hosted in a modern datacenter private
cloud environment with redundant networking equipment on multiple physical locations in Oslo, Norway.
When
This incident occurred on Tuesday 30. January from 10:30 until 11:50.
What happened
On Tuesday 30. January at 10:30 Zisson received alerts from our internal monitoring system and soon after from customers that they were missing number of agents and number in queues was missing in Zisson Interact. A war room was established with key people, and we started to investigate the problems that was reported by customers. Zisson Interact platform consists of numerous microservices, and some of these were experiencing unusually high loads. Despite the efforts of multiple technicians, the root cause of this issue remained elusive. Attempts were made to restart these components, but they returned with the same high load. Our technicians moved the most important microservices to mirrored server and those worked again as normal. Our operations partner was contacted, and they initiated troubleshooting. They discovered a fault in the storage unit utilized by one of their ESXi hosts. Subsequently, they relocated the 7 servers affected by this host to another one, and the platform started recovering.
Root cause
This issue was caused by a certificate after a certificate change for redundant storage infrastructure in Intility “In Cloud platform”. Please read Intility’s report below for more details.
What have we done:
The incident got resolved on Tuesday 30. January at 11:50 by moving our servers into another ESXi host.
What are we doing:
Intility Report below:
Intility experienced an issue on parts of the central compute platform on Tuesday, January 30, 2024.
User impact and scope of impact
The incident resulted from significant overallocation of CPU and memory resources on the ESXi host, which led to performance degradation and subsequent service unavailability on the affected virtual machines. Totally 151 VM-s were affected by the incident.
Incident start date and time
Tuesday, January 30, 2024, at 10:14.
Incident end date and time
Tuesday, January 30, 2024, at 11:45.
Total incident time
1 hour and 31 minutes.
Cause of events
The issue was caused by a certificate issue after a certificate change for redundant storage infrastructure. The relevant part of central compute infrastructure did not trust the new storage certificate, causing the relevant hosts to initiate a High Availability (HA) migration of affected virtual machines (VM-s) to other parts of the compute platform. This unexpected HA failover of VM-s resulted in overallocation of CPU and memory resources. The issue was resolved by performing a manual refresh on certifications for all affected hosts, and affected virtual machines functioned normally after live migration over to the now healthy hosts.
Measures
To minimize the risk of similar incidents in the future, Intility will implement the following measures:
Sequence of events
Tuesday, January 30, 2024:
• 10:14: Intility performs a certification change for redundant infrastructure on the central compute platform.
• 10:27: Virtual machines running on the relevant part of the infrastructure are automatically migrated to other parts of the compute platform.
• 11:15: Intility’s monitoring system alerts about loss of contact with several virtual machines. Intility receives an inquiry by chat from one customer regarding loss of access to virtual machines.
• 11:18: Incident responders escalate the alerts to specialized department for troubleshooting.
• 11:27: The Intility response team is alerted. The issue is categorized with highest criticality. Troubleshooting is ongoing in ongoing call.
• 11:41: Corporate IT contacts at affected customers are notified by SMS about the issue: Intility is experiencing an issue affecting access to servers and services running on the affected part of the platform. The problem is resolved, but it may take a few minutes until normal functionality is regained due to migration of servers to unaffected parts of the platform. We are working to resolve the issue. New status will follow. […]
• 11:45: The Problem is resolved. Affected virtual machines are up and running on other parts of the compute platform after automatically HA failover.
• 11:51: Corporate IT Contact(s) are updated about status: The problem which affected access to certain servers and services is resolved. Affected services now function normally. Intility will follow up any remaining alerts. Incident report will follow. […]