Temporary disturbance in Zisson's services

Incident Report for Zisson.com

Postmortem

Fault
The Zisson Interact platform was unstable, problems with call handling, logging on/off queues.

Description
The Zisson Interact platform consists of numerous microservices hosted in a modern datacenter private
cloud environment with redundant networking equipment on multiple physical locations in Oslo, Norway.

When
This incident occurred on Tuesday 30. January from 10:30 until 11:50.

What happened
On Tuesday 30. January at 10:30 Zisson received alerts from our internal monitoring system and soon after from customers that they were missing number of agents and number in queues was missing in Zisson Interact. A war room was established with key people, and we started to investigate the problems that was reported by customers. Zisson Interact platform consists of numerous microservices, and some of these were experiencing unusually high loads. Despite the efforts of multiple technicians, the root cause of this issue remained elusive. Attempts were made to restart these components, but they returned with the same high load. Our technicians moved the most important microservices to mirrored server and those worked again as normal. Our operations partner was contacted, and they initiated troubleshooting. They discovered a fault in the storage unit utilized by one of their ESXi hosts. Subsequently, they relocated the 7 servers affected by this host to another one, and the platform started recovering.

Root cause
This issue was caused by a certificate after a certificate change for redundant storage infrastructure in Intility “In Cloud platform”. Please read Intility’s report below for more details.

What have we done:
The incident got resolved on Tuesday 30. January at 11:50 by moving our servers into another ESXi host.

What are we doing:

We will have a meeting with our provider to understand how this could happen, and to make sure we can act faster if this kind of problem happens in the future.
Zisson continues to have regular weekly meetings and follow-up with our datacenter provider with a focus on stability, performance, and security.
The Zisson Operations team is also working on a new deployment system for the Zisson Interact application based on Kubernetes which will allow us to recover from failing servers more quickly.

‌

Intility Report below:

Intility experienced an issue on parts of the central compute platform on Tuesday, January 30, 2024.

User impact and scope of impact
The incident resulted from significant overallocation of CPU and memory resources on the ESXi host, which led to performance degradation and subsequent service unavailability on the affected virtual machines. Totally 151 VM-s were affected by the incident.

Incident start date and time
Tuesday, January 30, 2024, at 10:14.

Incident end date and time
Tuesday, January 30, 2024, at 11:45.

Total incident time
1 hour and 31 minutes.

Cause of events
The issue was caused by a certificate issue after a certificate change for redundant storage infrastructure. The relevant part of central compute infrastructure did not trust the new storage certificate, causing the relevant hosts to initiate a High Availability (HA) migration of affected virtual machines (VM-s) to other parts of the compute platform. This unexpected HA failover of VM-s resulted in overallocation of CPU and memory resources. The issue was resolved by performing a manual refresh on certifications for all affected hosts, and affected virtual machines functioned normally after live migration over to the now healthy hosts.

Measures
To minimize the risk of similar incidents in the future, Intility will implement the following measures:

Update the routine for certificate change on storage arrays to include script with auto refresh of the new certificates.
Revise documentation on vCenter SSL certificate replacement and ensure that it also takes into account that storage certificate must be trusted.
Review internal monitoring and alerts to Intility Cloud Infrastructure Operations to ensure swift action, and if possible; action before the incident involving downtime on parts of the central compute platform occurs.

Sequence of events

Tuesday, January 30, 2024:
• 10:14: Intility performs a certification change for redundant infrastructure on the central compute platform.
• 10:27: Virtual machines running on the relevant part of the infrastructure are automatically migrated to other parts of the compute platform.
• 11:15: Intility’s monitoring system alerts about loss of contact with several virtual machines. Intility receives an inquiry by chat from one customer regarding loss of access to virtual machines.
• 11:18: Incident responders escalate the alerts to specialized department for troubleshooting.
• 11:27: The Intility response team is alerted. The issue is categorized with highest criticality. Troubleshooting is ongoing in ongoing call.
• 11:41: Corporate IT contacts at affected customers are notified by SMS about the issue: Intility is experiencing an issue affecting access to servers and services running on the affected part of the platform. The problem is resolved, but it may take a few minutes until normal functionality is regained due to migration of servers to unaffected parts of the platform. We are working to resolve the issue. New status will follow. […]
• 11:45: The Problem is resolved. Affected virtual machines are up and running on other parts of the compute platform after automatically HA failover.
• 11:51: Corporate IT Contact(s) are updated about status: The problem which affected access to certain servers and services is resolved. Affected services now function normally. Intility will follow up any remaining alerts. Incident report will follow. […]

Posted Feb 08, 2024 - 12:05 CET

Resolved

We now have confirmation the issue is resolved and we're seeing normal stability.

Should you experience any further issues please contact us at: support@zisson.com

We apologies for the inconvenienc, a thorough report will follow.

Posted Jan 30, 2024 - 11:54 CET

Monitoring

A fix has now been implemented and we're starting to see improvements, we're waiting for a final confirmation and are currently monitoring the systems. Normal functionality should be restored shortly!

Posted Jan 30, 2024 - 11:48 CET

Update

We're making progress in resolving the issue and are still working with the highest possible priority, we're getting close to a solution!

Next update asap, at the latest 12.00 CET

Posted Jan 30, 2024 - 11:30 CET

Identified

We've identified the issue and are working on a permanent fix, users may experience some slowness & delays in the system still.

Posted Jan 30, 2024 - 11:01 CET

Investigating

We're currently experiencing some temporary disturbances in our services, we're working as fast as possible to resolve this asap, we will get back to you with a status shortly, at the latest 11.00 CET.

Posted Jan 30, 2024 - 10:41 CET

This incident affected: Zisson Interact.