Interact er delvis utilgjengelig | Interact is partially inaccessible
Incident Report for Zisson.com
Postmortem

Below will follow a postmortem report in regards to the disturbance on the 2nd of April.

Fault
The Zisson Interact application had latency in call handling and agents and agents were not able to log on call queues.

Description
The Zisson Interact platform uses the architecture of several microservices which together compose the Zisson Interact application. One of these microservices, which is called the Agent Server, is handling agent state including logging on and off agents to call queues.

The fault affected customers using the Zisson Interact K platform which is hosted in Norway (zissoninteract.com).

The fault did not affect all services so web frontend and API services, telephony services including IVRs and IP-PABX, chat services for end-users were still working. The Zisson Interact T and SE platforms were not affected.

When
This incident occurred on Tuesday 2. April between approximately 07:00 and 10:58.

What happened
On Thursday 21. March at 22:00 Zisson released the 24.02 release on the Zisson K platform. This release included new versions of many microservices including the Agent Server.

On Tuesday 2. April at approximately 08:02 the Zisson Operations team was notified by our monitoring system that there were signs of increased latency on the Agent Server. Zisson immediately started to investigate the system. At approximately 08:12 the Zisson Operations team was notified by our support team that some customers were having issues.

At 08:26 Zisson deactivated an integration for mobile phone status in the agent server. This integration was working properly but we decided to deactivate it since it is not very critical, and it will cause the latency of the agent server to decrease. This was improving the situation, but it did not resolve the problem completely so Zisson continued to investigate the problem.

At 10:58 another integration for looking up the names and addresses of external callers in a third-party directory service was disabled. This resolved the incident completely and the Zisson platform was fully operational after this.

Root Cause
The incident's root cause was the integration between the Agent Server and a third-party directory service used to look up names and addresses of external callers. This integration was looking up the directory information using an external HTTP-based API in the agent server using a single processing thread. This thread also handled all other action and event processing, and this caused processing of events and actions inside the agent server to be severely delayed and thus caused problems for agents which were trying to log on and off queues and call handling in general.

Zisson has had a third-party directory integration for many years but during the 24.02 release this integration was refactored into the Agent Server in order to simplify and improve the integration. This was working as intended but it caused performance problems during the very high system load.

Why was this not caught earlier by QA testing?
The 24.02 release was tested by the Zisson QA team three weeks before the 24.02 release was deployed on Zisson K platform and it was also running in production on the Zisson Interact T platform since 14. March and it was working as intended for two weeks without any problems (The Zisson Interact T platform is used by approximately 28% of Zisson Interact customers).

The problem only occurs during a very high system load and Tuesday 2. April was a day with a very high call volume and very high system load.

What have we done
Zisson has done the following actions:

  • Zisson has disabled the integration with the third-party directory service.

What are we doing

Zisson is doing the following actions:

  • Zisson have developed a fix for the third-party directory service. This fix changes the third-party directory service to do API calls in an asynchronous and parallelizable manner. This fix will be deployed into production during the next Zisson Interact release (24.03 release) which will be released in April/May 2024.
  • Zisson will continue to improve its QA process and we are currently working on automating more of our QA processes which will allow us to improve QA coverage.
Posted Apr 24, 2024 - 11:58 CEST

Resolved
Problemet er nå løst. En detaljert forklaring av problemet vil komme etter en grundig undersøkelse er blitt gjort.
-
The issue is now resolved. A detailed explanation of the issue will follow after a thorough investigation has been conducted.
Posted Apr 02, 2024 - 09:00 CEST
Monitoring
Vi har nå implementert en fiks og forventer bedring innen kort tid. Ny varsling vil komme snart.
-
We have now implemented a fix and expect improvement shortly. New notification will come soon
Posted Apr 02, 2024 - 08:42 CEST
Investigating
Interact opplever for øyeblikket ustabilitet. Vi jobber med å undersøke årsaken. Vi gir tilbakemelding så fort vi vet noe mer, eller senest etter 30 minutter.
-
We're currently investigating an issue where Interact is experiencing some issues. We'll get back to you as soon as we know more, latest within 30 minutes.
Posted Apr 02, 2024 - 08:18 CEST
This incident affected: Zisson Interact.