Below will follow a postmortem report in regards to the disturbance on the 2nd of April.
Fault
The Zisson Interact application had latency in call handling and agents and agents were not able to log on call queues.
Description
The Zisson Interact platform uses the architecture of several microservices which together compose the Zisson Interact application. One of these microservices, which is called the Agent Server, is handling agent state including logging on and off agents to call queues.
The fault affected customers using the Zisson Interact K platform which is hosted in Norway (zissoninteract.com).
The fault did not affect all services so web frontend and API services, telephony services including IVRs and IP-PABX, chat services for end-users were still working. The Zisson Interact T and SE platforms were not affected.
When
This incident occurred on Tuesday 2. April between approximately 07:00 and 10:58.
What happened
On Thursday 21. March at 22:00 Zisson released the 24.02 release on the Zisson K platform. This release included new versions of many microservices including the Agent Server.
On Tuesday 2. April at approximately 08:02 the Zisson Operations team was notified by our monitoring system that there were signs of increased latency on the Agent Server. Zisson immediately started to investigate the system. At approximately 08:12 the Zisson Operations team was notified by our support team that some customers were having issues.
At 08:26 Zisson deactivated an integration for mobile phone status in the agent server. This integration was working properly but we decided to deactivate it since it is not very critical, and it will cause the latency of the agent server to decrease. This was improving the situation, but it did not resolve the problem completely so Zisson continued to investigate the problem.
At 10:58 another integration for looking up the names and addresses of external callers in a third-party directory service was disabled. This resolved the incident completely and the Zisson platform was fully operational after this.
Root Cause
The incident's root cause was the integration between the Agent Server and a third-party directory service used to look up names and addresses of external callers. This integration was looking up the directory information using an external HTTP-based API in the agent server using a single processing thread. This thread also handled all other action and event processing, and this caused processing of events and actions inside the agent server to be severely delayed and thus caused problems for agents which were trying to log on and off queues and call handling in general.
Zisson has had a third-party directory integration for many years but during the 24.02 release this integration was refactored into the Agent Server in order to simplify and improve the integration. This was working as intended but it caused performance problems during the very high system load.
Why was this not caught earlier by QA testing?
The 24.02 release was tested by the Zisson QA team three weeks before the 24.02 release was deployed on Zisson K platform and it was also running in production on the Zisson Interact T platform since 14. March and it was working as intended for two weeks without any problems (The Zisson Interact T platform is used by approximately 28% of Zisson Interact customers).
The problem only occurs during a very high system load and Tuesday 2. April was a day with a very high call volume and very high system load.
What have we done
Zisson has done the following actions:
What are we doing
Zisson is doing the following actions: