Call dispatching / Agent queue logon issue
Incident Report for Zisson.com
Postmortem

Below will follow a detailed postmortem report about the incident on the night/morning of the 12th of March.

Fault
The Zisson Interact application was unable to dispatch calls from callers to agents and agents were not able to log on call queues.

Description
The Zisson Interact platform uses the architecture of several microservices which together compose the Zisson Interact application. These services are run on a set of fully redundant servers. During the incident a server was rebooted during routine maintenance and a microservice was started and ran on two servers at the same time. This impacted queue handling in the Zisson Interact application. The fault affected customers using the Zisson Interact K platform which is hosted in Norway (zissoninteract.com). The fault did not affect all services so web frontend and API services, telephony services including IVRs and IP-PABX, chat services for end-users were still working. The Zisson Interact K and SE platforms were not affected.

When
This incident occurred on Tuesday 12. March between approximately 02:30 and 07:14.

What happened
On Tuesday 12. March at 02:30 some of the servers hosting the Zisson Interact application were scheduled for operating system security patching and reboots. This is a job which is done on a scheduled basis every month by our hosting provider. The patching is done over multiple days so that the application is still fully functional during the process.

At 06:42 the Zisson operations team was notified by our Support team that some customers were having a problem with call dispatching.

At 06:50 a war room with the Zisson operations team was established to handle the incident.

At 07:05 the cause of the incident was identified as multiple versions of the same microservice was running on multiple servers.

At 07:07 the incident was resolved by stopping and restarting the affected microservices.

At 07:14 all services were restored and the Zisson platform was fully operational.

Root Cause
The root cause of the incident was that two of the microservices were running on two servers at the same time. These services were:

  • Agent Server: This service handles agents logging on and off queues• Queue Server: This service handles dispatching incoming calls to available agents
  • Queue Server: This service handles dispatching incoming calls to available agents

These services normally run on a server in the Zisson Platform which is called an SVC-server (the primary SVC server).

On January 30. 2024 the Queue Server and Agent Server were however moved to another server (a secondary SVC server) during another incident (See the incident report for the 2024-01-30 VMWare Storage Incident).

On Tuesday 12. March at 02:30 the primary SVC server was rebooted for security patching and the startup scripts for this server were starting up the Agent Server and Queue Server services on the primary SVC server.

This resulted in two running instances of both the Agent Server and the Queue Server, and this led to various race conditions and state inconsistencies so that agent and queue handling were not working as intended.

The incident was fixed by stopping the Agent Server and Queue Server on the secondary SVC server and restarting the Agent Server and Queue Server on the primary SVC server at 07:07. This resolved the incident.

Agents might have needed to log off and on queues for call dispatching to work in case their agent was in an inconsistent state.

Why was this not caught earlier by monitoring?
This incident should have been identified by monitoring earlier. However, the current monitoring setup of these microservices was only checking whether these microservices are running and not whether there are multiple copies of these services running.

What have we done
Zisson has done the following actions:

  • Zisson restored the Agent Server and Queue Server to normal operating status.

  • Zisson changed our internal routine for microservice handling to avoid duplicate microservices during security patching and rebooting.

What are we doing
Zisson is doing the following actions:

  • Zisson will continue to improve its QA process and we are currently working on automating more of our QA processes which will allow us to improve QA coverage.
  • Zisson is working to improve its monitoring setup to better detect duplicate microservices.

  • The Zisson Operations team is also working on a new deployment system for the Zisson Interact application based on Kubernetes which will allow us to recover from failing servers more quickly.

Posted Mar 20, 2024 - 13:39 CET

Resolved
We experienced a problem during the night of the 12th of March with Call dispatching & Agent queue logon.

Issue started approximately 02.30 and was resolved at 07.14 CET.
Posted Mar 12, 2024 - 06:30 CET