Calastone’s operational resilience: managing incidents and operational capacity
30 Jan 2017
Continuous service improvement plays a central role in any well managed business. It is essential in ensuring that a business has the ability to remain competitive and scalable as client requirements change and as incidents arise. Considered and planned metrics give the ability to measure and identify areas for improvement, and to mitigate potential issues before they become incidents.
When I joined Calastone in 2009, we were an entirely domestic service and the continuous service improvement programme (CSIP) sessions were less formal focusing largely on the evaluation of processes and the development of products.
Since then, we have expanded into 29 countries and territories and the network has now scaled to over 11,000 active trading links. It processes over six million messages per month, in hundreds of different formats. Our CSIP has therefore become formal and more focused, with a far greater emphasis on a high quality, scalable service delivery through automation.
We have developed a number of processes that enable us to manage incidents, growth and operational capacity, ensuring the robustness of our network and the higher service quality that results. These processes are never static – as we continuously improve the Calastone network, we always find new ways to manage changes, issues and new client requirements.
Preparing for Incidents
Incidents affecting the network are rare, but the nature of the incidents we encounter changes regularly, so the processes and tools that under-pin our lines of defence are highly dynamic. We have a large number of processes that are followed when dealing with incidents (an incident defined as any deviation from the normal service) and through extensive team training, ever improving toolsets and proactive management we are able to elegantly administer a complex and evolving set of technologies and client requirements.
For example, our clients’ dealing and technology behaviours are baselined. This allows us to set tolerances against these baselines and ensures that we can generate automatic alerts if they are breached. This dealing behaviour is dynamic, however, and so these baselines are checked and reviewed regularly to ensure that we continue to provide scalable service delivery.
Part of this process is to identify and catalogue error conditions, or conditions that appear to be outside of the normal baseline range. We use automated toolsets to rank them by severity, impact and urgency. We then set automatic audible, visual and email alerts to the Operations team so that they can investigate. It means we often identify and fix potential issues, wherever in the connectivity chain they may occur, in advance of them becoming service or client impacting.
We also create knowledge bases. Error conditions and frequently asked procedural questions are formally tagged so that they suggest possible fixes and lines of enquiries to the Operations team. This happens as the system generates error messages, but the facility is also accessible to the team during the course of day to day operations.
Together with rigorous training, a culture of collaboration and individual learning, this helps us to maintain and continuously improve the service that our clients expect.
Managing Operational Capacity
Managing capacity is essential, as it ensures that we meet all current and future client requirements for Calastone’s services. Capacity Management forms an important part of our CSIP, not only because of changing client requirements, but also because as technology advances and services broaden, greater demands are placed on system performance, which in turn feeds client expectations.
In order to manage our operational capacity, we formally define the expected performance tolerances of the various technologies. We understand transactional and technical response times for a given route, and baseline it. For example, deals between parties typically have a consistent round-trip time and consume a consistent amount of resource from the systems they pass through. This data is logged and tolerances can be set. Using this information, we generate proactive alerting at defined thresholds. This results in a tactical early warning system that is constantly in use by the Operations team – it allows us to take steps to prevent potential incidents from impacting clients.
In addition, we automate the reporting of internal predicted capacity of system components. This is then managed at regular, formal capacity review meetings. This gives us the information we need to plan capacity upgrades well in advance of any breach of utilisation markers. As part of our CSIP, we also run regular capacity test labs to predict future need and to match our systems to the requirements and tolerances of the business.
The Importance of Continuous Improvement
The importance of regular CSIPs cannot be overstated as they define the formal frameworks that, in the context of capacity management, ensure tools are developed and services are evolved to meet and exceed client requirements. It is through our CSIPs that we have evolved into the robust network that we are today – and as with our business continuity processes, they are not just a generic process that everyone is forced to follow, they are at the centre of the culture of not just the Operations team but Calastone as a whole.