Environment Support

Introduction

Modern exchange platforms, smart order routers, market data aggregators, are complex systems with a lot of components, workflows and functions embedded. Therefore, they require special maintenance and support.

Special departments are established to provide the support and maintenance of such systems.

Their goal is to look after the system by using various monitoring tools and alert the business immediately if something suspicious occurs.

This requires in-depth business and technical knowledge about the supported system.

Exactpro has gained extensive experience in testing, which allows us to excel in production systems support, because:

- it provides user-like experience to the support engineer;

- the investigation of issues and the subsequent communication with the development team nurtures an advanced understanding of the system;

- it provides the support engineer with a unique skill of being confident and calm even when the system works incorrectly.

In most cases, a production system, unlike a test system, works properly. However, if a live system malfunctions, our support engineers are well-prepared for troubleshooting, to help determine the root cause. Their broad experience with testing gives them a mindset of confidence, which, along with the time-tested support procedures and checklists, makes our engineers qualified to provide first class support for a wide range of systems.

The Common Scheme of Support

System
The system under support. The system can be accessed by the client, first line support, back line support and the support manager. Clients usually have limited access to the system, while members of the Support Team have unlimited access to see all the configurations and transactions (sometimes ‘read-only’ for security reasons).
Client
A user of the system, who uses its functions on a paid basis.
First line support
Support engineers continuously monitoring the system. The main quality is speed and vigilance. The main task is looking after the system and detecting errors/suspicious behavior.
Back line support
Support engineers with strong knowledge of the system and the way it can be configured. The main quality is deep system understanding and ability to solve problems. The main task is troubleshooting a variety of issues that first line support cannot resolve.
Client services
Support engineers responsible for informing the clients. The main quality is excellent communication skills. The main task is conveying messages about the system status and issues to the clients, receiving their calls, providing basic consultancy. Other tasks include letting first line monitor the system and report the issues, as well as resolve the issues without spending additional time on communication Relevant only for medium/big support teams. In small support teams, these functions are performed by first line support or a support manager.
Support manager
A person responsible for making crucial decisions and managing critical situations. The main quality is deep understanding of the system, related business processes and interests of the clients. If any serious problems are encountered or something should be undertaken with the system, the manager is responsible for handling the situation and providing a decision. A support manager can perform back line support functions.
Incidents tracker

A system used to track all the incidents, similar to bugtracker, but lighter. Allows to raise incidents faster (e.g. Supportworks), provides aggregation of information about incidents and access to it at any time. It is used as a reference for certain incidents among the members of the support team. Incident records can be raised by both clients and members of the support team.

When the system is running in production, first line support looks after it to detect errors or unusual behavior. To ease the tracking, the support centres use wide screens with performance statistics and health status, combined with a common phone line to communicate with clients.

If a problem is minor and first line support knows exactly how to fix it, they do it on their own, making sure it is fixed and creating an incident report. Although the problem is already fixed, the report can be important to the other members of the team or to the manager, to track and analyze the reliability of the system.

If the problem is serious, first line support needs to create a ticket quickly and identify if it impacts the client. If the client is impacted, first line support should contact a client’s manager, client services or the client directly, to let the client be notified about the problem. Then the first line support gets back to line support to investigate the problem.

When the cause is found, appropriate action should be taken to tackle it. But before performing any risky or significant remedy actions, first line support should receive a confirmation from the support manager. While the problem is being investigated, the client should be updated about the process, usually it's done periodically, for example each 15 minutes via mail or by phone. When the problem is resolved, the client should immediately be notified about the system being back to normal, so the routine business activities can be resumed.

Serious problems, like outages or freezes of production should be accompanied by official incident reports sent to the client, containing detailed explanations of the cause and the remedy used to fix the problem, as well as preventive actions to ensure that it will not happen again in the future.

Each critical problem should be thoroughly described to the client, making him confident that appropriate actions are rigorously conducted to provide the best quality of services.

It is worth mentioning that the support team is not only responsible for production monitoring, but also for Pre-Production and UAT environments for both monitoring and deployment of new changes.

And since the new patches/updates are delivered to UAT and Pre-Production for testing first, it gives them an advantage of anticipating and troubleshooting the possible issues.

In addition to the support activities mentioned above and, sometimes, maintenance of UAT and Pre-Production, the Support Team can be engaged in backup/restore procedures in production.

Different Types of the System’s Work Cycle and Support

24/6 - when the system is available for 24 hours, 6 days a week, with one day for maintenance.
24/7 - when the system is available for 24 hours, 7 days a week, short interruptions for emergency maintenance are possible.
Several hours a day - when the system is actively used only within a certain time window.

The last type of work cycles is the most common one, and the majority of systems, like exchange platforms, market data aggregators, smart order routers follow it.


Usually the work cycle is split into 3 separate stages:

  • Beginning Of Day, preparation before the start and the start of the system;
  • Intraday, routine functioning of the system;
  • End of Day, shutdown of the system, maintenance activities, preparation to the next day.

Beginning of Day (BOD)

As Aristotle said, "well begun is half done", and it may well be appropriate for any complex system, like an exchange trading platform or a smart order router. High or medium complexity systems consist of dozens of components and hundreds of processes. All the components start at a scheduled time during BOD and the main tasks for Support here are revealing any errors encountered, examining the impact, understanding the root cause and possibly resolving it.

If the current specialist can not investigate and resolve the problem in a few minutes, other experienced members can be engaged in troubleshooting via call or mail. The aim is to resolve the encountered problem as soon as possible, before the clients connects and starts using the system. Naturally, the problem during BOD does not pose as much urgency as the one encountered during production hours, because no losses are inflicted yet. But a quick reaction, good communication skills and a sharp mind are essential, because sometimes third parties are providing the internet connection, the VPN connection, gateways connection, etc., so it could take time to figure the situation out and reach the required person, thus no delays are allowed during BOD problems.

Together with the start of components, processes like uploading some sample data and weighty database queries take place during BOD, which also requires attention, as well as involves consumption of CPU, RAM, hard drive space and load average on all the servers used by the system.

When all the BOD activities are done, a mail notification about the status of the system can be sent to notify the clients about the fact that the system has been started successfully and is ready for use.

Intraday

When the system is up and running and clients are working with it, it is also important to see if any errors/problems pop up and how the system behaves in terms of resource consumption (RAM, CPU, HDD, Load Average, etc.).

However, most of Support’s daily routine during the Intraday phase consists in handling customers requests and replying to their questions. Requests can be received via email or by phone, and sometimes clients have claims or worries, they may be anxious or in a rush, so a big part of accepting a request lies in psychological training. It is imperative to be polite and honest and simultaneously cool-headed, to be able to quickly distinguish the matter of the problem, or swiftly figure out the points necessary to be confirmed with the client, to get enough information about the issue. If any disruption or malfunction occurs, the client should be immediately informed about it and kept posted about the status of troubleshooting.

End Of Day

When the Intraday phase ends and the last client stops using the system, the End Of Day part takes over. If needed, some configuration changes take place here, the required reference data can be added, and the required patches - installed.

All the processes are being shut down, logs are being archived, the possible queries into the database are being executed. If necessary, reports and status messages are sent to the customers.

From all the 3 phases, problems are less likely to occur here, and there is less of a rush in investigating the issues. It is obvious that the status of the previous End Of Day is important for the consequent Beginning Of Day, so all outstanding changes or encountered issues should be conveyed to the next shift of Support.

In the case of the "several hours a day" cycle, one shift of support team is sufficient to conduct the support activities, however in the case of the 24/6 or the 24/7 work cycle, it is required to split the support team into at least 2 shifts. Since support can be done remotely, it is convenient to use the advantage of time zone differences. For example when it is just 5 AM in London (winter time), it is already 8 AM in Moscow, so it is a natural benefit for performing morning hours monitoring and support from the Moscow time zone.

For example, London Stock Exchange conducts its trading services each weekday from 8:00 am to 4:30 pm, which on Moscow Time is 11 AM to 7:30 PM, so it is convenient to start operations and support procedures in Moscow, 2-3 hours before the market opens.



Exactpro has long-term experience in working this way, as well as in providing both the 24/6 and the 24/7 systems’ support in various time zones from Europe to the US.

How to Monitor and Examine the System

Backend monitoring

Usually, error records have a specific format in the log, so BASH or Perl scripts can be periodically executed to detect and send results via mail to the support team or all interested parties. RAM, CPU, HDD consumption and the load average can also be tracked by the scripts.

SSH client (e.g. PuTTy) can be used for manual monitoring on demand. It allows to manage all the processes running on the backend, see RAM, CPU, HDD consumption and the load average, view the logs, the coredumps, and check the information about the system. With enough skills, it is a very powerful supplement for automatic monitoring via scripts. And, surely, it is used to confirm the results retrieved using automated scripts.

Network resources monitoring tools
They are predominantly used by IT-support but anyway they are a part of production monitoring, allowing to see if there are disruptions in the network or the hardware in the physical or virtual machines. If anything like that is observed, the related parties should be contacted and remedy actions should be taken as soon as possible.
System frontends

The system frontends (FEs) are made to provide a user interface to interact with the systems. Usually, each system has a frontend. It should be launched each day to make sure that the user can login and interact with the system. Launching frontends in advance, before the user, provides additional time to tackle the problems, should they be encountered.

Keeping FEs open throughout the day allows to track future activities of the system and apply remedy actions, if needed.

Smart tools for real-time monitoring of processes and applications.

They may be an alternative or a good supplement of Perl and BASH scripts, used for immediate real-time monitoring of the system. Provides a good interface and a wide range of functions to monitor processes, logs, script results, usage of hardware resources, the read/write speed, and the flow of transactions through the system.

Here are some examples of the tools ITRS Geneos, SolarWinds Server & Application Monitor, AlertSite, AppDynamics APM, CA APM.

Needless to say, all the approaches above require a certain level of skill, and our team is experienced with BASH/Perl scripts, manual monitoring and management of linux backends, configuration and usage of a variety of monitoring tools.