Troubleshooting troubles! – part 2

PRAMOD SRIDHARAMURTHY
Tuesday, July 16, 2013

Basic troubleshooting and Automation

In the previous blog, we looked at some of the steps commonly followed during troubleshooting and also how even though the specifics are different the overall approach is similar.

While looking at finding solutions to enable support, we need to remember that all support problems cannot be treated the same way. If we look at support issues, they are typically broken down into different levels, depending on the complexity of the problem. Most organizations have 3 to 4 levels – L1 – L3 or L4.

  • Issues that get resolved at L1/L2 level – These form the majority of the support cases in almost all support organizations. While these issues have lower mean time to resolution (MTTR), the volume of issues makes this the majority of the time that support team spends on
  • Issues that get resolved at L2/L3 or L3/L4 levels (depending on how organizations have structured their team) – While these are fewer in number, the MTTR for such problems are a lot higher.

Both the issues above need a different way of handling. But before we discuss ways of solving this, one of the most important things to remember is that the support organization is always under tremendous pressure to resolve issues in the shortest time possible. A support engineer will always choose a method that helps him get the results in the shortest time and the fewest possible steps and hence a solution that does not meet those goals, will not succeed.

With the above premise, let’s look at possible solutions to the two types of issues listed above. In this blog we can look at L1/L2 issues and continue the rest in the next blog. For L1/L2 type of issues, what if there was

  • A System that can process incoming device logs within seconds
  • A log vault that allows the support engineer to get access to the log in the shortest possible time
  • A search interface to quickly and easily search historical logs
  • An integrated file viewer and a file diff tool that can allow a support engineer to view log file of interest and see the changes in the file as compared to the previous files.
  • A rules and alert engine that can be configured to look for the most common problems (Known issue repository) and alert the support engineer when there are issues found
  • An interface that highlights all the known issue in a given log as soon as it is loaded into the system – using the Known issue repository
  • An interface to show related cases or known bugs – This would require integration into case and bug management system
  • An interface that shows that the current configuration of the system and changes in configuration of interest – Requires an interface to define what constitutes a configuration and which of those need to be tracked for change
  • Integration with case management system and RMA system, to automatically open cases for known issues

While the above list is neither comprehensive nor detailed, it does provide an overview of interfaces, systems, processes that can be put in place to make things easier for the L1/L2 support.

In the next blog, we can continue this topic and look at systems and processes that can help L3/L4 type of issues