Problem Management – Key to Achieving Operational Excellence

4 Minutes to read

Print
Share
Dark
Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Are you operating an application in production? If your answer is yes, is problem management part of your internal processes? If your answer is no, it means you’re operating your applications in a reactive manner and there may be an area for improvement in your operating model. In this post we’re going to talk about a strategy for achieving operational excellence by leveraging problem management.

When operating an application in production, one scenario that you want your team to avoid is being in firefighting mode. Aside from it being frustrating to end users and to the developers or support resources, this scenario is also very costly because most of the time you also need to allocate your senior resources to help out in troubleshooting instead of focusing on more value adding activities. Hence, as much as possible you want to avoid or at least minimize being in that scenario. One other scenario in operations support which can be optimized is having support resources work on recurring incidents and service requests.

Major outages are not totally avoidable because of the sheer number of variables that can lead to an outage with some you don’t even have control of. A lot of the variables which can be monitored were covered in our previous post on Service Operations Processes – Event Management. Event management can be leveraged to look for signs of errors even prior to a major outage. Retrace monitoring and alerting features can be used to detect these events and enable the support team to react, address events and prevent major outages if possible.

For addressing recurring incidents and service requests, there are two approaches to look for ways to minimize if not totally eliminate these recurring activities:

1. Conduct a periodic operations review

An operations review is a regular activity of the operations support team intended to discuss and analyze the tickets. Analysis can also be done by someone in the team prior to the meeting in order to save time. Normally, an operations review covers an assessment with the entire support team to see how the team is performing by taking a look at their performance based on KPIs. Part of the ticket analysis here is looking for patterns of service requests and incidents. If there are recurring service requests and incidents, these can be considered for problem management. Ideally, problem tickets must be created to monitor and track the status of these changes.

It would also be helpful for a support team to review Retrace Dashboard, Errors and Logs during the operations review. This is a good opportunity to assess and recalibrate acceptable limits (i.e., upper limit and lower limit) of metrics.

To decide whether a recurring incident or service request is worth doing an enhancement or implementing a permanent fix, you need to gather some data (e.g. time it takes to resolve the incident or time it takes to fulfill the service request and the number of occurrences). For incidents, it’s important for us to understand the concept of quick fix and permanent fix. A quick fix is an action taken by a support resource or developer to resolve a specific occurrence of the incident. Once a quick fix is implemented, the ticket that corresponds to that specific incident can already be tagged as resolved but there is no guarantee that a similar incident will not happen again. A permanent fix on the other hand covers more steps including: root cause analysis, design of the fix, implementation, verification and maybe also deployment if it’s a code change. One key characteristic of an effective permanent fix is that, once it is implemented the incident will no longer recur. The decision on which approach to take to resolve an incident (quick fix vs permanent fix) depends on the effort and risks involved in implementing the fix. If the permanent fix takes 80 man-hours to implement while the quick fix only takes 10mins, then the quick fix makes more sense to implement. If the incident occurs too often though, it may warrant a consideration of the permanent fix. The support team can continue applying the quick fix while the permanent fix is being worked on. Unfortunately, there are instances wherein a quick fix is not available. In such cases, you’ll be left without a choice but to implement the permanent fix. A permanent fix that’s waiting to be implemented must be logged as a problem ticket so that it can be monitored and tracked to implementation.

2. Real time analysis of incident

Quick fix and permanent fix consideration can also be done as soon as the incident is reported without waiting for the operations review. Tools such as Retrace becomes very valuable here because consideration of a permanent fix requires proper root cause analysis so that the remedy and the effort required to implement it can be determined as fast as possible. The same considerations mentioned above for choosing between quick fix and permanent fix can be done.

If a quick fix is available and chosen over a permanent fix, a problem ticket must be created to monitor and track the status of the permanent fix.

Benefits of Problem Management

The ultimate benefit of problem management is increased productivity because it frees up resources to do more value adding activities like enhancements. This is on top of the given outcome which is improved user/customer engagement brought about by having a seamless experience with using your application.

Was this article helpful?

What's Next

OTel Appliance - Put your serverless and containerized apps on Retrace

Table of contents

Benefits of Problem Management