I have been following a very interesting discussion at the ITSkeptic (http://www.itskeptic.org/we-should-create-problem-record-right-front-incide) and have resisted the LinkedIn group as I think the madness will continue there. Why do I call it madness? Well the Skeptic accused me of oversimplifying things and maybe he is right. I actually often think that I overcomplicate things but I think I get a right blend with some pragmatism thrown in.
I will not go over the same old ground as on that site so maybe I should state my position:
Incident Management's (IM) primary objective is to restore the Service and get the customer up and running. This means that if it is a high impact Incident IM manages the restoration process. Why do I need to state this, as I think it is obvious? Well some people think that Problem Management's charter to find and remove the Root Cause may take precedence over restoring the Service to the customer. I say that this is resoundingly incorrect. Yes there will be instances where diagnostic information is lost, but if we are dealing with a high impact Incident our focus should be customer and business oriented!
So that is my opening argument. For a high impact Incident PM should always become involved, firstly by creating a related Problem record and possibly by playing a coordinating role.
So what happens if as part of the Incident process we discover what the root cause is? Some would argue that we do not need to create a Problem record and in that is something that I see causing havoc. Yes we may have found the root cause BUT it is not within IM's charter to decide whether we want to remove that root cause, which is the charter of PM! I have seen many organisations that have melded IM and PM and it has always caused them difficulties and kills their metrics as they have a large amount of Incidents left open as the root cause has not been addressed. This has no place in IM.
PM is where we make the decision, if the root cause has not been discovered whether we should be even interested in looking at it. And not just root cause as to most organisations that is just the last event that occurred before the error was detected. What about all the contributing causes that could be very time consuming to discover but may allow us to remedy some of these to reduce the risk of the Incident reoccurring at a much decreased cost?
Whether we actually perform root cause analysis must be determined by weighing up the risk of the Incident occurring again and what the impact will be. If we have a quick workaround in place the impact may be mitigated to such an extent that we do not consider doing further root cause analysis. We have a memory leak that our workaround is to reboot the server, what is the cost of this though and does anybody ever go back to look at this as to whether we really need to find and remove the root cause, How many times to these workarounds just get ingrained in the standard operating procedure?
Whatever the case we should ALWAYS raise a related problem record for a high impact Incident - this is the whole tenant of Reactive Problem Management. Some would argue that the most benefits in Problem Management is removing the root cause for high impact Incidents and this may be the case for organisations who experience many of these (an entire blog topic in itself) but is that true of all organisations - I am not so sure! It will certainly show from a cost perspective as these high impact Incidents have the potential to actually cost the company money.
It is in proactive Problem Management that we look to perform data mining to look for trends, generally across our lower priority Incidents. Very few organisations do this well - the reason being that few organisations have mature Incident classifications or Configuration that is so important to finding trends. There is also a lack of data mining tools with sufficient intelligence to predict these trends and if there are thy ear generally neglected in place of slick and shiny dashboards that provide value from a marketing perspective. Proactive PM will provide more internal benefits to IT as it will stop the expensive resources doing the same thing over an over again, rarely will it have the visibility of removing the cause of high Impact Incidents.
One of the things that I have said in other blogs is that we often try to be too smart when a simpler solution will do the job, this is the same with PM, don't overcomplicate it.