One of the most important things in troubleshooting in general is that you follow a system or workflow. You have to attack the problem methodically and analytical. There’s nothing worse than trying all kinds of things and realize suddenly that it works and not knowing what you did to get it working. It is great that it starts to work again, but you don’t know what it caused it.

The approach

There are a lot of different approaches you can take.
My approach roughly looks like the list below. I’m not always doing the things in the same order.
  1. Take a deep breathe, a cup of coffee or tea and manage the expectations
  2. Analyze the problem
  3. Verify that you still have the problem you described.
  4. Know when to call your vendor/integrator
  5. Do your research
  6. Create a sketch of the systems or components involved.
  7. Make sure your systems match the documentation or logfiles
  8. Check for common problems and solutions
  9. Call your vendor/integrator

I’m not going into too much detail in all of the steps above, but I hope it gives you enough structure in your troubleshooting journey. And don’t forget, depending on your specific situation you might jump to another step in the process. I can imagine that in a lot of caces

1. Take a deep breathe, a cup of coffee or tea and manage the expectations

IMHO this is one of the most, if not most important things to do. Don’t panic, sit and relax. When you start to change all kinds of things you will probably be further away from a solution than ever. Another thing that you most definitely should NOT do is CYA, cover you ass. Don’t erase log files to hide your screw up. In the end it will all come out anyway. Focus on solving the problem. Saving your job or career comes after that. Besides that the action of you deleting your tracks always comes out, you make it hard, or even impossible to find the root cause of your problem.

It is also imperative that you start to manage the expectations. Everybody wants their problem to be fixed as soon as possible, but troubleshooting takes tame. People looking over your shoulder the whole time doesn’t help either in fixing the problem. Try to setup a communication process, for example ‘we have call every 30 minute’ depending on the problem. And when you tell that you have a call every 30 minutes, DO call. They are eager to know when they can get back to their work/process/business.

2. Analyze the problem

It is very important that you can describe your problem. “It doesn’t work” really doesn’t qualify as knowing what the problem is.

What is it that isn’t working? Are you unable to install something? Are you unable to reach a system? Isn’t the system behaving like it should? And how does that manifest? Be as specific as you can. List as many symptoms as you can. No ‘I think this or that happens’ or ‘someone somewhere said that it was broken’ and no filtering. It has to be as specific as possible.

As part of describing the problem you can try to answer the Who, What, Why, When, Where questions, but other questions as well like:

  1. What is the problem
  2. Who is having the problem
  3. When is the problem occurring
  4. Where is the problem occurring
  5. (if you can) Why is the problem occurring
  6. Did it ever worked in the first place?
  7. What changed since the time that the system or component was working.

3. Verify the problem still exist

Replicate the problem. Don’t trust another architect/consultant/engineer/<fill in the blanks>. Trying to troubleshoot a problem that has disappeared or trusting in something you didn’t see yourself is a waste of time in the process. It is important to know if the error still exists. If you didn’t get any events in your event logs try to generate an error or message you know that the program or system should report on. Perhaps the problem was intermittent and already has been solved on another level, for example network outage.

4. Know when to call your vendor/integrator

Depending on the priority of the problem this might be a good time to notify your vendor or integrator. At least you have thought about your problem and you described it in detail. You even know that the problem still exists.

I don’t want to say with this statement that you need to call your vendor/integrator for every problem you have right now. Most of the time it is useful to the legwork first before calling. You may have solved the problem yourself. On the other hand there is the point of wasting time. Often your vendor/integrator wants to do some checking themselves, as in the first paragraph of point 3.

5. Do your research

Suspect the components that were changed since the last known good state. I have done a lot of troubleshooting on systems where the problem was caused by a change performed earlier. After a rollback of the change ‘everything worked again’.

If you checked them of it is time to check log files, systems, configurations.

  1. Check what the system is supposed to do and if it does that. It is especially important to know if the system or component was working, even before the incident.
  2. Check errors, event logs, log files
  3. Check the configuration based on your documentation (you do have documentation, don’t you?), but don’t change anything yet!!

Most of the times this log will give you a hint where to look next. However, do consider that the absence of an error also can be seen as an error. Sometimes the log file missed the details you need or want to troubleshoot. When that happens you can crank up the detail of the log messages.

If after reading the logfiles you still don’t have a clue on what’s going on, then you should log a support case with the vendor of your product, like VMware. Don’t wait too long with this. If at first it seems that you need help to solve it within a reasonable time, go create a support case. One of the things you can do to speed up the process is to create a support bundle on forehand, so you can add it to your support case.

6. Create a sketch of the systems or components involved.

Take a piece of paper, or walk to a whiteboard and start drawing the situation. Most of the times when you sketch out your environment with the problem component in it it will give you the usual suspects on where to troubleshoot. At least it gives you and everyone else involved in the troubleshooting process the same basic information on how the environment is configured.

I find it very helpful to explain it somebody else. When talking to someone you have to structure your thoughts and create a logical timeline/story. If the other person also ask questions about the problem or environment you may gain insights you didn’t have before.

7. Make sure your systems match the documentation or logfiles

If you are ready with the check of systems and components you should have a list of things that aren’t conform your documentation. Go ahead and change the items (after you created a backup or noted the settings of course)
  1. Change one item at a time.
  2. Document your change
  3. If it doesn’t work, roll back to the state before the change
  4. replace components for working components
  5. ‘break’ components on purpose so you know what outcome you can expect. USE WITH CAUTION!!

8.Check for common problems and solutions

Now you know how the components are related to each other you can check if they are able to communicate. One of the things you want to check is ‘can I ping the server from the related component/server?’. Of course this only works if ping packets are allowed between client and server (verify before you draw any conclusions!)
Also search the internet for solutions to your problem. One of the most important places to check out is the knowledge base. There is a change that other people already solved your problem. For VMware products you check of course.
And perhaps it might be something simple, like ‘reboot the server’.

9. Call your vendor/integrator

By now you checked everything, walked through your configuration and changed the things that weren’t conform documentation. If the problem still isn’t resolved you really need to call in the troops. If you haven’t done so in step 4, now is really the time to call your vendor/integrator.
Be aware that this also takes time. They need all the information you gathered, like a good problem description, logfiles, sketches, and so on. This is why I said earlier that in some cases it is important not to wait too long before creating a support ticket with your vendor/integrator.

Closing thoughts

Every problem is different. I used the process outlined here often, not only when I have to do my own troubleshooting, but especially when I have to do troubleshooting with different teams. It helps to describe what everyone in his or her expertise needs to do or check. In case of a problem there’s no need for panic. Just make sure you do the right things at the right time. Keeping a clear head helps you a long way to fix the problem.

If you want to know more about the art of troubleshooting you can also check “The art of troubleshooting” . It describes  more approaches to troubleshooting in detail.