Oracle Troubleshooting
Version 11.2.0.3
 
Why a troubleshootings page?

Because I am up to my [...] in frustration right now working with a team that does not know how to troubleshoot a problem and that is not the least bit interested in listening to those that do. They have a RAC cluster with one node that restarts and immediately shoots itself in the head.

Is it an Oracle issue? It isn't. But they are not even remotely close to considering that it might not be. So staggering amounts of expensive time are being wasted. When the help desk reports that an application is slow the thing every DBA is advised to do is determine where in the technology stack the issue exists. Is it the database or is it perhaps an application/app server issue, network issue, storage layer, etc. Unfortunately this good advice is seemingly gone when there is a database outage. The sanity that prevails for "slow" is gone in this situation and it is the default assumption that the issue is the database.

So for those of you facing issues related to database and instance outages these notes are for you and all based on my experiences
 
Topic Discussion
Terms and Definitions
Term Definition
CSI Customer Service Identifier
MOS MyOracleSupport. The horrible, by improving, website that provides Oracle on-line support.
RDA Remote Diagnostic Agent
Sev Severity. An outage is Sev 1
SR Service Request
TAR Technical Assistance Request: The old name for a Service Request (SR)
My Oracle Support: Topic 1 Set a timer. If you can not get everything back online in less than 3 minutes then your first step is to open an SR with Oracle Support. If by some chance you fix the problem before they respond then close it: No harm done. If you have not then use Oracle Support to sanity check what you are thinking of doing. It is not uncommon for DBAs to make a bad situation far worse by doing the wrong thing. You want Oracle to agree with your proposed course of action and, in addition to helping you with your decision making, having them involved will keep your employer from making your life a living hell (see CYA).
My Oracle Support: Topic 2 Not all DBAs are of equal quality and not all support engineers with Oracle are either. If you have uploaded an RDA and a support engineer asks you for information that is in that RDA ask to speak to an escalation manager: Doing so is your right as a customer. Do not tolerate this "engineer" as they are either lazy or not sufficiently skilled to read what you uploaded.

If you have a support engineer that is asking you to run diagnostic test after diagnostic test, multiple system state dumps, etc. and it is not getting you anywhere ... ask to speak to an escalation manager.

The time you are wasting is your own.
Change Management If you do not have the ability to, within one or two minutes walk through changes made to your environment by system admins, network admins, storage admins, vendors, yourself, and other DBAs on your team your organization flunks by definition. Did a network engineer reconfigure the DNS server? You do not know. Did a storage admin patch the software on your SAN? You have no idea. Did a UNIX SA decide to fence CPUs? You are blind. Another member of your team change an init parameter? Perhaps. And exactly how perfect is anyone's memory about what they did yesterday much less on Friday of last week? Please don't make me laugh (or cry).

Change management is a serious topic and you need to implement change management practices that include not just "here's what I did" but also the actual keystrokes, spooled to a file, for later review.
RDA If you do not know what an RDA is know that I would never hire you as a DBA on any project and neither should anyone else. If you do and you do not have a current one ready to send to Oracle attached to an SR know that I'd have you in my office immediately after the outage for a serious discussion of your career path choices. If you have one ... then make sure it is updated every time you make a configuration change, zip it up, and have it ready to upload. An RDA on a big system can take hours to create. Trying to create one during an outage is a guaranteed non-starter.

Not having a current, RDA ready to use when opening an SR, should be a firing offense on any production system.
Database Alert Log If you have not reviewed your alert log every day prior to this outage you are starting with a negative rating. You have no idea what errors are unrelated to the current situation and which ones may be warnings prior to the current outage.

Find the alert log entries for the current outage and then review the previous 48 hours to see if there are any warnings that might be related. If you see a warning or error that might be related and you do not know what it is ... use google. You do not have time to mess around with MyOracleSupport. Copy those that might be related to an electronic scratchpad and hold them.
ASM Alert Log If you are using ASM then you have a separate Oracle instance that is the ASM home with its own alert log. If you do not know where it is then you need to learn that now and document it as you should be reading the alert log every day for warnings and errors.
Clusterware Logs If you have a RAC cluster, by definition, you have Oracle Clusterware installed which means you have log files for the nodeapps and you need to know where to find these and need to read them every day for warnings and errors.
O/S Logs /var/log/messages
Shell History Log .bash_history
Services If a user or application connects using a service, as they all should, and the service is not running the user or application will not be able to connect. If you do not know what services should be running, and with RAC on what nodes, then one of your tasks is to document this using the following SQL statement when everything is fine.
SELECT inst_id, name, goal, dtp, blocked, clb_goal
FROM gv$active_services
ORDER BY 2, 1;
Resource Manager Are you using Resource Management (DBMS_RESOURCE_MGR)?
 
Blocking Sessions
  SELECT (
  SELECT username
  FROM gv$session
  WHERE sid=a.sid) blocker,
  a.sid, ' is blocking ', (
  SELECT username
  FROM gv$session
  WHERE sid=b.sid) blockee,
  b.sid
FROM gv$lock a, gv$lock b
WHERE a.block = 1
AND b.request > 0
AND a.id1 = b.id1
AND a.id2 = b.id2;
  SELECT *
FROM dba_blockers;
 
Waiting Sessions
  SELECT *
FROM dba_waiters;
   
 
nxt
ORA-01555  
ORA-01333  
ORA-00600  
ORA-07455  
   
 
 
Morgan's Library Page Footer
This site is maintained by Dan Morgan. Last Updated: This site is protected by copyright and trademark laws under U.S. and International law. © 1998-2014 Daniel A. Morgan All Rights Reserved