先说说这个数据库的环境:
Oracle 9.2.0.4 RAC,只不过这个RAC只运行了一个节点,另一节点没有开启。
AIX 5.3 TL04
主机为p550,4CPU,16G内存
应用为部署在Weblogic下的WEB应用。

故障现象:
首先是客户端的操作没有响应,从weblogic上看连接数非常高,其日志里面不停报超出连接池的最大连接数。在主机上用sqlplus "/ as sysdba",在显示sqlplus的banner后,停止响应。

从故障现象来看,是数据库hang住了。

由于sqlplus不能操作,那么这个时候没办法通过oracle来dump system state。先看看操作系统里面,用topas命令观察,发现一个oracle进程占用了26%左右的CPU资源,IO等待几乎为0,可用的物理内存还比较多。根据那个占用CPU的进程号用ps命令查看,是一个普通的Server Process。

看来起这个进程陷入死循环了,26%的CPU资源正好是1个CPU(因为系统共4个CPU)。如果一个oracle进程拿到比较重要的资源,比如shared pool latch、library cache latch等,然后陷入了死循环(SPIN)后,其他进程没法解析SQL等,也就只有挂起了。

用kill命令杀掉那个进程,系统恢复正常,看来前面对故障的推断是正确的,不过没过几分钟,又出现了此故障现象。

只有找到oracle当时正在干什么,才能进行处理。用dbx来dump system state:

# dbx -a 446910
Waiting to attach to process 446910 ...
Successfully attached to oracle.
Type 'help' for help.
reading symbolic information ...
stopped in iosl.select at 0x9000000000c94d8 ($t2)
0x9000000000c94d8 (select+0xfffffffffff06318) e8410028 ld r2,0x28(r1)
(dbx) print ksudss(10)

Segmentation fault in slrac at 0x100083aa0 ($t2)
0x100083aa0 (slrac+0xe4) 88030000 lbz r0,0x0(r3)
(dbx) detach

分析trace文件:

Starting Systemstate 1
..............................................................................
......................
Ass.Awk Version 1.0.9 - Processing oa2_ora_446910.trc

System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer' seq=16329
3:
4: waiting for 'rdbms ipc message' seq=33158
5: waiting for 'ges remote message' seq=30917
6: waiting for 'gcs remote message' seq=52339
7: waiting for 'gcs remote message' seq=52407
8: waiting for 'rdbms ipc message' seq=32661
9: waiting for 'rdbms ipc message' seq=32504
10: waiting for 'rdbms ipc message' seq=19573
11: waiting for 'rdbms ipc message' seq=52125
12: waiting for 'latch free' [Latch 7000001491d2a40] seq=16415
13: waiting for 'rdbms ipc message' seq=270
14: waiting for 'rdbms ipc message' seq=3494
15: waiting for 'rdbms ipc message' seq=29593
16: waiting for 'latch free' [Latch 7000001491d2a40] seq=54430
17: waiting for 'rdbms ipc message' seq=52224
18: waiting for 'latch free' [Latch 7000001491d2a40] seq=23413
19: waiting for 'latch free' [Latch 7000001491d2a40] seq=49796
20: waiting for 'latch free' [Latch 7000001491d2a40] seq=20730
21: waiting for 'latch free' [Latch 7000001491d2a40] seq=12464
Cmd: Select
22: waiting for 'latch free' [Latch 7000001491d2a40] seq=34653
Cmd: Select
23: waiting for 'latch free' [Latch 7000001491d2850] seq=58618
24: waiting for 'latch free' [Latch 7000001491d2a40] seq=36619
25: waiting for 'latch free' [Latch 7000001491d2a40] seq=56192
26: waiting for 'latch free' [Latch 7000001491d2a40] seq=60361
27: waiting for 'latch free' [Latch 7000001491d2a40] seq=19963
28: waiting for 'latch free' [Latch 7000001491d2a40] seq=10464
29: waiting for 'latch free' [Latch 7000001491d2a40] seq=29548
30: waiting for 'latch free' [Latch 7000001491d2a40] seq=15077
31: waiting for 'latch free' [Latch 7000001491d2a40] seq=32472
32: waiting for 'latch free' [Latch 7000001491d2a40] seq=17466
33: waiting for 'latch free' [Latch 7000001491d2a40] seq=34712
34: waiting for 'latch free' [Latch 7000001491d2a40] seq=27146
35: waiting for 'latch free' [Latch 7000001491d2a40] seq=2911
36: waiting for 'latch free' [Latch 7000001491d2a40] seq=16634
Cmd: Select
37: waiting for 'latch free' [Latch 7000001491d2a40] seq=43400
38: waiting for 'latch free' [Latch 7000001491d2a40] seq=8435
39: waiting for 'latch free' [Latch 7000001491d2a40] seq=39939
40: waiting for 'latch free' [Latch 7000001491d2a40] seq=59619
41: waiting for 'latch free' [Latch 7000001491d2a40] seq=16259
42: waiting for 'latch free' [Latch 7000001491d2a40] seq=21150

43: waiting for 'latch free' [Latch 7000001491d2a40] seq=35843
44: waiting for 'latch free' [Latch 7000001491d2a40] seq=27141
45: waiting for 'latch free' [Latch 7000001491d2a40] seq=29400
46: waiting for 'latch free' [Latch 7000001491d2a40] seq=51444
47: waiting for 'latch free' [Latch 7000001491d2a40] seq=19142
48: waiting for 'latch free' [Latch 7000001491d2a40] seq=1031
49: waiting for 'latch free' [Latch 7000001491d2a40] seq=31852
50: waiting for 'latch free' [Latch 7000001491d2a40] seq=46957
51: waiting for 'latch free' [Latch 7000001491d2a40] seq=41609
52: waiting for 'latch free' [Latch 7000001491d2850] seq=57255
53: waiting for 'latch free' [Latch 7000001491d2a40] seq=34982
54: waiting for 'latch free' [Latch 7000001491d2a40] seq=26060
55: waiting for 'latch free' [Latch 7000001491d2a40] seq=16070
56: waiting for 'latch free' [Latch 7000001491d2a40] seq=63913
57: waiting for 'latch free' [Latch 7000001491d2a40] seq=2724
58: waiting for 'latch free' [Latch 7000001491d2a40] seq=42177
59: waiting for 'latch free' [Latch 7000001491d2a40] seq=19973
60: waiting for 'latch free' [Latch 7000001491d2a40] seq=47446
61: waiting for 'latch free' [Latch 7000001491d2a40] seq=42112
Cmd: Select
62: waiting for 'latch free' [Latch 7000001491d2a40] seq=927
63: waiting for 'latch free' [Latch 7000001491d2850] seq=3182
64: waiting for 'latch free' [Latch 7000001491d2a40] seq=2347
65: waiting for 'latch free' [Latch 7000001491d2a40] seq=18928
66: waiting for 'latch free' [Latch 7000001491d2a40] seq=1512
67: waiting for 'latch free' [Latch 7000001491d2a40] seq=53568
68: waiting for 'latch free' [Latch 7000001491d2a40] seq=20420
69: waiting for 'latch free' [Latch 7000001491d2a40] seq=30353
70: waiting for 'latch free' [Latch 7000001491d2a40] seq=36936
71: waiting for 'latch free' [Latch 7000001491d2a40] seq=50852
72: waiting for 'latch free' [Latch 7000001491d2a40] seq=8533
73: waiting for 'latch free' [Latch 7000001491d2a40] seq=31009
74: waiting for 'latch free' [Latch 7000001491d2a40] seq=15406
75: waiting for 'latch free' [Latch 7000001491d2a40] seq=4375
Cmd: Select
76: waiting for 'library cache load lock' seq=207
78: waiting for 'latch free' [Latch 7000001491d2a40] seq=6165
79: waiting for 'latch free' [Latch 7000001491d2a40] seq=22977
80: last wait for 'SQL*Net message from client'
81: waiting for 'latch free' [Latch 7000001491d2a40] seq=20226
82: waiting for 'latch free' [Latch 7000001491d2a40] seq=17323
Cmd: Select
83: waiting for 'latch free' [Latch 7000001491d2a40] seq=19680
84: waiting for 'latch free' [Latch 7000001491d2a40] seq=5997
85: waiting for 'latch free' [Latch 7000001491d2a40] seq=54166
Cmd: Select
86: waiting for 'latch free' [Latch 7000001491d2a40] seq=61874
87: waiting for 'latch free' [Latch 7000001491d2a40] seq=33285
88: waiting for 'latch free' [Latch 7000001491d2a40] seq=16784
89: last wait for 'latch free'
Cmd: Select
90: waiting for 'latch free' [Latch 7000001491d2a40] seq=63834
91: waiting for 'latch free' [Latch 7000001491d2a40] seq=42945
92: waiting for 'latch free' [Latch 7000001491d2a40] seq=3676
93: waiting for 'latch free' [Latch 7000001491d2a40] seq=59730
94: waiting for 'latch free' [Latch 7000001491d2a40] seq=42691
Cmd: Update
95: waiting for 'latch free' [Latch 7000001491d2a40] seq=41076
96: waiting for 'latch free' [Latch 7000001491d2a40] seq=23230
97: waiting for 'latch free' [Latch 7000001491d2a40] seq=64835
98: waiting for 'latch free' [Latch 7000001491d2a40] seq=45314
99: waiting for 'latch free' [Latch 7000001491d2a40] seq=65301
100:waiting for 'latch free' [Latch 7000001491d2a40] seq=51423
101:waiting for 'latch free' [Latch 7000001491d2a40] seq=25171
Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.

Resource Holder State
Latch 7000001491d2a40 80: last wait for 'SQL*Net message from client'
Latch 7000001491d2a40 89: last wait for 'latch free'
Latch 7000001491d2850 26: 26: is waiting for 80: 89:

Object Names
~~~~~~~~~~~~
Latch 7000001491d2a40 Child library cache
Latch 7000001491d2850 Child library cache

很多进程都在等待library cache latch,而持有library cache latch的进程出现了异常,因而导致了数据库被hang住。对出现异常的进程进行分析,发现异常进程执行的SQL都是应用中的普通的SQL(并且后来出现异常的多个进程执行的SQL都不相同)。由于急需恢复应用,所以不得已先重启了数据库,再继续查找原因。

然而数据库重启只能使用1天,就会再次出现此类问题。
针对此故障,首先怀疑可能是BUG引起。但在确认BUG之前,需要收集足够多的证据。随后发现,这个数据库存在下面几个问题:

1、这个库的shared pool 设置很大,达到了4G。
2、应用没有使用绑定变量,全部都是拼接的SQL,并且没有使用任何存储过程,也就是应用的全部SQL都是硬解析。4G的shared pool可以在1天之内全部填满。
3、从alert日志文件里面看到,这个库经常因为ORA-04031错误崩溃。

看来这个库的问题比较严重。数据库hang住和ORA-04031应该都与硬解析过多有关。但目前改变应用是十分困难的事情。只好通过将cursor_sharing设置为similar,强制使用绑定变量。

将cursor_sharing设置为similiar之后,经过一段时间的观察,数据库运行正常,不再出现hang住的现象,hard parse从以前每秒数十次减少为每秒不到1次。shared pool的内存使用非常少,并且减少了shared_pool_size的设置。

不过cursor_sharing设置为similar有比较多的BUG,最常见的就是statspack的BUG和会有大量高version count的SQL。

对于OLTP系统来说,使用绑定变量,怎么强调都不过分。

Trackback

3 comments untill now

  1. […] oracle系统状态。在记一次Oracle数据库无响应(hang住)故障的处理一文中,就曾使用dbx做systemstate […]

  2. 学习了 谢谢

    [回复]

  3. in Oracle 9i & 10g & 11g, you can try this to login in the db if the db hang:

    sqlplus -prelim / as sysdba
    then do a hanganalyze

    [回复]

Add your comment now