我给《DBA 手记III》投的一篇稿子,名为《一次由隐含参数引起性能问题的处理》,这篇文章描述了一套关键的系统,由于”_kghdsidx_count”这个参数设置为1,导致了严重的性能问题。从故障现象上看是大量的library cache latch的等待,以及shared pool latch的等待,但是前者的等待时间比后者长得多。在文章中,我提到,在当时我推断,由于”_kghdsidx_count”这个隐含参数设置为1,导致shared pool只有1个subpool,引起了shared pool latch的严重竞争,进而引起了library cache cache的更为严重的竞争,这个竞争的过程如下:

  1. 由于”_kghdsidx_count”=1,使得shared pool latch只有1个child latch。而library cache latch的child latch数量跟CPU数量有关,最大值为67,编号为child #1-#67。
  2. 会话1持有shared pool latch。
  3. 会话2解析SQL语句,首先持有了library cache latch的child latch,假设这里为child #1,然后去请求shared pool latch。很显然,这时候被会话1持有,那么会话2就会等待shared pool latch。
  4. 会话3解析SQL语句,首先需要请求library cache latch,如果请求的library cache child latch刚好是#1,那么由于会话2持有了这个child latch,就会等待library cache latch。
  5. 因此,实际上会话1和会话2的shared pool latch的竞争引起了会话3的library cache latch的等待。如果并发数不是太高,那么shared pool latch的竞争看上去就会比library cache latch的竞争多一些。但是如果有几百个活动会话,这个时候,就会有大量的会话首先等待library cache latch,因为在解析SQL时是首先需要获取library cache latch再获取shared pool latch。由于大量的软解析,甚至不需要获取shared pool latch,同时一个大型的OLTP系统中,某几条相同的SQL并发执行的概率很高,这样会使很多会话同时请求同一library cache child latch;另外,在解析过程中,可能会多次请求library cache latch和shared pool latch,而前者请求和释放的次数会比后者多得多;这样大量的会话在获取library cache latch时处于等待状态,从现象上看就比shared pool latch的等待多得多。


而本文主要表达的是,怎么来验证在解析时,Oracle进程在持有了library cache latch的情况下去请求shared pool latch,而不是在请求shared pool时不需要持有library cache latch。

由于这个验证过程过于internal,所以没有在《DBA手记III》中描述出来。这里写出来,供有兴趣的朋友参考。

验证上面的这一点,有2个方法。下面以测试过程来详细描述。
测试环境 :Oracle 10.2.0.5.1 for Linux X86.

方法一:使用oradebug。

1. 将数据库的”_kghdsidx_count”参数值设为1,并重启数据库,以保证只有一个shared pool child latch。
2. 使用sqlplus连接到数据库,假设这个会话是session-1,查询当前的SID:

SQL> select sid from v$mystat where rownum=1;

       SID
----------
       159

同时获得当前连接的spid为2415。
3. 新建一个连接到数据库,假设会话是session-2,查询shared pool latch的地址,并使用oradebug将这个地址对应的值置为1,以表示该latch已经被持有:


SQL> select addr,latch#,level#,child#,name,gets  from v$latch_children where name='shared pool';

ADDR         LATCH#     LEVEL#     CHILD# NAME                                                     GETS
-------- ---------- ---------- ---------- -------------------------------------------------- ----------
200999BC        216          7          1 shared pool                                             34949
20099A20        216          7          2 shared pool                                                 6
20099A84        216          7          3 shared pool                                                 6
20099AE8        216          7          4 shared pool                                                 6
20099B4C        216          7          5 shared pool                                                 6
20099BB0        216          7          6 shared pool                                                 6
20099C14        216          7          7 shared pool                                                 6

SQL> oradebug poke 0x200999BC 4 1
BEFORE: [200999BC, 200999C0) = 00000000
AFTER:  [200999BC, 200999C0) = 00000001

4. 在session-1会话中执行下面的SQL:

SQL> select owner from t where object_id=3003;

正如预料之中的反映,这个会话hang住。

5. 在session-2中,对session-1的进程作process dump。(注意这个时候不能查询v$session_wait、v$latchholder等视图)

SQL> oradebug setospid 2415
Oracle pid: 15, Unix process pid: 2415, image: oracle@xty (TNS V1-V3)
SQL> oradebug dump processstate 10
Statement processed.
SQL> oradebug tracefile_name
/oracle/app/oracle/admin/xty/udump/xty_ora_2415.trc

然后从/oracle/app/oracle/admin/xty/udump/xty_ora_2415.trc这个TRACE文件中可以找到下面的信息:


Process global information:
     process: 0x51a2dce8, call: 0x51b465e4, xact: (nil), curses: 0x51b2a4b8, usrses: 0x51b2a4b8
  ----------------------------------------
  SO: 0x51a2dce8, type: 2, owner: (nil), flag: INIT/-/-/0x00
  (process) Oracle pid=15, calls cur/top: 0x51b465e4/0x51b465e4, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 110 0 4
              last post received-location: kslpsr
              last process to post me: 51a2a904 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 51a2a904 1 6
    (latch info) wait_event=0 bits=20
        Location from where call was made: kghupr1: Chunk Header
        Context saved from call: 1306572176
      waiting for 200999bc Child shared pool level=7 child#=1
        Location from where latch is held: kghupr1: Chunk Header
        Context saved from call: 1299065692
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           10 (54, 1304393092, 54)
           11 (54, 1304393092, 54)
           24 (15, 1304393092, 15)
           15 (6, 1304393092, 6)
           waiter count=4
          gotten 35138 times wait, failed first 502 sleeps 5
          gotten 0 times nowait, failed: 0
      on wait list for 200999bc
      holding    (efd=3) 4f7c75f8 Child library cache level=5 child#=1
        Location from where latch is held: kgllkdl: child: no lock handle: latch
        Context saved from call: 2
        state=busy, wlstate=free

从waiting for 200999bc Child shared pool level=7 child#=1和holding (efd=3) 4f7c75f8 Child library cache level=5 child#=1可以看到,进程的确是在持有了library cache latch的同时去请求 shared pool latch。

最后使用“oradebug poke 0×200999BC 4 0”命令将shared pool latch置为free状态。

方法二:使用gdb。
1. 还原”_kghdsidx_count”参数,重启数据库。
2. 使用sqlplus连接到数据库,获取对应的Oracle Process的spid为4763。
3. 使用gdb:

[oracle@xty ~]$ gdb -p 4763
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
...省略部分输出...
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_files.so.2
0x0059b402 in __kernel_vsyscall ()

4. 在gdb中,在latch的获取和释放的函数上设置断点:

(gdb) b kslgpl
Breakpoint 1 at 0x8339fca
(gdb) b kslgetsl
Breakpoint 2 at 0x833a2d5
(gdb) b kslgpl
Breakpoint 3 at 0x83473bf
(gdb) b kslg2c
Breakpoint 4 at 0x8347886
(gdb) b kslg2csl
Breakpoint 5 at 0x834794e
(gdb) b kslgprl
Breakpoint 6 at 0x8346186
(gdb)
Note: breakpoint 6 also set at pc 0x8346186.
Breakpoint 7 at 0x8346186
(gdb) b kslfre
Breakpoint 8 at 0x833c05d
(gdb) b kslf2c
Breakpoint 9 at 0x8347a1e
(gdb) b kslfpl
Breakpoint 10 at 0x8347812
(gdb) b kslfal
Breakpoint 11 at 0x833fbca
(gdb) b kslfpl0
Breakpoint 12 at 0x834766d
(gdb)
注:最后几个函数只是猜测跟latch有关....latch的请求和释放所调用的函数主要为:kslgetl、kslgetsl和kslfre。
(gdb) c
Continuing.

5. 在sqlplus中执行SQL:

SQL> select * from t where object_id=10000;

正如预期的那样,会话hang住。这是因为运行到了gdb设置的断点。

6.在gdb中可以看到:

Breakpoint 2, 0x0833a2d5 in kslgetsl ()
(gdb) info f
Stack level 0, frame at 0xbfd3fb08:
 eip = 0x833a2d5 in kslgetsl; saved eip 0x84257e1
 called by frame at 0xbfd3fb44
 Arglist at 0xbfd3fb00, args:
 Locals at 0xbfd3fb00, Previous frame's sp is 0xbfd3fb08
 Saved registers:
  ebp at 0xbfd3fb00, eip at 0xbfd3fb04
(gdb) x /10x 0xbfd3fb00
0xbfd3fb00:     0xbfd3fb3c      0x084257e1      0x51b4c38c      0x00000001
0xbfd3fb10:     0x51b11738      0x00000058      0x00000008      0x00000002
0xbfd3fb20:     0x0d099c40      0x0d099d3c

0×51b4c38c这是一个地址,从v$latch和v$latch_children中可以查到正是是 session idle bit latch。

6.在gdb中继续:

(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb) info f
Stack level 0, frame at 0xbfd3fb18:
 eip = 0x833c05d in kslfre; saved eip 0x842581a
 called by frame at 0xbfd3fb44
 Arglist at 0xbfd3fb10, args:
 Locals at 0xbfd3fb10, Previous frame's sp is 0xbfd3fb18
 Saved registers:
  ebp at 0xbfd3fb10, eip at 0xbfd3fb14
(gdb) x /10x 0xbfd3fb10
0xbfd3fb10:     0xbfd3fb3c      0x0842581a      0x51b4c38c      0x00000002
0xbfd3fb20:     0x0d099c40      0x0d099d3c      0x0d099c40      0x00000002
0xbfd3fb30:     0x00000000      0x517a7090
(gdb) c
Continuing.
----下面很重要-----
Breakpoint 1, 0x08339fca in kslgetl ()
(gdb) x /10x $ebp+8
0xbfd3ecc0:     0x4f7c7594      0x00000001      0x00000001      0x00000a57
0xbfd3ecd0:     0x0af275ca      0x00000000      0x0dcfdcc0      0x0d045860
0xbfd3ece0:     0xbfd3ed58      0x0b20fc09
(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb) x /10x $ebp+8
0xbfd3ed64:     0x4f7c7594      0x00000000      0x10010040      0x0d045860
0xbfd3ed74:     0xbfd3ef10      0x0b1f5d10      0x0d099c40      0x4f7c7594
0xbfd3ed84:     0x00000000      0x00000008
(gdb) c
Continuing.

Breakpoint 1, 0x08339fca in kslgetl ()
(gdb) x /10x $ebp+8
0xbfd3ecc0:     0x4f7c7594      0x00000001      0x00000001      0x00000a57
0xbfd3ecd0:     0x0af275ca      0x00000000      0x0dcfdcc0      0x0d045860
0xbfd3ece0:     0xbfd3ed58      0x0b20fc09
(gdb) c
Continuing.

Breakpoint 1, 0x08339fca in kslgetl ()
(gdb)  x /10x $ebp+8
0xbfd3eb50:     0x200999bc      0x00000001      0x00000000      0x000009de
0xbfd3eb60:     0x0d099c40      0x0000000f      0x0c3c3ec0      0x0d045860
0xbfd3eb70:     0xbfd3ec10      0x0af3983b
(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb)  x /10x $ebp+8
0xbfd3eb6c:     0x200999bc      0x0d099c40      0x2001d310      0x0d045860
0xbfd3eb7c:     0xbfd3ec10      0x0af390a8      0x0d099c40      0x200999bc
0xbfd3eb8c:     0xbfd3ebe8      0x00000164
(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb)  x /10x $ebp+8
0xbfd3ed64:     0x4f7c7594      0x00000000      0x10010040      0x0d045860
0xbfd3ed74:     0xbfd3ef10      0x0b1f58e6      0x0d099c40      0x4f7c7594
0xbfd3ed84:     0x00000000      0x00000008
(gdb) quit
A debugging session is active.

        Inferior 1 [process 4763] will be detached.

同时可以从数据库中查到:

SQL> select addr,latch#,child#,level#,name from v$latch_children
where name in ('library cache','shared pool');
ADDR         LATCH#     CHILD#     LEVEL# NAME
-------- ---------- ---------- ---------- --------------------------------------------------
4F7C75F8        217          1          5 library cache
4F7C7594        217          2          5 library cache
4F7C7530        217          3          5 library cache
200999BC        216          1          7 shared pool
20099A20        216          2          7 shared pool
20099A84        216          3          7 shared pool
20099AE8        216          4          7 shared pool
20099B4C        216          5          7 shared pool
20099BB0        216          6          7 shared pool
20099C14        216          7          7 shared pool

标记为”下面很重要“的输出,汇总起来就是:

kslgetl  0x4f7c7594  library cache
kslfre   0x4f7c7594  library cache
kslgetl  0x4f7c7594  library cache
kslgetl  0x200999bc  shared pool
kslfre   0x200999bc  shared pool
kslfre   0x4f7c7594  library cache

这个过程可以理解为:Oracle进程首先获取library cache latch,在library cache中查找是否已经有要执行的SQL语句(这里显然没有找到),然后释放library cache。接下来是硬解析SQL的后半部分,首先获取library cache latch,然后获取shared pool latch,从shared pool中申请内存,然后释放shared pool latch,最后再获取library cache latch。

后面不再继续跟踪解析的过程,这个过程会有相当多的latch的获取和释放,但是从上面可以很清晰地看到,进程在持有了library cache latch的同时去请求shared pool latch。

至于这个案例,大家会在《DBA 手记III》上看到完整的介绍。