客户的一套重要生产系统,出现了性能问题。这个问题涉及的信息如下:

  • 数据库主机的CPU利用率长期在100%左右。
  • 应用系统在晚上进行调整后,开始出现了问题。
  • 数据库中出现大量的latch: shared pool、latch: library cache、cursor: pin S、latch: cache buffers chains和latch: cache buffers lru chain等各种等待。
  • 由于系统较大,整个系统实际上被分成了2个部分,每个部分服务不同的用户。二者的应用基本相同,分别对应2套数据库,二者也同样在出问题之前进行了调整,而数据库的主机配置及参数也是相同的,同时均为10.2.0.5双节点RAC数据库。但是只有其中1套库有性能问题。
  • 2套库的配置虽然没有差别,但是负载方式却有很大的区别,正常的那套库,2个节点的负载基本上是均衡的,而现在有性能问题的这套库,所有的负载基本上全部在第1个节点上(虽然已经多次要求开发商整改,不幸的是...这里不用说了)。

下面是AWR报告中的数据:

性能正常时间段的数据(采集时间2小时):

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:      3408 19-Sep-11 09:00:21     4,690      39.5
  End Snap:      3412 19-Sep-11 11:00:07     4,950      38.7
   Elapsed:              119.76 (mins)
   DB Time:            2,900.95 (mins)

Cache Sizes
~~~~~~~~~~~                       Begin        End
                             ---------- ----------
               Buffer Cache:    35,840M    35,840M  Std Block Size:         8K
           Shared Pool Size:     6,144M     6,144M      Log Buffer:    30,632K

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:          1,968,314.84              6,360.04
              Logical reads:          1,038,182.89              3,354.59
              Block changes:            148,577.35                480.09
             Physical reads:              3,541.98                 11.44
            Physical writes:                546.31                  1.77
                 User calls:              7,811.87                 25.24
                     Parses:              3,187.30                 10.30
                Hard parses:                 10.07                  0.03
                      Sorts:              6,258.90                 20.22
                     Logons:                  9.65                  0.03
                   Executes:             24,100.52                 77.87
               Transactions:                309.48
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:   99.69    In-memory Sort %:  100.00
            Library Hit   %:   99.36        Soft Parse %:   99.68
         Execute to Parse %:   86.77         Latch Hit %:   99.64
Parse CPU to Parse Elapsd %:   45.61     % Non-Parse CPU:   99.46

 Shared Pool Statistics        Begin    End
                              ------  ------
             Memory Usage %:   95.44   94.46
    % SQL with executions>1:   89.10   77.18
  % Memory for SQL w/exec>1:   87.82   76.62

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                        114,107          65.6
db file sequential read           8,232,067      40,956      5   23.5   User I/O
db file scattered read              929,509       2,392      3    1.4   User I/O
log file sync                     2,119,459       2,372      1    1.4     Commit
SQL*Net message from dblink       1,375,619       1,576      1    0.9    Network
          -------------------------------------------------------------
^LWait Events                         DB/Inst: CRM2DB/crm2db1  Snaps: 3408-3412
-> s  - second
-> cs - centisecond -     100th of a second
-> ms - millisecond -    1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
-> %Timeouts: value of 0 indicates value was <  .5%.  Value of null is truly 0

                                                                  Avg
                                            %Time  Total Wait    wait     Waits
Event                                 Waits -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ----- ----------- ------- ---------
db file sequential read           8,232,067   N/A      40,956       5       3.7
db file scattered read              929,509   N/A       2,392       3       0.4
log file sync                     2,119,459     0       2,372       1       1.0
SQL*Net message from dblink       1,375,619   N/A       1,576       1       0.6
SQL*Net more data from clien      1,871,442   N/A       1,428       1       0.8
control file sequential read      3,010,438   N/A       1,179       0       1.4
Backup: sbtbackup                        12   N/A       1,167   97243       0.0
gc cr grant 2-way                 3,688,572   N/A         993       0       1.7
log file parallel write           2,205,552   N/A         974       0       1.0
gc cr multi block request         2,378,748   N/A         721       0       1.1
gc current block 2-way            1,941,104   N/A         693       0       0.9
SQL*Net more data to client       5,142,664   N/A         498       0       2.3
gc buffer busy                      166,717     0         428       3       0.1
direct path read                    424,645   N/A         344       1       0.2
db file parallel read                61,816   N/A         298       5       0.0
enq: HW - contention                 17,487     0         216      12       0.0
db file parallel write              221,865   N/A         210       1       0.1
Backup: sbtwrite2                    36,271   N/A         145       4       0.0
gc current grant 2-way              445,456   N/A         115       0       0.2
enq: TM - contention                    255    74         102     400       0.0
                         

在应用调整第2天上午的awr数据(采集时长1小时):

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:          2,420,236.58              6,734.48
              Logical reads:          1,277,888.60              3,555.82
              Block changes:            164,638.57                458.12
             Physical reads:              1,208.59                  3.36
            Physical writes:                456.79                  1.27
                 User calls:              8,785.42                 24.45
                     Parses:              4,103.35                 11.42
                Hard parses:                 12.87                  0.04
                      Sorts:              8,052.30                 22.41
                     Logons:                  9.05                  0.03
                   Executes:             32,573.41                 90.64
               Transactions:                359.38

  % Blocks changed per Read:   12.88    Recursive Call %:    85.01
 Rollback per transaction %:    0.19       Rows per Sort:     8.74

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   99.97       Redo NoWait %:  100.00
            Buffer  Hit   %:   99.92    In-memory Sort %:  100.00
            Library Hit   %:   99.92        Soft Parse %:   99.69
         Execute to Parse %:   87.40         Latch Hit %:   99.33
Parse CPU to Parse Elapsd %:   29.94     % Non-Parse CPU:   99.17

 Shared Pool Statistics        Begin    End
                              ------  ------
             Memory Usage %:   95.42   95.02
    % SQL with executions>1:   61.41   72.38
  % Memory for SQL w/exec>1:   60.04   73.07

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                         69,736          56.2
Backup: sbtwrite2                 2,319,231      22,510     10   18.1 Administra
db file sequential read           3,476,830      20,923      6   16.9   User I/O
latch: shared pool                  159,416       9,280     58    7.5 Concurrenc
Backup: sbtbackup                        28       2,746  98056    2.2 Administra
          -------------------------------------------------------------
^LWait Events                         DB/Inst: CRM2DB/crm2db1  Snaps: 3602-3604
-> s  - second
-> cs - centisecond -     100th of a second
-> ms - millisecond -    1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
-> %Timeouts: value of 0 indicates value was <  .5%.  Value of null is truly 0

                                                                  Avg
                                            %Time  Total Wait    wait     Waits
Event                                 Waits -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ----- ----------- ------- ---------
Backup: sbtwrite2                 2,319,231   N/A      22,510      10       1.8
db file sequential read           3,476,830   N/A      20,923       6       2.7
latch: shared pool                  159,416   N/A       9,280      58       0.1
Backup: sbtbackup                        28   N/A       2,746   98056       0.0
log file sync                     1,283,319     0       2,538       2       1.0
gc buffer busy                    1,088,432     0       2,324       2       0.8
latch: library cache                 31,278   N/A       1,328      42       0.0
control file sequential read      1,512,218   N/A       1,082       1       1.2
log file parallel write           1,157,491   N/A       1,045       1       0.9
SQL*Net message from dblink         952,956   N/A         987       1       0.7
SQL*Net more data from clien        962,454   N/A         837       1       0.7
kst: async disk IO                   17,574   N/A         483      27       0.0
gc cr grant 2-way                 1,539,289   N/A         439       0       1.2
gc current block 2-way              946,330   N/A         353       0       0.7
direct path read                    291,853   N/A         313       1       0.2
db file parallel write              261,735   N/A         298       1       0.2
read by other session                92,451     0         270       3       0.1
cursor: pin S                   304,676,088   N/A         233       0     235.1
db file scattered read               28,062   N/A         132       5       0.0
enq: TX - row lock contentio            466    52         123     263       0.0
direct path write                   227,989   N/A         113       0       0.2
          

从Load profile数据对比来看,应用调整后系统负载有一定的提高,同时每事务逻辑读也有一定的增加(不足10%)。这会是个问题吗?
从第2份数据来看,latch: shared pool 和 latch: library cache有明显的增加。

下面的是性能问题更加严重的时候的AWR数据(采集时长为2小时):

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:      3744 26-Sep-11 09:00:01     4,504      38.2
  End Snap:      3748 26-Sep-11 11:00:26     4,916      52.7
   Elapsed:              120.41 (mins)
   DB Time:           21,868.27 (mins)

Cache Sizes
~~~~~~~~~~~                       Begin        End
                             ---------- ----------
               Buffer Cache:    35,840M    35,840M  Std Block Size:         8K
           Shared Pool Size:     6,144M     6,144M      Log Buffer:    30,632K

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:          1,790,430.57              5,945.33
              Logical reads:          1,162,875.15              3,861.46
              Block changes:            154,064.15                511.59
             Physical reads:              3,007.80                  9.99
            Physical writes:                337.21                  1.12
                 User calls:              8,227.28                 27.32
                     Parses:              3,603.24                 11.96
                Hard parses:                 11.25                  0.04
                      Sorts:              7,304.40                 24.26
                     Logons:                 11.27                  0.04
                   Executes:             25,994.53                 86.32
               Transactions:                301.15

  % Blocks changed per Read:   13.25    Recursive Call %:    82.87
 Rollback per transaction %:    0.21       Rows per Sort:     8.14

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   99.99       Redo NoWait %:  100.00
            Buffer  Hit   %:   99.75    In-memory Sort %:  100.00
            Library Hit   %:   99.94        Soft Parse %:   99.69
         Execute to Parse %:   86.14         Latch Hit %:   98.90
Parse CPU to Parse Elapsd %:   14.69     % Non-Parse CPU:   84.51

 Shared Pool Statistics        Begin    End
                              ------  ------
             Memory Usage %:   95.48   94.33
    % SQL with executions>1:   87.63   74.92
  % Memory for SQL w/exec>1:   86.52   74.66

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
latch: shared pool                  991,918     274,715    277   20.9 Concurrenc
latch: library cache                941,443     232,344    247   17.7 Concurrenc
CPU time                                        212,150          16.2
cursor: pin S                  ############     115,049      0    8.8      Other
latch: cache buffers lru chain    1,106,570      96,442     87    7.4      Other
          -------------------------------------------------------------
^LWait Events                         DB/Inst: CRM2DB/crm2db1  Snaps: 3744-3748
-> s  - second
-> cs - centisecond -     100th of a second
-> ms - millisecond -    1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
-> %Timeouts: value of 0 indicates value was < .5%.  Value of null is truly 0

                                                                  Avg
                                            %Time  Total Wait    wait     Waits
Event                                 Waits -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ----- ----------- ------- ---------
latch: shared pool                  991,918   N/A     274,715     277       0.5
latch: library cache                941,443   N/A     232,344     247       0.4
cursor: pin S                 9,095,470,035   N/A     115,049       0   4,180.5
latch: cache buffers lru cha      1,106,570   N/A      96,442      87       0.5
db file sequential read           7,819,285   N/A      49,564       6       3.6
latch: cache buffers chains         511,936   N/A      35,753      70       0.2
latch free                           95,257     0      14,609     153       0.0
latch: object queue header o        221,907   N/A      14,378      65       0.1
log file sync                     2,156,343     1      11,496       5       1.0
cursor: pin S wait on X             422,320   100       4,603      11       0.2
gc cr grant 2-way                 3,526,856     0       3,430       1       1.6
db file scattered read              624,480   N/A       3,083       5       0.3
enq: TX - row lock contentio          5,946    88       2,600     437       0.0
latch: row cache objects             16,983   N/A       2,281     134       0.0
SQL*Net message from dblink       1,284,487   N/A       2,179       2       0.6
gc buffer busy                      353,228     0       2,054       6       0.2
gc cr multi block request         2,499,709     0       1,974       1       1.1
gc current block 2-way            1,969,516     0       1,920       1       0.9
SQL*Net more data from clien      1,667,513   N/A       1,914       1       0.8
control file sequential read      2,974,349   N/A       1,735       1       1.4
db file parallel read               230,795   N/A       1,645       7       0.1
log file parallel write           1,581,432   N/A       1,303       1       0.7
Backup: sbtbackup                        14   N/A         730   52120       0.0
buffer busy waits                   143,869     0         488       3       0.1
latch: library cache lock             5,058   N/A         378      75       0.0
enq: HW - contention                 22,039     0         370      17       0.0
gc current grant 2-way              424,243   N/A         357       1       0.2
enq: SQ - contention                  5,010     1         354      71       0.0
read by other session                50,153     0         339       7       0.0
direct path read                    324,226   N/A         305       1       0.1
db file parallel write              207,793   N/A         274       1       0.1
row cache lock                      175,301     0         263       1       0.1
          

从第3份数据来看,性能更为恶化,情形更为复杂。

那么针对这个性能现象,可以提出如下的问题:

  • 最直观的,最容易想到的就是,性能问题的出现是否与应用调整有关,如果是,为什么另一套库没有出现问题?会不会是另一套库的负载在2个节点都有相对均衡的负担,而出问题的库,负载全部集中在1个节点上引起?
  • 客户是在应用调整几天后才报告性能问题,这个问题会不会是一个逐渐出现的问题?如果一开始就有严重的性能问题,那么应该会很快报告。不过中间跨了一个周末,所以对于”逐渐出现问题“的判断又加了一些难度。
  • 这么多的性能问题相关的现象中,哪个更贴近问题的原因?实际上在主机的CPU长期保持在100%左右时,会放大平时的一些轻微的问题,或者引起另一些问题。从等待来看,latch: cache buffers chains和cursor: pin S都会引起CPU的急剧增加,而其他的latch竞争同样也会引起CPU的上升,虽然上升没有前者大。到底是SQL执行效率不高引起CPU急剧增加然后引起了shared pool latch等与解析相关的资源争用还是因为解析相关的问题导致CPU急剧增加引起了cache buffers chains等与SQL执行相关的latch争用?或者是2者的共同作用?

下面首先来分析,会不会是应用调整引起的问题,也就是说,是不是由SQL引起的问题。如果是SQL引起的问题,会有几种可能,1是部分频繁执行的SQL执行效率变差;2是增加了很多的硬解析;3是并发增加或者说是SQL的执行次数有上升。仔细检查对比现在和以前的AWR数据,可以排除第1和2这种可能。至于第3种,可能略有增加,但是不一定跟应用调整有关,毕竟正常情况下业务量的变化也是有可能的。所以应用调整引起问题的可能性较小。

会不会是因为负载过大引起的问题,这个很容易验证。取业务低谷期,比如下班时间的数据,进行分析可以发现,既然在CPU在稍低(75%以下)的时候,仍然有大量的library cache latch的争用。下面是取自周日19:00-20:00的AWR数据(注:本文涉及故障的分析时间是在星期一):

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                         49,224          48.3
latch: shared pool                  185,951      41,178    221   40.4 Concurrenc
latch: library cache                 45,636       8,861    194    8.7 Concurrenc
db file sequential read             815,066       4,763      6    4.7   User I/O
cursor: pin S                  ############       1,732      0    1.7      Other
          -------------------------------------------------------------

                                                                  Avg
                                            %Time  Total Wait    wait     Waits
Event                                 Waits -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ----- ----------- ------- ---------
latch: shared pool                  185,951   N/A      41,178     221       0.6
latch: library cache                 45,636   N/A       8,861     194       0.2
db file sequential read             815,066   N/A       4,763       6       2.8
cursor: pin S                 1,890,998,700   N/A       1,732       0   6,459.2
latch free                            8,549     0       1,278     150       0.0
control file sequential read      1,507,789   N/A         466       0       5.2
Backup: sbtbackup                         7   N/A         428   61096       0.0
log file sync                       285,442     0         309       1       1.0
SQL*Net more data from clien        145,137   N/A         221       2       0.5
db file scattered read               67,716   N/A         173       3       0.2
gc buffer busy                       56,842     0         135       2       0.2
gc cr grant 2-way                   359,146   N/A          99       0       1.2
SQL*Net message from dblink         123,206   N/A          98       1       0.4
log file parallel write             289,048   N/A          91       0       1.0
SQL*Net more data to client       1,242,471   N/A          79       0       4.2
gc current block 2-way              179,654   N/A          66       0       0.6
direct path read                     62,336   N/A          65       1       0.2
gc cr multi block request           228,693   N/A          52       0       0.8
          

至于性能问题是不是逐渐引起的,这个问题目前并不关键,暂时不考虑。只是做为后面分析问题的一个参考。

Read the rest of this entry

,