十
03
客户的一套重要生产系统,出现了性能问题。这个问题涉及的信息如下:
- 数据库主机的CPU利用率长期在100%左右。
- 应用系统在晚上进行调整后,开始出现了问题。
- 数据库中出现大量的latch: shared pool、latch: library cache、cursor: pin S、latch: cache buffers chains和latch: cache buffers lru chain等各种等待。
- 由于系统较大,整个系统实际上被分成了2个部分,每个部分服务不同的用户。二者的应用基本相同,分别对应2套数据库,二者也同样在出问题之前进行了调整,而数据库的主机配置及参数也是相同的,同时均为10.2.0.5双节点RAC数据库。但是只有其中1套库有性能问题。
- 2套库的配置虽然没有差别,但是负载方式却有很大的区别,正常的那套库,2个节点的负载基本上是均衡的,而现在有性能问题的这套库,所有的负载基本上全部在第1个节点上(虽然已经多次要求开发商整改,不幸的是...这里不用说了)。
下面是AWR报告中的数据:
性能正常时间段的数据(采集时间2小时):
Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 3408 19-Sep-11 09:00:21 4,690 39.5 End Snap: 3412 19-Sep-11 11:00:07 4,950 38.7 Elapsed: 119.76 (mins) DB Time: 2,900.95 (mins) Cache Sizes ~~~~~~~~~~~ Begin End ---------- ---------- Buffer Cache: 35,840M 35,840M Std Block Size: 8K Shared Pool Size: 6,144M 6,144M Log Buffer: 30,632K Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 1,968,314.84 6,360.04 Logical reads: 1,038,182.89 3,354.59 Block changes: 148,577.35 480.09 Physical reads: 3,541.98 11.44 Physical writes: 546.31 1.77 User calls: 7,811.87 25.24 Parses: 3,187.30 10.30 Hard parses: 10.07 0.03 Sorts: 6,258.90 20.22 Logons: 9.65 0.03 Executes: 24,100.52 77.87 Transactions: 309.48 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 100.00 Redo NoWait %: 100.00 Buffer Hit %: 99.69 In-memory Sort %: 100.00 Library Hit %: 99.36 Soft Parse %: 99.68 Execute to Parse %: 86.77 Latch Hit %: 99.64 Parse CPU to Parse Elapsd %: 45.61 % Non-Parse CPU: 99.46 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 95.44 94.46 % SQL with executions>1: 89.10 77.18 % Memory for SQL w/exec>1: 87.82 76.62 Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- CPU time 114,107 65.6 db file sequential read 8,232,067 40,956 5 23.5 User I/O db file scattered read 929,509 2,392 3 1.4 User I/O log file sync 2,119,459 2,372 1 1.4 Commit SQL*Net message from dblink 1,375,619 1,576 1 0.9 Network ------------------------------------------------------------- ^LWait Events DB/Inst: CRM2DB/crm2db1 Snaps: 3408-3412 -> s - second -> cs - centisecond - 100th of a second -> ms - millisecond - 1000th of a second -> us - microsecond - 1000000th of a second -> ordered by wait time desc, waits desc (idle events last) -> %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0 Avg %Time Total Wait wait Waits Event Waits -outs Time (s) (ms) /txn ---------------------------- -------------- ----- ----------- ------- --------- db file sequential read 8,232,067 N/A 40,956 5 3.7 db file scattered read 929,509 N/A 2,392 3 0.4 log file sync 2,119,459 0 2,372 1 1.0 SQL*Net message from dblink 1,375,619 N/A 1,576 1 0.6 SQL*Net more data from clien 1,871,442 N/A 1,428 1 0.8 control file sequential read 3,010,438 N/A 1,179 0 1.4 Backup: sbtbackup 12 N/A 1,167 97243 0.0 gc cr grant 2-way 3,688,572 N/A 993 0 1.7 log file parallel write 2,205,552 N/A 974 0 1.0 gc cr multi block request 2,378,748 N/A 721 0 1.1 gc current block 2-way 1,941,104 N/A 693 0 0.9 SQL*Net more data to client 5,142,664 N/A 498 0 2.3 gc buffer busy 166,717 0 428 3 0.1 direct path read 424,645 N/A 344 1 0.2 db file parallel read 61,816 N/A 298 5 0.0 enq: HW - contention 17,487 0 216 12 0.0 db file parallel write 221,865 N/A 210 1 0.1 Backup: sbtwrite2 36,271 N/A 145 4 0.0 gc current grant 2-way 445,456 N/A 115 0 0.2 enq: TM - contention 255 74 102 400 0.0
在应用调整第2天上午的awr数据(采集时长1小时):
Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 2,420,236.58 6,734.48 Logical reads: 1,277,888.60 3,555.82 Block changes: 164,638.57 458.12 Physical reads: 1,208.59 3.36 Physical writes: 456.79 1.27 User calls: 8,785.42 24.45 Parses: 4,103.35 11.42 Hard parses: 12.87 0.04 Sorts: 8,052.30 22.41 Logons: 9.05 0.03 Executes: 32,573.41 90.64 Transactions: 359.38 % Blocks changed per Read: 12.88 Recursive Call %: 85.01 Rollback per transaction %: 0.19 Rows per Sort: 8.74 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 99.97 Redo NoWait %: 100.00 Buffer Hit %: 99.92 In-memory Sort %: 100.00 Library Hit %: 99.92 Soft Parse %: 99.69 Execute to Parse %: 87.40 Latch Hit %: 99.33 Parse CPU to Parse Elapsd %: 29.94 % Non-Parse CPU: 99.17 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 95.42 95.02 % SQL with executions>1: 61.41 72.38 % Memory for SQL w/exec>1: 60.04 73.07 Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- CPU time 69,736 56.2 Backup: sbtwrite2 2,319,231 22,510 10 18.1 Administra db file sequential read 3,476,830 20,923 6 16.9 User I/O latch: shared pool 159,416 9,280 58 7.5 Concurrenc Backup: sbtbackup 28 2,746 98056 2.2 Administra ------------------------------------------------------------- ^LWait Events DB/Inst: CRM2DB/crm2db1 Snaps: 3602-3604 -> s - second -> cs - centisecond - 100th of a second -> ms - millisecond - 1000th of a second -> us - microsecond - 1000000th of a second -> ordered by wait time desc, waits desc (idle events last) -> %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0 Avg %Time Total Wait wait Waits Event Waits -outs Time (s) (ms) /txn ---------------------------- -------------- ----- ----------- ------- --------- Backup: sbtwrite2 2,319,231 N/A 22,510 10 1.8 db file sequential read 3,476,830 N/A 20,923 6 2.7 latch: shared pool 159,416 N/A 9,280 58 0.1 Backup: sbtbackup 28 N/A 2,746 98056 0.0 log file sync 1,283,319 0 2,538 2 1.0 gc buffer busy 1,088,432 0 2,324 2 0.8 latch: library cache 31,278 N/A 1,328 42 0.0 control file sequential read 1,512,218 N/A 1,082 1 1.2 log file parallel write 1,157,491 N/A 1,045 1 0.9 SQL*Net message from dblink 952,956 N/A 987 1 0.7 SQL*Net more data from clien 962,454 N/A 837 1 0.7 kst: async disk IO 17,574 N/A 483 27 0.0 gc cr grant 2-way 1,539,289 N/A 439 0 1.2 gc current block 2-way 946,330 N/A 353 0 0.7 direct path read 291,853 N/A 313 1 0.2 db file parallel write 261,735 N/A 298 1 0.2 read by other session 92,451 0 270 3 0.1 cursor: pin S 304,676,088 N/A 233 0 235.1 db file scattered read 28,062 N/A 132 5 0.0 enq: TX - row lock contentio 466 52 123 263 0.0 direct path write 227,989 N/A 113 0 0.2
从Load profile数据对比来看,应用调整后系统负载有一定的提高,同时每事务逻辑读也有一定的增加(不足10%)。这会是个问题吗?
从第2份数据来看,latch: shared pool 和 latch: library cache有明显的增加。
下面的是性能问题更加严重的时候的AWR数据(采集时长为2小时):
Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 3744 26-Sep-11 09:00:01 4,504 38.2 End Snap: 3748 26-Sep-11 11:00:26 4,916 52.7 Elapsed: 120.41 (mins) DB Time: 21,868.27 (mins) Cache Sizes ~~~~~~~~~~~ Begin End ---------- ---------- Buffer Cache: 35,840M 35,840M Std Block Size: 8K Shared Pool Size: 6,144M 6,144M Log Buffer: 30,632K Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 1,790,430.57 5,945.33 Logical reads: 1,162,875.15 3,861.46 Block changes: 154,064.15 511.59 Physical reads: 3,007.80 9.99 Physical writes: 337.21 1.12 User calls: 8,227.28 27.32 Parses: 3,603.24 11.96 Hard parses: 11.25 0.04 Sorts: 7,304.40 24.26 Logons: 11.27 0.04 Executes: 25,994.53 86.32 Transactions: 301.15 % Blocks changed per Read: 13.25 Recursive Call %: 82.87 Rollback per transaction %: 0.21 Rows per Sort: 8.14 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 99.99 Redo NoWait %: 100.00 Buffer Hit %: 99.75 In-memory Sort %: 100.00 Library Hit %: 99.94 Soft Parse %: 99.69 Execute to Parse %: 86.14 Latch Hit %: 98.90 Parse CPU to Parse Elapsd %: 14.69 % Non-Parse CPU: 84.51 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 95.48 94.33 % SQL with executions>1: 87.63 74.92 % Memory for SQL w/exec>1: 86.52 74.66 Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- latch: shared pool 991,918 274,715 277 20.9 Concurrenc latch: library cache 941,443 232,344 247 17.7 Concurrenc CPU time 212,150 16.2 cursor: pin S ############ 115,049 0 8.8 Other latch: cache buffers lru chain 1,106,570 96,442 87 7.4 Other ------------------------------------------------------------- ^LWait Events DB/Inst: CRM2DB/crm2db1 Snaps: 3744-3748 -> s - second -> cs - centisecond - 100th of a second -> ms - millisecond - 1000th of a second -> us - microsecond - 1000000th of a second -> ordered by wait time desc, waits desc (idle events last) -> %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0 Avg %Time Total Wait wait Waits Event Waits -outs Time (s) (ms) /txn ---------------------------- -------------- ----- ----------- ------- --------- latch: shared pool 991,918 N/A 274,715 277 0.5 latch: library cache 941,443 N/A 232,344 247 0.4 cursor: pin S 9,095,470,035 N/A 115,049 0 4,180.5 latch: cache buffers lru cha 1,106,570 N/A 96,442 87 0.5 db file sequential read 7,819,285 N/A 49,564 6 3.6 latch: cache buffers chains 511,936 N/A 35,753 70 0.2 latch free 95,257 0 14,609 153 0.0 latch: object queue header o 221,907 N/A 14,378 65 0.1 log file sync 2,156,343 1 11,496 5 1.0 cursor: pin S wait on X 422,320 100 4,603 11 0.2 gc cr grant 2-way 3,526,856 0 3,430 1 1.6 db file scattered read 624,480 N/A 3,083 5 0.3 enq: TX - row lock contentio 5,946 88 2,600 437 0.0 latch: row cache objects 16,983 N/A 2,281 134 0.0 SQL*Net message from dblink 1,284,487 N/A 2,179 2 0.6 gc buffer busy 353,228 0 2,054 6 0.2 gc cr multi block request 2,499,709 0 1,974 1 1.1 gc current block 2-way 1,969,516 0 1,920 1 0.9 SQL*Net more data from clien 1,667,513 N/A 1,914 1 0.8 control file sequential read 2,974,349 N/A 1,735 1 1.4 db file parallel read 230,795 N/A 1,645 7 0.1 log file parallel write 1,581,432 N/A 1,303 1 0.7 Backup: sbtbackup 14 N/A 730 52120 0.0 buffer busy waits 143,869 0 488 3 0.1 latch: library cache lock 5,058 N/A 378 75 0.0 enq: HW - contention 22,039 0 370 17 0.0 gc current grant 2-way 424,243 N/A 357 1 0.2 enq: SQ - contention 5,010 1 354 71 0.0 read by other session 50,153 0 339 7 0.0 direct path read 324,226 N/A 305 1 0.1 db file parallel write 207,793 N/A 274 1 0.1 row cache lock 175,301 0 263 1 0.1
从第3份数据来看,性能更为恶化,情形更为复杂。
那么针对这个性能现象,可以提出如下的问题:
- 最直观的,最容易想到的就是,性能问题的出现是否与应用调整有关,如果是,为什么另一套库没有出现问题?会不会是另一套库的负载在2个节点都有相对均衡的负担,而出问题的库,负载全部集中在1个节点上引起?
- 客户是在应用调整几天后才报告性能问题,这个问题会不会是一个逐渐出现的问题?如果一开始就有严重的性能问题,那么应该会很快报告。不过中间跨了一个周末,所以对于”逐渐出现问题“的判断又加了一些难度。
- 这么多的性能问题相关的现象中,哪个更贴近问题的原因?实际上在主机的CPU长期保持在100%左右时,会放大平时的一些轻微的问题,或者引起另一些问题。从等待来看,latch: cache buffers chains和cursor: pin S都会引起CPU的急剧增加,而其他的latch竞争同样也会引起CPU的上升,虽然上升没有前者大。到底是SQL执行效率不高引起CPU急剧增加然后引起了shared pool latch等与解析相关的资源争用还是因为解析相关的问题导致CPU急剧增加引起了cache buffers chains等与SQL执行相关的latch争用?或者是2者的共同作用?
下面首先来分析,会不会是应用调整引起的问题,也就是说,是不是由SQL引起的问题。如果是SQL引起的问题,会有几种可能,1是部分频繁执行的SQL执行效率变差;2是增加了很多的硬解析;3是并发增加或者说是SQL的执行次数有上升。仔细检查对比现在和以前的AWR数据,可以排除第1和2这种可能。至于第3种,可能略有增加,但是不一定跟应用调整有关,毕竟正常情况下业务量的变化也是有可能的。所以应用调整引起问题的可能性较小。
会不会是因为负载过大引起的问题,这个很容易验证。取业务低谷期,比如下班时间的数据,进行分析可以发现,既然在CPU在稍低(75%以下)的时候,仍然有大量的library cache latch的争用。下面是取自周日19:00-20:00的AWR数据(注:本文涉及故障的分析时间是在星期一):
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- CPU time 49,224 48.3 latch: shared pool 185,951 41,178 221 40.4 Concurrenc latch: library cache 45,636 8,861 194 8.7 Concurrenc db file sequential read 815,066 4,763 6 4.7 User I/O cursor: pin S ############ 1,732 0 1.7 Other ------------------------------------------------------------- Avg %Time Total Wait wait Waits Event Waits -outs Time (s) (ms) /txn ---------------------------- -------------- ----- ----------- ------- --------- latch: shared pool 185,951 N/A 41,178 221 0.6 latch: library cache 45,636 N/A 8,861 194 0.2 db file sequential read 815,066 N/A 4,763 6 2.8 cursor: pin S 1,890,998,700 N/A 1,732 0 6,459.2 latch free 8,549 0 1,278 150 0.0 control file sequential read 1,507,789 N/A 466 0 5.2 Backup: sbtbackup 7 N/A 428 61096 0.0 log file sync 285,442 0 309 1 1.0 SQL*Net more data from clien 145,137 N/A 221 2 0.5 db file scattered read 67,716 N/A 173 3 0.2 gc buffer busy 56,842 0 135 2 0.2 gc cr grant 2-way 359,146 N/A 99 0 1.2 SQL*Net message from dblink 123,206 N/A 98 1 0.4 log file parallel write 289,048 N/A 91 0 1.0 SQL*Net more data to client 1,242,471 N/A 79 0 4.2 gc current block 2-way 179,654 N/A 66 0 0.6 direct path read 62,336 N/A 65 1 0.2 gc cr multi block request 228,693 N/A 52 0 0.8
至于性能问题是不是逐渐引起的,这个问题目前并不关键,暂时不考虑。只是做为后面分析问题的一个参考。 Read the rest of this entry