您好,登錄后才能下訂單哦!
Oracle Study之--DataGuard 最大保護模式故障(ORA-16198)
系統環境:
操作系統:RedHat EL5
Oracle: Oracle 11gR2 (11.2.0.1.0)
故障現象:
Physical Standby在從Maximum Performance轉換到Maximum Protection時,出現以下故障:
10:13:06 SYS@ prod1>startup force mount; ORACLE instance started. Total System Global Area 418484224 bytes Fixed Size 1336932 bytes Variable Size 281020828 bytes Database Buffers 130023424 bytes Redo Buffers 6103040 bytes Database mounted.
10:13:30 SYS@ prod1>select name,protection_mode from v$database;
NAME PROTECTION_MODE
--------- --------------------
PROD1 MAXIMUM PROTECTION
Open DataBase失敗:
10:07:04 SYS@ prod1>alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 4612
Session ID: 1 Serial number: 5
查看告警日志:
alter database open
Thu Jun 11 10:07:10 2015
LGWR: STARTING ARCH PROCESSES
Thu Jun 11 10:07:10 2015
ARC0 started with pid=19, OS id=4614
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Thu Jun 11 10:07:10 2015
ARC1 started with pid=20, OS id=4616
Thu Jun 11 10:07:10 2015
ARC2 started with pid=21, OS id=4618
ARC1: Archival started
ARC2: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
ARC1: Becoming the heartbeat ARCH
LGWR: Primary database is in MAXIMUM PROTECTION mode
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
Thu Jun 11 10:07:11 2015
NSS2 started with pid=18, OS id=4620
Thu Jun 11 10:07:40 2015
ORA-16198: LGWR received timedout error from KSR
Errors in file /u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Error 16198 verifying archivelog destination LOG_ARCHIVE_DEST_2
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Continuing...
LGWR: Minimum of 1 applicable standby database required
Errors in file /u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:
ORA-16072: a minimum of one standby database destination is required
Errors in file /u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:
ORA-16072: a minimum of one standby database destination is required
LGWR (ospid: 4565): terminating the instance due to error 16072
Instance terminated by LGWR, pid = 4565
---從日志信息可以看出,所有的歸檔路徑都失敗,本地歸檔及遠程歸檔均失敗!
解決方法:
依據Oracle官方建議,修改net_timeout值:(主備庫)
10:10:23 SYS@ prod1>alter system set log_archive_dest_2='service=shdb lgwr sync affirm VALID_FOR=(online_logfiles,primary_role) net_timeout=30 DB_UNIQUE_NAME=shdb'; 10:11:25 SYS@ shdb>alter system set log_archive_dest_2='service=bjdb lgwr sync affirm VALID_FOR=(online_logfiles,primary_role) net_timeout=30 DB_UNIQUE_NAME=bjdb';
增加standby redo log:
主庫:(在Maximum Performance下添加Standby redo logs)
10:20:35 SYS@ prod1>select group#,status ,bytes from v$log; GROUP# STATUS BYTES ---------- ---------------- ---------- 1 INACTIVE 52428800 2 CURRENT 52428800 3 INACTIVE 52428800 10:20:54 SYS@ prod1>select member from v$logfile; MEMBER ------------------------------------------------------------------------------------------------------------------------ /u01/app/oracle/oradata/prod1/redo03.log /u01/app/oracle/oradata/prod1/redo02.log /u01/app/oracle/oradata/prod1/redo01.log /u01/app/oracle/oradata/prod1/std_redo01.log /u01/app/oracle/oradata/prod1/std_redo02.log 6 rows selected. 10:21:03 SYS@ prod1>alter database add standby logfile 10:21:25 2 '/u01/app/oracle/oradata/prod1/std_redo03.log' size 50m; Database altered. 10:21:46 SYS@ prod1>alter database add standby logfile 10:21:51 2 '/u01/app/oracle/oradata/prod1/std_redo04.log' size 50m; Database altered. 10:01:48 SYS@ shdb>select member from v$logfile; MEMBER ------------------------------------------------------------------------------------------------------------------------ /u01/app/oracle/oradata/shdb/redo03.log /u01/app/oracle/oradata/shdb/redo02.log /u01/app/oracle/oradata/shdb/redo01.log /disk2/arch_prod11_0_881851982.dbf /u01/app/oracle/oradata/shdb/std_redo01.log /u01/app/oracle/oradata/shdb/std_redo02.log 6 rows selected.
備庫:
10:18:17 SYS@ shdb>alter database open; Database altered. 10:20:21 SYS@ shdb>select group#,status ,bytes from v$log; GROUP# STATUS BYTES ---------- ---------------- ---------- 1 CLEARING 52428800 2 CLEARING 52428800 3 CLEARING_CURRENT 52428800 10:20:45 SYS@ shdb>alter database add standby logfile 10:22:41 2 '/u01/app/oracle/oradata/shdb/std_redo03.log' size 50m; Database altered. 10:22:57 SYS@ shdb>alter database add standby logfile 10:23:02 2 '/u01/app/oracle/oradata/shdb/std_redo04.log' size 50m; Database altered. 10:23:14 SYS@ shdb>col member for a50 10:23:23 SYS@ shdb>select group#,member from v$logfile; GROUP# MEMBER ---------- -------------------------------------------------- 3 /u01/app/oracle/oradata/shdb/redo03.log 2 /u01/app/oracle/oradata/shdb/redo02.log 1 /u01/app/oracle/oradata/shdb/redo01.log 5 /u01/app/oracle/oradata/shdb/std_redo01.log 6 /u01/app/oracle/oradata/shdb/std_redo02.log 7 /u01/app/oracle/oradata/shdb/std_redo03.log 8 /u01/app/oracle/oradata/shdb/std_redo04.log 8 rows selected.
-----經過以上方式處理后,問題依舊,在Maximum Protection模式下主庫依然不能被Open ;但在Maximum Availablity 和 Maximum Performance下主庫可以Open 。出錯原因依舊在探索。。。
參考文檔:
數據庫報ORA-16198故障的解決方法分析
--------http://blog.itpub.net/28546804/viewspace-1260003/
1. 首先看官方文檔關于ORA-16198報錯的說明
.......................
報錯可能原因是因為net_timeout設置低,在以前老版本默認是10,建議更改為30
……………………………
The net_timeout attribute in the log_archive_dest_2 on the primary is
set too low so that
LNS couldn't finish sending redo block in 10 seconds in this example.
…………………………….
如果設置30還不行,請檢查磁盤的IO使用情況或者網絡傳輸情況
…………………………..
Note: If NET_TIMEOUT attribute has already been set to 30, and you still get ORA-16198, that means LNS couldn't finish sending redo block in 30 seconds.
The slowness may caused by:
1. Operating System. Please keep track of OS usage (like iostat).
2. Network. Please keep track network flow (like tcpdump).
……………………………
也有可能是BUG,受影響的版本為11.2.0.1或10.2.0.4,建議升級到11.2.0.2以上的版本
…………………………..
Bug 9259587 Multiple LGWR reconnect attempts in Data Guard MAXIMUM_AVAILABILITY
This note gives a brief overview bug 9259587.
Affects:
Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected Versions BELOW 12.1
Versions confirmed as being affected 11.2.0.1 10.2.0.4
Platforms affected Generic (all / most platforms affected)
Fixed:
This issue is fixed in 12.1 (Future Release) 11.2.0.2 (Server Patch Set)
Symptoms:
Related To:
Hang (Process Spins)
Active Dataguard (ADG)
Physical Standby Database / Dataguard
Description
…………………………………………………
發生的報錯,大概類似于下面的顯示
…………………………………………………
Rediscovery Notes:
Alert log contains messages like:
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Errors in file
/app/oracle/diag/rdbms/ora11g_dga/ora11g/trace/ora11g_lgwr_290838.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Network asynch I/O wait error 16198 log 2 service 'ora11g_DGb'
LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby
host 'ora11g_DGb'
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Failed to archive log 2 thread 1 sequence 1422 (16198)
…………………………………………………
In a Data Guard configuration using LGWR SYNC transport on one or more LOG_ARCHIVE_DEST_n parameters, and using a protection mode of MAXIMUM_AVAILABILITY, then if the primary database becomes disconnected from the standby database, LGWR continues to attempt to reconnect to the standby database. It should instead avoid attempts to reconnect until an ARCH process has re-established communication with the standby database.
所以可以確定的是:
報這種錯誤主要發生在DATAGUARD這種架構上,原因就是主機的日志向備機傳輸時沒在規定時間完成,或無法向備機傳送日志,那么我們就下面主要的兩種故障原因來進行說明:
2. 參數設置過低導致的故障
可能由于設置的LOG_ARCHIVE_DEST_2的NET_TIMEOUT值過低,導致的日志無法在規定時間傳輸完成,建議設置成30。
查詢NET_TIMEOUT:
SQL> select DEST_NAME,NET_TIMEOUT FROM V$ARCHIVE_DEST;
DEST_NAME NET_TIMEOUT
------------------------- -----------
LOG_ARCHIVE_DEST_1 0
LOG_ARCHIVE_DEST_2 30
……………輸出省略
查看LOG_ARCHIVE_DEST_2參數:
SQL> show parameter log_archive_dest_2
值為'service=orcl_std reopen=120 lgwr sync valid_for=(online_logfiles,primary_role) db_unique_name=orcl_std'
我沒有設置NET_TIMEOUT參數,默認卻是30,因為我的版本是11.2.0.3的。
如果你的參數不是30,請進行修改,參考如下:
SQL>ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='service=orcl_std reopen=120 lgwr sync net_timeout=30 valid_for=(online_logfiles,primary_role) db_unique_name=orcl_std';
然后觀察一下是否還報此類問題。
3. 由于網絡不通暢或存儲IO繁忙等其他原因導致的故障
如果是由于網絡不通暢和存儲繁忙的原因導致的報錯,請用操作系統命令類似于,tcpdump或IOSTAT,VMSTAT來查看相關資源使用情況,或聯系網絡,存儲管理員來協助分析。
如果以上都沒問題,還有一種可能性就是你主機或備機單獨改sys密碼了,但是相關的備機或主機沒有同時改,造成主機向備機驗證時失效也是很有可能的。
4. 數據庫的BUG
如果以上方法還沒有解決問題,你也分析不出具體的原因,恰好你的數據庫版本是11.2.0.1或10.2.0.4,那么升級吧少年。。
5. 總結
考慮此類問題,要從多角度分析,比如:參數值低,存儲使用情況,網絡傳輸情況,sys密碼改了,數據庫的BUG等。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。