星期六晚上成功将一个生产环境数据库从9.2.0.6升级到10.2.0.1.
操作系统为windows 2003 enterprise edition sp1.
今天早上就出现数据库连接不上的状况, 登到server一看, listener挂掉, 起不来, 也不能以DBA登录到db, 报tns无监听程序,协议适配器错误.
查看sqlnet.log:
***********************************************************************
Fatal NI connect error 12505, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=HIS01)(PORT=1521))(CONNECT_DATA=(SID=HIS)(CID=(PROGRAM=D:\oracle\product\10.2.0\HIS\bin\emagent.exe)(HOST=HIS)(USER=SYSTEM))))
VERSION INFORMATION:
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 - Production
Time: 01-12月-2008 11:38:42
Tracing not turned on.
Tns error struct:
ns main err code: 12564
TNS-12564: TNS: 拒绝连接
ns secondary err code: 0
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
***********************************************************************
查看db alert log:
Mon Dec 01 11:36:55 2008
Errors in file d:\oracle\admin\his\udump\his_ora_3868.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C8224B2] [ADDR:0x2D484D4E] [UNABLE_TO_READ] []
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C8224B2] [ADDR:0x2D484D4E] [UNABLE_TO_READ] []
查看对应trace文件:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [unable_to_trans_pc] [PC:0x7C8224B2] [ADDR:0x2D484D4E] [UNABLE_TO_READ] []
OR
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
重启server以后,数据库以及listener均正常工作, 但过一段时间,大概是30分钟左右, 问题依旧.
为了production环境AP能够正常使用, 只能再次重启, 并且等待下一次crash.
最终查到Metalink Note:405904.1, note中描述状况与我们遇到的基本一致.
Solution:
修改SQLNET.ORA, 添加:
SQLNET.INBOUND_CONNECT_TIMEOUT = 0
修改LISTENER.ORA, 添加:
INBOUND_CONNECT_TIMEOUT_LISTENER = 0
另, 发现该instance PGA设的很小:24M, 改为128M.
再次重启server之后, 恢复正常.
但是这次正常也只不过持续了4个小时, 又挂掉了, 问题依旧.
于是开始怀疑是内存问题, 因为将PGA修改之后, 能够持续得更长时间.
到晚上,对比该环境与另外一生产环境(同样也是跑在10g上)的pfile, 终于发现了问题所在.
虽然我们升级了DB, 但是相关parameter依旧采用9i的模式, 例如SGA为采用自动管理(未设定sga_target), 而是手动设定了里面相关模块的固定大小.
pga也存在同样问题.
不光如此, sga中设置的个内存模块大小比例也感觉很不恰当, 且还有一些其它的奇怪设定.
对照修改pfile之后, 第二天上班, 一切正常.
看来升级数据库的时候还是有不少陷阱的, 一定要注意参数的修改.