Tuesday, May 25, 2010

FAL Failing to Resolve Archive Gaps

FAL failing to resolve archive gaps:

I had been having problems with FAL failing to resolve archive gaps. The classic error message on my 10.2.0.3 physical standby looked something like this:

FAL[client]: Failed to request gap sequence
GAP – thread 1 sequence 11402-11403
DBID 2878580510 branch 589635294
FAL[client]: All defined FAL servers have been attempted.
————————————————————-
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that is sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
————————————————————-

Now, a FAL (Fetch Archive Log) request is triggered by the MRP process detecting a gap in the available archives to apply. The server to contact to fulfil the request is determined by the FAL_SERVER parameter which you set to an appropriate tnsnames alias. This essentially enables your standby to contact the primary database and get it to ship the appropriate missing log using the arc process.

While the above situation was logging the above on the standby, no indication of any issues were being logged on the primary. First thought of course was connectivity, but that was easily disproved.

So to get more information I deployed extra logging on both the primary and standby, using the log_archive_trace parameter. This parameter has the ability to trace a number of background processes and is thus useful on both the primary and the standby.

In fact it was the arcn process on the primary that really showed what was wrong, as the trace continually had the following lines:

FAL[server]: Duplicate request detected (dbid 0 branch 127 thread 1 sequence 11402 dest STANDBY)
FAL[server]: Request not queued in this case
FAL[server]: Duplicate request detected (dbid 0 branch 127 thread 1 sequence 11403 dest STANDBY)
FAL[server]: Request not queued in this case
*** 2008-06-03 11:57:18.229
FAL[subfal]: Waiting on response from destination for thread 1 sequence 10890.

So, what is this actually telling us? The standby is asking for 11402, and 11403 and the primary knows the standby is after these, however the primary also thinks the standby is after 10890, but the standby having already applied this does not care about receiving this archived log.

Essentially this log is blocking the FAL process from working properly.

Turns out this is a bug situation. We were hitting bug:5576816 with an associated MetaLink Note: 5576816.8. Seems like it’s related to having max_connections set on the log_archive_dest parameter (which we did).

As an attempted workaround I tried setting the log_archive_max_processes down from 6 to 1, with the following shown in the primary alert log:

Shutting down archive processes
ALTER SYSTEM SET log_archive_max_processes=1 SCOPE=BOTH;
ARCH shutting down
ARC4: Archival stopped
ARCH shutting down
ARC3: Archival stopped
ARCH shutting down
ARC2: Archival stopped
ARCH shutting down
ARC1: Archival stopped
ARC5: Becoming the ‘no FAL’ ARCH
ARC5: Becoming the ‘no SRL’ ARCH
ARC5: Becoming the heartbeat ARCH
ARCH shutting down
ARC5: Archival stopped
ARC0: Becoming the ‘no FAL’ ARCH
ARC0: Becoming the ‘no SRL’ ARCH
ARC0: Becoming the heartbeat ARCH

I then killed (OS kill that is) the final running arc background process – note this is safe according Oracle support, and pmon detects it is dead and restarts the arc process:


$ ps -ef|grep arc|grep
$ kill -9

ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
ARC0: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC0 started with pid=276, OS id=24202
ARC0: Becoming the ‘no FAL’ ARCH
ARC0: Becoming the ‘no SRL’ ARCH
ARC0: Becoming the heartbeat ARCH

No comments: