 |
» |
|
|
 |
 |
 |
|
|
 |
|
<TITLE>
TITLE: HP Tru64 UNIX - Corrects device related hangs, panics and boot issues.
Copyright (c) Hewlett-Packard Company 2006. All rights reserved.
PRODUCT: HP TruCluster Server [R] V5.1B-3
SOURCE: Hewlett-Packard Company
ECO INFORMATION:
ECO Name: TCRKIT1001020-V51BB26-E-20061205
ECO Kit Approximate Size: 0.00MB
Kit Applies To: HP TruCluster Server V5.1B-3 PK5 (BL26)
ECO Kit CHECKSUMS:
/usr/bin/sum results:
/usr/bin/cksum results:
MD5 results:
SHA1 results:
ECO KIT SUMMARY:
A dupatch-based, Early Release Patch kit exists for HP TruCluster Server V5.1B-3
that contains solutions to the following problem(s):
This patch fixes a configuration issue found in non CAM devices and CD_ROM
devices. This patch improves the reliability of the Tru64 Cluster DRD
subsystem when faced with tape devices and tape device failures.
There was a timing hole where two opens would be sent down at the same time
to the tape driver. Before the tape driver would check to determine if it
was already open, the paths could be changed, which would result in a
kernel memory fault panic. A typical stack trace for the panic would be:
THREAD 1
drd_open()
drd_set_tape_changer_server()
drd_check_path()
drd_issue_local_ioctl()
ctape_ioctl()
ccmn_path_setup3
ccmn_alloc_path3()
cmn_reg_hier_path3
THREAD 2
drd_open()
drd_local_open()
drd_local_device_open()
drd_issue_local_ioctl() ctape_ioctl()
ctape_verify_path()
ccmn_path_setup3
ccmn_del_stale_paths3()
ccmn_destroy_invalid_paths()
ccmn_reg_hier_path3
When a device is deleted via hwmgr and an open is in progress the open can
hang. This patch removes the timing hole that allows the open to progress
to the point where it hangs.
When a device fails all current IOs are returned with an appropriate error
status code. If the upper layers continue to send IOs after the device has
been marked as failed, IOs can hang in drd.
. This patch also fixes barrier issues when devices fail and a barrier is
in progress. Symptoms for 2,3 and 4.
Status of a drd disk with stalled IOs.
drd_disk d_hwid d_state d_flags d_type errno eei d_bp_cnt
0xfffffc00f4fe0e00 0x0086 0x0003 0x0a800081 0x0000 0x0013 0x0000 1
DRD_FAILED
DRD_DISK_BLOCKED
DK_DAIO_DISK
DRD_DISK_NOT_USABLE bp 0xfffffc00291b3500 00:02:24.180
DRD_DRAINED_FLAGS
DRD_DISK_FAILED
DRD_STOP_SERVER
DRD_DO_NOT_DELETE
DRD_IS_BARRIERABLE
DRD_CAM_REGISTERED
Typical thread trace for vold threads at the time of hung IOs.
>0 thread_block
1 volsiowait
4 volsioctl_rea
5 spec_ioctl
6 vn_ioctl
7 ioctl_base
8 syscall
9 _Xsyscall
This patch fixes an error in the DRD subsystem wherein un-initialized disk
attributes can cause a system panic.
a) This patch fixes an error in the DRD subsystem wherein few
un-initialized disk attributes could result in a system panic with the
following or similar stack trace:
4 panic
5 trap
6 _XentMM
7 free
8 drd_release_bp_resources
9 drd_ics_io
10 drd_ics_read
11 svr_drd_ics_read
12 icssvr_daemon_from_poolsvr_drd_ics_read
This problem appears when open/read is attempted on deleted XCR disks.
This patch also fixes an error during a failback of a Tape device wherein
character devt is not restored properly.
Corrects a problem where, DRD event thread may run infinitely while
responding for bid server transaction.
This patch fixes a problem wherein DRD subsystem may cause a system panic
since strategy routines may be called from a Light weight context(LWC).
Corrects a problem with DRD subsystem, where strategy routines can be
called from a Light weight context(LWC).
This could result in a system panic with the following or similar stack
trace.
0 boot
1 panic
2 thread_block
3 lock_wait
4 lock_write
5 (source file cannot be determined)
6 (source file cannot be determined)
7 (source file cannot be determined)
8 drd_restart_io
9 drd_io_barrier_complete_timeout
10 softclock_scan
11 lwc_schedule
12 exception_exit
Fixes a hang with disklabel(8) that occurred if a local open failed for the
same disk simultaneously.
Corrects reference counting issues within the DRD subsystem that can
prevent the deletion of hwids.
Fixes disk I/O hang in DRD. This patch fixes a problem in DRD that could
result in the hanging of commands like disklabel, showfdmn or any file
system I/O.
Typical stack trace is as follows:
0 thread_block
1 sleep_prim
2 mpsleep
3 drd_reopen_partitions
4 drd_change_server_node
5 drd_complete_failback
6 drd_handle_event_io_drained
7 drd_handle_one_event
8 drd_handle_events
9 drd_event_thread
DRD now plays an active role in the device deletion callback and voting. In
the past drd would be notified after the device deletion had occurred via
an evm event. This caused numerous panics and hung devices as drd could
attempt to access a deleted device. With this fix drd will no longer access
a device that has a deletion pending or in progress.
This patch fixes an issue of DRD returning incorrect device information
when the hwid is not found.
Corrects an existing timing-hole.
Provides a fix for a Kernel Memory Fault in drd disk code
A typical stack trace of the problem is as follows:
0 boot
1 panic
2 trap
3 _XentMM
4 simple_lock_D
5 drd_add_server
6 drd_find_local_disks
7 drd_config_thread
Fix for DRD_IOCTL_ERROR handling for tape devices
Fixes a Kernel Memory Fault in IO Path for Served Disks and for stalled Ios
A typical stack trace of the problem is as follows:
0 stop_secondary_cpu
1 panic
2 event_timeout
3 printf
4 panic
5 trap
6 _XentMM
7 drd_ics_get_disk
8 drd_ics_io
9 drd_ics_read
10 svr_drd_ics_read
11 icssvr_daemon_from_pool
Fixes disk access issues that shows up early in the boot process.
This problem could result in a system panic with the following or similar
stack trace.
PANIC: "CNX MGR: Invalid configuration for cluster seq disk"
0 boot
1 panic
2 init_globals
3 init_cnx
4 cnx_subsys_configure
5 cnx_callback
6 dispatch_callback
7 main
8 main
Fixes a hang during cluster bootup caused by early reservation conflicts.
During cluster bootup, the following warning messages appears and the node
hangs till another node comes up.
"WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed to mount"
Fixes a cluster hang issue during cluster boot-up, when local disk open
operations fail while disklabel is in progress.
This patch corrects an erroneous error message which can be displayed by
drdmgr when relocating a device. For example:
drdmgr: Error, Uknown error -1431655766 for device 'tape0' attribute
DRD_SERVER
Handles reservation conflict errors to address cluster node hang during
boot. During cluster booting, the following warning messages appears and
the node may hang until the second node comes up. Typical message that
appears on the console when the node hangs is as below,
"WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed to mount"
This is due to the path being configured later in the boot process
resulting in a reservation conflict.
Allow retries of disk open at boot time if device is in MUNSA reject state.
A disk open can fail if the device is currently in MUNSA reject state. This
can result in boot hang conditions while the system is being booted up.
The Patch Kit Installation Instructions and the Patch Summary and Release
Notes documents provide patch kit installation and removal instructions
and a summary of each patch. Please read these documents prior to
installing patches on your system.
The patches in this ERP kit will also be available in the next mainstream
patch kit - HP TruCluster Server V5.1B-6.
INSTALLATION NOTES:
1) Install this kit with the dupatch utility that is included in the patch
kit. You may need to baseline your system if you have manually changed
system files on your system. The dupatch utility provides the baselining
capability.
2) The patch in this ERP kit does not have any file intersections with any
other ERP available at this time for this product version.
3) This ERP kit will NOT install over any Customer Specific Patches (CSPs)
which have file intersections with this ERP kit. Contact your normal
Service Provider for assistance if the installation of this ERP kit is
blocked by any of your installed CSPs.
INSTALLATION PREREQUISITES:
You must have installed HP TruCluster Server V5.1B-3 PK5 (BL26) prior to
installing this Early Release Patch Kit.
SUPERSEDE INFORMATION:
TCRKIT1000547-V51BB26-E-20060420
KNOWN PROBLEMS WITH THE PATCH KIT:
None.
RELEASE NOTES FOR TCRKIT1001020-V51BB26-E-20061205:
[R] UNIX is a registered trademark in the United States and other countries
licensed exclusively through X/Open Company Limited.
Copyright Hewlett-Packard Company 2006. All Rights reserved.
This software is proprietary to and embodies the confidential technology
of Hewlett-Packard Company. Possession, use, or copying of this
software and media is authorized only pursuant to a valid written license
from Hewlett-Packard or an authorized sublicensor.
This ECO has not been through an exhaustive field test process.
Due to the experimental stage of this ECO/workaround, Hewlett-Packard
makes no representations regarding its use or performance. The
customer shall have the sole responsibility for adequate protection
and back-up data used in conjunction with this ECO/workaround.
|