Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Red Hat Linux (http://www.linux-archive.org/red-hat-linux/)
-   -   RHEL 5.5 Oracle RAC cluster resbooted due to processor hung!! (http://www.linux-archive.org/red-hat-linux/673936-rhel-5-5-oracle-rac-cluster-resbooted-due-processor-hung.html)

raj sourabh 06-18-2012 06:44 AM

RHEL 5.5 Oracle RAC cluster resbooted due to processor hung!!
 
Hi,

I have raised this question with redhat support as well. Just want to
collect your thoughts on the below issue.
----
*Platform: RHEL 5.5 *
*Arch: 64 bit, Running Oracle RAC 11gr2 (2 Node cluster)*
*Problem Description: Node 2 of the cluster got rebooted. The reboot
process was initiated by Oracle due to unknown reasons. /var/log/messages
show that the processor was hung for 10 seconds (Please see the logs
below). What could be the cause of this??*


Jun 10 19:22:04 prddbs02 snmpd[5158]: Received SNMP packet(s) from UDP:
[127.0.0.1]:17955 Jun 10 19:22:34 prddbs02 kernel: NETDEV WATCHDOG: eth0:
transmit timed out Jun 10 19:22:34 prddbs02 kernel: bonding: bond0: link
status definitely down for interface eth0, disabling it Jun 10 19:22:34
prddbs02 kernel: bonding: bond0: making interface eth2 the new active one.
Jun 10 19:22:34 prddbs02 kernel: device eth2 entered promiscuous mode Jun
10 19:22:46 prddbs02 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[multipathd:5060] Jun 10 19:22:46 prddbs02 kernel: CPU 2: Jun 10 19:22:46
prddbs02 kernel: Modules linked in: oracleacfs(PFU) oracleadvm(PFU)
oracleoks(PU) autofs4 hidp smbus(U) ipmi_devintf ipmi_si ipmi_msghandler
rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq
freq_table bonding dm_round_robin dm_multipath scsi_dh video backlight sbs
power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi
acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev
sr_mod cdrom i2c_i801 igb pcspkr i2c_core 8021q e1000e dca sg dm_raid45
dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log
dm_mod lpfc(U) scsi_transport_fc ata_piix libata shpchp mptsas mptscsih
mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd
ehci_hcd Jun 10 19:22:46 prddbs02 kernel: Pid: 5060, comm: multipathd
Tainted: PF M 2.6.18-194.el5 #1 Jun 10 19:22:46 prddbs02 kernel: RIP:
0010:[<ffffffff8007767a>] [<ffffffff8007767a>]
__smp_call_function_many+0x9a/0xbc Jun 10 19:22:46 prddbs02 kernel: RSP:
0018:ffff8108e79a5bf8 EFLAGS: 00000297 Jun 10 19:22:46 prddbs02 kernel:
Pid: 5060, comm: multipathd Tainted: PF M 2.6.18-194.el5 #1 Jun 10 19:22:46
prddbs02 kernel: RIP: 0010:[<ffffffff8007767a>] [<ffffffff8007767a>]
__smp_call_function_many+0x9a/0xbc Jun 10 19:22:46 prddbs02 kernel: RSP:
0018:ffff8108e79a5bf8 EFLAGS: 00000297 Jun 10 19:22:46 prddbs02 kernel:
RAX: 0000000000000006 RBX: 0000000000000007 RCX: 0000000000000000 Jun 10
19:22:46 prddbs02 kernel: RDX: 00000000000000ff RSI: 00000000000000ff RDI:
00000000000000c0 Jun 10 19:22:46 prddbs02 kernel: RBP: 0000000000000000
R08: 0000000000000008 R09: 0000000000000038 Jun 10 19:22:46 prddbs02
kernel: R10: ffff8108e79a5b98 R11: 0000000000000000 R12: ffffffff80143e16
Jun 10 19:22:46 prddbs02 kernel: R13: 0000000000000003 R14:
ffff810366ec2c58 R15: ffff81093da13340 Jun 10 19:22:46 prddbs02 kernel: FS:
000000004189d940(0063) GS:ffff81012071cec0(0000) knlGS:0000000000000000 Jun
10 19:22:46 prddbs02 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033 Jun 10 19:22:46 prddbs02 kernel: CR2: 00002aaaac004000
CR3: 0000000928447000 CR4: 00000000000006e0 Jun 10 19:22:46 prddbs02
kernel: Jun 10 19:22:46 prddbs02 kernel: Call Trace: Jun 10 19:22:46
prddbs02 kernel: [<ffffffff8007754d>] do_flush_tlb_all+0x0/0x6a Jun 10
19:22:46 prddbs02 kernel: [<ffffffff8007754d>] do_flush_tlb_all+0x0/0x6a
Jun 10 19:22:46 prddbs02 kernel: [<ffffffff80077778>]
smp_call_function_many+0x38/0x4c Jun 10 19:22:46 prddbs02 kernel:
[<ffffffff8007754d>] do_flush_tlb_all+0x0/0x6a Jun 10 19:22:46 prddbs02
kernel: [<ffffffff80077869>] smp_call_function+0x4e/0x5e Jun 10 19:22:46
prddbs02 kernel: [<ffffffff8007754d>] do_flush_tlb_all+0x0/0x6a Jun 10
19:22:46 prddbs02 kernel: [<ffffffff881fcb28>] :dm_mod:dev_status+0x0/0x38
Jun 10 19:22:46 prddbs02 kernel: [<ffffffff800958c1>] on_each_cpu+0x10/0x22
Jun 10 19:22:46 prddbs02 kernel: [<ffffffff800d2017>]
__remove_vm_area+0x2b/0x42 Jun 10 19:22:46 prddbs02 kernel:
[<ffffffff800d2046>] remove_vm_area+0x18/0x25 Jun 10 19:22:46 prddbs02
kernel: [<ffffffff800d209a>] __vunmap+0x47/0xed Jun 10 19:22:46 prddbs02
kernel: [<ffffffff881fdeff>] :dm_mod:ctl_ioctl+0x237/0x25b Jun 10 19:22:46
prddbs02 kernel: [<ffffffff800424bd>] do_ioctl+0x55/0x6b Jun 10 19:22:46
prddbs02 kernel: [<ffffffff800304d6>] vfs_ioctl+0x457/0x4b9 Jun 10 19:22:46
prddbs02 kernel: [<ffffffff8000d3e9>] dput+0x2c/0x114 Jun 10 19:22:46
prddbs02 kernel: [<ffffffff8004cbb7>] sys_ioctl+0x59/0x78 Jun 10 19:22:46
prddbs02 kernel: [<ffffffff8005e116>] system_call+0x7e/0x83 Jun 10 19:22:46
prddbs02 kernel: Jun 10 19:23:04 prddbs02 kernel: BUG: soft lockup - CPU#4
stuck for 10s! [eecd:8758] Jun 10 19:23:04 prddbs02 kernel: CPU 4: Jun 10
19:23:04 prddbs02 kernel: Modules linked in: oracleacfs(PFU)
oracleadvm(PFU) oracleoks(PU) autofs4 hidp smbus(U) ipmi_devintf ipmi_si
ipmi_msghandler rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand
acpi_cpufreq freq_table bonding dm_round_robin dm_multipath scsi_dh video
backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery
asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp
parport joydev sr_mod cdrom i2c_i801 igb pcspkr i2c_core 8021q e1000e dca
sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero
dm_mirror dm_log dm_mod lpfc(U) scsi_transport_fc ata_piix li: Jun 10
19:23:04 prddbs02 kernel: Pid: 8758, comm: eecd Tainted: PF M
2.6.18-194.el5 #1 Jun 10 19:23:04 prddbs02 kernel: RIP:
0010:[<ffffffff80065bfc>] [<ffffffff80065bfc>] .text.lock.spinlock+0x2/0x30
Jun 10 19:23:04 prddbs02 kernel: RSP: 0018:ffff8108997d1bc0 EFLAGS:
00000286 Jun 10 19:23:04 prddbs02 kernel: RAX: 0000000000000000 RBX:
00000000d2a03d30 RCX: 0000000000000001 Jun 10 19:23:04 prddbs02 kernel:
RDX: ffff8108997d1d98 RSI: ffffffff885dd304 RDI: ffffffff8030e6c8 Jun 10
19:23:04 prddbs02 kernel: RBP: ffff8102f1aa8c10 R08: 0000000000000001 R09:
ffff8108997d1bf8 Jun 10 19:23:04 prddbs02 kernel: R10: ffff81089d5285c0
R11: 0000000000000000 R12: 0000000000000000 Jun 10 19:23:04 prddbs02
kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000000fb
Jun 10 19:23:04 prddbs02 kernel: FS: 0000000000000000(0000)
GS:ffff81012077dd40(0063) knlGS:00000000d2a04b90 Jun 10 19:23:04 prddbs02
kernel: CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b Jun 10 19:23:04
prddbs02 kernel: CR2: 00000000d2a02ddc CR3: 00000008f0781000 CR4:
00000000000006e0 Jun 10 19:23:04 prddbs02 kernel: Jun 10 19:23:04 prddbs02
kernel: Call Trace: Jun 10 19:23:04 prddbs02 kernel: [<ffffffff80077764>]
smp_call_function_many+0x24/0x4c Jun 10 19:23:04 prddbs02 kernel:
[<ffffffff885dd304>] :smbus:smbus_GetCpuError_callback+0x0/0x14 Jun 10
19:23:04 prddbs02 kernel: [<ffffffff80077869>] smp_call_function+0x4e/0x5e
Jun 10 19:23:04 prddbs02 kernel: [<ffffffff885e4fcd>]
:smbus:smbus_ioctl+0x2880/0x2f74 Jun 10 19:23:05 prddbs02 kernel:
[<ffffffff80063ff8>] thread_return+0x62/0xfe Jun 10 19:23:05 prddbs02
kernel: [<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff Jun 10 19:23:05
prddbs02 kernel: [<ffffffff8002b379>] flush_tlb_page+0xac/0xda Jun 10
19:23:05 prddbs02 kernel: [<ffffffff80011149>] do_wp_page+0x3fd/0x902 Jun
10 19:23:05 prddbs02 kernel: [<ffffffff80009677>]
__handle_mm_fault+0xee5/0xfaa Jun 10 19:23:05 prddbs02 kernel:
[<ffffffff80022127>] __up_read+0x19/0x7f Jun 10 19:23:05 prddbs02 kernel:
[<ffffffff80067b88>] do_page_fault+0x4fe/0x874 Jun 10 19:23:05 prddbs02
kernel: [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 Jun 10 19:23:05
prddbs02 kernel: [<ffffffff885e56d7>] :smbus:smbus_ioctl_compat+0x16/0x1d
Jun 10 19:23:05 prddbs02 kernel: [<ffffffff800fb8d4>]
compat_sys_ioctl+0xc5/0x2b2 Jun 10 19:23:05 prddbs02 kernel:
[<ffffffff8006249d>] sysenter_do_call+0x1e/0x76 Jun 10 19:23:05 prddbs02
kernel: Jun 10 19:23:14 prddbs02 kernel: BUG: soft lockup - CPU#4 stuck for
10s! [eecd:8758] Jun 10 19:23:14 prddbs02 kernel: CPU 4: Jun 10 19:23:14
prddbs02 kernel: Modules linked in: oracleacfs(PFU) oracleadvm(PFU)
oracleoks(PU) autofs4 hidp smbus(U) ipmi_devintf ipmi_si ipmi_msghandler
rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq
freq_table bonding dm_round_robin dm_multipath scsi_dh video backlight sbs
power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi
acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev
sr_mod cdrom i2c_i801 igb pcspkr i2c_core 8021q e1000e dca sg dm_raid45
dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log
dm_mod lpfc(U) scsi_transport_fc ata_piix libata shpchp mptsas mptscsih
mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd
ehci_hcd Jun 10 19:23:14 prddbs02 kernel: Pid: 8758, comm: eecd Tainted: PF
M 2.6.18-194.el5 #1 Jun 10 19:23:14 prddbs02 kernel: RIP:
0010:[<ffffffff80065bfc>] [<ffffffff80065bfc>]


Thanks for any help in advance :)

Regards,
Raj
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

Georgios Magklaras 06-21-2012 08:04 AM

RHEL 5.5 Oracle RAC cluster resbooted due to processor hung!!
 
On 06/18/2012 08:44 AM, raj sourabh wrote:

Jun 10 19:22:04 prddbs02 snmpd[5158]: Received SNMP packet(s) from UDP:
[127.0.0.1]:17955 Jun 10 19:22:34 prddbs02 kernel: NETDEV WATCHDOG: eth0:
transmit timed out Jun 10 19:22:34 prddbs02 kernel: bonding: bond0: link
status definitely down for interface eth0, disabling it Jun 10 19:22:34
prddbs02 kernel: bonding: bond0: making interface eth2 the new active one.
Jun 10 19:22:34 prddbs02 kernel: device eth2 entered promiscuous mode Jun
Before the soft lockup, what exactly caused the the NETDEV WATCHDOG
loose eth0?
For the __smp_call_function_many lockup, there were many fixes between
5.5 and 5.6 in relation to multipath and other third party drivers
that caused similar lookups. (why are you on 5.5 and not at least 5.6,
which kernel are you running on)?


Best regards,

--
--
George Magklaras PhD
RHCE no: 805008309135525

Senior Systems Engineer/IT Manager
Biotechnology Center of Oslo and
the Norwegian Center for Molecular Medicine
EMBnet TMPC Chair

http://folk.uio.no/georgios




10 19:22:46 prddbs02 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[multipathd:5060] Jun 10 19:22:46 prddbs02 kernel: CPU 2: Jun 10 19:22:46
prddbs02 kernel: Modules linked in: oracleacfs(PFU) oracleadvm(PFU)
oracleoks(PU) autofs4 hidp smbus(U) ipmi_devintf ipmi_si ipmi_msghandler
rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq
freq_table bonding dm_round_robin dm_multipath scsi_dh video backlight sbs
power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi
acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev
sr_mod cdrom i2c_i801 igb pcspkr i2c_core 8021q e1000e dca sg dm_raid45
dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log
dm_mod lpfc(U) scsi_transport_fc ata_piix libata shpchp mptsas mptscsih
mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd
ehci_hcd Jun 10 19:22:46 prddbs02 kernel: Pid: 5060, comm: multipathd
Tainted: PF M 2.6.18-194.el5 #1 Jun 10 19:22:46 prddbs02 kernel: RIP:
0010:[<ffffffff8007767a>] [<ffffffff8007767a>]
__smp_call_function_many+0x9a/0xbc Jun 10 19:22:46 prddbs02 kernel: RSP:
0018:ffff8108e79a5bf8 EFLAGS: 00000297 Jun 10 19:22:46 prddbs02 kernel:
Pid: 5060, comm: multipathd Tainted: PF M 2.6.18-194.el5 #1 Jun 10 19:22:46
prddbs02 kernel: RIP: 0010:[<ffffffff8007767a>] [<ffffffff8007767a>]
__smp_call_function_many+0x9a/0xbc Jun 10 19:22:46 prddbs02 kernel: RSP:
0018:ffff8108e79a5bf8 EFLAGS: 00000297 Jun 10 19:22:46 prddbs02 kernel:
RAX: 0000000000000006 RBX: 0000000000000007 RCX: 0000000000000000 Jun 10
19:22:46 prddbs02 kernel: RDX: 00000000000000ff RSI: 00000000000000ff RDI:
00000000000000c0 Jun 10 19:22:46 prddbs02 kernel: RBP: 0000000000000000
R08: 0000000000000008 R09: 0000000000000038 Jun 10 19:22:46 prddbs02
kernel: R10: ffff8108e79a5b98 R11: 0000000000000000 R12: ffffffff80143e16
Jun 10 19:22:46 prddbs02 kernel: R13: 0000000000000003 R14:
ffff810366ec2c58 R15: ffff81093da13340 Jun 10 19:22:46 prddbs02 kernel: FS:
000000004189d940(0063) GS:ffff81012071cec0(0000) knlGS:0000000000000000 Jun

...

Thanks for any help in advance :)

Regards,
Raj



--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

Shashank 06-30-2012 02:19 AM

RHEL 5.5 Oracle RAC cluster resbooted due to processor hung!!
 
Actually the cluster checks are done via private network, so eth0
network loss should not have crashed the server.

Do you see any logs in /var/crash? Is kdump/netdump setup? Can you
post logs for ocssd (should be under grid directory) for the 10-15
minutes before the crash?


Also post the /var/log/messages for 10-15 minutes prior to the crash.



On Thu, Jun 21, 2012 at 1:04 AM, Georgios Magklaras
<georgios@biotek.uio.no> wrote:
> On 06/18/2012 08:44 AM, raj sourabh wrote:
>>
>> Jun 10 19:22:04 prddbs02 snmpd[5158]: Received SNMP packet(s) from UDP:
>> [127.0.0.1]:17955 Jun 10 19:22:34 prddbs02 kernel: NETDEV WATCHDOG: eth0:
>> transmit timed out Jun 10 19:22:34 prddbs02 kernel: bonding: bond0: link
>> status definitely down for interface eth0, disabling it Jun 10 19:22:34
>> prddbs02 kernel: bonding: bond0: making interface eth2 the new active one.
>> Jun 10 19:22:34 prddbs02 kernel: device eth2 entered promiscuous mode Jun
>
> Before the soft lockup, what exactly caused the the NETDEV WATCHDOG loose
> eth0?
> For the __smp_call_function_many lockup, there were many fixes between 5.5
> and 5.6 in relation to multipath and other third party drivers
> that caused similar lookups. (why are you on 5.5 and not at least 5.6, which
> kernel are you running on)?
>
> Best regards,
>
> --
> --
> George Magklaras PhD
> RHCE no: 805008309135525
>
> Senior Systems Engineer/IT Manager
> Biotechnology Center of Oslo and
> the Norwegian Center for Molecular Medicine
> EMBnet TMPC Chair
>
> http://folk.uio.no/georgios
>
>
>
>
>> 10 19:22:46 prddbs02 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
>> [multipathd:5060] Jun 10 19:22:46 prddbs02 kernel: CPU 2: Jun 10 19:22:46
>> prddbs02 kernel: Modules linked in: oracleacfs(PFU) oracleadvm(PFU)
>> oracleoks(PU) autofs4 hidp smbus(U) ipmi_devintf ipmi_si ipmi_msghandler
>> rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq
>> freq_table bonding dm_round_robin dm_multipath scsi_dh video backlight sbs
>> power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi
>> acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev
>> sr_mod cdrom i2c_i801 igb pcspkr i2c_core 8021q e1000e dca sg dm_raid45
>> dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror
>> dm_log
>> dm_mod lpfc(U) scsi_transport_fc ata_piix libata shpchp mptsas mptscsih
>> mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd
>> ehci_hcd Jun 10 19:22:46 prddbs02 kernel: Pid: 5060, comm: multipathd
>> Tainted: PF M 2.6.18-194.el5 #1 Jun 10 19:22:46 prddbs02 kernel: RIP:
>> 0010:[<ffffffff8007767a>] [<ffffffff8007767a>]
>> __smp_call_function_many+0x9a/0xbc Jun 10 19:22:46 prddbs02 kernel: RSP:
>> 0018:ffff8108e79a5bf8 EFLAGS: 00000297 Jun 10 19:22:46 prddbs02 kernel:
>> Pid: 5060, comm: multipathd Tainted: PF M 2.6.18-194.el5 #1 Jun 10
>> 19:22:46
>> prddbs02 kernel: RIP: 0010:[<ffffffff8007767a>] [<ffffffff8007767a>]
>> __smp_call_function_many+0x9a/0xbc Jun 10 19:22:46 prddbs02 kernel: RSP:
>> 0018:ffff8108e79a5bf8 EFLAGS: 00000297 Jun 10 19:22:46 prddbs02 kernel:
>> RAX: 0000000000000006 RBX: 0000000000000007 RCX: 0000000000000000 Jun 10
>> 19:22:46 prddbs02 kernel: RDX: 00000000000000ff RSI: 00000000000000ff RDI:
>> 00000000000000c0 Jun 10 19:22:46 prddbs02 kernel: RBP: 0000000000000000
>> R08: 0000000000000008 R09: 0000000000000038 Jun 10 19:22:46 prddbs02
>> kernel: R10: ffff8108e79a5b98 R11: 0000000000000000 R12: ffffffff80143e16
>> Jun 10 19:22:46 prddbs02 kernel: R13: 0000000000000003 R14:
>> ffff810366ec2c58 R15: ffff81093da13340 Jun 10 19:22:46 prddbs02 kernel:
>> FS:
>> 000000004189d940(0063) GS:ffff81012071cec0(0000) knlGS:0000000000000000
>> Jun
>
> ...
>
>> Thanks for any help in advance :)
>>
>> Regards,
>> Raj
>
>
>
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list


All times are GMT. The time now is 09:26 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.