FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > CentOS > CentOS

 
 
LinkBack Thread Tools
 
Old 07-07-2010, 08:33 AM
Alexander Farber
 
Default kernel: Machine check events logged

Hello,

every few hours I get the following message in /var/log/message:

Jul 5 20:23:28 hXXX kernel: Machine check events logged
Jul 5 20:53:28 hXXX kernel: Machine check events logged
Jul 5 22:13:28 hXXX kernel: Machine check events logged
Jul 5 23:53:28 hXXX kernel: Machine check events logged
Jul 5 23:58:27 hXXX kernel: Machine check events logged
Jul 6 01:38:27 hXXX kernel: Machine check events logged
Jul 6 04:48:27 hXXX kernel: Machine check events logged

And in the /var/log/mcelog I see:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
uptime (unreliable)]
MISC c008000001000000 ADDR 1148f5940
Northbridge NB Array Error
bit35 = err cpu3
bit42 = L3 subcache in error bit 0
bit43 = L3 subcache in error bit 1
bit46 = corrected ecc error
bit59 = misc error valid
memory/cache error 'generic read mem transaction, generic
transaction, level generic'
STATUS 9c1f4cf8001c011b MCGSTATUS 0
No DIMM found for 1148f5940 in SMBIOS

My machine (a CentOS 5.5/64bit server rented at German
hoster strato.de) seems to run ok as a LAMP server though...

What do these messages actually mean,
is RAM defect and how critical is it
(because I have an important event this Friday
and would prefer not to take the machine offline)

Thank you and I'm attaching my dmesg output below

Regards
Alex

Linux version 2.6.18-194.8.1.el5 (mockbuild@builder10.centos.org) (gcc
version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Thu Jul 1 19:04:48
EDT 2010
Command line: ro root=LABEL=/ console=tty0 console=ttyS0,57600
BIOS-provided physical RAM map:
BIOS-e820: 0000000000010000 - 000000000009f000 (usable)
BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000ddfb0000 (usable)
BIOS-e820: 00000000ddfb0000 - 00000000ddfbe000 (ACPI data)
BIOS-e820: 00000000ddfbe000 - 00000000ddfe0000 (ACPI NVS)
BIOS-e820: 00000000ddfe0000 - 00000000ddfee000 (reserved)
BIOS-e820: 00000000ddff0000 - 00000000de000000 (reserved)
BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000120000000 (usable)
DMI present.
ACPI: RSDP (v000 ACPIAM ) @ 0x00000000000faf80
ACPI: RSDT (v001 032510 RSDT1503 0x20100325 MSFT 0x00000097) @
0x00000000ddfb0000
ACPI: FADT (v002 032510 FACP1503 0x20100325 MSFT 0x00000097) @
0x00000000ddfb0200
ACPI: MADT (v001 032510 APIC1503 0x20100325 MSFT 0x00000097) @
0x00000000ddfb0390
ACPI: MCFG (v001 032510 OEMMCFG 0x20100325 MSFT 0x00000097) @
0x00000000ddfb0400
ACPI: OEMB (v001 032510 OEMB1503 0x20100325 MSFT 0x00000097) @
0x00000000ddfbe040
ACPI: HPET (v001 032510 OEMHPET 0x20100325 MSFT 0x00000097) @
0x00000000ddfb48c0
ACPI: SSDT (v001 A M I POWERNOW 0x00000001 AMD 0x00000001) @
0x00000000ddfb4900
ACPI: DSDT (v001 A96B3 A96B3210 0x00000210 INTL 0x20051117) @
0x0000000000000000
No NUMA configuration found
Faking a node at 0000000000000000-0000000120000000
Bootmem setup node 0 0000000000000000-0000000120000000
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
On node 0 totalpages: 1022763
DMA zone: 2627 pages, LIFO batch:0
DMA32 zone: 890856 pages, LIFO batch:31
Normal zone: 129280 pages, LIFO batch:31
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 0:4 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 0:4 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
Processor #2 0:4 APIC version 16
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
Processor #3 0:4 APIC version 16
ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 4, version 33, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to physical flat
ACPI: HPET id: 0x8300 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Nosave address range: 000000000009f000 - 00000000000a0000
Nosave address range: 00000000000a0000 - 00000000000e4000
Nosave address range: 00000000000e4000 - 0000000000100000
Nosave address range: 00000000ddfb0000 - 00000000ddfbe000
Nosave address range: 00000000ddfbe000 - 00000000ddfe0000
Nosave address range: 00000000ddfe0000 - 00000000ddfee000
Nosave address range: 00000000ddfee000 - 00000000ddff0000
Nosave address range: 00000000ddff0000 - 00000000de000000
Nosave address range: 00000000de000000 - 00000000ff700000
Nosave address range: 00000000ff700000 - 0000000100000000
Allocating PCI resources starting at e0000000 (gap: de000000:21700000)
SMP: Allowing 4 CPUs, 0 hotplug CPUs
Built 1 zonelists. Total pages: 1022763
Kernel command line: ro root=LABEL=/ console=tty0 console=ttyS0,57600
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Checking aperture...
CPU 0: aperture @ 4000000 size 32 MB
Aperture too small (32 MB)
No AGP bridge found
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 4000000
Nosave address range: 0000000004000000 - 0000000008000000
ACPI: DMAR not present
Memory: 4016200k/4718592k available (2575k kernel code, 144564k
reserved, 1303k data, 212k init)
Calibrating delay loop (skipped), value calculated using timer
frequency.. 5000.21 BogoMIPS (lpj=2500105)
Security Framework v1.0.0 initialized
SELinux: Initializing.
SELinux: Starting in permissive mode
selinux_register_security: Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 0/0 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
Using local APIC timer interrupts.
Detected 12.500 MHz APIC timer.
SMP alternatives: switching to SMP code
Booting processor 1/4 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 5000.36 BogoMIPS (lpj=2500184)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 1/1 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
Quad-Core AMD Opteron(tm) Processor 1381 stepping 02
SMP alternatives: switching to SMP code
Booting processor 2/4 APIC 0x2
Initializing CPU#2
Calibrating delay using timer specific routine.. 4999.71 BogoMIPS (lpj=2499855)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 2/2 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 2
Quad-Core AMD Opteron(tm) Processor 1381 stepping 02
SMP alternatives: switching to SMP code
Booting processor 3/4 APIC 0x3
Initializing CPU#3
Calibrating delay using timer specific routine.. 5000.16 BogoMIPS (lpj=2500084)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 3/3 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 3
Quad-Core AMD Opteron(tm) Processor 1381 stepping 02
Brought up 4 CPUs
testing NMI watchdog ... OK.
time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
time.c: Detected 2500.108 MHz processor.
sizeof(vma)=176 bytes
sizeof(page)=56 bytes
sizeof(inode)=560 bytes
sizeof(dentry)=216 bytes
sizeof(ext3inode)=760 bytes
sizeof(buffer_head)=96 bytes
sizeof(skbuff)=248 bytes
migration_cost=230
checking if image is initramfs... it is
Freeing initrd memory: 2614k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
PCI: Using configuration type 1
PCI: Using configuration type 1 for extended access
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: set SATA to AHCI mode
PCI: Ignoring BAR0-3 of IDE controller 0000:00:14.1
PCI: Transparent bridge - 0000:00:14.4
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE4._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE5._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0PC._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 *5 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 7 10 11 12 14 *15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKF] (IRQs 9) *0, disabled.
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 7 10 11 12 14 15) *0, disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 14 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
PCI: Cannot allocate resource region 1 of device 0000:00:14.0
NetLabel: Initializing
NetLabel: domain hash size = 128
NetLabel: protocols = UNLABELED CIPSOv4
NetLabel: unlabeled traffic allowed by default
hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 0, 0
hpet0: 4 32-bit timers, 14318180 Hz
ACPI: DMAR not present
PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ 4000000 size 65536 KB
PCI-DMA: using GART IOMMU.
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
pnp: 00:0b: ioport range 0xa00-0xa0f has been reserved
pnp: 00:0b: ioport range 0xa10-0xa1f has been reserved
PCI: Error while updating region 0000:00:14.0/1 (e0000004 != 8000a014)
PCI: Bridge: 0000:00:01.0
IO window: c000-cfff
MEM window: fe800000-fe9fffff
PREFETCH window 0x00000000fc000000-0x00000000fdffffff
PCI: Bridge: 0000:00:04.0
IO window: d000-dfff
MEM window: fea00000-feafffff
PREFETCH window: disabled.
PCI: Bridge: 0000:00:05.0
IO window: e000-efff
MEM window: feb00000-febfffff
PREFETCH window: disabled.
PCI: Bridge: 0000:00:14.4
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Setting latency timer of device 0000:00:04.0 to 64
PCI: Setting latency timer of device 0000:00:05.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
audit: initializing netlink socket (disabled)
type=2000 audit(1278326009.398:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
SELinux: Registering netfilter hooks
Initializing Cryptographic API
alg: No test for crc32c (crc32c-generic)
ksign: Installing public key data
Loading keyring
- Added public key 71959A475B93578
- User ID: CentOS (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
Boot video device is 0000:01:05.0
PCI: Setting latency timer of device 0000:00:04.0 to 64
PCI: Setting latency timer of device 0000:00:05.0 to 64
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: duty_cycle spans bit 4
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
Real Time Clock Driver v1.12ac
hpet_resources: 0xfed00000 is busy
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:05: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:06: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
brd: module loaded
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
SB600_PATA: IDE controller at PCI slot 0000:00:14.1
GSI 16 sharing vector 0xC1 and IRQ 16
ACPI: PCI Interrupt 0000:00:14.1[A] -> GSI 16 (level, low) -> IRQ 193
SB600_PATA: chipset revision 0
SB600_PATA: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xff00-0xff07, BIOS settings: hdaio, hdbio
Probing IDE interface ide0...
Probing IDE interface ide0...
Probing IDE interface ide1...
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
PNP: PS/2 controller doesn't have AUX irq; using default 12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
ACPI: (supports S0 S1 S3 S4 S5)
Initalizing network drop monitor service
Freeing unused kernel memory: 212k freed
Write protecting the kernel read-only data: 504k
GSI 17 sharing vector 0xC9 and IRQ 17
ACPI: PCI Interrupt 0000:00:13.5[D] -> GSI 19 (level, low) -> IRQ 201
ehci_hcd 0000:00:13.5: EHCI Host Controller
ehci_hcd 0000:00:13.5: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:13.5: applying AMD SB600/SB700 USB freeze workaround
ehci_hcd 0000:00:13.5: debug port 1
ehci_hcd 0000:00:13.5: irq 201, io mem 0xfe7ff000
ehci_hcd 0000:00:13.5: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 10 ports detected
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt 0000:00:13.0[A] -> GSI 16 (level, low) -> IRQ 193
ohci_hcd 0000:00:13.0: OHCI Host Controller
ohci_hcd 0000:00:13.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:13.0: irq 193, io mem 0xfe7fe000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
GSI 18 sharing vector 0xD1 and IRQ 18
ACPI: PCI Interrupt 0000:00:13.1[b] -> GSI 17 (level, low) -> IRQ 209
ohci_hcd 0000:00:13.1: OHCI Host Controller
ohci_hcd 0000:00:13.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:13.1: irq 209, io mem 0xfe7fd000
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
GSI 19 sharing vector 0xD9 and IRQ 19
ACPI: PCI Interrupt 0000:00:13.2[C] -> GSI 18 (level, low) -> IRQ 217
ohci_hcd 0000:00:13.2: OHCI Host Controller
ohci_hcd 0000:00:13.2: new USB bus registered, assigned bus number 4
ohci_hcd 0000:00:13.2: irq 217, io mem 0xfe7fc000
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:13.3[b] -> GSI 17 (level, low) -> IRQ 209
ohci_hcd 0000:00:13.3: OHCI Host Controller
ohci_hcd 0000:00:13.3: new USB bus registered, assigned bus number 5
ohci_hcd 0000:00:13.3: irq 209, io mem 0xfe7fb000
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:13.4[C] -> GSI 18 (level, low) -> IRQ 217
ohci_hcd 0000:00:13.4: OHCI Host Controller
ohci_hcd 0000:00:13.4: new USB bus registered, assigned bus number 6
ohci_hcd 0000:00:13.4: irq 217, io mem 0xfe7fa000
usb usb6: configuration #1 chosen from 1 choice
hub 6-0:1.0: USB hub found
hub 6-0:1.0: 2 ports detected
USB Universal Host Controller Interface driver v3.0
md: raid1 personality registered for level 1
SCSI subsystem initialized
libata version 3.00 loaded.
ahci 0000:00:12.0: version 3.0
GSI 20 sharing vector 0xE1 and IRQ 20
ACPI: PCI Interrupt 0000:00:12.0[A] -> GSI 22 (level, low) -> IRQ 225
ahci 0000:00:12.0: controller can't do 64bit DMA, forcing 32bit
ahci 0000:00:12.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:12.0: flags: ncq sntf ilck pm led clo pmp pio slum part
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
ata1: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ff900 irq 225
ata2: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ff980 irq 225
ata3: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ffa00 irq 225
ata4: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ffa80 irq 225
ata1: softreset failed (device not ready)
ata1: failed due to HW bug, retry pmp=0
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: Hitachi HDS721050CLA362, JP2OA39C, max UDMA/133
ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata2: softreset failed (device not ready)
ata2: failed due to HW bug, retry pmp=0
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-8: Hitachi HDS721050CLA362, JP2OA39C, max UDMA/133
ata2.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata3: SATA link down (SStatus 0 SControl 300)
ata4: SATA link down (SStatus 0 SControl 300)
Vendor: ATA Model: Hitachi HDS72105 Rev: JP2O
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3
sd 0:0:0:0: Attached scsi disk sda
Vendor: ATA Model: Hitachi HDS72105 Rev: JP2O
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
sdb: sdb1 sdb2 sdb3
sd 1:0:0:0: Attached scsi disk sdb
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel@redhat.com
device-mapper: dm-raid45: initialized v0.2594l
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb3 ...
md: adding sdb3 ...
md: sdb1 has different UUID to sdb3
md: adding sda3 ...
md: sda1 has different UUID to sdb3
md: created md1
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
raid1: raid set md1 active with 2 out of 2 mirrors
md: considering sdb1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md0
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md0 active with 2 out of 2 mirrors
md: ... autorun DONE.
kjournald starting. Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: Disabled at runtime.
SELinux: Unregistering netfilter hooks
type=1404 audit(1278326037.818:2): selinux=0 auid=4294967295 ses=4294967295
input: PC Speaker as /class/input/input0
e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k3.1
e1000e: Copyright (c) 1999-2008 Intel Corporation.
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 193
PCI: Setting latency timer of device 0000:02:00.0 to 64
EDAC MC: Ver: 2.0.1 Jul 1 2010
EDAC amd64_edac: Ver: 3.2.0 Jul 1 2010
Floppy drive(s): fd0 is 1.44M
e1000e 0000:02:00.0: Warning: detected ASPM enabled in EEPROM
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: Attached scsi generic sg1 type 0
eth0: (PCI Express:2.5GB/s:Width x1) 40:61:86:ec:c0:45
eth0: Intel(R) PRO/1000 Network Connection
eth0: MAC: 2, PHY: 2, PBA No: ffffff-0ff
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 209
PCI: Setting latency timer of device 0000:03:00.0 to 64
e1000e 0000:03:00.0: Warning: detected ASPM enabled in EEPROM
eth1: (PCI Express:2.5GB/s:Width x1) 40:61:86:ec:c0:46
eth1: Intel(R) PRO/1000 Network Connection
eth1: MAC: 2, PHY: 2, PBA No: ffffff-0ff
piix4_smbus 0000:00:14.0: Found 0000:00:14.0 device
EDAC amd64: This node reports that Memory ECC is currently disabled,
set F3x44[22] (0000:00:18.3).
EDAC amd64: WARNING: ECC is disabled by BIOS. Module will NOT be loaded.
Either Enable ECC in the BIOS, or set 'ecc_enable_override'.
Also, use of the override can cause unknown side effects.
amd64_edac: probe of 0000:00:18.2 failed with error -22
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
floppy0: no floppy controllers found
Floppy drive(s): fd0 is 1.44M
floppy0: no floppy controllers found
lp: driver loaded but no devices found
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
ACPI: Mapper loaded
dell-wmi: No known WMI GUID found
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
device-mapper: multipath: version 1.0.5 loaded
EXT3 FS on md1, internal journal
Adding 1052248k swap on /dev/sda2. Priority:-1 extents:1 across:1052248k
Adding 1052248k swap on /dev/sdb2. Priority:-2 extents:1 across:1052248k
powernow-k8: Found 1 Quad-Core AMD Opteron(tm) Processor 1381
processors (4 cpu cores) (version 2.20.00)
powernow-k8: 0 : fid 0x0 gid 0x0 (2500 MHz)
powernow-k8: 1 : fid 0x0 gid 0x0 (1800 MHz)
powernow-k8: 2 : fid 0x0 gid 0x0 (1300 MHz)
powernow-k8: 3 : fid 0x0 gid 0x0 (800 MHz)
ip_tables: (C) 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
ADDRCONF(NETDEV_UP): eth0: link is not ready
e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
eth0: 10/100 speed: disabling TSO
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
ADDRCONF(NETDEV_UP): eth1: link is not ready
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
Machine check events logged
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 12:51 PM
 
Default kernel: Machine check events logged

Alexander Farber wrote:
> Hello,
>
> every few hours I get the following message in /var/log/message:
>
> Jul 5 20:23:28 hXXX kernel: Machine check events logged
<snip>
> And in the /var/log/mcelog I see:
>
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
> uptime (unreliable)]
> MISC c008000001000000 ADDR 1148f5940
> Northbridge NB Array Error
> bit35 = err cpu3
> bit42 = L3 subcache in error bit 0
> bit43 = L3 subcache in error bit 1
> bit46 = corrected ecc error
> bit59 = misc error valid
> memory/cache error 'generic read mem transaction, generic
> transaction, level generic'
> STATUS 9c1f4cf8001c011b MCGSTATUS 0
> No DIMM found for 1148f5940 in SMBIOS
>
> My machine (a CentOS 5.5/64bit server rented at German
> hoster strato.de) seems to run ok as a LAMP server though...
>
> What do these messages actually mean,
> is RAM defect and how critical is it
> (because I have an important event this Friday
> and would prefer not to take the machine offline)
<snip>
First, this is *very* bad - I'm not good enough on this to tell you if
it's the CPU, or the motherboard, but it's one of the two, *not* just
memory. Second, if you're paying for hosting, and it's *their* server, you
need to get on the phone with them *now*, and tell them that they need to
fix it, yesterday would be preferable. They *should* have seen the logs.

Dunno if you have a physical machine hosted there, or a VM' if the latter,
they can move it without you seeing any downtime at all. If the former,
they can just hot swap the drives into another server.

But call them *NOW*. You're paying for the service.

mark



_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 01:32 PM
Alexander Farber
 
Default kernel: Machine check events logged

Hello Mark,

On Wed, Jul 7, 2010 at 2:51 PM, <m.roth@5-cent.us> wrote:
> First, this is *very* bad - I'm not good enough on this to tell you if
> it's the CPU, or the motherboard, but it's one of the two, *not* just
> memory. Second, if you're paying for hosting, and it's *their* server, you
> need to get on the phone with them *now*, and tell them that they need to
> fix it, yesterday would be preferable. They *should* have seen the logs.

yes, thanks for confirming this.

I've called them few hours ago
and they are currently performing "hardware tests"
with my dedicated server now.

Stupidly they (Strato.de) have refused to
move my HDDs to another machine and
then just test the old machine "offline" :-(
(Not the best service, but I'm locked
by an 18-month contract...)

Regards
Alex
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 01:41 PM
 
Default kernel: Machine check events logged

Alexander Farber wrote:
> Hello Mark,
>
> On Wed, Jul 7, 2010 at 2:51 PM, <m.roth@5-cent.us> wrote:
>> First, this is *very* bad - I'm not good enough on this to tell you if
>> it's the CPU, or the motherboard, but it's one of the two, *not* just
>> memory. Second, if you're paying for hosting, and it's *their* server,
>> you need to get on the phone with them *now*, and tell them that they need
>> to fix it, yesterday would be preferable. They *should* have seen the
logs.
>
> yes, thanks for confirming this.
>
> I've called them few hours ago and they are currently performing
"hardware tests"
> with my dedicated server now.
>
> Stupidly they (Strato.de) have refused to move my HDDs to another
machine and
> then just test the old machine "offline" :-(
> (Not the best service, but I'm locked by an 18-month contract...)

Really? And what's the SLA they have in the contract (and there *better*
be one)?

mark

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 02:21 PM
Peter Kjellstrom
 
Default kernel: Machine check events logged

On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
> Alexander Farber wrote:
> > every few hours I get the following message in /var/log/message:
> > Jul 5 20:23:28 hXXX kernel: Machine check events logged
...
> > MCE 0
> > HARDWARE ERROR. This is *NOT* a software problem!
> > Please contact your hardware vendor
> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
> > uptime (unreliable)]
> > MISC c008000001000000 ADDR 1148f5940
> > Northbridge NB Array Error
> > bit35 = err cpu3
> > bit42 = L3 subcache in error bit 0
> > bit43 = L3 subcache in error bit 1
> > bit46 = corrected ecc error
> > bit59 = misc error valid
> > memory/cache error 'generic read mem transaction, generic
> > transaction, level generic'
> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
> > No DIMM found for 1148f5940 in SMBIOS
...
> First, this is *very* bad

That's a bit hard. Depending on what the actual error is that triggers this
mce it may actually be just an annoyance (even though, yes, it is a hardware
problem). Also the OP did mention that the servers runs without any obvious
problems.

> - I'm not good enough on this to tell you if
> it's the CPU, or the motherboard, but it's one of the two, *not* just
> memory.

What do you base that on? I've seen a lot of different MCE-errors being
resolved by finding and replacing flaky dimms.

> Second, if you're paying for hosting, and it's *their* server, you
> need to get on the phone with them *now*, and tell them that they need to
> fix it, yesterday would be preferable. They *should* have seen the logs.
>
> Dunno if you have a physical machine hosted there, or a VM'

I'm quite sure you can't get that kind of MCE-dump inside a VM.

/Peter

> if the latter,
> they can move it without you seeing any downtime at all. If the former,
> they can just hot swap the drives into another server.
>
> But call them *NOW*. You're paying for the service.
>
> mark
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 02:26 PM
 
Default kernel: Machine check events logged

Peter Kjellstrom wrote:
> On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
>> Alexander Farber wrote:
>> > every few hours I get the following message in /var/log/message:
>> > Jul 5 20:23:28 hXXX kernel: Machine check events logged
> ...
>> > MCE 0
>> > HARDWARE ERROR. This is *NOT* a software problem!
>> > Please contact your hardware vendor
>> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
>> > uptime (unreliable)]
>> > MISC c008000001000000 ADDR 1148f5940
>> > Northbridge NB Array Error
>> > bit35 = err cpu3
>> > bit42 = L3 subcache in error bit 0
>> > bit43 = L3 subcache in error bit 1
>> > bit46 = corrected ecc error
>> > bit59 = misc error valid
>> > memory/cache error 'generic read mem transaction, generic
>> > transaction, level generic'
>> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
>> > No DIMM found for 1148f5940 in SMBIOS
> ...
<snip>
>> - I'm not good enough on this to tell you if
>> it's the CPU, or the motherboard, but it's one of the two, *not* just
>> memory.
>
> What do you base that on? I've seen a lot of different MCE-errors being
> resolved by finding and replacing flaky dimms.

Because it says NB Array error, and errors in the L3 subcache. I've seen
enough memory errors, and not seen an NB array & subcache error.

I do just note that there's "No DIMM found for ... in SMBIOS", but I
assume that's just a bank that's not filled.

mark

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 02:26 PM
Alexander Farber
 
Default kernel: Machine check events logged

I've only found this Solaris blog, but don't understand it well enough:
http://blogs.sun.com/gavinm/entry/amd_opteron_athlon64_turion64_fault

Can't provide you more details, because my dedicated server
is under hoster's "hardware tests" since 5 hours :-(
(and I guess everyone will run home for the Germany-Spain game soon)

Regards
Alex

>> > MCE 0
>> > HARDWARE ERROR. This is *NOT* a software problem!
>> > Please contact your hardware vendor
>> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
>> > uptime (unreliable)]
>> > MISC c008000001000000 ADDR 1148f5940
>> > * Northbridge NB Array Error
>> > * * * *bit35 = err cpu3
>> > * * * *bit42 = L3 subcache in error bit 0
>> > * * * *bit43 = L3 subcache in error bit 1
>> > * * * *bit46 = corrected ecc error
>> > * * * *bit59 = misc error valid
>> > * memory/cache error 'generic read mem transaction, generic
>> > transaction, level generic'
>> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
>> > No DIMM found for 1148f5940 in SMBIOS
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 02:38 PM
 
Default kernel: Machine check events logged

Alexander Farber wrote:
> I've only found this Solaris blog, but don't understand it well enough:
> http://blogs.sun.com/gavinm/entry/amd_opteron_athlon64_turion64_fault
>
> Can't provide you more details, because my dedicated server
> is under hoster's "hardware tests" since 5 hours :-(
> (and I guess everyone will run home for the Germany-Spain game soon)
>
First, that's solaris (or opensolaris), so it's not the same. Second,
you'll notice that the diagram and the table do *not* mention L3 caches,
so the architecture's a bit different.

Finally, note where the article says, "If an error is recoverable then it
does not raise a Machine Check Exception (MCE or mc#) when detected. The
recoverable errors, broadly speaking, are single-bit ECC errors from
ECC-protected arrays and parity errors on clean parity- <snip>
If an error is irrecoverable then detection of that error will raise a
machine check exception (if the bit that controls mc# for that error type
is set; if not you'll either never know or you pick it up by polling). The
mc# handler can extract information about the error from the machine check
architecture registers as before, but has the additional responsibility of
deciding what further actions (which may include panic and reboot) are
required. A machine check exception is a form of interrupt which allows
immediate notification of an error condition - you can't afford to wait to
poll for the error since that could result in the use of bad data and
associated data corruption.
--- end excerpt ---

So, it is, in fact, serious, and non-recoverable, so they have a problem
with their hardware, and you've paid for a service that they provide,
including hardware that's supposed to be up 99.<whatever you paid for>% of
the time. If they don't get it up, there should be penalties against them,
or at least money rebates to *you*.

There may also be limits that would mean they've broken the contract, and
are liable.

mark
> Regards
> Alex
>
>>> > MCE 0
>>> > HARDWARE ERROR. This is *NOT* a software problem!
>>> > Please contact your hardware vendor
>>> > CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51
>>> > uptime (unreliable)]
>>> > MISC c008000001000000 ADDR 1148f5940
>>> > * Northbridge NB Array Error
>>> > * * * *bit35 = err cpu3
>>> > * * * *bit42 = L3 subcache in error bit 0
>>> > * * * *bit43 = L3 subcache in error bit 1
>>> > * * * *bit46 = corrected ecc error
>>> > * * * *bit59 = misc error valid
>>> > * memory/cache error 'generic read mem transaction, generic
>>> > transaction, level generic'
>>> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
>>> > No DIMM found for 1148f5940 in SMBIOS
> _______________________________________________
> CentOS mailing list
> CentOS@centos.org
> http://lists.centos.org/mailman/listinfo/centos
>


_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 03:44 PM
Peter Kjellstrom
 
Default kernel: Machine check events logged

On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
> Peter Kjellstrom wrote:
> > On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
> >> Alexander Farber wrote:
...
> >> > MISC c008000001000000 ADDR 1148f5940
> >> > Northbridge NB Array Error
> >> > bit35 = err cpu3
> >> > bit42 = L3 subcache in error bit 0
> >> > bit43 = L3 subcache in error bit 1
> >> > bit46 = corrected ecc error
> >> > bit59 = misc error valid
> >> > memory/cache error 'generic read mem transaction, generic
> >> > transaction, level generic'
> >> > STATUS 9c1f4cf8001c011b MCGSTATUS 0
> >> > No DIMM found for 1148f5940 in SMBIOS
...
> >> - I'm not good enough on this to tell you if
> >> it's the CPU, or the motherboard, but it's one of the two, *not* just
> >> memory.
> >
> > What do you base that on? I've seen a lot of different MCE-errors being
> > resolved by finding and replacing flaky dimms.
>
> Because it says NB Array error, and errors in the L3 subcache. I've seen
> enough memory errors, and not seen an NB array & subcache error.

That does sound like a reasonable guess. However, you presented it as absolute
truth. The MCE could just as easily be read as: NB means not IC/DC/BU =>
actual RAM.

Given that real world figures show bad RAM to be a lot more likely that a bad
CPU I'd start by looking at the dimms (or at the very least not exclude
it...).

> I do just note that there's "No DIMM found for ... in SMBIOS", but I
> assume that's just a bank that's not filled.

or the SMBIOS data is borked, wouldn't be the first time...

/Peter
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 07-07-2010, 03:45 PM
Alexander Farber
 
Default kernel: Machine check events logged

Anyway my hoster has finished the "hardware tests"
(probably just kept running memtest86 or some vendor CD?)
on my CentOS 5.5/64bit machine with quad Opteron 1381
and said that they haven't found any issues.

I'll post here a short note if I will experience any issues
on my LAPP server (preferans.de - I run phpBB3+
PostgreSQL+my backend for a facebook card game there)

Thank you
Alex
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 

Thread Tools




All times are GMT. The time now is 04:21 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org