Sometimes the trickiest errors are the ones you should ignore.
I’m running PBX in a Flash, based on CentOS 5.2, on a Dell PowerEdge 1500SC. Back in February, I got a segmentation fault error and after some discussion on the PBX in a Flash forum, concluded that I needed to replace the hardware or at least the memory. Now I’m not so sure.
For one thing, the server has been running without issues for the last eight months. Is it really on its last legs? Let’s do some checking.
Memory Errors
I ran MemTest86+ 4.00 and Microsoft’s Windows Memory Diagnostic on the machine. Both of them report errors. Interesting thing is that if 1GB of RAM is installed, MemTest86+ reports an error at 1023.9MB; if I remove those DIMMs and install 256MB, it reports an error at 255.9MB. So it’s more like it doesn’t like the memory controller.
Sure enough, the Dell MPMemory diagnostics, part of Dell’s bootable 32 Bit Diagnostics CD, find no errors in the 1GB of memory, even after five passes. A Dell tech support rep confirmed that those diagnostics should be giving an accurate read on the PowerEdge 1500SC.
agpgart Errors
The next thing that concerned me were some errors that occurred on startup.
Fairly early in the boot sequence, there are some agpgart errors, retrieved here in their entirety from the dmesg
command:
Linux agpgart interface v0.101 (c) Dave Jones
agpgart: unable to determine aperture size.
agpgart: agp_backend_initialize() failed.
agpgart-serverworks: probe of 0000:00:00.0 failed with error -22
agpgart: unable to determine aperture size.
agpgart: agp_backend_initialize() failed.
agpgart-serverworks: probe of 0000:00:00.1 failed with error -22
agpgart: ServerWorks CNB20HE is unsupported due to lack of documentation.
agpgart: ServerWorks CNB20HE is unsupported due to lack of documentation.
As the Dell tech pointed out, agpgart has to do with video drivers. Apparently CentOS doesn’t recognize the on-board video driver in the server. But the video works fine (all I need is command-line), so I can ignore these errors.
hda Errors
These hda errors were actually a bigger concern:
hda: media error (bad sector): status=0x51 { DriveReady SeekComplete Error }
hda: media error (bad sector): error=0x30 { LastFailedSense=0x03 }
ide: failed opcode was: unknown
ATAPI device hda:
Error: Medium error -- (Sense key=0x03)
(reserved error code) -- (asc=0x11, ascq=0x05)
The failed "Read 10" packet command was:
"28 00 00 00 00 10 00 00 02 00 00 00 00 00 00 00 "
end_request: I/O error, dev hda, sector 64
Buffer I/O error on device hda, logical block 8
hda: media error (bad sector): status=0x51 { DriveReady SeekComplete Error }
hda: media error (bad sector): error=0x30 { LastFailedSense=0x03 }
ide: failed opcode was: unknown
ATAPI device hda:
Error: Medium error -- (Sense key=0x03)
(reserved error code) -- (asc=0x11, ascq=0x05)
The failed "Read 10" packet command was:
"28 00 00 00 00 06 00 00 0a 00 00 00 00 00 00 00 "
end_request: I/O error, dev hda, sector 24
Buffer I/O error on device hda, logical block 3
Buffer I/O error on device hda, logical block 4
Buffer I/O error on device hda, logical block 5
Buffer I/O error on device hda, logical block 6
Buffer I/O error on device hda, logical block 7
Is a hard drive failing? Or maybe the file structure is corrupt?
The server uses a RAID 5 array of SCSI disks attached to a PERC 3SC controller. First, from the RAID BIOS, I checked the status of each physical disk to see if any disk errors had been recorded. Nope.
Then, again using the Dell 32 Bit Diagnostics CD, I ran the GUI Diagnostics to test the hard drives. They all passed fine (short test only).
Wait a second. What disk is hda anyway? Following some tips in this post, after booting into CentOS, fdisk –l
shows I have two “drives” (logical drives from the RAID controller’s perspective): sda and sdb. So hda is not one of the hard logical hard drives used by CentOS.
According to Wikipedia, “/dev/hda is the path to the block device node of the first IDE device,” which if the system does not use IDE hard drives is often the optical drive. Sure enough, when I follow the link \dev\cdrom, it tries to read \dev\hda.
I have a bootable SystemRescueCD in the CD-ROM drive at all times (so I can use the Dell Remote Access Card to remotely format drives and restore a backup if necessary). Apparently that has some read errors during bootup, but that only means a bad CD or at worst a bad CD-ROM drive. So it looks like I can ignore these hda errors as well. (In fact, after going to the client site and replacing the SystemRescueCD with a fresh copy, the hda errors stopped and I can boot from the CD.)
Maybe this old PowerEdge is good for a while longer after all!