The AVIDD-B Cluster

Outages and Hardware Service

2006

Date System Problem
01/13/06 AVIDD-B nodes bc39-bc59 unavailable due to a power outage, ~4:00pm-7:00pm

2005

Date System Problem
11/22/05 bh1 System rebooted for unknown reasons, ~1:30pm
09/08/05 bh1 Disk failure; system powered down ~3:20pm
08/20/05 GPFS Filesystem became unresponsive due to storage node communication issues. Service restored 08/22 ~10:00
08/18/05 ih3 System rebooted to resolve persistent NFS/networking problem. Unavailable ~3:00pm until 5:00pm
08/10/05 ih1 Kernel panic, unknown cause, ~1:40pm, rebooted ~2:20pm
07/22/05 ih1 System rebooted due to NFS-related hangs (~3:00pm)
07/22/05 bh2 System rebooted due to GPFS-related kernel errors (~1:15pm)
06/07/05 AVIDD Maintenance window extended due to difficulties with the Force10 hardware upgrade
06/06/05 am node rebooted due to kernel-related networking problem, ~11am
05/03/05 AVIDD Maintenance window extended due to difficulties with the EMC NAS/SAN software upgrade
04/12/05 bc44 node offlined due to a failed disk; replacement is pending
04/08/05 bh2 node rebooted due to NFS problems, 2:05pm
04/04/05 ih1 node rebooted due to NFS problems, 11:45am
04/04/05 ic61 node offlined with hardware errors (online 04/04/05)
04/04/05 bc37 node offlined due to SCSI errors (online 04/04/05)
03/25/05 AVIDD-B NFS/homedirectory problems affecting all manner of cluster functionality
02/21/05 ic13 OoM, rebooted
02/17/05 bh2 system hang, rebooted ~4:40pm
02/01/05 ic53,ic54 system hang; rebooted
01/29/05 bc94 crash; out of memory
01/28/05 bc89-bc91,bc93 offlined for a special project
01/28/05 ic89-ic91,ic93 offlined for a special project
01/21/05 bc52 offlined due to a bad myrinet port; unavailable indefinitely
01/20/05 bh2 crash ~10:40am; rebooted ~11:15am
01/07/05 bc50 system hang; rebooted
01/06/05 bc39,bc58,bc95 systems powered down for unknown reasons
01/01/05 bc02,bc17,bc50,bc68 systems hung; rebooted

2004

Date System Problem
12/19/04 bc60 rebooted to clear zombies
12/13/04 /N/gpfsi GPFS outage due to SysAdmin error
12/06/04 bc62 system hang; cycled power
11/28/04 bc23,bc30 system hang; cycled power
11/25/04 ih2 NFS induced kernel panic; power cycled
11/22/04 ic65,ic79 system hang; cycled power
11/20/04 ic16 system hang; cycled power
11/19/04 bc59 crash w/ NMI; reseated Myrinet card and system rebooted OK
11/18/04 ic89 GM kernel panic
11/16/04 bh1 Out-of-Memory kernel killer brought the system down; rebooted
11/12/04 bc87 system unresponsive as of ~2am, power-cycled at 12:00pm
11/06/04 bf1 system crashed due to power-supply failure, replaced on 9-Nov.
11/06/04 bh1 system hung, rebooted 10:30am, 11/7/04
10/26/04 bc96 power-cycled after node hung, ~12:15pm
10/22/04 bc44 power-cycled after node hung, ~22:19 10/21
10/18/04 AVIDD-B, AVIDD-I restarted PBS and Maui ~5:00pm to clean out exiting jobs left over from the power outage at IUPUI
10/18/04 AVIDD-I electrical failure resulted in sporadic failures of nodes ic39-ic76, most of the day
10/18/04 AVIDD-I power-cycled ic62, ic63 after they became unresponsive (~9:30am)
10/08/04 AVIDD-B restarted Maui on bh2, ~9:30am (had died at ~8:37)
10/06/04 AVIDD-B nodes bc36,bc45 unresponsive, power cycled ~2:30pm
09/30/04 bh2 kernel panic at ~9:40am, system restarted at ~10:40am
08/31/04 bc83 dead Myrinet LED; switch side; removed from PBS
08/17/04 bh2 hang on /N/B; other NFS filesystems OK; cycled power
08/17/04 bc77 kernel oops wrt ex3 filesystem; cycled power
08/17/04 bh2 hang on /N/B; other NFS filesystems OK; cycled power
08/17/04 bh1 hang on /N/B; other NFS filesystems OK; cycled power
08/16/04 bh1,bh2 hang on /N/B; other NFS filesystems OK; cycled power
08/13/04 bc05,bc15,bc31 Nodes were unresponsive, causing Maui to hang; rebooted ~9:45am
08/6/04 ic38 HT turned off
08/6/04 ic89 one dead CPU; cycle power, now two
08/6/04 ic66 one dead CPU
07/09/04 ic44 PS#1 DC output not in normal range; unable to power on.
06/09/04 AVIDD-B bh1 became unresponsive, probably due to GPFS issues; rebooted at ~4:50pm
06/07/04 AVIDD-B bc88 repeatedly drops connections over the Myrinet adapter. Node has been removed from PBS
06/06/04 AVIDD-I GPFS problems on several nodes due to one node, ic67, with gpfs hung in strange state. After reboot of ic67, problem cleared.
06/02/04 AVIDD-I F10 switch problems causing NFS mounts and jobs to hang. Incident open with Force10.
05/26/04 AVIDD-I Torque problems causing jobs not to run, Torque restarted causing loss of some jobs.
05/25/04 ih1 Out of memory error caused head node to hang, rebooted and restored to service
05/18/04 tm1 Filesystem errors hung NFS server on the IA64 cluster
05/11/04 ic41,ic42 Out of memory errors cause nodes to need rebooting
04/27/04 im Problems with ethernet devices, might cause slow connections cluster wide on AVIDD-I
04/23/04 ih1 GPFS hung on ih1 at 12:45, rebooted and back at ~1:20
04/22/04 AVIDD-B GPFS unavailable for ~6 hours; admins are still investigating
04/11/04 AVIDD-B GPFS instability; Maui restarted
04/10/04 AVIDD-B Sporadic GPFS problems
03/27/04 ic66 node wouldn't reboot with rest of cluster
03/27/04 AVIDD-I GPFS hung, compute nodes rebooted
03/27/04 ih1 Server hung, rebooted
03/24/04 bc34 Myrinet card failure
03/24/04 if3 NMI crash and reboot
03/23/04 AVIDD-I GPFS hung, GPFS restarted
03/22/04 AVIDD-I GPFS hung, rebooted cluster
03/21/04 ic85 node hung, rebooted
03/18/04 AVIDD-I GPFS hung, rebooted cluster
03/16/04 if3 hardware problems, server dead
03/09/04 AVIDD-I GPFS hung, rebooted cluster
03/06/04 AVIDD-B GPFS hung, GPFS restarted
03/05/04 AVIDD-I GPFS hung, rebooted cluster
03/05/04 ic82,ic91 GPFS hung, rebooted systems
02/25/04 AVIDD-B and /N/B on AVIDD-I and AVIDD-I64 power failure on WCC power distribution unit #7
02/24/04 AVIDD-B and /N/B on AVIDD-I and AVIDD-I64 power failure on WCC power distribution unit #7
02/21/04 AVIDD-B half the Torque Moms died; rolled back to Storm version of Moms
02/20/04 ic38 hung and rebooted
02/19/04 bh1 crash due to disk I/O errors
02/18/04 bc38 crash due to NMI
02/17/04 bc48 crash due to NMI
02/17/04 bc50 dead CPU
02/16/04 bc14 GM/Myrinet/kernel problem
02/16/04 avidd-b jobs lost during upgrade to Torque resource manager
02/15/04 ic86 dead myrinet [pci?|switch?]
02/15/04 AVIDD-I The -I cluster was rebooted to clear a GPFS hang
02/11/04 ic86 Node hung, rebooted
02/11/04 ic91 Node hung, rebooted
02/10/04 ic83 dead myrinet [pci?|switch?]
02/09/04 avidd-b GPFS unavailable, ~12:30-2:30pm
02/09/04 avidd-i (ih1) out of memory, rebooted
02/08/04 bc83 dead myrinet pci card
02/08/04 ic70 dead myrinet [pci?|switch?]
02/08/04 avidd-i (ih1) public network not accepting connections; eth2 restarted
02/06/04 avidd-i (ih1) out of memory, rebooted
01/30/04 ic65 hung, rebooted
01/30/04 ic78 hung, rebooted
01/27/04 ic55 hung, rebooted
01/27/04 bh1 Out of memory, rebooted
01/26/04 ic76 hung and rebooted
01/25/04 ic77 Dead Myrinet card
01/21/04 bc09 Dead Myrinet [card?|switch?]
01/18/04 bc37 Dead Myrinet PCI card
01/17/04 bh1 bh1 ran out of memory & crashed
01/17/04 bc05 Dead Myrinet PCI card
01/16/04 bc58 Dead Myrinet switch port
01/15/04 bc48 Dead Myrinet switch port
01/15/04 bc47 Dead Myrinet switch port
01/14/04 AVIDD-I GPFS hung on all nodes at 11:30am forcing a reboot of many of the nodes.
01/12/04 bh1 Out of Memory state required a reboot
01/10/04 ic17 Dead Myrinet PCI card
01/10/04 ic86 Dead Myrinet PCI card
01/08/04 bh1 Out of Memory state required a reboot
01/02/04 bh1 more resources issues, periodic outages throughout the morning
01/01/04 bh1 became resource starved and required rebooting

2003

Date System Problem
12/25/03 ic40 Myrinet PCI card failure
12/22/03 bc18 Myrinet PCI card (laser) failure
12/19/03 bh1 Disk I/O error; system hung midnight-7:00 am EST.
12/18/03 AVIDD-B GPFS unavailable ~4:00pm until 5:45pm, cause not yet determined
12/12/03 AVIDD-I 7:21am All compute nodes shutdown because of AC outage in Indianapolis machine room.
12/11/03 ic01 Dead Myrinet card, node removed from PBS queue.
12/09/03 bc02 Dead Myrinet card, node removed from PBS queue.
11/20/03 bc56 Dead Myrinet card, node removed from PBS queue.
11/20/03 bf2 Power Supply failure. (note: repaired 11/21/03)
11/14/03 ic57 Node shows only 512MB RAM, removed from PBS queue until diagnosis can be made.
11/11/03 AVIDD-B GPFS Filesystems (/N/gpfsb, /N/ivdgl) are unavailable (~10:15am EST). (Service restored ~11:45am)
11/10/03 bc20 Dead myrinet adapter, node removed from PBS
11/09/03 AVIDD-B GPFS Filesystems (/N/gpfsb, /N/ivdgl) are unavailable. Resolved, ~1:00pm.
10/31/03 ic10 Myrinet card laser died: ordering replacement.
10/27/03 ic50 Myrinet card laser died; replaced but out of service until Myrinet remapped.
10/26/03 avidd-i PBS server died; restarted.
10/23/03 bc46 Node removed from PBS due to Myrinet card issue
10/21/03 bh1 rebooted at ~10:30am due to out of memory condition. Performance was erratic from ~10:00am until ~11:00am.
10/17/03 bc03, bc21, bc42 Myrinet cards dead
10/15/03 AVIDD-I PBS server died at 13:55, restarted.
10/09/03 AVIDD-I GPFS filesystem hung; multiple nodes rebooted.
10/02/03 ic26 Myrinet card dead, node will be offline until replaced
10/02/03 ic24 numerous zombie MOMS; rebooted
10/01/03 ic74 Myrinet card hung, rebooted
10/01/03 ic86 and 87 PBS moms dead, restarted
09/29/03 bc67 Dead LED; Myrinet PCI
09/28/03 EXP500#12 left Environmental Sensor Monitor alert; pulled/reseated
09/28/03 bc93 dead myrinet switch port
09/28/03 bc91 kernel oops
09/27/03 if3 kernel oops
09/26/03 ic89 One cpu dead, node removed from service until it can be replaced.
09/25/03 ic93 Myrinet network unavailable, hardware working but reboot didn't clear
09/24/03 AVIDD-N restarted PBS, rebooted a09 due to system hang
09/23/03 bc82 kernel oops; no details available
09/23/03 ih1 rebooted to clear NFS problems
09/22/03 a05 Myrinet interface isn't responding, node has been removed from PBS queue (again...)
09/22/03 bh1 Unplanned reboot due to hung mmfs (GPFS) kernel module ~11:15am EST (this was during the maintenance window)
09/17/03 AVIDD-B, AVIDD-I Problems continue with NFS mounts on both clusters.
09/16/03 AVIDD-B, AVIDD-N Intermittent outages due to emergency SSH upgrade may have been experienced ~15:30 EST
09/16/03 ic93 Myrinet failure
09/16/03 ih1 System rebooted to clear hanging NFS mount (~10:00am EST)
09/15/03 if3 Crash with gpfs kernel oops, no service outage due to redundant storage servers.
09/14/03 bc18,bc71 Nodes removed from PBS queue due to dead Myrinet switch ports. Should be back online following the next maintenance window
09/13/03 bh1 Crash; no specific cause determined
09/10/03 bc40 exp3 Assertion failure; crashed; power cycled
09/04/03 a05 Myrinet interface isn't responding, node has been removed from PBS queue
08/30/03 AVIDD-B WCC machine room power outage
08/30/03 bc50 Myrinet switch port failure
08/27/03 bc11 Myrinet dead, node removed from scheduler
08/27/03 ic20 Myrinet hiccup removed node from scheduler
08/27/03 ic57 Myrinet hiccup removed node from scheduler
08/27/03 if3 down Memory problems, node down, call placed with IBM. This will cause gpfs performance degredation until fixed.
08/26/03 if3 kernel oops; mmfsd; crash/reboot
08/25/03 ic57 probable Myrinet failure; crash/reboot
08/24/03 bc47 out of memory; rebooted
08/23/03 bc49 GPFS dir hang; GPFS stopped/restarted
08/19/03 bc07 node hung; no response; answered pings; rebooted
08/19/03 bc83 commands hang on execution; rebooted
08/18/03 avidd-i avidd-b scheduled maintenance activities
08/17/03 bf2 remapped Myrinet topology & returned to service
08/16/03 avidd-b WCC power outage
08/15/03 bf2 kernel oops; crash; Myrinet switch port failure
08/14/03 bc23 Myrinet interface isn't responding, node has been removed from PBS queue
08/14/03 a05 Myrinet interface isn't responding, node has been removed from PBS queue
08/09/03 bc36 Myrinet interface isn't responding, node has been removed from PBS queue
07/25/03 AVIDD-B cluster rebooted/unavailable from 01:00-05:00 EST for emergency GPFS maintenance
07/24/03 AVIDD-B, AVIDD-I clusters unavailable 08:00-17:00 EST for emergency kernel upgrade
07/23/03 bh1 system rebooted to clear hang on iVDGL disks
07/21/03 bc93 system reboots continuously; powered down until further diagnosis is available
07/17/03 AVIDD-N Network connectivity between IUB and IUN intermittent beginning at ~4:00pm, ended ~4:30pm
07/12/03 AVIDD-N Power outages in the IUN machine room caused cluster reboots at 1:00pm and 3:30pm.
07/10/03 bc92,bc94,ic93,ic94 Nodes removed from PBS and Force10 switch for research project
07/09/03 ic35 Myrinet switch chassis LED failure; will be recovered on next mapping
07/09/03 AFS Access to filesystem problem reported; to be diagnosed
07/03/03 avidd-B WCC machine room power outage; 2:15 EST to 18:00 EST
06/29/03 bc45 Myrinet; switch LED failure; removed from PBS
06/28/03 bh1 local filesystem induced kernel panic; undeletable file deleted
06/27/03 bh1 local filesystem induced kernel panic; system rebooted around 6am
06/26/03 bh1 Kernel crash; system rebooted around 7am
06/23/03 bc29 Kernel crash; derefernece of NULL pointer; system rebooted
06/22/03 bc93 Bad diode on Myrinet adapter; node removed from PBS queue
06/21/03 bc71 Kernel oops; NMI received for unknown reason 35.
06/19/03 ic94 myrinet dead; card lasing LED dead; being diagnosed
06/18/03 bc32 Power supply 1: DC output not in normal range; self corrected
06/18/03 bc83 myrinet GM drivers hung; system rebooted
06/17/03 ic89 system crash; hung during reboot; being diagnosed
06/16/03 bc16 myrinet nonfunctional; switch lasing LED dead; moved to new port
06/16/03 bc88 myrinet nonfunctional; card lasing LED dead; return to manufacture
06/12/03 bc69 node returned to service; no problem found
06/12/03 bc40 node back online after maintenance
06/12/03 avidd-i (compute nodes) Some avidd-i.iupui.edu compute nodes were rebooted to kill a runaway process that hung gpfs.
06/10/03 bc40 node removed from PBS queue for maintenance (scheduled for 06/11/03)
06/09/03 ih1 rebooted to kill a runaway process.
06/09/03 AVIDD-I GPFS is hanging on some operations; sysadmins are investigating: Restored 2:55 pm
06/06/03 ic29 Myrinet connectivity lost
06/06/03 GPFS on I cluster Filesystem service restored
06/05/03 GPFS on I cluster Filesystem hangs when certain directories are accessed
06/05/03 bc85 Node returned to PBS; no problem found
06/05/03 bc40 Node back online with firmware update
06/05/03 bc69 Node rebooted ~6am due to Myrinet error
06/04/03 bc40 Node down; IBM engineers are performing maintenance
06/02/03 bc95,bc96,ic95,ic96 Nodes removed from PBS to test AFS
06/02/03 bc85 Node removed from PBS due to Myrinet adapter problems; currently being diagnosed
05/30/03 AVIDD-I NFS hang from AVIDD-B caused logins to fail this morning, NFS remount fixed the problem
05/29/03 ic66 Rebooted, node status is online (this is a change of status from 05/27/03)
05/27/03 AVIDD-I GPFS service restored ~10:30am
05/27/03 ic29 failed Myrinet adapter, node status is offline
05/27/03 ic66 failed motherboard, node status is offline
05/27/03 AVIDD-I GPFS status unchanged from 05/20/03, IBM is analyzing system logs in an effort to resolve the issue
05/26/03 AVIDD-I GPFS status unchanged from 05/20/03
05/25/03 AVIDD-I GPFS status unchanged from 05/20/03
05/24/03 AVIDD-I GPFS status unchanged from 05/20/03
05/23/03 bc40 rebooted ~11:50am, DASD errors
05/23/03 AVIDD-I GPFS status unchanged from 05/20/03
05/22/03 AVIDD-I NFS mount problems due to a routing issue on ih3; /N/B was unavailable from approximately 6pm-9pm
05/22/03 bc42 rebooted ~4:11pm, stack overflow errors, cause unknown
05/22/03 bc40 rebooted, reporting DASD errors; IBM has been contacted
05/22/03 AVIDD-I GPFS status unchanged from 05/20/03
05/21/03 AVIDD-I GPFS status unchanged from 05/20/03
05/20/03 AVIDD-I GPFS down due to residual problems from Monday's upgrade
05/20/03 AVIDD-N cluster rebooted at 10:38am and 11:37am due to power outage in IUN machine room.
05/14/03 bf4 Storage node hung, causing home directories on AVIDD-B to be unavailable; rebooted.
05/12/03 bc96 Motherboard failure
05/11/03 bc29 Myrinet adapter failure
05/07/03 bc08 Node hung, rebooted
04/30/03 AVIDD-B, -I Both clusters were unavailable to diagnose performance problems and evaluate the Force10 network equipment.
04/27/03 AVIDD-I GPFS was down; a complete cluster reboot was required