| Date |
System |
Problem |
| 12/19/04 |
bc60 |
rebooted to clear zombies |
| 12/13/04 |
/N/gpfsi |
GPFS outage due to SysAdmin error |
| 12/06/04 |
bc62 |
system hang; cycled power |
| 11/28/04 |
bc23,bc30 |
system hang; cycled power |
| 11/25/04 |
ih2 |
NFS induced kernel panic; power cycled |
| 11/22/04 |
ic65,ic79 |
system hang; cycled power |
| 11/20/04 |
ic16 |
system hang; cycled power |
| 11/19/04 |
bc59 |
crash w/ NMI; reseated Myrinet card and system rebooted OK |
| 11/18/04 |
ic89 |
GM kernel panic |
| 11/16/04 |
bh1 |
Out-of-Memory kernel killer brought the system down; rebooted |
| 11/12/04 |
bc87 |
system unresponsive as of ~2am, power-cycled at 12:00pm |
| 11/06/04 |
bf1 |
system crashed due to power-supply failure, replaced on 9-Nov. |
| 11/06/04 |
bh1 |
system hung, rebooted 10:30am, 11/7/04 |
| 10/26/04 |
bc96 |
power-cycled after node hung, ~12:15pm |
| 10/22/04 |
bc44 |
power-cycled after node hung, ~22:19 10/21 |
| 10/18/04 |
AVIDD-B, AVIDD-I |
restarted PBS and Maui ~5:00pm to clean out exiting jobs left over from the power outage at IUPUI |
| 10/18/04 |
AVIDD-I |
electrical failure resulted in sporadic failures of nodes ic39-ic76, most of the day |
| 10/18/04 |
AVIDD-I |
power-cycled ic62, ic63 after they became unresponsive (~9:30am) |
| 10/08/04 |
AVIDD-B |
restarted Maui on bh2, ~9:30am (had died at ~8:37) |
| 10/06/04 |
AVIDD-B |
nodes bc36,bc45 unresponsive, power cycled ~2:30pm |
| 09/30/04 |
bh2 |
kernel panic at ~9:40am, system restarted at ~10:40am |
| 08/31/04 |
bc83 |
dead Myrinet LED; switch side; removed from PBS |
| 08/17/04 |
bh2 |
hang on /N/B; other NFS filesystems OK; cycled power |
| 08/17/04 |
bc77 |
kernel oops wrt ex3 filesystem; cycled power |
| 08/17/04 |
bh2 |
hang on /N/B; other NFS filesystems OK; cycled power |
| 08/17/04 |
bh1 |
hang on /N/B; other NFS filesystems OK; cycled power |
| 08/16/04 |
bh1,bh2 |
hang on /N/B; other NFS filesystems OK; cycled power |
| 08/13/04 |
bc05,bc15,bc31 |
Nodes were unresponsive, causing Maui to hang; rebooted ~9:45am |
| 08/6/04 |
ic38 |
HT turned off |
| 08/6/04 |
ic89 |
one dead CPU; cycle power, now two |
| 08/6/04 |
ic66 |
one dead CPU |
| 07/09/04 |
ic44 |
PS#1 DC output not in normal range; unable to power on. |
| 06/09/04 |
AVIDD-B |
bh1 became unresponsive, probably due to GPFS issues; rebooted at ~4:50pm |
| 06/07/04 |
AVIDD-B |
bc88 repeatedly drops connections over the Myrinet adapter. Node has been removed from PBS |
| 06/06/04 |
AVIDD-I |
GPFS problems on several nodes due to one node, ic67, with gpfs hung in strange state. After reboot of ic67, problem cleared. |
| 06/02/04 |
AVIDD-I |
F10 switch problems causing NFS mounts and jobs to hang. Incident open with Force10. |
| 05/26/04 |
AVIDD-I |
Torque problems causing jobs not to run, Torque restarted causing loss of some jobs. |
| 05/25/04 |
ih1 |
Out of memory error caused head node to hang, rebooted and restored to service |
| 05/18/04 |
tm1 |
Filesystem errors hung NFS server on the IA64 cluster |
| 05/11/04 |
ic41,ic42 |
Out of memory errors cause nodes to need rebooting |
| 04/27/04 |
im |
Problems with ethernet devices, might cause slow connections cluster wide on AVIDD-I |
| 04/23/04 |
ih1 |
GPFS hung on ih1 at 12:45, rebooted and back at ~1:20 |
| 04/22/04 |
AVIDD-B |
GPFS unavailable for ~6 hours; admins are still investigating |
| 04/11/04 |
AVIDD-B |
GPFS instability; Maui restarted |
| 04/10/04 |
AVIDD-B |
Sporadic GPFS problems |
| 03/27/04 |
ic66 |
node wouldn't reboot with rest of cluster |
| 03/27/04 |
AVIDD-I |
GPFS hung, compute nodes rebooted |
| 03/27/04 |
ih1 |
Server hung, rebooted |
| 03/24/04 |
bc34 |
Myrinet card failure |
| 03/24/04 |
if3 |
NMI crash and reboot |
| 03/23/04 |
AVIDD-I |
GPFS hung, GPFS restarted |
| 03/22/04 |
AVIDD-I |
GPFS hung, rebooted cluster |
| 03/21/04 |
ic85 |
node hung, rebooted |
| 03/18/04 |
AVIDD-I |
GPFS hung, rebooted cluster |
| 03/16/04 |
if3 |
hardware problems, server dead |
| 03/09/04 |
AVIDD-I |
GPFS hung, rebooted cluster |
| 03/06/04 |
AVIDD-B |
GPFS hung, GPFS restarted |
| 03/05/04 |
AVIDD-I |
GPFS hung, rebooted cluster |
| 03/05/04 |
ic82,ic91 |
GPFS hung, rebooted systems |
| 02/25/04 |
AVIDD-B and /N/B on AVIDD-I and AVIDD-I64 |
power failure on WCC power distribution unit #7 |
| 02/24/04 |
AVIDD-B and /N/B on AVIDD-I and AVIDD-I64 |
power failure on WCC power distribution unit #7 |
| 02/21/04 |
AVIDD-B |
half the Torque Moms died; rolled back to Storm version of Moms |
| 02/20/04 |
ic38 |
hung and rebooted |
| 02/19/04 |
bh1 |
crash due to disk I/O errors |
| 02/18/04 |
bc38 |
crash due to NMI |
| 02/17/04 |
bc48 |
crash due to NMI |
| 02/17/04 |
bc50 |
dead CPU |
| 02/16/04 |
bc14 |
GM/Myrinet/kernel problem |
| 02/16/04 |
avidd-b |
jobs lost during upgrade to Torque resource manager |
| 02/15/04 |
ic86 |
dead myrinet [pci?|switch?] |
| 02/15/04 |
AVIDD-I |
The -I cluster was rebooted to clear a GPFS hang |
| 02/11/04 |
ic86 |
Node hung, rebooted |
| 02/11/04 |
ic91 |
Node hung, rebooted |
| 02/10/04 |
ic83 |
dead myrinet [pci?|switch?] |
| 02/09/04 |
avidd-b |
GPFS unavailable, ~12:30-2:30pm |
| 02/09/04 |
avidd-i (ih1) |
out of memory, rebooted |
| 02/08/04 |
bc83 |
dead myrinet pci card |
| 02/08/04 |
ic70 |
dead myrinet [pci?|switch?] |
| 02/08/04 |
avidd-i (ih1) |
public network not accepting connections; eth2 restarted |
| 02/06/04 |
avidd-i (ih1) |
out of memory, rebooted |
| 01/30/04 |
ic65 |
hung, rebooted |
| 01/30/04 |
ic78 |
hung, rebooted |
| 01/27/04 |
ic55 |
hung, rebooted |
| 01/27/04 |
bh1 |
Out of memory, rebooted |
| 01/26/04 |
ic76 |
hung and rebooted |
| 01/25/04 |
ic77 |
Dead Myrinet card |
| 01/21/04 |
bc09 |
Dead Myrinet [card?|switch?] |
| 01/18/04 |
bc37 |
Dead Myrinet PCI card |
| 01/17/04 |
bh1 |
bh1 ran out of memory & crashed |
| 01/17/04 |
bc05 |
Dead Myrinet PCI card |
| 01/16/04 |
bc58 |
Dead Myrinet switch port |
| 01/15/04 |
bc48 |
Dead Myrinet switch port |
| 01/15/04 |
bc47 |
Dead Myrinet switch port |
| 01/14/04 |
AVIDD-I |
GPFS hung on all nodes at 11:30am
forcing a reboot of many of the nodes. |
| 01/12/04 |
bh1 |
Out of Memory state required a reboot |
| 01/10/04 |
ic17 |
Dead Myrinet PCI card |
| 01/10/04 |
ic86 |
Dead Myrinet PCI card |
| 01/08/04 |
bh1 |
Out of Memory state required a reboot |
| 01/02/04 |
bh1 |
more resources issues, periodic outages throughout the morning |
| 01/01/04 |
bh1 |
became resource starved and required rebooting |
| Date |
System |
Problem |
| 12/25/03 |
ic40 |
Myrinet PCI card failure |
| 12/22/03 |
bc18 |
Myrinet PCI card (laser) failure |
| 12/19/03 |
bh1 |
Disk I/O error; system hung midnight-7:00 am EST. |
| 12/18/03 |
AVIDD-B |
GPFS unavailable ~4:00pm until 5:45pm, cause not yet determined |
| 12/12/03 |
AVIDD-I |
7:21am All compute nodes shutdown because of AC outage in Indianapolis machine room. |
| 12/11/03 |
ic01 |
Dead Myrinet card, node removed from PBS queue. |
| 12/09/03 |
bc02 |
Dead Myrinet card, node removed from PBS queue. |
| 11/20/03 |
bc56 |
Dead Myrinet card, node removed from PBS queue. |
| 11/20/03 |
bf2 |
Power Supply failure. (note: repaired 11/21/03) |
| 11/14/03 |
ic57 |
Node shows only 512MB RAM, removed from PBS queue until diagnosis can be made. |
| 11/11/03 |
AVIDD-B |
GPFS Filesystems (/N/gpfsb, /N/ivdgl) are unavailable (~10:15am EST). (Service restored ~11:45am) |
| 11/10/03 |
bc20 |
Dead myrinet adapter, node removed from PBS |
| 11/09/03 |
AVIDD-B |
GPFS Filesystems (/N/gpfsb, /N/ivdgl) are unavailable. Resolved, ~1:00pm. |
| 10/31/03 |
ic10 |
Myrinet card laser died: ordering replacement. |
| 10/27/03 |
ic50 |
Myrinet card laser died; replaced but out of service until Myrinet remapped. |
| 10/26/03 |
avidd-i |
PBS server died; restarted. |
| 10/23/03 |
bc46 |
Node removed from PBS due to Myrinet card issue |
| 10/21/03 |
bh1 |
rebooted at ~10:30am due to out of memory condition. Performance was erratic from ~10:00am until ~11:00am. |
| 10/17/03 |
bc03, bc21, bc42 |
Myrinet cards dead |
| 10/15/03 |
AVIDD-I |
PBS server died at 13:55, restarted. |
| 10/09/03 |
AVIDD-I |
GPFS filesystem hung; multiple nodes rebooted. |
| 10/02/03 |
ic26 |
Myrinet card dead, node will be offline until replaced |
| 10/02/03 |
ic24 |
numerous zombie MOMS; rebooted |
| 10/01/03 |
ic74 |
Myrinet card hung, rebooted |
| 10/01/03 |
ic86 and 87 |
PBS moms dead, restarted |
| 09/29/03 |
bc67 |
Dead LED; Myrinet PCI |
| 09/28/03 |
EXP500#12 |
left Environmental Sensor Monitor alert; pulled/reseated |
| 09/28/03 |
bc93 |
dead myrinet switch port |
| 09/28/03 |
bc91 |
kernel oops |
| 09/27/03 |
if3 |
kernel oops |
| 09/26/03 |
ic89 |
One cpu dead, node removed from service until it can be replaced. |
| 09/25/03 |
ic93 |
Myrinet network unavailable, hardware working but reboot didn't clear |
| 09/24/03 |
AVIDD-N |
restarted PBS, rebooted a09 due to system hang |
| 09/23/03 |
bc82 |
kernel oops; no details available |
| 09/23/03 |
ih1 |
rebooted to clear NFS problems |
| 09/22/03 |
a05 |
Myrinet interface isn't responding, node has been removed from PBS queue (again...) |
| 09/22/03 |
bh1 |
Unplanned reboot due to hung mmfs (GPFS) kernel module ~11:15am EST (this was during the maintenance window) |
| 09/17/03 |
AVIDD-B, AVIDD-I |
Problems continue with NFS mounts on both clusters. |
| 09/16/03 |
AVIDD-B, AVIDD-N |
Intermittent outages due to emergency SSH upgrade may have been experienced ~15:30 EST |
| 09/16/03 |
ic93 |
Myrinet failure |
| 09/16/03 |
ih1 |
System rebooted to clear hanging NFS mount (~10:00am EST) |
| 09/15/03 |
if3 |
Crash with gpfs kernel oops, no service outage due to redundant storage servers. |
| 09/14/03 |
bc18,bc71 |
Nodes removed from PBS queue due to dead Myrinet switch ports. Should be back online following the next maintenance window |
| 09/13/03 |
bh1 |
Crash; no specific cause determined |
| 09/10/03 |
bc40 |
exp3 Assertion failure; crashed; power cycled |
| 09/04/03 |
a05 |
Myrinet interface isn't responding, node has been removed from PBS queue |
| 08/30/03 |
AVIDD-B |
WCC machine room power outage |
| 08/30/03 |
bc50 |
Myrinet switch port failure |
| 08/27/03 |
bc11 |
Myrinet dead, node removed from scheduler |
| 08/27/03 |
ic20 |
Myrinet hiccup removed node from scheduler |
| 08/27/03 |
ic57 |
Myrinet hiccup removed node from scheduler |
| 08/27/03 |
if3 down |
Memory problems, node down, call placed with IBM. This will cause gpfs performance degredation until fixed. |
| 08/26/03 |
if3 |
kernel oops; mmfsd; crash/reboot |
| 08/25/03 |
ic57 |
probable Myrinet failure; crash/reboot |
| 08/24/03 |
bc47 |
out of memory; rebooted |
| 08/23/03 |
bc49 |
GPFS dir hang; GPFS stopped/restarted |
| 08/19/03 |
bc07 |
node hung; no response; answered pings; rebooted |
| 08/19/03 |
bc83 |
commands hang on execution; rebooted |
| 08/18/03 |
avidd-i avidd-b |
scheduled maintenance activities |
| 08/17/03 |
bf2 |
remapped Myrinet topology & returned
to service |
| 08/16/03 |
avidd-b |
WCC power outage |
| 08/15/03 |
bf2 |
kernel oops; crash; Myrinet switch port failure |
| 08/14/03 |
bc23 |
Myrinet interface isn't responding, node has been removed from PBS queue |
| 08/14/03 |
a05 |
Myrinet interface isn't responding, node has been removed from PBS queue |
| 08/09/03 |
bc36 |
Myrinet interface isn't responding, node has been removed from PBS queue |
| 07/25/03 |
AVIDD-B |
cluster rebooted/unavailable from 01:00-05:00 EST for emergency GPFS maintenance |
| 07/24/03 |
AVIDD-B, AVIDD-I |
clusters unavailable 08:00-17:00 EST for emergency kernel upgrade |
| 07/23/03 |
bh1 |
system rebooted to clear hang on iVDGL disks |
| 07/21/03 |
bc93 |
system reboots continuously; powered down until further diagnosis is available |
| 07/17/03 |
AVIDD-N |
Network connectivity between IUB and IUN intermittent beginning at ~4:00pm, ended ~4:30pm |
| 07/12/03 |
AVIDD-N |
Power outages in the IUN machine room caused cluster reboots at 1:00pm and 3:30pm. |
| 07/10/03 |
bc92,bc94,ic93,ic94 |
Nodes removed from PBS and Force10 switch for research project |
| 07/09/03 |
ic35 |
Myrinet switch chassis LED failure; will be recovered on next mapping |
| 07/09/03 |
AFS |
Access to filesystem problem reported; to be diagnosed |
| 07/03/03 |
avidd-B |
WCC machine room power outage; 2:15 EST to 18:00 EST |
| 06/29/03 |
bc45 |
Myrinet; switch LED failure; removed from PBS |
| 06/28/03 |
bh1 |
local filesystem induced kernel panic; undeletable file deleted |
| 06/27/03 |
bh1 |
local filesystem induced kernel panic; system rebooted around 6am |
| 06/26/03 |
bh1 |
Kernel crash; system rebooted around 7am |
| 06/23/03 |
bc29 |
Kernel crash; derefernece of NULL pointer; system rebooted |
| 06/22/03 |
bc93 |
Bad diode on Myrinet adapter; node removed from PBS queue |
| 06/21/03 |
bc71 |
Kernel oops; NMI received for unknown reason 35. |
| 06/19/03 |
ic94 |
myrinet dead; card lasing LED dead; being diagnosed |
| 06/18/03 |
bc32 |
Power supply 1: DC output not in normal range; self corrected |
| 06/18/03 |
bc83 |
myrinet GM drivers hung; system rebooted |
| 06/17/03 |
ic89 |
system crash; hung during reboot; being diagnosed |
| 06/16/03 |
bc16 |
myrinet nonfunctional; switch lasing LED dead; moved to new port |
| 06/16/03 |
bc88 |
myrinet nonfunctional; card lasing LED dead; return to manufacture |
| 06/12/03 |
bc69 |
node returned to service; no problem found |
| 06/12/03 |
bc40 |
node back online after maintenance |
| 06/12/03 |
avidd-i (compute nodes) |
Some avidd-i.iupui.edu compute nodes were rebooted to kill a runaway process that hung gpfs. |
| 06/10/03 |
bc40 |
node removed from PBS queue for maintenance (scheduled for 06/11/03) |
| 06/09/03 |
ih1 |
rebooted to kill a runaway process. |
| 06/09/03 |
AVIDD-I |
GPFS is hanging on some operations; sysadmins are investigating: Restored 2:55 pm |
| 06/06/03 |
ic29 |
Myrinet connectivity lost |
| 06/06/03 |
GPFS on I cluster |
Filesystem service restored |
| 06/05/03 |
GPFS on I cluster |
Filesystem hangs when certain directories are accessed |
| 06/05/03 |
bc85 |
Node returned to PBS; no problem found |
| 06/05/03 |
bc40 |
Node back online with firmware update |
| 06/05/03 |
bc69 |
Node rebooted ~6am due to Myrinet error |
| 06/04/03 |
bc40 |
Node down; IBM engineers are performing maintenance |
| 06/02/03 |
bc95,bc96,ic95,ic96 |
Nodes removed from PBS to test AFS |
| 06/02/03 |
bc85 |
Node removed from PBS due to Myrinet adapter problems; currently being diagnosed |
| 05/30/03 |
AVIDD-I |
NFS hang from AVIDD-B caused logins to fail this morning, NFS remount fixed the problem |
| 05/29/03 |
ic66 |
Rebooted, node status is online (this is a change of status from 05/27/03) |
| 05/27/03 |
AVIDD-I |
GPFS service restored ~10:30am |
| 05/27/03 |
ic29 |
failed Myrinet adapter, node status is offline |
| 05/27/03 |
ic66 |
failed motherboard, node status is offline |
| 05/27/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03, IBM is analyzing system logs in an effort to resolve the issue |
| 05/26/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03 |
| 05/25/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03 |
| 05/24/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03 |
| 05/23/03 |
bc40 |
rebooted ~11:50am, DASD errors |
| 05/23/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03 |
| 05/22/03 |
AVIDD-I |
NFS mount problems due to a routing issue on ih3; /N/B was unavailable from approximately 6pm-9pm |
| 05/22/03 |
bc42 |
rebooted ~4:11pm, stack overflow errors, cause unknown |
| 05/22/03 |
bc40 |
rebooted, reporting DASD errors; IBM has been contacted |
| 05/22/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03 |
| 05/21/03 |
AVIDD-I |
GPFS status unchanged from 05/20/03 |
| 05/20/03 |
AVIDD-I |
GPFS down due to residual problems from Monday's upgrade |
| 05/20/03 |
AVIDD-N |
cluster rebooted at 10:38am and 11:37am due to power outage in IUN machine room. |
| 05/14/03 |
bf4 |
Storage node hung, causing home directories on AVIDD-B to be unavailable; rebooted. |
| 05/12/03 |
bc96 |
Motherboard failure |
| 05/11/03 |
bc29 |
Myrinet adapter failure |
| 05/07/03 |
bc08 |
Node hung, rebooted |
| 04/30/03 |
AVIDD-B, -I |
Both clusters were unavailable to diagnose performance problems and evaluate the Force10 network equipment. |
| 04/27/03 |
AVIDD-I |
GPFS was down; a complete cluster reboot was required |