worse hell: July 2007

Monday, July 30, 2007

chkbootblk.sh

i wrote a small shell script as proof of concept for checking for corrupted boot-blocks in solaris.
there you go:
#!/bin/ksh

ARCH=`uname -i`
DEV=`df -k / | awk '{ print $1 }' | tail -1`
PREDEF="/tmp/chkbootblk_predef.tmp"
CURRENT="/tmp/chkbootblk_current.tmp"

dd if=/usr/platform/$ARCH/lib/fs/ufs/bootblk \
ibs=1b count=1 | od -c | head -3 > /tmp/chkbootblk_predef.tmp

dd if=$DEV ibs=1b iseek=1 count=1 | od -c | \
head -3 > /tmp/chkbootblk_current.tmp

cmp -s $PREDEF $CURRENT
if [ $? = 0 ]
then echo "*********************************"
echo "* BOOT-BLOCK SEEMS TO BE O.K. *"
echo "*********************************";
else echo "*********************************"
echo "* WARNING: CORRUPTED BOOT-BLOCK *"
echo "*********************************";
fi

rm $PREDEF $CURRENT

Wednesday, July 25, 2007

short resume on ZFS

while the guys at solaris code camp glorified ZFS and its opportunities, i wanted to take a closer look.
since a filesystem/vm should be production-ready, it should at least handle hotplugged - failed drive in this case - devices correctly.
so i set up a sun enterprise 420r with a d130 storage array attached to it.
`zpool create tank raidz c1t10d0 c1t11d0 c1t12d0` (don't bother about the 3 disk raidz setup - thanks ;-)
then i physically detached a disk of the array. `zpool status` hung on giving me output. it was even not possible to login with ssh anymore.
this is a known issue by the ZFS community:
OpenSolaris Case
while speaking about just one problem of the current ZFS implementation we have to wait for proper FMA integration.
ongoing discussions can be found at eschrock's blog:
http://blogs.sun.com/eschrock/

Tuesday, July 24, 2007

check for corrupted boot-blocks on solaris

i just wanted to check if a machines boot-block is corrupted.
i came up with checking the "pre-defined" block at /usr/platform/..../bootblk against the installed. the following procedure works on veritas volumes as well as metadevices and raw-slices. you just have to remember the 1 offset (iseek=1) on the "real" disks:

# dd if=/usr/platform/SUNW,Sun-Fire-15000/lib/fs/ufs/bootblk ibs=1b count=1 | od -c | head -3
1+0 records in
1+0 records out
0000000 375 003 J 331 \0 \0 027 0 314 022 \t / p a c k
0000020 a g e s 002 004 4 024 \0 034 022 024 C a n '
0000040 t f i n d / p a c k a g e s

# dd if=/dev/md/rdsk/d0 ibs=1b iseek=1 count=1 | od -c | head -3
1+0 records in
1+0 records out
0000000 375 003 J 331 \0 \0 027 0 314 022 \t / p a c k
0000020 a g e s 002 004 4 024 \0 034 022 024 C a n '
0000040 t f i n d / p a c k a g e s

# dd if=/dev/rdsk/c1t1d0s0 ibs=1b iseek=1 count=1 | od -c | head -3
1+0 records in
1+0 records out
0000000 375 003 J 331 \0 \0 027 0 314 022 \t / p a c k
0000020 a g e s 002 004 4 024 \0 034 022 024 C a n '
0000040 t f i n d / p a c k a g e s

if there's no difference at all, the boot-block should be ok.

SF15k feature

a 15k domain came down today.
first problem was, that the domain powered off automatically every few minutes. even at ok prompt. second was, bad boot blocks on every disk out of 5.
so, we discovered a feature, which is official since 2004:
a 15k domain automatically gets a reset if the obp is not able to detect a boot block(corrupted even). this took several hours to be figured and cost time throughout the SL ;-)
...a feature, not a bug!
SunSolve #76798

worse hell