[clug] How to make my server robust for booting

Fri Sep 13 06:06:39 UTC 2019

> On 12 Sep 2019, at 10:42, Tony Lewis via linux <linux at lists.samba.org> wrote:
> 
> Thanks for the link.  From that, it recommends making sure root is not hardcoded as /dev/hd0, which it isn't; it uses /dev/mapper/md1_crypt.
> 
> So it looks like it should work in the real world.  I'll try it when I get that far.

Tony,

Did I miss that on the 1st pass - that your boot partition (md1) is encrypted?

I’ve never played with crypto filesystems, but they all share a common boot problem - feeding in the password(s) to unlock the keys when-ever they (cold) boot.

Does grub (are you using v2 or v1?) support encrypted boot drives without intervention?

I’ve seen boot problems on other POSIX solutions withe RAID 1 / mirroring after a disk failure:

 - the bootloader comes up, finds the remaining disk, loads the RAID software and then ‘fails to proceed'
 - mirroring driver refuses to boot, because it doesn’t have a ‘quorum’, defined as (N/2) + 1

When N=2, the quorum  is 2 :( Not Real Helpful when booting, but the system survives if running and if you’re attentive and have hot-swap disks, can recover on-the-fly, no data loss, no service interruption, no unexpected boot problems later.

If you have multiple drives and are using a small boot partition, you may be able to mirror the boot partition across 3 or more drives to avoid quorum problems.
But first identify if that’s the problem. [should be quick in Vbox]

In answer to the person that said, “Use a RAID Card”, I’d strongly advise against that.

At one site they had a Dell server with a PERC disk controller. The previous IT group had saved a little money by not buying hot-swap drives, but a ‘remove the server from the rack and pull it right down’ disk cage. Maybe 8 drives in all. 

The drives were split into two sets
 - one for the Unix file system (RAID-5 of 3 drives, IIRC)
 - one for User files, exported via Novell File protocol, not SMB, to all user PC’s, maybe RAID-5, maybe R-6, or remaining 5.
		(resulting in an on-going file share/lock problem with Attache Accounting. [V 7?]
		 Seemed to be an old version, no support contract for it either)

This fault happened on the evening of the last day of the server support contract.
Organisation seemed to have decided it’s main sever, which included all shared storage, didn’t need hardware support.
They ran SuSE SLES & Novell tools, & heavily relied on the Novell GUI to manage users & access to areas of the shared file storage.

Organisation didn’t consider it necessary to have a SLES/ Novell support contract - the admins there had no expertise in it and I’d not been trained in it or asked about it during hiring. Nor was there any local SLES operations/ admin docu, possibly some on-line manuals.
[Nor did they have support contracts for other business critical software, always running ancient versions]

For some reason, I was rebooting all servers and had done so before. It wasn’t a high-risk or high-impact task, quite routine.
[I’d checked batteries in all UPS’s, all were seriously degraded and needed replacement. Could’ve been that.]

About 9-10PM, I did a normal shutdown/restart of the main server, and it hung - the RAID controller wasn’t recognising any disks.
Left it a while, never came good.

Did a ‘cold boot’ - power reset - and RAID card had cleared its fault - only it didn’t remember that one of the RAID-5 group that held the Unix system had been declared faulty and removed from the array months before, well before I arrived but there were ‘readable’ filesystem blocks, though quite ancient.

The hardware RAID controller presented a corrupted volume (mix of old and new data) and hence we had a corrupted file-system
- I don’t know how it even worked, because you’d _think_ that as every block had a checksum, that reads should’ve all failed.
But maybe the PERC controller only checked checksums if there was a CRC hard-error or a soft read error.

Organisation never checked logs, nobody cared about auto-detecting & alerting of soft- or hard-failures and I was the deer in the headlights when the inevitable happened.

That was a Monday night.
Dell very generously agreed to ship a replacement drive Tuesday morning, arrived TUE afternoon & installed before CoB.
On Tuesday afternoon, a SLES trained admin was found and brought in. [they didn’t sign a support contract with him after this]
Sometime on Wednesday the system was reinstalled and configs restored (or recreated, unsure).

On Thursday, staff could get email, share files, do accounting/ payroll and work as usual.

The whole Org was down for Tuesday, some limited services on Wed (no email) and back-up properly on Thu.

Pretty expensive jaunt down ‘penny pinching avenue’ I thought.

I’ve heard some shockers with different systems & hardware RAID with the _only_ config being buried in the card.

Techs could disassemble cards, move EPROM to new, identical card, and ‘restore service’. Took days, was fraught and with old hardware, no new spares available - had to find an identical card discarded from elsewhere.

So, I’m not a fan of hardware RAID cards,
but used conservatively, they might be a very good solution for some applications.

regards
steve

--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin