[clug] How to make my server robust for booting

Chris Smart clug at csmart.io
Wed Sep 11 01:01:25 UTC 2019


On Wed, 11 Sep 2019, at 08:49, Tony Lewis via linux wrote:
> All,
> 
> I'm rebuilding my home server, and want to make it 'robust' for boot 
> purposes.  For example, if a disk fails, the system can continue to 
> function until I replace it.
> 
> I am testing my ideas in VirtualBox at the moment.
> 
> The stuff I think I've sorted are:
> 
>   * encrypted RAID1 for /
>   * RAID1 for /boot
> 
> What I'm stuck on is how to handle a failure of the drive where GRUB is 
> installed.  I thought it might be as simple as doing grub-install 
> /dev/sd[bcd] (as well as on /dev/sda) and BIOS would just find *a* copy 
> of GRUB and be able to continue the boot process.
> 

Yep, pretty sure I've done that successfully before. I'm not sure if I used grub install or just dd'd the first 446 from one to the other, it's been a while.

Can you use hexdump to check if Grub is at least embedded in the second drive?

What does your Grub device.map look like? Does it only have hd0?

> It's not working in VirtualBox at least.  If I let it boot unaided, it 
> cannot find a bootable medium (expected behaviour). If I interrupt that 
> with F12 and choose the second hard drive to boot from, it locks up.  It 
> might be a VirtualBox thing, and so the physical server would be OK.  Or 
> more likely I don't understand what I'm doing.
> 

What if you disconnect the first drive leaving only the second (which obviously becomes the first), does that work?  Could be a virtualbox bug, can you try it on a KVM host somewhere?

> What's the best way to architect things so that a failed hard drive 
> where GRUB is installed, is easily handled?
> 

For what it's worth, I just tested it in KVM and CentOS 7 just does the right thing out of the box without anything fancy; literally just set RAID1 for /boot and encrypted RAID1 for / in the installer and it Just Works(tm).

There's a timeout when looking for the missing device (90 seconds or something), if it's takes too long you could try kernel ark like 'rd.retry=30'.

Another solution is probably a hardware raid card and let it take are of it all for you.... ;-)

-c



More information about the linux mailing list