[clug] Is a weekly RAID scrub too much?

Sun Feb 26 11:16:24 UTC 2017

On 26/02/17 17:32, Eyal Lebedinsky wrote:
> On 26/02/17 17:15, Paul Wayper wrote:
>> On 24/02/17 22:35, Eyal Lebedinsky wrote:
[...]
>>> Is the scrub doing more harm than good by shortening the service life of the
>>> disk?
>>
>> So, what do you mean by "weekly RAID scrub"?
>>
>> I've never heard of anyone doing this on a RAID array.
>>
>> I've used 'scrub' on a disk or files to overwrite them with random data to
>> avoid anyone using a scanning electron microscope on them to view your data
>> again.  With disk geometries, cylinder tolerances, and encoding mechanisms
>> these days, even overwriting with zeros will render the data inaccessible to
>> almost all attackers.
> 
> I use 'shred' for this.
> 
>> But I've never heard of 'RAID scrubbing'.  What are you doing, Eyal?
> 
> RAID scrub refers to, in linux software RAID, a 'check' request. It means that
> the full RAID is read and verified. For RAID6 it means P and Q are verified.

I looked up the ever-helpful Arch Linux pages on RAID and LVM:

https://wiki.archlinux.org/index.php/Software_RAID_and_LVM#Scrubbing

That took me to Wikipedia's page on "Data Scrubbing":

https://en.wikipedia.org/wiki/Data_scrubbing

Which contained lots of hyperbolic, panic-inducing language like:

"""Data integrity is a high-priority concern in writing, reading, storage,
transmission, or processing of the computer data in computer operating systems
and in computer storage and data transmission systems. However, only a few of
the currently existing and used file systems provide sufficient protection
against data corruption."""

To me this is totally overblown.  In all the many years I've been looking at
files, I don't think I've ever seen a file corrupted on a healthy disk.  I've
seen one IBM XT with dodgy memory that started corrupting everything that was
written to disk - and funnily enough running full disk scans made the problem
worse.  In the case of RAID 5 and 6 you've already got full redundancy should
one piece go missing - you can literally lose an entire disk and keep going in
degraded mode, so why do you care if a block goes bad?

My favourite bit was:

"""Due to the high integration density of contemporary computer memory chips,
the individual memory cell structures became small enough to be vulnerable to
cosmic rays and/or alpha particle emission. The errors caused by these
phenomena are called soft errors. This can be a problem for DRAM- and
SRAM-based memories."""

So the solution to the possibility of one of the sectors going silently bad -
despite all the systems in the drive itself to detect and correct these errors
- is to read it into memory which can also be faulty?  Tell me again how this
is going to help? :-)

Joking aside, I think maybe a scrub once a year might be a possibility if
you're dealing with hardware you don't entirely trust.  But I'd keep it to a
minimum because, as you observe, drives do wear out with overuse, and full
disk scans will do that.

It's actually one of the common failure modes in RAID arrays - you have
several near-failing drives but they're fine with the normal workload.  One
drive fails and you replace it, but now the RAID rebuild has to scan every
single disk, so all the other drives near fail point are pushed over the limit.

But by "common" here I mean "this almost never happens".  I've heard
anecdotes, and I'm hoping some people on the list can share hilarious stories
about drive and array failures :-)

So, anyone running BTRFS?  Can you grep your logs for "csum failed ino" and
tell us all how often you see that error?  BTRFS stores checksums on metadata
and data (XFS stores checksums only on metadata), and verifies the checksum on
read and re-reads data if it's not correct.  If we could get some kind of
statistical analysis of the number of reads and writes performed vs the number
of actual checksum failures, we'd get an idea of how likely they are in the
real world.

Have fun,

Paul