[clug] Is a weekly RAID scrub too much?

Sun Feb 26 12:11:30 UTC 2017

On 26/02/17 22:16, Paul Wayper wrote:
> On 26/02/17 17:32, Eyal Lebedinsky wrote:
>> On 26/02/17 17:15, Paul Wayper wrote:
>>> On 24/02/17 22:35, Eyal Lebedinsky wrote:
> [...]
>>>> Is the scrub doing more harm than good by shortening the service life of the
>>>> disk?
>>>
>>> So, what do you mean by "weekly RAID scrub"?
>>>
>>> I've never heard of anyone doing this on a RAID array.
>>>
>>> I've used 'scrub' on a disk or files to overwrite them with random data to
>>> avoid anyone using a scanning electron microscope on them to view your data
>>> again.  With disk geometries, cylinder tolerances, and encoding mechanisms
>>> these days, even overwriting with zeros will render the data inaccessible to
>>> almost all attackers.
>>
>> I use 'shred' for this.
>>
>>> But I've never heard of 'RAID scrubbing'.  What are you doing, Eyal?
>>
>> RAID scrub refers to, in linux software RAID, a 'check' request. It means that
>> the full RAID is read and verified. For RAID6 it means P and Q are verified.
>
> I looked up the ever-helpful Arch Linux pages on RAID and LVM:
>
> https://wiki.archlinux.org/index.php/Software_RAID_and_LVM#Scrubbing
>
> That took me to Wikipedia's page on "Data Scrubbing":
>
> https://en.wikipedia.org/wiki/Data_scrubbing
>
> Which contained lots of hyperbolic, panic-inducing language like:
>
> """Data integrity is a high-priority concern in writing, reading, storage,
> transmission, or processing of the computer data in computer operating systems
> and in computer storage and data transmission systems. However, only a few of
> the currently existing and used file systems provide sufficient protection
> against data corruption."""
>
> To me this is totally overblown.  In all the many years I've been looking at
> files, I don't think I've ever seen a file corrupted on a healthy disk.  I've

I had the scrub discover problems a few times. I do not know the reason for those
failures, there was no disk error logged and I was not aware of any event that
could cause it.

However, these problems pointed to files that were read happily but the data
delivered is likely to be bad.

[aside]
We do not have a "safe read" option for raid6, which could check data as it is
read and correct it if necessary (and possible). Furthermore, while raid6 can
do a 'repair', which is like a 'check' but on detecting an error it recalculates
the syndromes, this repair is not so good. raid6 has enough data to detect
*which* sector is bad and rewrite it rather than assume the data is good and
the syndromes are bad. Note: if a disk returns an actual i/o error then the
kernel does repair the failed sectors on the fly.

In short, when a 'check' error is detected I identify the affected files and
deal with their content, restoring from backup if necessary.
[/aside]

> seen one IBM XT with dodgy memory that started corrupting everything that was
> written to disk - and funnily enough running full disk scans made the problem
> worse.  In the case of RAID 5 and 6 you've already got full redundancy should
> one piece go missing - you can literally lose an entire disk and keep going in
> degraded mode, so why do you care if a block goes bad?
>
> My favourite bit was:
>
> """Due to the high integration density of contemporary computer memory chips,
> the individual memory cell structures became small enough to be vulnerable to
> cosmic rays and/or alpha particle emission. The errors caused by these
> phenomena are called soft errors. This can be a problem for DRAM- and
> SRAM-based memories."""
>
> So the solution to the possibility of one of the sectors going silently bad -
> despite all the systems in the drive itself to detect and correct these errors
> - is to read it into memory which can also be faulty?  Tell me again how this
> is going to help? :-)
>
> Joking aside, I think maybe a scrub once a year might be a possibility if
> you're dealing with hardware you don't entirely trust.  But I'd keep it to a
> minimum because, as you observe, drives do wear out with overuse, and full
> disk scans will do that.

I doubt a yearly scrub will ever come out clean :-(
But I probably need to reduce the frequency to monthly.

Question: How often other people do a scrub?

> It's actually one of the common failure modes in RAID arrays - you have
> several near-failing drives but they're fine with the normal workload.  One
> drive fails and you replace it, but now the RAID rebuild has to scan every
> single disk, so all the other drives near fail point are pushed over the limit.

True, but while seeing a second disk fail during a rebuild is scary, it is not
fatal. In my experience disks fails slowly, a few bad sectors at a time, and I
can deal with a few bad sectors while replacing a failing disk.

[aside]
I did not yet have a catastrophic failure in the 12 years of running software
raid, but the last (current) set of disks is the worst. These are WD blacks and
out of 7 disks six were replaced (some more than once) and the seventh had
"events" suggesting it is next. This is after 3.5 years in the planned 5 years
life of this array.

BTW, these disks do not list a workload limit, and I assume they are designed
for continuous use with a high load.
[/aside]

> But by "common" here I mean "this almost never happens".  I've heard
> anecdotes, and I'm hoping some people on the list can share hilarious stories
> about drive and array failures :-)
>
> So, anyone running BTRFS?  Can you grep your logs for "csum failed ino" and
> tell us all how often you see that error?  BTRFS stores checksums on metadata
> and data (XFS stores checksums only on metadata), and verifies the checksum on
> read and re-reads data if it's not correct.  If we could get some kind of
> statistical analysis of the number of reads and writes performed vs the number
> of actual checksum failures, we'd get an idea of how likely they are in the
> real world.
>
> Have fun,
>
> Paul

cheers

-- 
Eyal Lebedinsky (eyal at eyal.emu.id.au)