[clug] Is a weekly RAID scrub too much?
eyal at eyal.emu.id.au
Sun Feb 26 12:11:30 UTC 2017
On 26/02/17 22:16, Paul Wayper wrote:
> On 26/02/17 17:32, Eyal Lebedinsky wrote:
>> On 26/02/17 17:15, Paul Wayper wrote:
>>> On 24/02/17 22:35, Eyal Lebedinsky wrote:
>>>> Is the scrub doing more harm than good by shortening the service life of the
>>> So, what do you mean by "weekly RAID scrub"?
>>> I've never heard of anyone doing this on a RAID array.
>>> I've used 'scrub' on a disk or files to overwrite them with random data to
>>> avoid anyone using a scanning electron microscope on them to view your data
>>> again. With disk geometries, cylinder tolerances, and encoding mechanisms
>>> these days, even overwriting with zeros will render the data inaccessible to
>>> almost all attackers.
>> I use 'shred' for this.
>>> But I've never heard of 'RAID scrubbing'. What are you doing, Eyal?
>> RAID scrub refers to, in linux software RAID, a 'check' request. It means that
>> the full RAID is read and verified. For RAID6 it means P and Q are verified.
> I looked up the ever-helpful Arch Linux pages on RAID and LVM:
> That took me to Wikipedia's page on "Data Scrubbing":
> Which contained lots of hyperbolic, panic-inducing language like:
> """Data integrity is a high-priority concern in writing, reading, storage,
> transmission, or processing of the computer data in computer operating systems
> and in computer storage and data transmission systems. However, only a few of
> the currently existing and used file systems provide sufficient protection
> against data corruption."""
> To me this is totally overblown. In all the many years I've been looking at
> files, I don't think I've ever seen a file corrupted on a healthy disk. I've
I had the scrub discover problems a few times. I do not know the reason for those
failures, there was no disk error logged and I was not aware of any event that
could cause it.
However, these problems pointed to files that were read happily but the data
delivered is likely to be bad.
We do not have a "safe read" option for raid6, which could check data as it is
read and correct it if necessary (and possible). Furthermore, while raid6 can
do a 'repair', which is like a 'check' but on detecting an error it recalculates
the syndromes, this repair is not so good. raid6 has enough data to detect
*which* sector is bad and rewrite it rather than assume the data is good and
the syndromes are bad. Note: if a disk returns an actual i/o error then the
kernel does repair the failed sectors on the fly.
In short, when a 'check' error is detected I identify the affected files and
deal with their content, restoring from backup if necessary.
> seen one IBM XT with dodgy memory that started corrupting everything that was
> written to disk - and funnily enough running full disk scans made the problem
> worse. In the case of RAID 5 and 6 you've already got full redundancy should
> one piece go missing - you can literally lose an entire disk and keep going in
> degraded mode, so why do you care if a block goes bad?
> My favourite bit was:
> """Due to the high integration density of contemporary computer memory chips,
> the individual memory cell structures became small enough to be vulnerable to
> cosmic rays and/or alpha particle emission. The errors caused by these
> phenomena are called soft errors. This can be a problem for DRAM- and
> SRAM-based memories."""
> So the solution to the possibility of one of the sectors going silently bad -
> despite all the systems in the drive itself to detect and correct these errors
> - is to read it into memory which can also be faulty? Tell me again how this
> is going to help? :-)
> Joking aside, I think maybe a scrub once a year might be a possibility if
> you're dealing with hardware you don't entirely trust. But I'd keep it to a
> minimum because, as you observe, drives do wear out with overuse, and full
> disk scans will do that.
I doubt a yearly scrub will ever come out clean :-(
But I probably need to reduce the frequency to monthly.
Question: How often other people do a scrub?
> It's actually one of the common failure modes in RAID arrays - you have
> several near-failing drives but they're fine with the normal workload. One
> drive fails and you replace it, but now the RAID rebuild has to scan every
> single disk, so all the other drives near fail point are pushed over the limit.
True, but while seeing a second disk fail during a rebuild is scary, it is not
fatal. In my experience disks fails slowly, a few bad sectors at a time, and I
can deal with a few bad sectors while replacing a failing disk.
I did not yet have a catastrophic failure in the 12 years of running software
raid, but the last (current) set of disks is the worst. These are WD blacks and
out of 7 disks six were replaced (some more than once) and the seventh had
"events" suggesting it is next. This is after 3.5 years in the planned 5 years
life of this array.
BTW, these disks do not list a workload limit, and I assume they are designed
for continuous use with a high load.
> But by "common" here I mean "this almost never happens". I've heard
> anecdotes, and I'm hoping some people on the list can share hilarious stories
> about drive and array failures :-)
> So, anyone running BTRFS? Can you grep your logs for "csum failed ino" and
> tell us all how often you see that error? BTRFS stores checksums on metadata
> and data (XFS stores checksums only on metadata), and verifies the checksum on
> read and re-reads data if it's not correct. If we could get some kind of
> statistical analysis of the number of reads and writes performed vs the number
> of actual checksum failures, we'd get an idea of how likely they are in the
> real world.
> Have fun,
Eyal Lebedinsky (eyal at eyal.emu.id.au)
More information about the linux