[clug] raid question [some resolution]
Eyal Lebedinsky
eyal at eyal.emu.id.au
Tue Feb 25 05:45:06 MST 2014
Turns out the reason that 'check' does not hit the bad sector is because
it is early enough to be in the header. From "mdadm -E /dev/sdi1":
Avail Dev Size : 7813771264 (3725.90 GiB 4000.65 GB)
Array Size : 19534425600 (18629.48 GiB 20003.25 GB)
Used Dev Size : 7813770240 (3725.90 GiB 4000.65 GB)
Data Offset : 262144 sectors
and the error is in sector 259648.
The discussion is continuing on the linux-raid list, selecting the best
method to recover before the system hits the bad sector in the header
and get really annoyed.
I am also wanting to understand how resilient software raid is in the face of
this situation: a bad data sector may lose some data, a bad header may kick
the whole disk out of the array (hopefully leaving it only slightly degraded).
cheers
Eyal
On 02/20/14 13:08, Eyal Lebedinsky wrote:
> In short: smartctl lists one pending sector. A dd provokes an i/o error as expected.
> An mdadm 'check' does not find a problem and does not trigger an i/o error. Why?
>
>
> My smart log is indicating a pending sector in a component of a 7x4TB raid6 device.
> Looking at the component I see:
>
> # smartctl -x /dev/sdi
> SMART Extended Self-test Log Version: 1 (1 sectors)
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed: read failure 90% 5878 261696
>
> I then test it:
>
> # dd if=/dev/sdi of=/dev/null skip=261120 count=2048
> dd: error reading '/dev/sdi': Input/output error
> 576+0 records in
> 576+0 records out
> 294912 bytes (295 kB) copied, 3.18338 s, 92.6 kB/s
>
> and the log shows:
>
> # dmesg|tail
> [768141.382189] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
> [768141.461997] 00 03 fe 40
> [768141.503122] sd 6:0:6:0: [sdi]
> [768141.542668] Add. Sense: Unrecovered read error - auto reallocate failed
> [768141.623913] sd 6:0:6:0: [sdi] CDB:
> [768141.667622] Read(16): 88 00 00 00 00 00 00 03 fe 40 00 00 00 08 00 00
> [768141.748586] end_request: I/O error, dev sdi, sector 261696
> [768141.816217] Buffer I/O error on device sdi, logical block 32712
> [768141.889061] ata13: EH complete
> [768141.927696] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
>
> I decided to run a raid check on this part (the first 1GB is enough to cover this
> bad sector) and it did not find any problem and did not trigger an i/o error.
>
> This last fact I find unexpected as I thought the mdadm 'check' operation will read
> all 7 parts in each stripe and validate the checksums.
>
> Q1) Why do I not see an i/o error from the raid check?
>
> I want to use debugfs to see where the problem is and fix it. I need to know which
> fs blocks include this sector (actually the whole stripe needs to be attended to).
> If any is in use I will try to recover the files. I will then run a raid 'repair'
> on the area.
>
> sdi sector 261696 is sdi1 sector 259648 and in a 7 part raid6 it is 259648*5=1298240
> sectors into the fs, or 162280 4k blocks. The blocks in this area are my focus.
>
> Q2) Is this logic correct?
>
> TIA
>
--
Eyal Lebedinsky (eyal at eyal.emu.id.au)
More information about the linux
mailing list