[clug] S.M.A.R.T message for hd failure

steve jenkin sjenkin at canb.auug.org.au
Thu Jan 31 22:59:02 GMT 2008


Joshua Worth wrote on 31/1/08 9:28 PM:
> It doesn't look good, but I saw on a forum that it might be lying to me
> but I cant be sure. This message was appearing when I had an extra 80
> gigabyte drive in my computer, but after taking that out and doing some
> tests, it turned out to be fine. I am using OpenSuSE 10.3 X86_64
> Is there a way I could fix this without destroying any data?
>
> Here is the forum I looked at:
> http://suseforums.net/index.php?showtopic=42621
>   

'S.M.A.R.T.' stands for "Self-Monitoring, Analysis, and Reporting
Technology"
<http://en.wikipedia.org/wiki/S.M.A.R.T.>

In the last year there have been two major studies published on the
failure rates of newish technology disk drives.
[Why 'newish' drives? You have to run drives for 5 years to collect the
baseline.]

Notably, SMART addresses mechanical faults and cannot warn/report on
electronics failures - which are always sudden and catastrophic.
"The Google team found that 36% of the failed drives did not exhibit a
single SMART-monitored failure."

Take-away:
* Keep an eye on the SMART output. It *will* tell you about some failing
drives.
* Don't forget that for one third of failures,
* Old drives fail much more...
* System/Controller/Software faults that scramble data are as or more
likely than drive failure

Two blogposts that pull those studies together:
"Everything You Know About Disks Is Wrong" and "Google’s Disk Failure
Experience"

<http://storagemojo.com/2007/02/19/googles-disk-failure-experience/>
<http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/>

CMU paper: "Disk failures in the real world: What does an MTTF of
1,000,000 hours mean to you?"
[They looked at 100,000 drives]
<http://www.usenix.org/events/fast07/tech/schroeder.html>

Google: "Failure Trends in a Large Disk Drive Population"
<http://labs.google.com/papers/disk_failures.pdf> [PDF]


Summary:
As another post advised - run the Linux SMART utility to check the drive.
Make sure you have second, safe copies of important data.

HTH
sj

-- 
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin



More information about the linux mailing list