I recently received a critical alert email from my FreeNAS box with the following error:
Device: /dev/ada3, Self-Test Log error count increased from 0 to 1
Device: /dev/ada3, 1 Currently unreadable (pending) sectors
Rather glad to know the email alerts I setup is working reliably, but looks as though I might have a few bad sectors on one of my drives.
The following commands resolved the error without resulting in any downtime.
The drive in question was /dev/ada3, so first login to a shell on your FreeNAS box as root and run a SMART long self-test (Replace adaX with your corresponding device).
After the test has finished (It might take a few hours) view the results.
From the results remember the sector size and the location of the faulty sector (LBA_of_first_error).
In my case my sector size was 512 and LBA_of_first_error was 3082982872.
To correct the SMART error we will zero out the bad sector(s) on the drive, but first we need to permit access to drive.
Now zero the sector stated in self-test results out.
Replace of=/dev/adaX, bs=512, seek=3082982872 with values relevant to your drive.
Re-run the SMART report command to check if the ‘Current_Pending_Sector’ is now showing 0.
To check the ZFS file system integrity run a scrub of the pool, replace poolX with the pool name the drive is under (list pools with ‘zpool list’).
Finally check the output of the scrub to ensure there are no known data errors.
If you have redundancy, one or two bad sectors can be fixed even without any downtime.
The error message might look something like this:
You can see that the faulty device is /dev/ada2. Now, login to console and run long SMART test:
It should tell you that the test is started and when it will finish. After it finishes, check test results:
The output will tell you two important things. In information section there is sector size:
Near the end there is SMART Self-test log which tells you whether it failed and what’s the faulty sector number (LBA_of_first_error column)
Now we have all the info we need to fix it. We will directly write to this sector to force it to reallocate. The important parameters are ‘of’ which is your faulty device name, ‘bs’ which is sector size and ‘seek’ which is sector number.
Then check if the ‘Current_Pending_Sector’ in ‘SMART Attributes Data Structure’ table went to 0:
If not, repeat the long test and write to all sectors until they reallocate. Then run a scrub, replace ‘poolname’ with your poolname:
Finally check scrub status
That should do it.
That dd command looks dangerously wrong, unless you really want to write a block as big as 850MiB. Once smartctl told you the LBA_of_first_error, and assuming your device has 512 byte sector blocks, you should then use something like:
dd if=/dev/zero of=/dev/yourdevice bs=512 count=1 seek=892134344
best to check the bs value, using diskinfo -v /dev/ada(your disk number)
Johnrezzo is correct. bs=sector. However, even though my sectors were 512, nothing was fixed after 40+ tries. It only worked once I went above what the sector number was and d > so my shell command was: dd if=/dev/zero of=/dev/ada2 bs=2048 count=1 seek=27945554
BS= stripesize and hence 512 did not work. 2048 worked but best to find the info using command: diskinfo -v /dev/ada1
Oooof. I really wish I had read the comments first before trying this…
In reply to “Oooof” lol, One would think Freenas whould have a simple web ui point and click way of dealing with this common problem but No Freenas Failed us massively, when your one and only job is to handle storage drives how could you not handle something as simple as smart errors hum, anyway the safest simplest noob way to handle this if you care about your data is to download and burn a cd seatools for dos, run the long test on the drive then repair the bad sectors / blocks then boot freenas and do a scrub all done all is well till next time pain in the but it works good and is easy here is a link to the download seatools for dos http://www.seagate.com/support/downloads/item/seatools-dos-master/
and yes it works on any drive brand and any file system for me anyway have fun
Instead of increasing sector number, you can just increase count, since total bytes written = sector size * count. And if the part of the disk that is failing is more than a few bytes large, it might pay to set count=something higher than 1, maybe even, say, 10, to force that number of sectors to remap.
In my case I also used /dev/random instead of /dev/zero.
Due to not reading the comments first, I unfortunately (what was I thinking? Oh yeah, I was not) executed the original command listed in the blog with bs= and seek=0. Then waited some seconds before realizing it was wrong and trying to stop it with control-C which seemed nonresponsive. Basically I zeroed out my drive lol. This caused ZFS to take it completely offline and put my pool in a DEGRADED state. I’m currently dealing with that, considering just physically removing the drive at this point, chalking it up as a loss and replacing it with a spare I have and resilvering.
Original Poster: PLEASE edit your post to fix the commands! Your advice will cause people to zero out their disks! lol
Since drives (as I understand it) already have some capacity set aside for automatic sector remapping as the drive is used, it’s my opinion that if these actually start showing up in userland, things are already f***ed and it may pay to just remove the drive entirely. Seems a waste since the rest of the drive is still theoretically readwriteable, but whatever, capacity is cheap these days.
Ah, I see what the problem here is. WordPress is filtering out anything in between less-than and greater-than signs, period. That’s lame. SO my command above that sa >
Error in the information given. It should be;
dd if=/dev/zero of=/dev/ada2 bs=892134344 count=1 seek= conv=noerror,sync
dd if=/dev/zero of=/dev/ada2 bs=4096 count=1 seek=892134344 conv=noerror,sync
Note: 4096 is the stripesize found by using command: diskinfo -v /dev/ada(your disk number)
eg; for me my disk is 1 and hence ada1 is used for this example;
]# diskinfo -v /dev/ada1
512 # sectorsize
2000398934016 # mediasize in bytes (1.8T)
3907029168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
3876021 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
WD-WCAVY6679527 # Disk ident.
The seek is the sector eg shown below;
So in my case the seek would equal = 887447520
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA
# 1 Extended offline Completed: read failure 80% 23265 887
# 2 Short offline Completed without error 00% 18175 –
# 3 Short offline Completed without error 00% 13959 –
# 4 Short offline Completed without error 00% 12206 –
# 5 Short offline Completed without error 00% 2308 –
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
hence the command i would use is;
dd if=/dev/zero of=/dev/ada1 bs=4096 count=1 seek=887447520 conv=noerror,sync
key bs= stripesize
I hope this helps because although the above text is useful it did have an error in it and took me some time to resolve after reading a number of other article.
I have a disk with some pending unreadable sectors, according to smartd. What would be the easiest way to make the disk remap them and stop smartd from complaining?
Today, I get two of these every hour:
The system is an x86 system running Ubuntu Linux 9.10 (jaunty). The disk is part of an LVM group. This is how smartctl identifies the disk:
3 Answers 3
A pending unreadable sector is one that returned a read error and which the drive has marked for remapping at the first possible opportunity. However, it can’t do the remapping until one of two things happens:
- The sector is reread successfully
- The sector is rewritten
Until then, the sector remains pending. So you have two corresponding ways to deal with this:
- Keep trying to reread the sector until you succeed
- Overwrite that sector with new data
Obviously, (1) is non-destructive, so you should probably try it first, although keep in mind that if the drive is starting to fail in a serious way then continual reading from a bad area is likely to make it fail much more quickly. If you have a lot of pending sectors and other errors, and you care about the data on the drive, I recommend taking it out of service and using the excellent tool ddrescue to recover as much data as possible. Then discard the drive.
If the sector in question contains data you don’t care about, or can restore from a backup, then overwriting it is probably the quickest and simplest solution. You can then view the reallocated and pending counts for the drive to make sure the sector was taken care of.
How do you find out what the sector corresponds to in the filesystem? I found an excellent article on the smartmontools web site, here, although it’s fairly technical and is specific to ext2/3/4 and reiser file systems.
A simpler approach, which I used on one of my own (Mac) drives, is to use find / -xdev -type f -print0 | xargs -0 . to read every file on the system. Make a note of the pending count before running this. If the sector is inside a file, you will get an error message from the tool you used to read the files (eg md5sum) showing you the path to it. You can then focus your attentions on re-reading just this file until it reads successfully. Often this will solve the problem, if it’s an infrequently-used file which just needed to be reread a few times. If the error goes away, or you don’t encounter any errors in reading all the files, check the pending count to see if it’s decreased. If it has, the problem was solved by reading.
If the file cannot be read successfully after multiple tries (eg 20) then you need to overwrite the file, or the block within the file, to allow the drive to reallocate the sector. You can use ddrescue on the file (rather than the partition) to overwrite just the one sector, by copying to a temporary file and then copying back again. Note that just removing the file at this point is a bad idea, because the bad sector will go into the free list where it will be harder to find. Completely overwriting it is bad too, because again the sectors will go into the free list. You need to rewrite the existing blocks. The notrunc option of dd is one way to do this.
If you encounter no errors, and the pending count did not decrease, then the sector must be in the freelist or in part of the filesystem infrastructure (eg an inode table). You can try filling up all the free space with cat /dev/zero >tempfile , and then check the pending count. If it goes down, the problem was in the free list and has now gone away.
If the sector is in the infrastructure, you have a more serious problem, and you will probably encounter errors just walking the directory tree. In this situation, I think the only sensible solution is to reformat the drive, optionally using ddrescue to recover data if necessary.
Keep a very close eye on the drive. Sector reallocation is a very good canary in the coal mine, potentially giving you early warning of a drive that is failing. By taking early action you can prevent a later catastrophic and very painful landslide. I’m not suggesting that a few sector reallocations are an indication that you should discard the drive. All modern drives need to do some reallocation. However, if the drive isn’t very old ( 1/month) then I recommend you replace it asap.
I don’t have empirical evidence to prove it, but my experience suggests that disk problems can be reduced by reading the whole disk once in a while, either by a dd of the raw disk or by reading every file using find . Almost all the disk problems I’ve experienced in the past several years have cropped up first in rarely-used files, or on machines that are not used much. This makes sense heuristically, too, in that if a sector is being reread frequently the drive has a chance to reallocate it when it first detects a minor problem with that sector rather than waiting until the sector is completely unreadable. The drive is powerless to do anything with a sector unless the host accesses it somehow, either by reading or writing it or by conducting one of the SMART tests.
I’d like to experiment with the idea of a nightly or weekly cron job that reads the whole disk. Currently I’m using a «poor man’s RAID» in which I have a second hard drive in the machine and I back up the main disk to it every night. In some ways, this is actually better than RAID mirroring, because if I goof and delete a file by mistake I can get yesterday’s version immediately from the backup disk. On the other hand, I believe a hardware RAID controller does a lot of good work in the background to monitor, report and fix disk problems as they emerge. My current backup script uses rsync to avoid copying data that hasn’t changed, but in view of the need to reread all sectors maybe it would be better to copy everything, or to have a separate script that reads the entire raw disk every week.