Copying files without fails

You can talk about almost anything that you want to on this board.

Moderator: Moderators

User avatar
koitsu
Posts: 4203
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: Copying files without fails

Post by koitsu »

Zepper wrote:Using chkdsk now... ;)
CHKDSK will not fix this problem. All that's going to do is potentially make your situation worse depending on how important the data is on the drive right now. *sighs* The only option you have given this drive's bad condition is to open up an RMA with Seagate and have the product replaced. The enclosure and the drive are a single product. Seagate, assuming the product is still under warranty (their site can determine this for you), will replace the entire product free of charge.

As for the individual attributes that are of concern -- and I'm having to go off of what HD Tune Pro shows rather than smartmontools, so my ability to reliably decode this is somewhat limited -- they are decoded below. Please ignore the "warning"/yellow labels in HD Tune Pro -- the author of this software does not fully understand that a non-zero value in some attributes DOES NOT indicate trouble (furthers my point about people needing to know how to decode the data properly). Also be aware that assuming SMART attributes are all zeroed from the factory on new drives is false.

Also be aware that even the descriptions of SMART attributes on Wikipedia are wrong. For example, attribute 197 says "If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased" -- that is completely false for many models of drives (ex. all Western Digital drives I've ever used and analysed, including present-day ones). So you can't entirely trust that either.

Attribute 1 (0x01) -- with Seagate drives sometimes this attribute is vendor-encoded and other times its a mix between a "rate" and a counter. Therefore, sometimes a non-zero value here can indicate repeated re-read attempts done by the drive itself (the storage layer has no idea what's going on under the hood) where eventually the drive is successful in reading data from a physical sector. Whether or not that's the case on this specific model I do not know (I'm not familiar with this exact model).

Attribute 5 (0x05) -- indicates there have been 88 successful LBA remaps during its power-on lifetime. An LBA is simple an arbitrary number that acts as a pointer to a physical sector on the disk. Disks made in the past 15-20 years use LBA addressing, thus computer accesses data on a drive via an LBA, not a sector (the OS has no concept what physical sector an LBA points to). So what this counter indicates is that there have been 88 events where an LBA points to a spare sector. More on what that actually "means" down below.

Attribute 9 (0x09) -- indicates the number of power-on hours of the drive. On this model, the counter represents hours, thus 1613 hours =~ roughly 67 days of power-on time. This is a fairly new, or at least fairly unused drive.

Attribute 11 (0x0B) -- indicates the drive has had 5 physical recalibration events during its power-on lifetime. Because this is a portable drive, this is more common/more likely than if it was a drive in a stationary system (ex. desktop). However, I would classify 5 full recalibrations during such a short power-on lifetime is an indicator of something physical going on within the drive. It may have been dropped, jostled or incorrectly assembled at the factory (despite QC/QA). All are possibilities, and all are speculative.

Attribute 191 (0xBF) -- indicates the drive has had 5 shock events during its power-on lifetime. This number correlates with Attribute 11 above. "Shock events", or "G-shock", are indicators that the drive itself was dropped or jostled while it was on. (The drive cannot count these type of events when power is off). One of the problem with portable drives is that their G-shock sensors are extremely sensitive; I've seen these attributes increment on 2.5" Western Digital drives in laptops simply by someone picking the laptop and putting it back down on a flat desk. They're very sensitive. However, these kind of physical movements can in fact jostle the actuator arm and heads to the point where misalignment can take place. Remember: hard disk R/W heads are literally floating a few [b]nanomet ... e platters. Whether or not these physical events caused damage to the platters, inducing LBA remaps, is impossible for me to tell (especially with HD Tune Pro; I might have a better idea with smartmontools).

Attribute 194 (0xC2) -- this value is vendor-encoded on Seagate drives and cannot be decoded using HD Tune Pro. I believe smartmontools can decode this. This value should be ignored for this analysis.

Attribute 195 (0xC3) -- most Seagate drives have this attribute as non-zero and is vendor-encoded. This is the first time I have seen a Seagate drive show a 0 value for this attribute. I'm noting it here because it's a good indicator of how each drive model and firmware version change in behaviour vs. comparative models. Normally this attribute indicates a count or possibly a rate of how often sector-level ECC has to be used to autocorrect data read from a sector (each actual sector on a hard disk contains an ECC region, alongside data and some metadata).

Attribute 196 (0xC4) -- Relates to attribute 5, indicating that there were 88 reallocation event counts. Note that this number does not necessarily have to equal that of attribute 5; this is an "event count", which does not necessarily guarantee an actual LBA remap. HD Tune Pro mislabels this attribute, sadly.

Attribute 197 (0xC5) -- indicates there are 64 "suspect" LBAs there are pending analysis. This explanation is long, so get some coffee or whatever.

During a read operation, a drive can have problems reading data from an LBA (which points to a physical sector); dust on the platters, head misalignment, spindle motor problems, actuator arm is slightly off, the list is endless. The drive internally (OS has no idea) will attempt to re-read the LBA an arbitrary number of times (varies per firmware implementation), and once reaching a retry threshold, will mark the LBA "suspect" and move on.

"Suspect" means the LBA can no longer be read by anything -- including the OS. You'll receive an I/O error when attempting to read from it. It does not mean the physical sector the LBA points to is bad/unusable, it just means that at that moment in time the drive could not read data from that LBA (and the drive will no longer let anything read that LBA).

The data at that LBA is effectively lost. You cannot get that data back, aside from one possibility: taking the drive to a data recovery company (specifically one who does physical data recovery, as in moves the platters to a donor drive or takes physical repair action). This requires you have issued absolutely NO WRITES to the drive. It's very hard to guarantee that too, because Windows writes crap to a disk under the hood all the time, you have no idea what it's doing. And I'll explain why the "DO NOT WRITE TO THE DRIVE!" matters:

A "suspect" LBA is only re-analysed (to determine if the sector the LBA points to is actually good/usable or not, or if the LBA should be mapped to a spare sector) on a write. So in some cases, yes, you can literally write to all the "suspect" LBAs on a drive and the number shown in attribute 197 will decrease as each sector at that LBA is deemed usable. Of course because you're writing data to the LBA, if successful, the data you just wrote will (naturally) overwrite whatever was there, but at least there wasn't an LBA remap. (Also in the case of an LBA remap or no LBA remap, attribute 196 will not decrement, hence why it's an "event counter" rather than a "remap counter").

A common technique I use to "test" drives in states where there are a very low (say 1 or 2) number of "suspect" LBAs is to simply write zeros to the LBAs which cannot be read (figuring out which LBA numbers to use is something the drive itself can do, believe it or not -- a SMART selective test, and on some drives a SMART short or SMART long test, can be used to get LBA numbers per results in the SMART self-test log -- HD Tune Pro does not support this, and it's a tedious/complex operation I won't describe here, but I use it regularly to do data recovery for people).

Now you see why ANY writes to the drive can potentially mean data loss if you are in fact wanting data recovery, particularly if the write hits a "suspect" LBA.

I'll use this opportunity to point something out: LBA numbers shown in OSes/within software (particularly on Windows) do not always map 1:1 with the LBA numbers used by a drive. They SHOULD map 1:1, but I have personally experienced many occasions where the OS has claimed LBA xyz is unreadable when in fact the LBA is some arbitrary number lower or higher than what the OS claims. I believe this is caused by storage drivers (ex. SATA/AHCI drivers) which use NCQ to report the incorrect LBA number (i.e. a driver bug). This is why I prefer to use the drive's own analysis tools (at the SMART level) to give me numbers.

And one more thing, more relevant I think: determining what file on a filesystem uses what LBA number is extremely painful on Windows (on Linux and FreeBSD it's a bit easier, but it depends on the filesystem (ext3 vs. reiserfs vs. UFS/FFS vs. ZFS)). Windows is a complete disk about this. There are speculations that fsutil can provide this on Vista or 7 or 8 (not XP), but the few times I've used it the numbers its given are wrong / don't match reality. So I think it might actually be giving NTFS cluster offset/number, which IS NOT the same as an LBA.

The best thing to do to find out what files are impacted by "suspect" LBAs is to use a file copy utility (not a filesystem or partition or disk copy utility). Files read which return I/O errors are obviously impacted, and the utility should obviously give you the filename.

I hope this explains why using verification utilities when writing data to the drive are now questionable -- meaning: sure, you can write a 300MByte file to a drive successfully, but it doesn't mean you're necessarily going to successfully read all that data back (yes, it's true: a write can succeed where a read of the data LBA fails. My above explanation about "suspect" LBAs and how to reanalyse them should explain why/how that's possible).

Attribute 198 (0xC6) -- indicates how many failed LBA remaps there have been. This is particularly common if a drive has undergone extensive LBA remapping and has run out of spare sectors (uncommon but does happen, especially on 4K / 4096-byte sector drives). This value is 0 so that's good, it just means there haven't been any failed remaps.

Attribute 200 (0xC8) -- write version of attribute 1. Won't go into this for the same reasons as describe in attribute 1.

Attribute 223 (0xDF) and attribute 225 (0xE1) -- I'm tired and am opting out of explaining these... sorry.

There is also one more situation that people have speculated about: bit rot. I've personally nor professionally ever encountered this (usually I can explain sudden checksum failures when using ZFS for example; I can correlate them to SATA or SCSI or SAS events), but I do believe it's a strong possibility given how magnetic media works. But do not be inclined to believe SSDs are somehow better in this regard -- SSDs have their own sets of major problems that MHDDs don't. I don't want to get into a talk about that, but search Slashdot for "SSD" sometime and read the analysis done by some folks. Also, don't trust things you read on "enthusiast" websites (i.e. gamer-fuelled hardware review sites) -- these guys often have no idea what the fuck they're talking about, and that includes occasionally dudes like Anand Lal Shampi (guy who runs Anandtech -- note the guy started the site when he was 14 years old... yeah, great, a 14 year old doing hardware reviews... I know he isn't 14 now, duh, but still...). Proper reviews and analysis have to be done by actual engineers; "enthusiast" sites often "talk tech" but have no fucking idea how something actually works under the hood. When it comes to hard disks, particularly IDE/PATA/ATA/SATA/AHCI, I'm one of the few who does. (What I don't fully understand are the physical characteristics, because I'm not a hardware engineer)

I'm certain this very long explanation will induce a billion questions of all sorts (I can see Tepples writing up some enormous 200-page inline response), just know that I can't/won't really answer a lot of them because drives are very complex and it's a tedious process for me to explain it all with text/typing. I've done software-level data recovery for a long time (I speak/read ATA protocol and have worked on some of the FreeBSD ATA and AHCI subsystem drivers, although not at very deep levels) so that's where I get the education on this -- and a lot of people who do the same thing do it wrong/badly because they don't understand how drives work (or the fact that different models and vendors of drives behave differently; ex. WD drives do not behave the same with regards to many of the above as Seagate drives (I have more experience with WD drives)).

Anyway... the reason you're getting back anomalies during verifications after writing data to the drive is because while your writes worked fine, reading the same LBAs you just wrote to fail / are marked "suspect" by the drive.

So, long story short: don't bother copying anything to that drive any longer. Any data you have on it right now which you want to keep, copy off to another drive/somewhere else (and any I/O errors when reading that file means that you should ignore that file -- it's been lost, I hope you have other backups of it :) ), then do an RMA with Seagate to get a replacement product. That's the simple answer. There is no point in trying to "save" this drive given the behaviour/description of the problem -- it will perpetually be like this forever.

If you want to see the (probably hundreds by now?) examples of me assisting in drive problems, Google "koitsu dslreports drive" or "koitsu dslreport disk" and sift through them all. I even tell stories of data recovery, with data if you're into that sort of thing.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Copying files without fails

Post by tepples »

Thanks for the detailed explanation, koitsu. Here's my 200 page response:

@k:
Buy another drive. Send this one back to the manufacturer if under warranty. Next time, for especially valuable data, make backups with forward error correction (such as PAR2) that can reconstruct the data in "suspect" sectors. And keep a backup set off-site.
.res 200*256 + @k - *
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Re: Copying files without fails

Post by Zepper »

I'm doing a CHKDSK and... it started at morning. Now it's night time and still running. :)
I don't know if works, but that's what I have for today.

The veredict isn't "buy a new HDD", but which one could be better? In the past, I used DVDs for backups, then I bought this HDD and everything's there. More than 10 years of backups.

edit: removed non-sense. :)
Last edited by Zepper on Fri Jul 18, 2014 7:51 pm, edited 1 time in total.
User avatar
koitsu
Posts: 4203
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: Copying files without fails

Post by koitsu »

The company doesn't really matter. There's a semi-recent... uh... "study" (*cough* if you can call it that *cough*) that indicates Seagate is abysmal compared to others, but it's all anecdotal if you ask me. Go with whatever brand you want / meets your budgetary constraints.

If you're asking me for my own experiences, I can tell you that I am incredibly happy with Western Digital products, including a 1TB My Passport drive (its USB/SATA bridge is worthwhile + allows full SMART passthrough, and it even has SES, which is pretty unique/rare). It's also USB 3.0 so that's an added bonus (if I ever end up going to Windows 7).

The timing of all of this is kinda funny though:

I ran Parodius for almost two decades nearly exclusively on Western Digital disks (we did have a couple Maxtors at one point) and during that time only had 4 separate disk issues (2 were drives that started going bad and within 48 hours died completely, the other 2 were an excess of "suspect" LBAs that grew to a point where I didn't feel comfortable using the drives any longer + had them RMA'd -- I used ZFS raidz1 (think RAID-5 but with checksumming and automatic data repair if issues were found)). The disks I bought were all "consumer grade"; in my experience there is no real difference between "enterprise grade" and "consumer grade" disks other than one thing: some enterprise disks have better shock absorption, which can matter a lot when using them in a very large SAN (say 16 disks per shelf), but for me it's not worth the 2-3x cost markup.

Outside of Parodius I've only had 1 or 2 disk issues, which is why I do backups.

The part that's funny is that the most recent issue happened 4 days ago -- my WD3003FZEX (3TB) used for backups, and had only been used for about 3-4 months, started throwing read errors, and all my analysis on it showed the situation would just get worse, so I Advance RMA'd it with WD. The replacement arrived yesterday and is good shape; I just finished testing it about an hour ago.

So my experience/track record with Western Digital drives has been excellent, but keep in mind I tend to prefer drives that only use single platters (e.g. present-day WD drives that do not exceed 1TB in size). More platters == more heads and actuator arms == higher chance of something going wrong (yes, I try to apply KISS principle to hard disks :-) ). The WD1003FZEX (1TB) drives use single 1TB platters thus 2 heads, so I really like them. The WD3003FZEX I have, on the other hand, has 4 platters (4x750GB = 3TB) thus 8 heads, and as such I'm not surprised it was the one which started developing issues.

I have colleagues who swear by Seagate because their experience with WD has been abysmal. So like I said, it just varies per person. There are only two brands I strongly recommend you avoid: Fujitsu (I dealt with SCSI disk failures of theirs constantly at my old job, I'm talking 2-3 a week) and OCZ (who only does SSDs; and they're going byebye soon anyway since Toshiba just bought them for a hilariously low US$35M). I'm not particularly fond of present-day Seagate drives due to their confirmed and repeated firmware bugs and bad engineering choices (like excessively parking heads, although the WD Green drives do that too, which is why I also avoid those), nor am I fond of Samsung (confirmed firmware bugs). That's all just based on my personal and professional experience though.

With regards to your currently-going-bad Seagate drive, I'd recommend just doing an RMA if you can. I know you're in Brazil but I believe they have a Brazil office and can do RMAs there. You already spent the money on the product, might as well get the replacement rather than essentially throwing away money. But if you want to try a different brand of portable drive, I really like the WD My Passport drives.
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Re: Copying files without fails

Post by Zepper »

One more thing: in Windows 7, "ejecting" the HDD before pulling out the USB plug is a "must do" or a "should do"?
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Copying files without fails

Post by tepples »

I'd say "must do" when backups are involved. I'm not familiar with the guarantees that Windows Explorer makes on having forced the operating system to commit its cache of data to be written to disk. And some drives themselves have their own internal cache.
User avatar
TmEE
Posts: 789
Joined: Wed Feb 13, 2008 9:10 am
Location: Estonia, Rapla city (50 and 60Hz compatible :P)
Contact:

Re: Copying files without fails

Post by TmEE »

If you disable write-behind caching it goes into the "should do" category. Otherwise it is "must do", especially if you got any virus scanners... those tend to make the buffer commits happen whole lot later from my experience.
User avatar
koitsu
Posts: 4203
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: Copying files without fails

Post by koitsu »

What TmEE said is correct -- that feature mainly matters if write caching is enabled (in Windows) for the device or not. If it isn't, then it's usually safe to (physically) unplug the drive whenever you want. But if it's enabled you'll want to ensure you use the "eject" thing every time, otherwise pending/cached writes will not be flushed to the drive, and you'll end up eventually with a corrupt or damaged filesystem.

Keep in mind, however, there are multiple places where write caching can be used. For example there could be a caching layer within the filesystem driver, a separate layer within the USB driver (or IDE or AHCI driver if using those), and finally the actual drive itself (handled by the drive firmware + utilising its own on-board cache). I'm under the impression that when write caching is enabled on the device in Windows, "ejecting" the drive causes Windows to basically ensure any caches used at the filesystem and/or USB/IDE/AHCI layer are issued to the drive in advance.

For the deepest layer (the drive itself), there is actually an ATA/SCSI command that can be issued to the drive that is *supposed to* cause the drive to write all its cached data to the platters/media (in ATA it's FLUSH CACHE EXT or FLUSH CACHE, and the commands are supposed to only return when they have finished the operation). I say *supposed to* because on MHDDs this is usually reliable/truthful, while there have been anecdotal "studies" done on SSDs in the past year or two showing that many SSDs lie (e.g. FLUSH CACHE EXT returns very quickly but the data has not been fully written to NAND cells before power is removed). There's speculation that the issue lies with not having or using large capacitors that can hold enough voltage to keep the drive alive for a very short period after physically losing power (giving it a chance to fully flush things to NAND cells). Do MHDDs have this? Yes, to some degree (which is how that SMART attribute can be updated/incremented/allow tracking of such events. SMART attributes, by the way, are actually written to a special area on the hard disk that you can't normally access, probably a subset of the HPA region even if HPA isn't actively in use).

All that said, regardless of the cache setting in Windows: "ejecting" the drive/device in Windows can also cause the underlying storage subsystem layer to submit things like ATA STANDBY or SLEEP CDBs to the underlying drive, giving it a chance to not only fully flush its cache, but to also "properly shut down" before the device is physically unplugged. There are some drives which are more sensitive to this needing to be done than others. Some will increment attribute 192 (0xC0), others will increment a different attribute (varies per manufacturer, model, and firmware). In my experience it's often 2.5" MHDDs which are sensitive about this.

TL;DR -- it's best to get in the process of doing the "eject" method every single time, just to be safe/cautious, but if you have write caching disabled and/or don't particularly care about the latter, then just physically unplugging is okay. I personally got in the habit of "ejecting" after Windows one day, despite write caching being disabled, popped up a message about how not all data had been fully flushed/written to the USB-attached drive I was using before I had pulled it. I was like "OH REALLY? THANKS FOR LYING TO ME THEN". (Review of the drive showed that it did in fact write all the data, so I still don't know what Windows was complaining about, but it did not sit well with me regardless).
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Copying files without fails

Post by tepples »

koitsu wrote:I personally got in the habit of "ejecting" after Windows one day, despite write caching being disabled, popped up a message about how not all data had been fully flushed/written to the USB-attached drive I was using before I had pulled it. I was like "OH REALLY? THANKS FOR LYING TO ME THEN". (Review of the drive showed that it did in fact write all the data, so I still don't know what Windows was complaining about, but it did not sit well with me regardless).
I remember reading about a "currently mounted" bit in the file system header that's turned on when a file system is mounted and turned off when it is unmounted or otherwise cleanly synced. Under Windows 98, booting from a "currently mounted" drive triggered an automatic ScanDisk. I'm under the impression that if the "currently mounted" bit is set on NTFS, Windows will replay the journal to make the metadata consistent, and if it's set on a non-journaling file system, Windows will female dog at the user in the way you describe. Perhaps all the data and metadata got written but the "currently mounted" bit didn't get turned off.
User avatar
koitsu
Posts: 4203
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: Copying files without fails

Post by koitsu »

While what you say is true, I don't think I did a good job articulating what happened. Meaning, in response to your last two lines: no, it would only be able to detect that situation on a re-mount. So let me explain clearly what happened and why as a result I have used "eject" consistently:

1. Attached USB-based (SATA) hard disk to system. Windows is configured for this device to have write caching DISABLED.
2. Did a bunch of I/O over the course of 15-20 minutes.
3. Did more I/O, but only on a specific file.
4. Was finished with device, so unplugged from USB port.
5. Windows systray immediately popped up a message (commonly called "toast" or "a toast") blabbing about how the device removed from the system may have lost data/issues because there was pending data to write to the device.
6. Reattachment did not show any anomalies, but this may be due to NTFS journal replay or whatever.
7. Review of file in step #3 showed everything I expected, i.e. journal replay did not "cause" loss of data.

So like I said, the write caching setting in Windows for the device is very important, however there is obviously some part of Windows (at least on XP -- and like my other threads, I'm not getting into a discussion about that, end of story) that still seems to cache some form of I/O to the device despite the setting implying otherwise. Hence, I use "eject" consistently every time, no matter what the write caching setting is set to.
Post Reply