Severe Problems with the SSD SanDisk Ultra II 240 GB

Yesterday, my SanDisk Ultra II 240 GB solid state drive - which hosts my Arch Linux system partition - showed a lot of ata errors (in the kernel message output dmesg). Rebooting was not possible, the system startup halted - again showing some ata failures. Very annoying!!! I was very happy with the performance of this SSD so far. But how reliable is it? Anyway, I only had my system installation on there so not much valuable data - but setting the system up once more would be annoying enough. (Next time, I will set up my system as described in my blog post RAID1 with SSD and HDD!)

So I removed the SSD, put it into a USB enclosure, and connected it to a different Linux computer running Opensuse 13.1. After checking its condition, I wanted to get as much data back as possible. Read on...

Checking the condition with SMART values

sudo smartctl -a /dev/sdf first got me this message:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/sdf: Unknown USB bridge [0x174c:0x5136 (0x001)]
Please specify device type with the -d option.

I found out that for my external USB enclosure I had to use the -d sat option: sudo smartctl -d sat -a /dev/sdf. This gave me the following output which I could not interpret very well but it's evident that some errors occured (see the end of the output):

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-29-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SanDisk SDSSDHII240G
Serial Number:    143616402365
LU WWN Device Id: 5 001b44 c7bc95fbd
Firmware Version: X31200RL
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA >3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Aug  7 09:22:47 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===                                                                                                                                            
SMART overall-health self-assessment test result: PASSED

General SMART Values:                                                                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                                                                            
                                        was never started.                                                                                                                          
                                        Auto Offline Data Collection: Disabled.                                                                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                                                                    
                                        without error or no self-test has ever                                                                                                      
                                        been run.                                                                                                                                   
Total time to complete Offline                                                                                                                                                      
data collection:                (    0) seconds.                                                                                                                                    
Offline data collection                                                                                                                                                             
capabilities:                    (0x11) SMART execute Offline immediate.                                                                                                            
                                        No Auto Offline data collection support.                                                                                                    
                                        Suspend Offline collection upon new                                                                                                         
                                        command.                                                                                                                                    
                                        No Offline surface scan supported.                                                                                                          
                                        Self-test supported.                                                                                                                        
                                        No Conveyance Self-test supported.                                                                                                          
                                        No Selective Self-test supported.                                                                                                           
SMART capabilities:            (0x0003) Saves SMART data before entering                                                                                                            
                                        power-saving mode.                                                                                                                          
                                        Supports SMART auto save timer.                                                                                                             
Error logging capability:        (0x01) Error logging supported.                                                                                                                    
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   253   100   ---    Old_age   Always       -       2153
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       474
165 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       34367078489
166 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       43
168 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       9
169 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       2
174 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       72
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   061   043   ---    Old_age   Always       -       39 (Min/Max 17/43)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
230 Unknown_SSD_Attribute   0x0032   100   100   ---    Old_age   Always       -       180391247914
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 Media_Wearout_Indicator 0x0032   100   100   ---    Old_age   Always       -       344
234 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       762
241 Total_LBAs_Written      0x0030   253   253   ---    Old_age   Offline      -       605
242 Total_LBAs_Read         0x0030   253   253   ---    Old_age   Offline      -       676
244 Unknown_Attribute       0x0032   000   100   ---    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 314 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 314 occurred at disk power-on lifetime: 2153 hours (89 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  41 40 08 c0 19 81 ed   8 sectors at LBA = 0x0d8119c0 = 226564544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c0 19 81 ed 08      00:00:00.000  READ DMA
  ca 00 88 18 bc ca e6 08      00:00:00.000  WRITE DMA
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:00:00.000  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:00:00.000  SET FEATURES [Set transfer mode]

Error 313 occurred at disk power-on lifetime: 2153 hours (89 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  41 40 02 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  c8 00 08 c0 19 81 ed 08      00:00:00.000  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:00:00.000  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:00:00.000  SET FEATURES [Set transfer mode]

Error 312 occurred at disk power-on lifetime: 2153 hours (89 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  41 40 02 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  c8 00 08 90 b9 94 ed 08      00:00:00.000  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:00:00.000  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:00:00.000  SET FEATURES [Set transfer mode]

Error 311 occurred at disk power-on lifetime: 2153 hours (89 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  41 40 98 e0 bb 54 ec   at LBA = 0x0c54bbe0 = 206879712

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 98 e0 bb 54 ec 08      00:00:00.000  WRITE DMA
  c8 00 08 90 b9 94 ed 08      00:00:00.000  READ DMA
  ca 00 08 10 bc ca e6 08      00:00:00.000  WRITE DMA
  ca 00 78 98 bb ca e6 08      00:00:00.000  WRITE DMA
  ca 00 08 f0 2f a0 e9 08      00:00:00.000  WRITE DMA

Error 310 occurred at disk power-on lifetime: 2153 hours (89 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  41 40 02 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  c8 00 08 c0 19 81 ed 08      00:00:00.000  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:00.000  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:00:00.000  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:00:00.000  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Selective Self-tests/Logging not supported

Data Recovery

So the device has some serious problems which is why I couldn't start the system from it anymore.

Which toolkit to use to get as much data out as possible? There are mainly two choices:

Package Name Program Name
gddrescue ddrescue
ddrescue dd_rescue

In an answer on Stackoverflow you can get an idea what suits you better.

I went for GNU ddrescue. (On Opensuse 13.1 I had to install the package gnu_ddrescue.)

First run, saving everything that's still readable without errors:

REC_PATH=/local/ssd-recovery
mkdir $REC_PATH
cd $REC_PATH
sudo ddrescue -n /dev/sdf2 $REC_PATH/sdf2.iso $REC_PATH/sdf2.iso.ddrescue.log

(Before running the ddrescue command, I ran sudo fdisk -l to check the physical sector size. It's 512 bytes. For newer HDDs the physical sector size is often 4096 bytes. Add the -b4096 option to the ddrescue command in that case.)

Second run to try to read the defective sectors found in the first run too:

sudo ddrescue /dev/sdf2 $REC_PATH/sdf2.iso $REC_PATH/sdf2.iso.ddrescue.log

The output finally states the success of this operation:

Initial status (read from logfile)
rescued:   118005 MB,  errsize:   1044 kB,  errors:      91
Current status
rescued:   118006 MB,  errsize:    713 kB,  current rate:        0 B/s
   ipos:   117346 MB,   errors:     107,    average rate:      571 B/s
   opos:   117346 MB,    time since last successful read:     6.2 m
Finished

So the second run could recover most of the data. All in all only 713 kB are lost (under the condition that the remaining bytes' integrity is OK). I will see how the system behaves when I write the data to a fresh drive. This one will go back to the shop, however!!!

Restoring the Partition

On the target system:

nc -l -p 8001 | dd bs=32M of=/dev/sdb3

On the system with the recovered partition image:

dd bs=32m if=~/Documents/ssd-recovery/sdf2.iso | nc 192.168.178.77 8001

If I had to do this over SSH, I would do it like this:

dd bs=16M if=/dev/sdX | ssh root@dst_machine "dd bs=16M of=/dev/sdY"

Checking the file system

fsck.ext4 -cDfttyv -C 0 /dev/sdxx > fsck.log

-c ? check for bad sectors with badblocks program
-D ? optimize directories if possible
-f ? force check, even if filesystem seems clean
-tt ? print (even more) timing stats
-p ? Automatically repair (?preen?) the file system
-C 0 ? Print progress to the console
-v ? Verbose mode

Recovery of GRUB

https://wiki.ubuntuusers.de/GRUB_2/Grundlagen#MBR-mit-GUID-Partitionstabelle-GPT

Resources

Comments