I will attempt to explain this problem with minimal ambiguity so that any help provided can be directed at technical solutions to this specific problem.
This topic has been posted about several times in the QNAP forums as well as other online forums. Typically, this problem is initially observed in Storage Manager and appears as the following:
Current Speed: 3 Gbps
Maximum Speed: 6 Gbps
QNAP Forum Posts related to this topic:
- Disk 1 only running at 3.0 Gbps others 6.0 Gbps
- Reported disk speed
- Current speed 3Gbps no 6 Gbps like other drives
- HS-251 WD RED 3TB 3Gbps current speed
- TS-870 PRO - SATA II vs. SATA III weirdness
- Disk speed only 3gbps
- Some SATA 3 drives operating at SATA 2 speed?s
After this initial problem observation here are some extra steps to make further observations:
Step #1 of 3:
Using the QNAP Diagnostic Tool --> Kernel Log Analyzer, the following is observed:
- ata1: hard resetting link --Count:N (Where N is the # of errors that have occurred thus far).
Using Storage Manager --> Disks/VJBOD --> SMART Information --> SATA_R-Error_Count, the following is observed:
- SATA_R-Error_Count: >= 1
Inspect the Kernel Logs using a Kernel Log Dump (QNAP Diagnostic Tool --> Dump Log) or using SSH and the 'dmesg' command.
Specific errors observed in the Kernel Logs vary slightly, but some typical observations would be as follows:
Kernel Log Sample #1 (SATA Link Speed Error)*:
Code: Select all
[ 9.968361] ata1: SATA max UDMA/133 abar m2048@0x81815000 port 0x81815100 irq 319
[ 10.288258] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 330)
[ 10.303240] ata1.00: ATA-9: WDC WD80EFZX-68UW8N0, 83.H0A83, max UDMA/133
[ 10.309978] ata1.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[ 10.350289] ata1.00: configured for UDMA/133
[ 10.370799] ata1.00: set queue depth = 31
[ 36.818439] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
[ 36.826007] ata1.00: irq_stat 0x08000000, interface fatal error
[ 36.831942] ata1: SError: { UnrecovData Handshk }
[ 36.836661] ata1.00: failed command: WRITE DMA
[ 36.841129] ata1.00: cmd ca/00:08:28:00:00/00:00:00:00:00/e0 tag 19 dma 4096 out
[ 36.841129] res 50/00:00:7f:03:08/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[ 36.856441] ata1.00: status: { DRDY }
[ 36.860129] ata1: hard resetting link
[ 36.863815] ata1: Speed down due to signal issue.
[ 36.868533] ata1: Speed down due to signal issue.
[ 37.177405] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 37.192372] ata1.00: configured for UDMA/133
[ 37.196690] ata1: EH complete
- This error below does not occur as often. When this error occurs, it seems to be later in the Boot-up log.
Kernel Log Sample #2 (SATA Link Speed Error)*:
Code: Select all
<3>[ 119.131479] ata1.00: exception Emask 0x10 SAct 0xfc0 SErr 0x400100 action 0x6 frozen
<3>[ 119.139394] ata1.00: irq_stat 0x08000000, interface fatal error
<3>[ 119.145325] ata1: SError: { UnrecovData Handshk }
<3>[ 119.150462] ata1.00: failed command: WRITE FPDMA QUEUED
<3>[ 119.155704] ata1.00: cmd 61/00:30:30:42:73/04:00:a2:03:00/40 tag 6 ncq 524288 out
<3>[ 119.155704] res 40/00:00:30:56:73/00:00:a2:03:00/40 Emask 0x10 (ATA bus error)
<3>[ 119.171114] ata1.00: status: { DRDY }
<3>[ 119.175267] ata1.00: failed command: WRITE FPDMA QUEUED
<3>[ 119.180519] ata1.00: cmd 61/00:38:30:46:73/04:00:a2:03:00/40 tag 7 ncq 524288 out
<3>[ 119.180519] res 40/00:00:30:56:73/00:00:a2:03:00/40 Emask 0x10 (ATA bus error)
<3>[ 119.196252] ata1.00: status: { DRDY }
<3>[ 119.199953] ata1.00: failed command: WRITE FPDMA QUEUED
<3>[ 119.205536] ata1.00: cmd 61/00:40:30:4a:73/04:00:a2:03:00/40 tag 8 ncq 524288 out
<3>[ 119.205536] res 40/00:00:30:56:73/00:00:a2:03:00/40 Emask 0x10 (ATA bus error)
<3>[ 119.220971] ata1.00: status: { DRDY }
<3>[ 119.224675] ata1.00: failed command: WRITE FPDMA QUEUED
<3>[ 119.229911] ata1.00: cmd 61/00:48:30:4e:73/04:00:a2:03:00/40 tag 9 ncq 524288 out
<3>[ 119.229911] res 40/00:00:30:56:73/00:00:a2:03:00/40 Emask 0x10 (ATA bus error)
<3>[ 119.245361] ata1.00: status: { DRDY }
<3>[ 119.250064] ata1.00: failed command: WRITE FPDMA QUEUED
<3>[ 119.255315] ata1.00: cmd 61/00:50:30:52:73/04:00:a2:03:00/40 tag 10 ncq 524288 out
<3>[ 119.255315] res 40/00:00:30:56:73/00:00:a2:03:00/40 Emask 0x10 (ATA bus error)
<3>[ 119.270811] ata1.00: status: { DRDY }
<3>[ 119.274511] ata1.00: failed command: WRITE FPDMA QUEUED
<3>[ 119.279761] ata1.00: cmd 61/00:58:30:56:73/04:00:a2:03:00/40 tag 11 ncq 524288 out
<3>[ 119.279761] res 40/00:00:30:56:73/00:00:a2:03:00/40 Emask 0x10 (ATA bus error)
<3>[ 119.295261] ata1.00: status: { DRDY }
<6>[ 119.299017] ata1: hard resetting link
<4>[ 119.302722] ata1: Speed down due to signal issue.
<4>[ 119.307437] ata1: Speed down due to signal issue.
<6>[ 119.616458] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
<6>[ 119.637328] ata1.00: configured for UDMA/133
<6>[ 119.641666] ata1: EH complete
[NCQ Commands and Non-NCQ Commands]
In the kernal logs, it has been repeatedly observed, that this problem is occurring for both NCQ Commands and Non-NCQ Commands. When 'UnrecovData Handshk' errors occur for both NCQ Commands and Non-NCQ Commands, this strongly indicates a physical connection problem, likely a backplane problem. [1][2].
[1] Source: https://patchwork.ozlabs.org/patch/136587/
[2] Source: https://wiki.lime-technology.com/The_An ... ive_Issues
In both cases, this example error starts off the error sequence observed in the logs:
Code: Select all
ata2.00: exception Emask 0x10 SAct 0x7ff4f SErr 0x400100 action 0x6 frozen
ata2.00: irq_stat 0x08000000, interface fatal error
ata2: SError: { UnrecovData Handshk }
Initial Problem Analysis:This is transmission error. Most common causes are power related or
unreliable connection especially if backplanes are involved. Is the
problem still reproducible? If so, can you please try to move it to
different power connector and SATA port and see what changes?
This problem seems to involve a hardware error and a renegotiation of SATA link speed to a lower speed (from SATA III (6.0 Gbps) down to SATA II (3.0 Gbps)), so that data can be more reliably transferred across the physical SATA link.
I am aware that it has been stated that using a mechanical HDD with a link speed renegotiated down to SATA II (3.0 Gbps) from SATA III (6.0 Gbps) is acceptable, because current mechanical (non-SSD) based HDD (in 2017) cannot even come close to achieving SATA II (3.0 Gbps) speeds. While technically true, I do not find this to be an acceptable reason for ignoring what is potentially a hardware signaling problem, which MAY lead to a more serious problem later on. Simply put, this observed behavior is incorrect and simply ignoring it based on the current technical limitations of mechanical HDD, will not aid in finding a serious solution to this problem.
Observed Problem Frequency:
It has been observed that this problem happens nearly every time the device is powered on or rebooted. From my observations, it would be exceptional to get a clean startup with the correct link speed negotiated and maintained throughout the entire boot process. However, I have actually observed a clean start up with the correct link speed negotiated, but this rarely occurs.
Device Info:
QNAP TS-453A
RAM: 8GB
Model: QNAP TS-453A
Firmware: 4.3.3_20170606
Kernel: Linux version 4.2.8
Troubleshooting Background:
Initially my configuration started as a RAID 10 when I noticed these errors. Later, I reinitialized the NAS with out using a RAID, instead using separate individual Volumes for each physical HDD. In either configuration, the ATA Hard Resetting Link error occurred with the same frequency.
Original HDD Configuration:
*All HDD have been scanned (without errors reported) and no S.M.A.R.T. errors (other than those stated below) are occurring. All HDD are brand new, and have been in use for less than 30 days.
- SATA Port 1 - [HDD A1]: WDC WD80EFZX-68UW8N0 83.H0A83 8TB (Firmware: 83.H0A83)
- SATA Port 2 - [HDD B2]: WDC WD80EFZX-68UW8N0 83.H0A83 8TB (Firmware: 83.H0A83)
- SATA Port 3 - [HDD C3]: WDC WD80EFZX-68UW8N0 83.H0A83 8TB (Firmware: 83.H0A83)
- SATA Port 4 - [HDD D4]: WDC WD80EFZX-68UW8N0 83.H0A83 8TB (Firmware: 83.H0A83)
I have tried swapping in one of the other available drives [HDD D4] into SATA Port 1 (with the SATA Ports 2-4 populated with all other drives each time). The ATA "hard resetting link" error occurs on SATA Port 1 with the other [HDD D4] connected to it. Short of plugging in every HDD that I have into SATA Port 1 and then testing them individually, I think it safe to say that the cause of this problem is not a particular HDD. I describe my troubleshooting methodology in detail below:
Methodology:
- [Rearrange Drives - #1 of 3]
Originally [HDD A1] was plugged into SATA Port 1, when I first noticed this error. It managed to accumulate the following errors while plugged into SATA Port 1:
[HDD A1]:- - ID: 199
- SATA_R-Error_Count (UltraDMA CRC Error Count): 32
[HDD D4]:- - ID: 199
- SATA_R-Error_Count (UltraDMA CRC Error Count): 10
Afterwards, I swapped [HDD A1] back into SATA Port 1, and put [HDD D4] back into SATA Port 4. Once this original/starting configuration was restored, errors began to once again accumulate on [HDD A1] plugged into SATA Port 1.
[Hot Swap Drive - #2 of 3]
I also tried physically Hot Swapping the Drive (plugging/unplugging) while the NAS is running and it did renegotiate the correct link speed. This can be observed in the Storage Manager as follows:
- Current Speed: 6 Gbps
Maximum Speed: 6 Gbps
- Current Speed: 3 Gbps
Maximum Speed: 6 Gbps
Reinitialized the NAS using only a single HDD plugged into SATA Port 1. SATA Port 2-4 did not have any HDD plugged into them. I still got a link speed error on SATA Port 1. This can be observed in the Storage Manager as follows:
- Current Speed: 3 Gbps
Maximum Speed: 6 Gbps
- - ID: 199
- Following these results, one could conclude that the problem is not the HDD, but the SATA Port 1 on the backplane or the SATA Controller is the source of the problem.
Storage Manager Link Speed Display (Bug):
NOTE: Also, the display in Storage Manager is sometimes inaccurate (bug?), in that it will display (Current Speed: 6Gbps / Maximum Speed: 6Gbps). But, using SSH with the following command displays the actual current link speed:
- [~] # cat /sys/class/ata_link/link1/sata_spd
3.0 Gbps
Alternatively, after each reboot, using SSH, try using a combination of commands like these to determine if a link speed error occurred and the current link speed:
- - Get the current SATA Link Speed(s):
- [~] # cat /sys/class/ata_link/link1/sata_spd
3.0 Gbps
[~] # cat /sys/class/ata_link/link2/sata_spd
6.0 Gbps
[~] # cat /sys/class/ata_link/link3/sata_spd
6.0 Gbps
[~] # cat /sys/class/ata_link/link7/sata_spd
6.0 Gbps
- Check kernel logs for ATA errors, etc.
- [~] # dmesg | grep -i ata
- [~] # cat /sys/class/ata_link/link1/sata_spd
S.M.A.R.T. Error Background:
[SATA_R-Error_Count]
"UltraDMA CRC Error Count S.M.A.R.T. parameter indicates the total quantity of CRC errors during UltraDMA mode. The raw value of this attribute indicates the number of errors found during data transfer in UltraDMA mode by ICRC (Interface CRC)."
Source: https://kb.acronis.com/content/9135
My observed S.M.A.R.T. Values:
Code: Select all
Model : TS-453A
Firmware : 4.3.3 (20170606)
NAS : QXXXXXXXXX
==========[ BAY 1, WDCWD80EFZX-68UW8N07630885, XXXXXXXXX ]
ID Description RawValue Value WorstValue Threshold Status
001 Raw_Read_Error_Rate 0x0 100 100 016 Good
002 Throughput_Performance 0x75 131 131 054 Good
003 Spin_Up_Time 0x801c001c0 147 147 024 Good
004 Start_Stop_Count 0x42 100 100 000 Good
005 Retired_Block_Count 0x0 100 100 005 Good
007 Seek_Error_Rate 0x0 100 100 067 Good
008 Seek_Time_Performance 0x12 128 128 020 Good
009 Power-On_Hours 0xa0 100 100 000 Good
010 Spin_Retry_Count 0x0 100 100 060 Good
012 Power_Cycle_Count 0x14 100 100 000 Good
022 Unknown_Attribute 0x64 100 100 025 Good
192 Power-Off_Retract_Count 0xf2 100 100 000 Good
193 Load_Cycle_Count 0xf2 100 100 000 Good
194 Temperature_Celsius 0x2c00190027 153 153 000 Good
196 Reallocated_Event_Count 0x0 100 100 000 Good
197 Current_Pending_Sector 0x0 100 100 000 Good
198 Uncorrectable_Sector_Count 0x0 100 100 000 Good
199 SATA_R-Error_Count 0x7 200 200 000 Good
==========[ BAY 2, WDCWD80EFZX-68UW8N07630885, XXXXXXXXX ]
ID Description RawValue Value WorstValue Threshold Status
001 Raw_Read_Error_Rate 0x0 100 100 016 Good
002 Throughput_Performance 0x70 132 132 054 Good
003 Spin_Up_Time 0x901c101c0 147 147 024 Good
004 Start_Stop_Count 0x3f 100 100 000 Good
005 Retired_Block_Count 0x1 100 100 005 Good
007 Seek_Error_Rate 0x0 100 100 067 Good
008 Seek_Time_Performance 0x12 128 128 020 Good
009 Power-On_Hours 0xa0 100 100 000 Good
010 Spin_Retry_Count 0x0 100 100 060 Good
012 Power_Cycle_Count 0x13 100 100 000 Good
022 Unknown_Attribute 0x64 100 100 025 Good
192 Power-Off_Retract_Count 0xe6 100 100 000 Good
193 Load_Cycle_Count 0xe6 100 100 000 Good
194 Temperature_Celsius 0x2d00190028 150 150 000 Good
196 Reallocated_Event_Count 0x1 100 100 000 Good
197 Current_Pending_Sector 0x0 100 100 000 Good
198 Uncorrectable_Sector_Count 0x0 100 100 000 Good
199 SATA_R-Error_Count 0x0 200 200 000 Good
==========[ BAY 3, WDCWD80EFZX-68UW8N07630885, XXXXXXXXX ]
ID Description RawValue Value WorstValue Threshold Status
001 Raw_Read_Error_Rate 0x0 100 100 016 Good
002 Throughput_Performance 0x74 131 131 054 Good
003 Spin_Up_Time 0x901c701c8 144 144 024 Good
004 Start_Stop_Count 0x2a 100 100 000 Good
005 Retired_Block_Count 0x0 100 100 005 Good
007 Seek_Error_Rate 0x0 100 100 067 Good
008 Seek_Time_Performance 0x12 128 128 020 Good
009 Power-On_Hours 0xa0 100 100 000 Good
010 Spin_Retry_Count 0x0 100 100 060 Good
012 Power_Cycle_Count 0x15 100 100 000 Good
022 Unknown_Attribute 0x64 100 100 025 Good
192 Power-Off_Retract_Count 0xd3 100 100 000 Good
193 Load_Cycle_Count 0xd3 100 100 000 Good
194 Temperature_Celsius 0x2d00190028 150 150 000 Good
196 Reallocated_Event_Count 0x0 100 100 000 Good
197 Current_Pending_Sector 0x0 100 100 000 Good
198 Uncorrectable_Sector_Count 0x0 100 100 000 Good
199 SATA_R-Error_Count 0x0 200 200 000 Good
==========[ BAY 4, WDCWD80EFZX-68UW8N07630885, XXXXXXXXX ]
ID Description RawValue Value WorstValue Threshold Status
001 Raw_Read_Error_Rate 0x0 100 100 016 Good
002 Throughput_Performance 0x6c 133 133 054 Good
003 Spin_Up_Time 0x901be01cd 145 145 024 Good
004 Start_Stop_Count 0x27 100 100 000 Good
005 Retired_Block_Count 0x0 100 100 005 Good
007 Seek_Error_Rate 0x0 100 100 067 Good
008 Seek_Time_Performance 0x12 128 128 020 Good
009 Power-On_Hours 0xa0 100 100 000 Good
010 Spin_Retry_Count 0x0 100 100 060 Good
012 Power_Cycle_Count 0x14 100 100 000 Good
022 Unknown_Attribute 0x64 100 100 025 Good
192 Power-Off_Retract_Count 0xc1 100 100 000 Good
193 Load_Cycle_Count 0xc1 100 100 000 Good
194 Temperature_Celsius 0x2b00190027 153 153 000 Good
196 Reallocated_Event_Count 0x0 100 100 000 Good
197 Current_Pending_Sector 0x0 100 100 000 Good
198 Uncorrectable_Sector_Count 0x0 100 100 000 Good
199 SATA_R-Error_Count 0x17 200 200 000 Good
After doing some research online via Google and DuckDuckGo, there are several theories as to why this error occurs.
[Physical Hardware Problems]
- A physical connection problem between the HDD SATA Port and the HOST SATA Port involving the SATA Cable. (Not Applicable) (QNAP NAS Device uses a backplane, not SATA Cables)
- A physical connection problem between the HDD SATA Port on the back of the HDD and the Backplane SATA Port in the NAS.
- HDD hardware is faulty. (often suggested to run S.M.A.R.T. tests)
[Software Problems]
- A kernel bug in the underlying Linux OS used by QTS.
- A driver/firmware bug in the SATA Controller, which in this case, is the Marvell 88SE9215-NAA2.
[Other Possible Physical Hardware Problems]
- Faulty or overloaded Power Supply.
- Source: https://ubuntuforums.org/showthread.php?t=2272486
Source: http://eliasoenal.com/2012/10/31/power- ... g-to-find/
I hope that my gathering of sources and troubleshooting will be of assistance to others facing similar problems.
If anyone has an alternate theory as to the cause of this problem or has any advice, please respond. Any help would be greatly appreciated.