QNAP firmware issues : Major file system bug

Questions about SNMP, Power, System, Logs, disk, & RAID.
Locked
SuperMario
Getting the hang of things
Posts: 75
Joined: Tue May 24, 2011 5:01 pm

Re: QNAP firmware issues : Major file system bug

Post by SuperMario »

Onlyalex:

All the hard drives in use are Seagates. We are confident that they aren't at fault here as we have large numbers of these drives in use in other QNAP units and in our workstations.

For the record, though, they are:

Seagate ST31500341AS CC1H
Seagate ST2000DL003-9VT1 CC32

More telling however, the very same drives have all previously worked without any problems on the very same QNAP hardware, using the older FW versions.

In an attempt to help diagnose this bug, I removed two of the QNAP NAS units and have performed various downgrades/upgrades, complete wipes and rebuilds, RAID expansions, the use of different filesystems (EXT3/EXT4, write cache on and off), with different data sets, etc. We have narrowed down the problem to as it is described in this thread.

We are way passed the point of passing the blame onto someone else, it's been pretty much proven beyond any doubt to be a QNAP firmware/filesystem related issue.

Mario.
TonyPh12345
Been there, done that
Posts: 738
Joined: Tue Jul 13, 2010 11:53 pm

Re: QNAP firmware issues : Major file system bug

Post by TonyPh12345 »

By the way, the command is "dmesg," not "demsg."
tmt
Experience counts
Posts: 1006
Joined: Mon Nov 16, 2009 11:02 am

Re: QNAP firmware issues : Major file system bug

Post by tmt »

AdrianW wrote:I previously posted those details for my NAS here, but shortly after that the thread died.

I have 3.4.3 firmware on mine, but have had the same issue for at least the previous two releases.

For me the pause problem doesn't happen every time I copy data to the NAS, but still quite regularly. Usually it will happen when nothing has been copied to the NAS for a while (at least a few minutes) and the file being copied is quite large (say 100MB or larger). After the pause has ended and I've copied a file, I can usually initiate a few more copies without the pause. But then waiting a couple of minutes (sometimes as little as a few seconds) and starting another copy the pause will occur again.
In the post you reference, you indicate that you're copying the files by drag-and-drop with Windows Explorer. Correct? Do you see the pauses if you copy with other techniques (e.g. xcopy, robocopy, etc), and without any explorer windows open?

The reason I ask is because Windows Explorer typically requests a change-notify on directories that it's viewing, so it can refresh promptly when the files change. There's a message sent from the server to Windows when the new file is created, and it might be causing some of this delay - especially if a large file is involved, requiring many buffer cache blocks to be flushed to disk at the server. Also, QNAP reportedly fixed a bug around rename related to change-notify, I wonder if there's more to it here.

One other question - have you tried disabling oplocks on the share in the QNAP admin gui? Oplocks have a similar callback behavior, and can cause pauses like this if they are not responded to.
SS-439, Ubuntu Server 12.04.3 LTS, EXT4, RAID10, 4xHitachi 5K1000
TS-112, 4.1.x Beta, EXT4, 1xHitachi 7K1000
AdrianW
Know my way around
Posts: 249
Joined: Thu Jul 10, 2008 6:17 pm

Re: QNAP firmware issues : Major file system bug

Post by AdrianW »

Thanks for helping tmt.

Yes, I do normally use drag and drop (it's the easiest method). I just now tried using xcopy, and managed to copy a number of files without any issues. I'll have to try a few more times over the next day or so, as the problem doesn't occur every time. But, you might be on to something there.

With oplocks, it looks like I have to disable them at each individual share level, correct? Is there any downsides to switching them off? For most of the shares I'm the only user with write permissions anyway. I'll have to give it a go to see what happens (maybe over the weekend).

I guess I could switch to using ftp or maybe an alternative file manager (like FreeComander) - but just using windows is preferable.

I just remembered something though, I still use Microsoft Money 2000 which I have setup to backup it's database file to my Qnap when it exits - the file is around 30MB, and the last time I used it (without any explorer windows open), and I exited - it started backing up and caused the NAS to hang (it's pretty obvious when I get yelled at for stalling my wife's TV show). So, that's just programatically writing to the NAS not using explorer.

Also, I will sometimes copy a large batch of files to the NAS into one folder, and then take my time dragging and dropping files into their correct folders - moving files around on the NAS like this using Windows Explorer never creates a pause for me - it's only when writing a new file to the NAS.
TS-853 Pro; TS-859 Pro; TS-409
tmt
Experience counts
Posts: 1006
Joined: Mon Nov 16, 2009 11:02 am

Re: QNAP firmware issues : Major file system bug

Post by tmt »

Oh, if the entire NAS hangs, that's different. I thought just the one SMB session was hanging. In any case, these aren't really intended as fixes or even workarounds, just ways to gather more info. There's clearly an issue.

Do you think there might be a network problem? For example copying large files creates traffic which could possibly cause a lost connection in the face of other errors. The recovery from that takes many seconds, and if the ethernet is glitching, that would explain the streaming interruption too.

There's a slight downside to disabling oplocks because it affects caching at the client. But it's slight, and in some cases it can actually help. Try it mainly as an experiment.
SS-439, Ubuntu Server 12.04.3 LTS, EXT4, RAID10, 4xHitachi 5K1000
TS-112, 4.1.x Beta, EXT4, 1xHitachi 7K1000
AdrianW
Know my way around
Posts: 249
Joined: Thu Jul 10, 2008 6:17 pm

Re: QNAP firmware issues : Major file system bug

Post by AdrianW »

tmt wrote:Do you think there might be a network problem? For example copying large files creates traffic which could possibly cause a lost connection in the face of other errors. The recovery from that takes many seconds, and if the ethernet is glitching, that would explain the streaming interruption too.
It's definitely not the network (which is gigabit), as I can often copy multi-gigabyte files to the NAS at the same time as other users are streaming video from it without a problem. But when it does hang no one can access anything on it for that period (which is now up to almost 30 seconds), the really weird thing is that when it hangs the LEDs for just drives 2, 4 and 7 will all blink in unison.
TS-853 Pro; TS-859 Pro; TS-409
User avatar
schumaku
Guru
Posts: 43579
Joined: Mon Jan 21, 2008 4:41 pm
Location: Kloten (Zurich), Switzerland -- Skype: schumaku
Contact:

Re: QNAP firmware issues : Major file system bug

Post by schumaku »

AdrianW wrote:But when it does hang no one can access anything on it for that period (which is now up to almost 30 seconds), the really weird thing is that when it hangs the LEDs for just drives 2, 4 and 7 will all blink in unison.
Error correction ongoing on the disk drives? The NAS has to stop the activities, beceuase SATA drives are not responsive during these actions. Convinced we talked on that before - any insight on # dmesg or SMART counters?
AdrianW
Know my way around
Posts: 249
Joined: Thu Jul 10, 2008 6:17 pm

Re: QNAP firmware issues : Major file system bug

Post by AdrianW »

schumaku wrote:
AdrianW wrote:...any insight on # dmesg or SMART counters?
I'll check the SMART counters when I get home. And as for that dmesg output, it's all greek to me.
TS-853 Pro; TS-859 Pro; TS-409
AdrianW
Know my way around
Posts: 249
Joined: Thu Jul 10, 2008 6:17 pm

Re: QNAP firmware issues : Major file system bug

Post by AdrianW »

I just checked the SMART reports for my drives and all of the error type raw values are zero for every drive, except one which has a value of 1 in a couple. (i.e. Raw_Read_Error_Rate, Reallocated_Sector_Ct, Seek_Error_Rate, Spin_Retry_Count,Calibration_Retry_Count,Hardware_ECC_Recovered, UDMA_CRC_Error_Count, Multi_Zone_Error_Rate, Load_Retry_Count).

So, I don't think the drives are at fault.
TS-853 Pro; TS-859 Pro; TS-409
AdrianW
Know my way around
Posts: 249
Joined: Thu Jul 10, 2008 6:17 pm

Re: QNAP firmware issues : Major file system bug

Post by AdrianW »

Maybe it's not samba - I just had exactly the same hang occur when initiating an ftp transfer (using FileZilla).
TS-853 Pro; TS-859 Pro; TS-409
tmt
Experience counts
Posts: 1006
Joined: Mon Nov 16, 2009 11:02 am

Re: QNAP firmware issues : Major file system bug

Post by tmt »

AdrianW wrote:I just checked the SMART reports for my drives and all of the error type raw values are zero for every drive, except one which has a value of 1 in a couple. (i.e. Raw_Read_Error_Rate, Reallocated_Sector_Ct, Seek_Error_Rate, Spin_Retry_Count,Calibration_Retry_Count,Hardware_ECC_Recovered, UDMA_CRC_Error_Count, Multi_Zone_Error_Rate, Load_Retry_Count).

So, I don't think the drives are at fault.
Assuming you have some level of RAID mirroring configured (and maybe a backup too!), you should try pulling that drive temporarily and see if the behavior changes. SMART is a highly unreliable indicator of drive health, but any non-zero values in things like CRC and seek errors can indicate larger issues. Not always, of course, but it's a lead which you probably ought to follow.
SS-439, Ubuntu Server 12.04.3 LTS, EXT4, RAID10, 4xHitachi 5K1000
TS-112, 4.1.x Beta, EXT4, 1xHitachi 7K1000
florian.baumert
Starting out
Posts: 15
Joined: Sat Jul 02, 2011 11:02 pm

Re: QNAP firmware issues : Major file system bug

Post by florian.baumert »

I think I run into the same problem. I documented it here:

http://forum.qnap.com/viewtopic.php?f=11&t=46573
SuperMario
Getting the hang of things
Posts: 75
Joined: Tue May 24, 2011 5:01 pm

Re: QNAP firmware issues : Major file system bug

Post by SuperMario »

Hello all,

Just a note to those following this thread. I had a voice call from a manager called Dave from QNAP in Taiwan late last week. The result of it was that they had their engineers "working on it", and basically to give them more time to sort it out.

As they make a good product I will give them the benefit of the doubt - but my past experience with all of the NMP-1000 issues (which are still not resolved) it don't give me much faith...

Thanks to the people who posted to this thread with theories, it's appreciated, but I can tell you that this is definitely not a network transport or a HDD/SMART issue, nor anything directly that is related to the SAMBA/FTP processes themselves. It appears to be as the thread name implies, a file system level issue that shows itself as a stall of the calling process (in my case, a guest launch of SAMBA when creating directories or files).

Tmt: The mention you make of a previous bugfix about a renaming issue is very telling... it seems to me that maybe QNAP need to start looking for cause and effect from there.

Mario.
AdrianW
Know my way around
Posts: 249
Joined: Thu Jul 10, 2008 6:17 pm

Re: QNAP firmware issues : Major file system bug

Post by AdrianW »

SuperMario wrote:The result of it was that they had their engineers "working on it", and basically to give them more time to sort it out.
Well, that's good to hear. I wonder if it means they've been able to reproduce the issue in their labs?

Hopefully we'll see a fix in an upcoming firmware release.
Last edited by AdrianW on Mon Jul 18, 2011 1:16 pm, edited 1 time in total.
TS-853 Pro; TS-859 Pro; TS-409
SuperMario
Getting the hang of things
Posts: 75
Joined: Tue May 24, 2011 5:01 pm

Re: QNAP firmware issues : Major file system bug

Post by SuperMario »

Hello all,

A little further info that may be of help. I SSH'ed into one of the test boxes (the TS-859Pro+ unit) as it has now just started to exhibit the stall behavior on an attempted access of change directory after the creation of a new directory and file copy into it.

I used a different workstation to access the QNAP via SSH and it's got a larger monitor, so when running TOP in refresh mode, I can see a lot more on screen vertically - and that displayed some interesting information.

Interestingly, as the attached logs show, when the guest "smbd" process is stalled (its task status is "D") there is also a stall on the admin "kjournald" process (it's status is "DW") for the duration.

PID USER STATUS RSS PPID %CPU %MEM COMMAND
16017 guest D 4140 3172 1.1 0.4 smbd ; #3127 *** Stalled
16430 admin R 908 15399 0.3 0.0 top
4374 admin S 732 1 0.3 0.0 hwmond
14723 admin S 2824 3172 0.0 0.2 smbd ; #3172
9955 admin S 2724 3172 0.0 0.2 smbd ; #3172
22052 admin S 2640 3172 0.0 0.2 smbd ; #3172
21060 admin S 2560 3172 0.0 0.2 smbd ; #3172
4702 admin S < 2232 1 0.0 0.2 iscsid
15399 admin S 1488 15394 0.0 0.1 sh
15394 admin S 1392 3414 0.0 0.1 sshd
3100 admin S 1088 1 0.0 0.1 cupsd
4743 admin S 1044 1 0.0 0.1 qLogEngined
3172 admin S 948 1 0.0 0.0 smbd
2974 admin S 940 1 0.0 0.0 _thttpd_
5188 admin S 928 1 0.0 0.0 nmbd
2720 admin S 848 1 0.0 0.0 upnpd
3081 admin S 784 1 0.0 0.0 Qthttpd
4579 admin S 708 1 0.0 0.0 hd_util
2412 admin S 580 1 0.0 0.0 hotswap
3414 admin S 536 1 0.0 0.0 sshd
3287 admin S 532 1 0.0 0.0 crond
5240 admin S 524 1 0.0 0.0 upnpcd
4737 admin S 520 1 0.0 0.0 lcdmond
4614 admin S 516 1 0.0 0.0 gen_bandwidth
5227 admin S 492 1 0.0 0.0 mDNSResponderPo
4275 admin S 488 1 0.0 0.0 bcclient
16433 admin D 476 3172 0.0 0.0 smbd ; #3172 *** Stalled
3310 admin S 472 1 0.0 0.0 ntpdated
3177 admin S 472 3172 0.0 0.0 smbd
1 admin S 460 0 0.0 0.0 init
4425 admin S N 460 1 0.0 0.0 acpid
4847 admin S 456 1 0.0 0.0 klogd.sh
4365 admin S 452 1 0.0 0.0 gpiod
5128 admin S 452 1 0.0 0.0 qsyncman
4451 admin S 448 1 0.0 0.0 upsutil
4295 admin S 448 1 0.0 0.0 picd
2420 admin S 412 1 0.0 0.0 qsmartd
4701 admin S 404 1 0.0 0.0 iscsid
4832 admin S 396 1 0.0 0.0 upsd
4707 admin S 384 1 0.0 0.0 vdd_control
4760 admin S 372 1 0.0 0.0 qShield
1858 admin S < 372 1 0.0 0.0 qwatchdogd
4463 admin S N 368 1 0.0 0.0 rsyncd
5331 admin S 364 1 0.0 0.0 getty
5332 admin S 364 1 0.0 0.0 getty
1773 admin S 360 1 0.0 0.0 daemon_mgr
4748 admin S 348 1 0.0 0.0 qsyslogd
3327 admin S 316 1 0.0 0.0 stunnel
4855 admin S 288 4847 0.0 0.0 dd
2165 admin S 180 1 0.0 0.0 modagent
2371 admin SW 0 2 0.0 0.0 md0_raid5
1503 admin SW 0 2 0.0 0.0 md9_raid1
412 admin SW 0 2 0.0 0.0 kswapd0
14 admin SW 0 2 0.0 0.0 events/3
2398 admin DW 0 2 0.0 0.0 kjournald ; ***** Stalled?!
13 admin SW 0 2 0.0 0.0 events/2
193 admin SW 0 2 0.0 0.0 bdi-default

I believe that "kjournal" handles the filesystem journaling or something related to it, so maybe it's related to the culprit here?

Mario.
Locked

Return to “System & Disk Volume Management”