Crash when writing more than 200GB in one batch

mr-auh · Post by **mr-auh** » Fri Dec 07, 2018 2:54 pm

A little update:
The QNAP support told me they escalated my issue and will get back in touch with me once they heard back from the technicall staff.

What really makes me scratch my head quite hard is the behaviour of the storage (6x HDDs) configured as RAID5.
At some point the data transfer and rise of cache just stops and the fileshare becomes unresponsive (can't browse any files with explorer, can't access the share restarting the explorer), however the QNAP itself doesn't crash.

2018-12-07_07_42_30.png

I can still access the files using File Station but trying to delete them just hangs the application. All I can do is reboot the QNAP.

2018-12-07_07_50_41.png

storageman · Post by **storageman** » Fri Dec 07, 2018 5:07 pm

This is off when it works without crashing, right? Or did your SSH do something more.
Strange we don't hear other people with this issue on that model.

temp.png

mr-auh · Post by **mr-auh** » Fri Dec 07, 2018 6:21 pm

storageman wrote: ↑Fri Dec 07, 2018 5:07 pm This is off when it works without crashing, right? Or did your SSH do something more.
Strange we don't hear other people with this issue on that model.
temp.png

This setting has been off for all my tests. Seeing the behaviour of the problem it all looks like a software bug to me. However I would imagine that using a RAID5 with all 6 HDDs in this box would be quite common and others should have this problem as well. Keen to see what the QNAP Support has to say.

storageman · Post by **storageman** » Fri Dec 07, 2018 6:26 pm

How about creating a volume on those SSDs and doing a 200GB+ copy?
Personally, I think those Reds are part of the problem.

mr-auh · Post by **mr-auh** » Fri Dec 07, 2018 8:20 pm

storageman wrote: ↑Fri Dec 07, 2018 6:26 pm How about creating a volume on those SSDs and doing a 200GB+ copy?
Personally, I think those Reds are part of the problem.

I tried the same thing using the NVMe SSDs in RAID1, same problem though. We also have another QNAP with 12 WD RED in a RAID6 for the same purpose and that is working without any issues.

mr-auh · Post by **mr-auh** » Thu Feb 21, 2019 3:41 pm

TL;DR at the bottom.

Another update:
Three different QNAP engineers tried their luck meanwhile with a one week remote session but nothing really happened.
Yesterday they contacted me once again and asked for a phone call at 10am. Sadly noone ever called and I got no reply to my mail asking what is going on.

Yesterday evening I invested another two hours to build upon the drop_caches workaround and see if I could automate it.
First of all you don't have to use

Code: Select all

echo 3 > /proc/sys/vm/drop_caches

as option,

Code: Select all

echo 1 > /proc/sys/vm/drop_caches

is working just as well (only freeing pagecache).
Some websites suggest running a cronjob every five minutes, however I did not like that solution at all because most of the time you don't need it at all and sometimes the five minute timespan is just too long still resulting in a crash.

Looking into the virtual memory subsystem I found the following setting:

Code: Select all

vm.min_free_kbytes = 131072

which pretty much explaines the behaviour I observed. Writing about 700mb/s the pagecache just jumps over this limitation right into non-existend RAM addresses.

TL;DR:
Increasing the min_free_kbytes value pretty much did the trick for me without using cronjobs or other dirty hacks. I added the following line to my autorun.sh:

Code: Select all

sysctl -w vm.min_free_kbytes=33554432

Surely a lower value will also work but for now I think 32GB of cache should be just enough. Having the system cache enabled or disabled both works.

2019-02-21_08_02_30.png

Going to run some longer tests now to see if the system stays stable for a productive use.

storageman · Post by **storageman** » Fri Feb 22, 2019 4:34 pm

mr-auh wrote: ↑Thu Feb 21, 2019 3:41 pm TL;DR at the bottom.

Another update:
Three different QNAP engineers tried their luck meanwhile with a one week remote session but nothing really happened.
Yesterday they contacted me once again and asked for a phone call at 10am. Sadly noone ever called and I got no reply to my mail asking what is going on.

Yesterday evening I invested another two hours to build upon the drop_caches workaround and see if I could automate it.
First of all you don't have to use
Code: Select all
echo 3 > /proc/sys/vm/drop_caches
as option,
Code: Select all
echo 1 > /proc/sys/vm/drop_caches
is working just as well (only freeing pagecache).
Some websites suggest running a cronjob every five minutes, however I did not like that solution at all because most of the time you don't need it at all and sometimes the five minute timespan is just too long still resulting in a crash.

Looking into the virtual memory subsystem I found the following setting:
Code: Select all
vm.min_free_kbytes = 131072
which pretty much explaines the behaviour I observed. Writing about 700mb/s the pagecache just jumps over this limitation right into non-existend RAM addresses.

TL;DR:
Increasing the min_free_kbytes value pretty much did the trick for me without using cronjobs or other dirty hacks. I added the following line to my autorun.sh:
Code: Select all
sysctl -w vm.min_free_kbytes=33554432
Surely a lower value will also work but for now I think 32GB of cache should be just enough. Having the system cache enabled or disabled both works.
2019-02-21_08_02_30.png
Going to run some longer tests now to see if the system stays stable for a productive use.

Nobody should need to do any of this, Qnap need to fix either hardware or software issue here.

QNAP NAS Community Forum

Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch

Re: Crash when writing more than 200GB in one batch