Crash when writing more than 200GB in one batch

iSCSI related applications
mr-auh
Starting out
Posts: 17
Joined: Thu Nov 29, 2018 5:44 pm

Re: Crash when writing more than 200GB in one batch

Post by mr-auh »

A little update:
The QNAP support told me they escalated my issue and will get back in touch with me once they heard back from the technicall staff.

What really makes me scratch my head quite hard is the behaviour of the storage (6x HDDs) configured as RAID5.
At some point the data transfer and rise of cache just stops and the fileshare becomes unresponsive (can't browse any files with explorer, can't access the share restarting the explorer), however the QNAP itself doesn't crash.
2018-12-07_07_42_30.png
I can still access the files using File Station but trying to delete them just hangs the application. All I can do is reboot the QNAP.
2018-12-07_07_50_41.png
You do not have the required permissions to view the files attached to this post.
User avatar
storageman
Ask me anything
Posts: 5507
Joined: Thu Sep 22, 2011 10:57 pm

Re: Crash when writing more than 200GB in one batch

Post by storageman »

This is off when it works without crashing, right? Or did your SSH do something more.
Strange we don't hear other people with this issue on that model.
temp.png
You do not have the required permissions to view the files attached to this post.
mr-auh
Starting out
Posts: 17
Joined: Thu Nov 29, 2018 5:44 pm

Re: Crash when writing more than 200GB in one batch

Post by mr-auh »

storageman wrote: Fri Dec 07, 2018 5:07 pm This is off when it works without crashing, right? Or did your SSH do something more.
Strange we don't hear other people with this issue on that model.
temp.png
This setting has been off for all my tests. Seeing the behaviour of the problem it all looks like a software bug to me. However I would imagine that using a RAID5 with all 6 HDDs in this box would be quite common and others should have this problem as well. Keen to see what the QNAP Support has to say.
User avatar
storageman
Ask me anything
Posts: 5507
Joined: Thu Sep 22, 2011 10:57 pm

Re: Crash when writing more than 200GB in one batch

Post by storageman »

How about creating a volume on those SSDs and doing a 200GB+ copy?
Personally, I think those Reds are part of the problem.
mr-auh
Starting out
Posts: 17
Joined: Thu Nov 29, 2018 5:44 pm

Re: Crash when writing more than 200GB in one batch

Post by mr-auh »

storageman wrote: Fri Dec 07, 2018 6:26 pm How about creating a volume on those SSDs and doing a 200GB+ copy?
Personally, I think those Reds are part of the problem.
I tried the same thing using the NVMe SSDs in RAID1, same problem though. We also have another QNAP with 12 WD RED in a RAID6 for the same purpose and that is working without any issues.
mr-auh
Starting out
Posts: 17
Joined: Thu Nov 29, 2018 5:44 pm

Re: Crash when writing more than 200GB in one batch

Post by mr-auh »

TL;DR at the bottom.

Another update:
Three different QNAP engineers tried their luck meanwhile with a one week remote session but nothing really happened.
Yesterday they contacted me once again and asked for a phone call at 10am. Sadly noone ever called and I got no reply to my mail asking what is going on.

Yesterday evening I invested another two hours to build upon the drop_caches workaround and see if I could automate it.
First of all you don't have to use

Code: Select all

echo 3 > /proc/sys/vm/drop_caches
as option,

Code: Select all

echo 1 > /proc/sys/vm/drop_caches
is working just as well (only freeing pagecache).
Some websites suggest running a cronjob every five minutes, however I did not like that solution at all because most of the time you don't need it at all and sometimes the five minute timespan is just too long still resulting in a crash.

Looking into the virtual memory subsystem I found the following setting:

Code: Select all

vm.min_free_kbytes = 131072
which pretty much explaines the behaviour I observed. Writing about 700mb/s the pagecache just jumps over this limitation right into non-existend RAM addresses.

TL;DR:
Increasing the min_free_kbytes value pretty much did the trick for me without using cronjobs or other dirty hacks. I added the following line to my autorun.sh:

Code: Select all

sysctl -w vm.min_free_kbytes=33554432
Surely a lower value will also work but for now I think 32GB of cache should be just enough. Having the system cache enabled or disabled both works.
2019-02-21_08_02_30.png
Going to run some longer tests now to see if the system stays stable for a productive use.
You do not have the required permissions to view the files attached to this post.
User avatar
storageman
Ask me anything
Posts: 5507
Joined: Thu Sep 22, 2011 10:57 pm

Re: Crash when writing more than 200GB in one batch

Post by storageman »

mr-auh wrote: Thu Feb 21, 2019 3:41 pm TL;DR at the bottom.

Another update:
Three different QNAP engineers tried their luck meanwhile with a one week remote session but nothing really happened.
Yesterday they contacted me once again and asked for a phone call at 10am. Sadly noone ever called and I got no reply to my mail asking what is going on.

Yesterday evening I invested another two hours to build upon the drop_caches workaround and see if I could automate it.
First of all you don't have to use

Code: Select all

echo 3 > /proc/sys/vm/drop_caches
as option,

Code: Select all

echo 1 > /proc/sys/vm/drop_caches
is working just as well (only freeing pagecache).
Some websites suggest running a cronjob every five minutes, however I did not like that solution at all because most of the time you don't need it at all and sometimes the five minute timespan is just too long still resulting in a crash.

Looking into the virtual memory subsystem I found the following setting:

Code: Select all

vm.min_free_kbytes = 131072
which pretty much explaines the behaviour I observed. Writing about 700mb/s the pagecache just jumps over this limitation right into non-existend RAM addresses.

TL;DR:
Increasing the min_free_kbytes value pretty much did the trick for me without using cronjobs or other dirty hacks. I added the following line to my autorun.sh:

Code: Select all

sysctl -w vm.min_free_kbytes=33554432
Surely a lower value will also work but for now I think 32GB of cache should be just enough. Having the system cache enabled or disabled both works.
2019-02-21_08_02_30.png
Going to run some longer tests now to see if the system stays stable for a productive use.
Nobody should need to do any of this, Qnap need to fix either hardware or software issue here.
Locked

Return to “iSCSI – Target & Virtual Disk”