QNAP NAS Community Forum

Posted: **Sat Sep 21, 2019 2:38 am**

I second that, excellent post by @datahrdr
In some ways it's not entirely QNAP's fault, as they are using third party drivers. It's a shame though, all that would need to happen is to monitor throughput and skip the cache whenever it slows down below non-cache speeds and it would improve a lot.
Personally I've also ditched the cache and use the SSDs as a standalone volume for performance-sensitive data like VM operating system volumes and things I'm currently working on. It's not as fast as when the cache is working well, but at least it's consistent and consistency is more important for me.

Posted: **Sun Sep 22, 2019 1:58 am**

Thank you both for your feedback. I only deal with file transfers of tens to hundreds to multiple hundred gigabytes at a time, but that's certainly enough to hit the SSD speed wall.

Indeed, it is not really QNAP's fault because they just expose the standard Linux kernel cache system (dm-cache), but it would have certainly saved me a lot of time if these drawbacks had been measured and documented somewhere.

And like you say, it's not conceptually very difficult to imagine a system that constantly monitors the speed of the cache and backing devices (I have no idea if overall throughput data is exposed to the cache layer, but the cache layer can certainly keep track of the reads and writes it performs and the service time of those requests (how many milliseconds it took to complete a request). If the throughput of either reads or writes of that class of transfer sizes (so that it would still take advantage of the greater IOPS that an SSD can provide at for example 4 KiB transfers, even when large block transfers choke) drops below that of the backing disk, skip the cache for the next 0.1 seconds, 1 second or whatever.)

I am a programmer but I've never dealt with anything like Linux kernel development, so I'm not very confident I could actually contribute. Personally I've solved the problem by just not using SSDs in my QNAP (well, I might still use them for VMs and whatnot).

Another caveat that I'm not sure has been addressed is that if the cache layer is located between the unlocked volume and the filesystem driver, everything is going to be written to the cache device in its decrypted form. I have not verified this; I may do so later. If this is the case, it would certainly make using an SSD cache an even more undesirable idea.

Posted: **Sun Sep 22, 2019 3:20 am**

datahrdr wrote: ↑Sun Sep 22, 2019 1:58 am Personally I've solved the problem by just not using SSDs in my QNAP (well, I might still use them for VMs and whatnot).

Yes having VM-disks on SSD-volumes is often useful, especially with disk intensive OSes like Windows.

Posted: **Sun Sep 22, 2019 2:09 pm**

datahrdr wrote: ↑Mon Sep 16, 2019 3:29 pm Am I dissatisfied? Not really. I should have checked the specifications more carefully, or chosen not to use encryption (I have stored the encryption key in the NAS so that it unlocks automatically - this basically defeats the purpose of the encryption anyway! ... but even if I deleted that, I would still have the password and the key file stored on my PC, which isn't encrypted). On the other hand, for me the NAS is just for storing files - 350 MB/s isn't bad at all compared to the bunch of spinning disks I had in my PC (which would only do 100-150 MB/s) previously! If I were to upgrade my NAS, I would look for a device that has a beefier CPU, but for now this'll do, I guess.

I have TVS-951x too. Two Samsung evo 860 in RAID1 and five 3TB HGST in RAID5. I don't use 10Gbps.
With QTS 4.4.1 you can use hardware self encrypting discs. Probably lowest model is Samsung evo 860. To do that I would have to reinitialise QNAP.
I'm wondering if setting new config with four Samsung evo 860 in RAID5 for system and some data shares and enabling SED feature would take some work off CPU? This way all encryption calculations are handled by Samsung on disc controllers. In theory this should give triple r/w speed of unencrypted SSD with QNAP CPU working only on RAID5 calculations. It's no difference for HDD of course.

Current situation with TVS-951x is I have two SSD bays useless. Cache and qtier aren't working as it suppose to. SSD are too expensive to use it for data storage.

If you use Samsung SED be aware they have poor security implementation in former 840 and probably 850 series. Nothing is know about 860 yet. But it's probably still better than storing key on Qnap.

Posted: **Sun Sep 22, 2019 2:38 pm**

If you only use 1 Gbps networking, it doesn't matter because software encryption is still at least three times faster than the network speed. I see no compelling reason to redo your setup with SED disks.

Posted: **Sun Sep 22, 2019 11:59 pm**

Excellent post, datahrdr, and perfect timing as I just ran into this issue today and you had an amazing, detailed answer ready and waiting for my frantic google searching.

I have a small, 2-bay unit, and use it for downloading media (mostly via torrents) and serving up that content (mostly via SMB). I had a spare NVMe SSD lying around, so I spent just $20 to get a PCIe adapter and set it up. My goal was to reduce read/writes on the HDD and quiet the system down, as the drives are fairly chunky sounding. It worked great for that... until it didn't.

So, assuming all I care about is reducing disk activity and drive noise, could I: Configure the SSD cache. Wait until I have performance problems. Remove it, and then re-add it. Would that get me another few weeks of running at full speed before it chokes again? Is there a configuration that would grant me the longest window of time before the choke hits? More overprovisioning? Less? Random IO vs. All? Block size?

Posted: **Mon Sep 23, 2019 7:46 am**

OK guys, I'm sorry this post is going to be so long, but if you actually give a crap about getting this cache performance thing solved, you may want to plow through it anway.

TL;DR: I think I'm coming to an understanding of why the performance starts out great with the cache and then hits a brick wall after some time, and I think I may be on the right path for finding a solution that avoids this performance crash. Read on for the gory details.

@joshuacant, my personal belief right now is as follows, and I'm doing research and investigation (and learning) to confirm or deny it, so here goes:

I believe that the QNAP device is not properly flushing the cache of dirty blocks. Over time the number of dirty blocks grows and grows until the entire cache is full of dirty blocks, upon which all subsequent writes will force the dirty blocks to first be written out to disk, then the blocks are free to be overwritten with new data. This is probably what people are seeing when they say that upon filling the cache performance plummets.

I'm new to all of this, so I've been doing some intense Googling, reading, and poking around on my TS-963X.

I have a 5-disk RAID 6 currently cached through a 2TB Samsung 860 EVO using 10% overprovisioning, connected directly to my Windows 10 PC over 10GbE. My cache is set to cache all io (ie: no cache bypass figure is set and all writes will go through the cache). From my testing so far the cache is helping. I have a directory of almost all jpg and gif images, with a few video files in it, that represents my "lots of small files" test, with 14.76GB of data spread over 2771 files in 21 folders. Some of those files are largish, many of them are quite small, on the order of a few 10s of k. With this directory I see a very modest boost in write speeds from my Windows 10 machine (the directory is on a 2TB Samsung 970 EVO NVME drive so the source for the data is really fast) to the QNAP, and more substantial ~10% or so benefit in read speeds when I copy the directory back to the PC from the QNAP. I have another directory with 46.86GB of data in just 93 files (some 10MB jpeg, 55MB RAW photos, and some video clips, some in the GB size range) that represents my "sequential IO" test, and writing this to the array sees a more substantial ~30% speedup through the cache over no cache, and reading back from the array sees a more modest maybe 5% boost because reading from the RAID 6 is pretty fast, and a single SATA SSD can barely outread five 7200RPM Seagate Ironwolf 8TB drives in RAID 6. I have a second 2TB SSD installed in the QNAP but cannot RAID 0 it to the first one and test that because I've only got 8GB of RAM in the QNAP, and a 4TB cache requires 16GB of RAM. This will be resolved when my RAM arrives tomorrow, and I'll RAID 0 my two SSDs and see how that affects my read/write performance to the array. I'm expecting a very noticeable boost, and I'll report back here.

Here are some commands that will help people who are inclined to poke around on their QNAP to look and learn about their cache. Use these at your own risk. I'm not a guru of this stuff, but I've used UNIX systems before. These commands just look at things. I haven't changed anything yet. I'm just looking and learning.

dmsetup status
This command will tell you some good status and info regarding your cache. Here's what I see when I run this:

[/] # dmsetup status
vg288-lv1: 0 46522171392 linear
CG0ssddev: 0 3463413760 flashcache_ssd CG0 enable WRITE_BACK 1690112 283823 264263 0 0
conf:
capacity(1690112M), associativity(512), data block size(1024K) metadata block size(4096b)
forced_plugout 0 cached disk 1 stop_sync 0 suspend_flush 0
Slot Info:
1 Used 3UXT5T-Zv9m-wHev-X1sJ-DsdS-zDYv-iCw71Q
stats:
nr_queued(0) nr_jobs(0)
lru hot blocks(7417779), lru warm blocks(7417086)
lru promotions(25214), lru demotions(0)
cachedev1: 0 46522171392 flashcache enable 283823 264263
write_uncached 0 stop_sync 0 suspend_flush 0
vg256-lv256: 0 3463413760 linear

Here's another command:
more /proc/flashcache/CG0/flashcache_stats

My cache device was called CG0 (you see this from the dmsetup status info), and status info is written to this by the cache mechanism. I won't paste in the full output here, but here are a couple of things to check out:

[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 3259529
dirty_write_hits: 11751
dirty_write_hit_percent: 99
dirty_blocks: 264305

Note the dirty_blocks value. That reports the number of blocks marked as dirty in the cache. I ran a test where I copied my 46GB "large files" directory onto the array through this cache probably an hour ago, and have run other such tests last night since starting that cache up.

Here I've just shut down the cache, which forced a flush of its dirty contents to the array. That took probably 10 or 15 minutes.

Looking at the stats again after turning the cache back on, here's what we see:
[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 418
dirty_write_hits: 418
dirty_write_hit_percent: 67
dirty_blocks: 25

Notice only 25 dirty blocks. I've just turned the cache back on and I haven't yet written anything to the array.

Ok, here I've just copied my 46GB "large files" directory to the array from my Windows 10 machine. Notice the dirty blocks now:
[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 40524
dirty_write_hits: 15078
dirty_write_hit_percent: 30
dirty_blocks: 48655

I'm about to go outside and work in the yard for a while, then when I get back I'll look at the cache stats again. The reason is this: all "idle" dirty blocks should have been flushed to the array by the time I get back, and I'll bet dollars to donuts that they won't have been. Why do I know that they should have been written out by then? Because if you run the following command, you'll see that the flashcache parameters have been set to do so.

# sysctl -a |grep fallow
dev.flashcache.CG0.fallow_clean_speed = 2
dev.flashcache.CG0.fallow_delay = 900

What is the fallow_delay parameter? Look at this document:
https://github.com/facebookarchive/flas ... -guide.txt

Here you see the following:
Sysctls for writeback mode only :

dev.flashcache.<cachedev>.fallow_delay = 900
In seconds. Clean dirty blocks that have been "idle" (not
read or written) for fallow_delay seconds. Default is 15
minutes.
Setting this to 0 disables idle cleaning completely.
dev.flashcache.<cachedev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set
per second. Defaults to 2.

Since my cache is set up in writeback mode (you can confirm that via the dmsetup status command) these settings should apply. The fallow_delay is the number of seconds after blocks in the cache become dirty (are written to but haven't yet been flushed to the array) since they were last written to or read from. This is 15 minutes. I'm going outside to work in the yard for more than 15 minutes, so while I'm gone all of the dirty blocks in the cache that I just created by copying that 46GB to the array should have begun being flushed, but they won't have been. I'll print out some stats when I get back to prove this.

Ok, it's about an hour later, and here's the results of the cache stats:

[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 449801
dirty_write_hits: 10713
dirty_write_hit_percent: 99
dirty_blocks: 53915

Not only are all the blocks previously written by the 46GB copy still dirty, but over time as more filesystem activity has occurred, the cache has accumulated more dirty blocks.

I just looked at the Resource Monitor under Disk Activity in QTS, and all five hard drives are showing 0 iops/s, while the SSD is showing 24-26 write iops/s and around 14 read iops/s, which demonstrates the slow increase in dirty blocks in the cache.

Conclusion so far: the cache is absorbing all of the writes, which become dirty blocks in the cache (ie: material has NOT been flushed out to the array yet), and those dirty blocks are not being flushed. They should begin being flushed 15 minutes after they are written if they haven't been read from or subsequently written to again, but that isn't happening.

This is problematic for several reasons.

1) I have a RAID 6 for a reason, ie: it's way more secure in protecting my data than a single SSD, and of course way way more secure than a pair of SSDs in RAID 0, which I will probably make my cache. Leaving data for potentially days or weeks residing only in the cache and not being written back to the array exposes that data to whatever risk the hardware in question provides, and in the case of a RAID 0 SSD cache that could be considerable.
2) So far everything I'm doing with the array involves data originating on other computers that still will likely exist for a while even after it's copied over to the array, so even if I lost a small amount of data by having a write to a RAID 0 cache interupted by one of the SSDs going bad, I would still have the original data and I'd be OK. Obviously this could change if I started actually working with files stored on the array, but for now I could survive a RAID 0 cache failure.
3) If the cache were flushing properly then even in the event of a RAID 0 cache SSD failure the amount of data lost would be limited to whatever was written to the cache in the last 15 minutes or so (give or take, since it does take some time to flush the cache to the array, especially if data is being written into the cache at a very high continuous transfer rate), which I could probably survive just fine.
4) Since the cache is not being flushed properly, I believe what people are seeing is the result of the cache filling completely with dirty blocks, and subsequent writes to the cache, or reads from files not already cached, cause blocks to be flushed to the array first before the requested io can complete. This would explain why peoples' performance is absolutely tanking after a while.

With all this being said, I believe that this situation can be fixed. Since I'm seeing concrete benefit to writing to, and reading from, the cache when doing my directory copies from my Windows 10 PC to the array, I would like to see the problems mitigated so that I can continue to use the cache. Also, given how many people have struggled through this issue in this thread, it seems that a proper resolution to this issue would benefit a lot of people.

So, here's what I'm going to continue to do. If anyone is smarter than me about these things please feel free to chime in, since I'm an absolute neophyte here on this sort of issue. I know enough to be dangerous to myself, but not that much more.

1) See if I can figure out why the dirty cache blocks aren't being flushed out properly, or at least why it doesn't appear that they are. One thing for me to consider here is the following setting, printed out here again:
dev.flashcache.<cachedev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set
per second. Defaults to 2.

I don't know what it means "disk writes per set per second." This cache, according to my dmsetup status results, is 512-way set associative. If that is the "set" being discussed, then in theory I should see up to 1024 cache flushing disk writes per second of "fallow" dirty blocks, which I clearly am not seeing. If by "set" the entire cache is meant, then this would mean that the defaults for flash_cache flushing, which were probably set like 10 years ago when flash_cache was first created, and SSD caches were massively smaller than they can be today, is simply woefully inadequate. If the filesystem is receiving more new write iops/s than it can flush at the default rate of 2 writes per second, then that explains why the size of the dirty cache keeps on climbing over time, leading eventually to the cache being full and the performance taking a nose-dive.

This is probably tunable via sysctl. You can see that it is a tunable parameter from my "sysctl -a |grep dirty" results. One experiment I may work up the courage to do tonight is use sysctl to increase the fallow_clean_speed parameter to something much higher than the number of read iops/s I'm currently seeing in the QTS resource monitor, and then see if that number goes up and write iops/s to the actual array disks goes up by a corresponding figure. This would indicate that the cache is in fact now being flushed. I would track this by printing out the flashcache stats in /proc and see if they start going down over time. If this actually works then it will simply come down to finding out where QTS sets the fallow_clean_speed parameter in the first place, and experimenting to find a new value that flushes the cache fast enough to keep up with my intended usage rates. Ideally, other than during massive high-rate file transfers onto the array, I would like to see the number of dirty blocks stay at approximately zero, or whatever very low number represents continuing OS activities that generate filesystem activity.

2) If tuning fallow_clean_speed and/or fallow_delay cannot succeed in taming this problem, then I need to figure out if there is a command that I can run on the command line that forces flash_cache to flush dirty blocks to the array. I bet there is, but I don't (yet) know what it is. If there is, then I could simply setup a chron job that forces flash_cache to flush to disk every so many minutes. If the cache is flushed often, the amount of data and io that it generates is likely to be small enough as not to cause too much of a burden on ongoing user activity. The SSDs themselves are capable of way more IOPS than all five disks in the array are, so even if write iops from the cache to the array are taking place, new writes into the cache should still run about as fast as they otherwise would. I would bet that if we can get the flash_cache to flush promptly, for some definition of "prompt", we could end up with a SSD cache setup that pretty much never hits that 100% dirty performance crash during typical usage by most users pretty much all of the time. All these complaints of people seeing great performance and then after a couple of weeks it hits a brick wall would evaporate.

I think I'll summon the gumption this evening to use sysctl to set the fallow_clean_speed to something like 100, or maybe 1000, tonight, and see if that results in any increase in IOPS to the array immediately following. If it does, then I think I'm onto something that could help a ton of people out here. I think that using an SSD cache should in fact be useful and helpful to people. In its current functioning, however, if those benefits go sour after the cache fills to the point where people turn it off, that's a problem. They're leaving potential performance gains on the table.

I think that the 2TB SSDs are probably a mistake on my part. I already had one of them, and bought the second one for experimentation with this array. It'll be worth it if I receive $279 worth of education, however, which is how much I paid for it. I'm thinking that I probably would have been better served with 4x1TB SSDs, rather than 2x2TB. The problem is that with 4 TB of SSD cache I have to have 16GB of RAM in the QNAP, which I will have, but that's all it can support. I can't go RAID 10 with 2TB SSDs, since that would require 32GB of RAM, which this QNAP won't support. On the other hand, if I had gone with 4x1TB SSDs in a RAID 10 setup I could have had 2 TB of cache, which requires only 8GB of RAM, and had both the speed benefits of striped SSDs combined with the data security of the mirror. Who knows, though; SSDs are so cheap these days, I may just end up buying four 1TB SSDs anyway to try it out.

If I'm heading in the right direction here, then it's a crying shame that QNAP engineers have never figured this stuff out on their own and set up their configurations to avoid this performance crash. It's not like people having been complaining about this for years already.

Posted: **Mon Sep 23, 2019 7:49 am**

joshuacant wrote: ↑Sun Sep 22, 2019 11:59 pm Excellent post, datahrdr, and perfect timing as I just ran into this issue today and you had an amazing, detailed answer ready and waiting for my frantic google searching.

I have a small, 2-bay unit, and use it for downloading media (mostly via torrents) and serving up that content (mostly via SMB). I had a spare NVMe SSD lying around, so I spent just $20 to get a PCIe adapter and set it up. My goal was to reduce read/writes on the HDD and quiet the system down, as the drives are fairly chunky sounding. It worked great for that... until it didn't.

So, assuming all I care about is reducing disk activity and drive noise, could I: Configure the SSD cache. Wait until I have performance problems. Remove it, and then re-add it. Would that get me another few weeks of running at full speed before it chokes again? Is there a configuration that would grant me the longest window of time before the choke hits? More overprovisioning? Less? Random IO vs. All? Block size?

I just posted a monster post up above, but to answer your specific question yes, if you look at the cache in QTS to see how full it is and then simply turn it off and force it to flush, then turn it back on again your good performance should resume. I'm currently investigating why this occurs, and what configuration changes could make this problem not occur in the first place, as detailed in my previous monster post, but the "brute force" method that doesn't requre any poking around on a Linux command line would be simply to turn off the cache, wait for it to flush, then turn it on again.

Posted: **Tue Sep 24, 2019 2:13 am**

Stayed up way too late last night, poring over the FlashCache source code trying to figure out how it works and to see if I could figure out what triggers a cache cleaning of fallow blocks. Since I restarted the cache again yesterday I have yet to see a single block flushed from the cache to disk due to being fallow. One thing I'm pretty sure of is that the code will in fact only even consider whether any blocks are fallow no more than once every fallow_delay seconds, which means once every 15 minutes. Two issues here: this variable seems to be unsettable. I tried writing a new value for fallow_delay into the appropriate variable both by using sysctl -w on it and also by writing to /proc/sys/dev/flashcache/CG0/fallow_delay, and either way the value doesn't change. This is wierd, because ls -l in that proc directory shows the variable is writeable, and yet if I write to it (say, setting it to 10 instead of 900) the next time I look at it by either sysctl -a |grep fallow or else cat /proc/sys/dev/flashcache/CG0/fallow_delay it's back to 900. In contrast, setting fallow_clean_speed by either method actually sticks. Of course, it doesn't matter what I set the fallow_clean_speed to if the code is never deciding to clean any fallow blocks at all, which seems to be the case. No matter how long I wait after copying files into the array and thus generating dirty cache blocks, whether 15 minutes or several hours, the fallow dirty blocks do not seem ever to be cleaned (ie: flushed to disk and marked not dirty).

One thing I don't know yet is how the code inside flash_cache that should be scanning the cache sets for fallow blocks is entered in the first place. My gut feel is that there's probably not some thread looping periodically to check this stuff, but rather this code is only entered when there is filesystem activity through the cache that causes flash_cache code to be executed. Still looking to try to understand this better. So far, my assumptions on what ought to be happening to reduce the numbers of dirty blocks over time have been proven wrong. On the face of it, from the flash_cache documentation it appears that after 15 minutes dirty blocks that are not subsequently either read from or re-written to should start being cleaned, and that is simply not happening. I'd like to know why, because it bears directly on the question of whether the performance cliff people are hitting is related to the cache filling up with dirty blocks and entering a state where any non-cached read or new block write must first sync a cache block to disk before it can proceed, thus hitting new io requests with the full latency of disk writes from previous cached io on top of their own latency.

Also, there are supposed to be two ways in normal operation (ie: not while being shut down and force-synced) that dirty blocks are cleaned: they are fallow for longer than fallow_delay, or the number of dirty blocks in that cache set exceeding dirty_thresh, which defaults to 20% and cannot be less than 10% or more than 90%. I'm going to copy enough files onto the array to ensure I exceed the dirty_thresh and see if that generates any cleanings.

Posted: **Tue Sep 24, 2019 2:19 am**

Is there anyone looking at this thread who can reliably reproduce the situation where with the cache on their filesystem performance drops of a cliff who would be willing to provide some data?

In particular, I'd be curious to see what the output of the following looks like in this situation:
watch -n2 more /proc/flashcache/CG0/flashcache_stats|grep blocks

Your cache might have a different name than CG0 in that directory, so you will have to look in /proc/flashcache to see what cache name directory is there that has a flash_stats file in it and change the watch command to reflect that.

If it's true that the performance problem is due to the cache filling up with dirty blocks such that any uncached reads or new writes must first clean dirty cache blocks, then what we should see in the output from the watch command above is output where the figure for dirty_blocks is pretty darn close to the number of total_blocks. This would nail the cause right here. I can try to generate this situation myself but with a huge 1.6TB cache I'm going to have to copy my large files test directory over to the QNAP quite a few times to fill up the cache.

Posted: **Tue Sep 24, 2019 7:42 am**

Ok, I was able to duplicate the issue with a full cache. I filled up my cache till it was showing 99% full (and not getting to 100%, probably due to some cache set simply not being hashed to by the load I was using).

I noticed some things.

1) The "dirty blocks" climbed up to around 16% and then stayed constant. I started to see cleanings show up in the stats, though still no "fallow" cleanings, ie: no blocks aging out after 15 minutes of disuse.
2) The "dirty_thresh_pct" was set at 20, so up to 20% of each cache set could be dirty but no more than that. Exceeding that would trigger cleanings where dirty blocks were written out to disk. The reason the number of "dirty" blocks hit 16% and leveled out rather than right at 20% is probably because this is a 20% threshold per cache set (there are 512 sets in this cache), and each cache set sees slightly different usage patterns.
3) With the cache full, every write to the cache has to evict a block already in the cache, and since the newly written blocks are by definition dirty, an equal number of existing dirty blocks must be cleaned in order to keep the dirty block count under that 20% threshold. The upshot of this was that while I could watch in the QTS Resource Monitor as data was written to the cache equal to the amount of data coming in over the network, I was also seeing the same amount of data being read from the cache and written to the array. Unlike the case where I turn the cache off, however, and just write straight to the array, writing to the full cache that is also forced to clean an equal quantity of dirty blocks and also evict LRU-determined blocks to make room for the incoming data added a lot of additional overhead.
4) My write speed into the cache with it full and constantly flushing out dirty blocks was about half as fast as writing into the non-full cache, and also slower, but less so, than just writing to the array directly.

My thoughts are:
1) It shouldn't add that much overhead to just write over a non-dirty block in the cache when the cache is full and LRU-determined blocks need to be recycled, but add to what overhead there is the side load of having to flush out an equal quantity of dirty blocks at the same time to disk, and you get worse performance.
2) Still no "fallow" cleaning of the array. If one is not pounding the cache with a constant, heavy load, and there were code in place that would gradually clean the dirty blocks so that, after a few minutes or whatever with the cache not being written to, there were no more dirty blocks in the cache, then when the next heavy load hit the cache it would not have to do the work and hit the disk so hard cleaning dirty blocks while servicing the incoming new io.
3) It doesn't appear as if there's a daemon or kernel thread or whatever actually scanning the cache metadata for fallow dirty blocks when the cache isn't being accessed, so there doesn't appear to be any mechanism by which the fallow blocks could even be cleaned "in the background" when the cache/array weren't actually being actively accessed.

I've got some more ideas, and some more tests I will run. One idea I'd like to test out is what happens if I simply do a simple read from some file I create on the array every so many seconds, causing the flash_cache code to be entered. Perhaps even something like a simple file read can cause the flash_cache code to clean the dirty blocks "in the background", ie: constantly over time, rather than all at once while servicing a heavy load of incoming user io traffic. If this can be done, so that the average percentage of dirty blocks long term in the cache is zero or small, that massive performance hit we're seeing can be avoided.

I just got my 16GB of RAM, and am about to shut down the array and install it. When I log in again I'll re-create the cache, combining my two 2TB Samsung SSDs into a RAID 0 cache. I'll use a huge percentage of overprovisioning, like maybe even 50% or so, just so that my final combined cache isn't so huge. If part of my testing involves filling the cache, then having a smaller cache will help. Even with 50% overprovisioning the final cache size will be over 2TB in size.

I'll also spend some more time this evening examining the flash_cache code some more to better understand what does and what does not trigger the fallow dirty block cleansing, since I'm never seeing this done, yet the code is in there. I'd like to get to the bottom of it, and if I know how it works it will make it easier to find something I can do (such as every few seconds having some background script do a short write to a file) to trigger it happening even when the array/cache isn't actively being accessed by user traffic.

One more thing that occurs to me is that the settings QTS is using to create the cache itself may well not be optimal for a very large cache. For instance, it's setting up 512-way set associativity. With a 4TB cache that means each cache set is responsible for something like 8GB worth of bytes of cache, in blocks. If blocks are 512 bytes each that's like 16 million blocks, and even if that block size is 4k or whatever that's still in the millions of blocks, and the flash_cache code does a linear search of each set looking to see if a cache block for that page already exists in the cache. That's a metric crap-ton of overhead. I know that flash_cache allows up to I think 8192-way set associativity. While that wouldn't eliminate the bogosity of a linear search through a massive amount of block records, it would at least divide that average searching time by 16, in exchange for likely a much smaller additional overhead in code that has to go through the whole array of cache sets. In other words, if I can figure out where and how QTS is creating the cache, it may be possible to create a settings file or something that will result in it creating a cache more optimized for a large cache. We have to recall that flash_cache is something like 10 years old now, and was itself based on an even earlier cache design from IBM. It's likely that when this was written people thought of an SSD cache and assumed it would be something like a few GB or so, tops. We're now at terabyte-scale SSDs. If QTS is creating this cache naively then perhaps that could be improved. It's something I'd like to get to the bottom of at any rate, one way or another.

Posted: **Tue Sep 24, 2019 7:48 am**

I would definitely recommend opening a help desk ticket and share your findings with QNAP especially since I believe you are running qts 4.4.1 (lost track along the way).

They do not generally monitor the forums as a rule so this good information may not be hitting the people needed to address the issue.

Sent from my iPad using Tapatalk

Posted: **Tue Sep 24, 2019 5:32 pm**

I'll do that. I'll keep looking and investigating too, because I'm learning more and more about exactly what's going on, to the point of looking at the source code to figure out exactly how it's supposed to be working.

So I created a RAID 0 cache with I think 10% overprovisioning and it came out to around 3.38TB in size, Let's look at some figures. The cache was setup, as per "dmsetup status" with 1024K (ie: 1MB) data blocks, in 512-way set associativity.

[~] # [~] # dmsetup status
vg288-lv1: 0 46522171392 linear
CG0ssddev: 0 6926835712 flashcache_ssd CG0 enable WRITE_BACK 3380224 109814 102342 0 0
conf:
capacity(3380224M), associativity(512), data block size(1024K) metadata block size(4096b)
forced_plugout 0 cached disk 1 stop_sync 0 suspend_flush 0
Slot Info:
1 Used 3UXT5T-Zv9m-wHev-X1sJ-DsdS-zDYv-iCw71Q
stats:
nr_queued(0) nr_jobs(0)
lru hot blocks(3380460), lru warm blocks(3380227)
lru promotions(2776), lru demotions(0)
cachedev1: 0 46522171392 flashcache enable 109814 102342
write_uncached 0 stop_sync 0 suspend_flush 0
vg256-lv256: 0 6926835712 linear

Why does this matter? It matters because of the way flash_cache is implemented. They use a linear search in each set for matching blocks. Why is that bad? 3.38TB divided by 1024K blocks in 512 sets makes 6602 blocks per set that must be linearly searched for a match in whatever set they hash into. That's quite a bit of overhead. One would think a better method would be in order, but keep in mind when this code was first written people were probably using 32gb caches or whatever, not 3.38TB caches. At any rate, QNAP is being stupid here, because flash_cache would allow up to I believe 8192-way associativity. If the cache were created with that many sets each set would only have 412 blocks that had to be linearly searched for a match. That's quite a reduction in overhead.

I measured some copies of my 46gb "large files" directory from Windows over to the QNAP, and with my 3.38TB RAID 0 array they were actually slower than with a single 1.6TB SSD as the cache. That's right, the RAID 0 was slower. I don't believe it's that RAID 0 isn't faster than a single SSD, but the size of the cache created by combining most of the capacity of both SSDs into a single cache, combined with the increased overhead within the cache code of dealing with such large sets, probably had something to do with the lackluster performance on that file copy. For what it's worth, reading that same directory (now contained completely in the cache) back to Windows 10 was much quicker, as in something like 50% quicker.

I'm going to kill the cache and recreate it with the 50% overprovisioning I spoke of earlier to get a RAID 0 cache that's around the same size as my single-SSD cache and test it again, and see how they compare now that the numbers of blocks in each set are now approximately the same. This should confirm whether or not the increased overhead of their massive sets with a very large cache really have something to do with the performance or not. This is one thing I'll bring up with QNAP in my ticket. It's kind of ironic that people doing 4x1TB RAID 10 caches may well end up having better performance, not to mention better safety, than trying to go huge with bigger SSDs.

I didn't look at the code any more tonight, but I intend to get to the bottom of why no blocks are ever being cleaned after they surpass the "fallow" delay. The sources for QTS 4.4.1 aren't up yet that I've seen, but I did download 4.4.0 sources and I'm going to try to figure out where and how they are creating the cache, and see if there are any config files that might be user-editable to change how the cache is created. I'd certainly love to try out an 8192-way set associative cache instead of 512-way and see if that changes anything.

I'm starting to understand why Facebook abandoned flash_cache. I'm still hoping that for my purposes it can still prove to be of benefit.

Posted: **Thu Sep 26, 2019 3:50 am**

datahrdr wrote: ↑Sun Sep 22, 2019 2:38 pm If you only use 1 Gbps networking, it doesn't matter because software encryption is still at least three times faster than the network speed. I see no compelling reason to redo your setup with SED disks.

If you using 10Gbps, encryption and have problems with saturating 10Gbps then there is the reason. For you. Not for me.

How do you explain your statement and the difference in write speed on TVS-951x page https://www.qnap.com/pl-pl/product/tvs-951x where when encrypted write slows down do 363MB/s from 739MB/s. Isn't it the software encryption eating CPU power?

Posted: **Thu Sep 26, 2019 9:27 am**

Thank you very much, AfroDieter. Great research done.

QNAP NAS Community Forum

Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration

Re: Slow transfer speed due to SSD cache acceleration