OK guys, I'm sorry this post is going to be so long, but if you actually give a crap about getting this cache performance thing solved, you may want to plow through it anway.
TL;DR: I think I'm coming to an understanding of why the performance starts out great with the cache and then hits a brick wall after some time, and I think I may be on the right path for finding a solution that avoids this performance crash. Read on for the gory details.
@joshuacant, my personal belief right now is as follows, and I'm doing research and investigation (and learning) to confirm or deny it, so here goes:
I believe that the QNAP device is not properly flushing the cache of dirty blocks. Over time the number of dirty blocks grows and grows until the entire cache is full of dirty blocks, upon which all subsequent writes will force the dirty blocks to first be written out to disk, then the blocks are free to be overwritten with new data. This is probably what people are seeing when they say that upon filling the cache performance plummets.
I'm new to all of this, so I've been doing some intense Googling, reading, and poking around on my TS-963X.
I have a 5-disk RAID 6 currently cached through a 2TB Samsung 860 EVO using 10% overprovisioning, connected directly to my Windows 10 PC over 10GbE. My cache is set to cache all io (ie: no cache bypass figure is set and all writes will go through the cache). From my testing so far the cache is helping. I have a directory of almost all jpg and gif images, with a few video files in it, that represents my "lots of small files" test, with 14.76GB of data spread over 2771 files in 21 folders. Some of those files are largish, many of them are quite small, on the order of a few 10s of k. With this directory I see a very modest boost in write speeds from my Windows 10 machine (the directory is on a 2TB Samsung 970 EVO NVME drive so the source for the data is really fast) to the QNAP, and more substantial ~10% or so benefit in read speeds when I copy the directory back to the PC from the QNAP. I have another directory with 46.86GB of data in just 93 files (some 10MB jpeg, 55MB RAW photos, and some video clips, some in the GB size range) that represents my "sequential IO" test, and writing this to the array sees a more substantial ~30% speedup through the cache over no cache, and reading back from the array sees a more modest maybe 5% boost because reading from the RAID 6 is pretty fast, and a single SATA SSD can barely outread five 7200RPM Seagate Ironwolf 8TB drives in RAID 6. I have a second 2TB SSD installed in the QNAP but cannot RAID 0 it to the first one and test that because I've only got 8GB of RAM in the QNAP, and a 4TB cache requires 16GB of RAM. This will be resolved when my RAM arrives tomorrow, and I'll RAID 0 my two SSDs and see how that affects my read/write performance to the array. I'm expecting a very noticeable boost, and I'll report back here.
Here are some commands that will help people who are inclined to poke around on their QNAP to look and learn about their cache. Use these at your own risk. I'm not a guru of this stuff, but I've used UNIX systems before. These commands just look at things. I haven't changed anything yet. I'm just looking and learning.
dmsetup status
This command will tell you some good status and info regarding your cache. Here's what I see when I run this:
[/] # dmsetup status
vg288-lv1: 0 46522171392 linear
CG0ssddev: 0 3463413760 flashcache_ssd CG0 enable WRITE_BACK 1690112 283823 264263 0 0
conf:
capacity(1690112M), associativity(512), data block size(1024K) metadata block size(4096b)
forced_plugout 0 cached disk 1 stop_sync 0 suspend_flush 0
Slot Info:
1 Used 3UXT5T-Zv9m-wHev-X1sJ-DsdS-zDYv-iCw71Q
stats:
nr_queued(0) nr_jobs(0)
lru hot blocks(7417779), lru warm blocks(7417086)
lru promotions(25214), lru demotions(0)
cachedev1: 0 46522171392 flashcache enable 283823 264263
write_uncached 0 stop_sync 0 suspend_flush 0
vg256-lv256: 0 3463413760 linear
Here's another command:
more /proc/flashcache/CG0/flashcache_stats
My cache device was called CG0 (you see this from the dmsetup status info), and status info is written to this by the cache mechanism. I won't paste in the full output here, but here are a couple of things to check out:
[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 3259529
dirty_write_hits: 11751
dirty_write_hit_percent: 99
dirty_blocks: 264305
Note the dirty_blocks value. That reports the number of blocks marked as dirty in the cache. I ran a test where I copied my 46GB "large files" directory onto the array through this cache probably an hour ago, and have run other such tests last night since starting that cache up.
Here I've just shut down the cache, which forced a flush of its dirty contents to the array. That took probably 10 or 15 minutes.
Looking at the stats again after turning the cache back on, here's what we see:
[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 418
dirty_write_hits: 418
dirty_write_hit_percent: 67
dirty_blocks: 25
Notice only 25 dirty blocks. I've just turned the cache back on and I haven't yet written anything to the array.
Ok, here I've just copied my 46GB "large files" directory to the array from my Windows 10 machine. Notice the dirty blocks now:
[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 40524
dirty_write_hits: 15078
dirty_write_hit_percent: 30
dirty_blocks: 48655
I'm about to go outside and work in the yard for a while, then when I get back I'll look at the cache stats again. The reason is this: all "idle" dirty blocks should have been flushed to the array by the time I get back, and I'll bet dollars to donuts that they won't have been. Why do I know that they should have been written out by then? Because if you run the following command, you'll see that the flashcache parameters have been set to do so.
# sysctl -a |grep fallow
dev.flashcache.CG0.fallow_clean_speed = 2
dev.flashcache.CG0.fallow_delay = 900
What is the fallow_delay parameter? Look at this document:
https://github.com/facebookarchive/flas ... -guide.txt
Here you see the following:
Sysctls for writeback mode only :
dev.flashcache.<cachedev>.fallow_delay = 900
In seconds. Clean dirty blocks that have been "idle" (not
read or written) for fallow_delay seconds. Default is 15
minutes.
Setting this to 0 disables idle cleaning completely.
dev.flashcache.<cachedev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set
per second. Defaults to 2.
Since my cache is set up in writeback mode (you can confirm that via the dmsetup status command) these settings should apply. The fallow_delay is the number of seconds after blocks in the cache become dirty (are written to but haven't yet been flushed to the array) since they were last written to or read from. This is 15 minutes. I'm going outside to work in the yard for more than 15 minutes, so while I'm gone all of the dirty blocks in the cache that I just created by copying that 46GB to the array should have begun being flushed, but they won't have been. I'll print out some stats when I get back to prove this.
Ok, it's about an hour later, and here's the results of the cache stats:
[/] # more /proc/flashcache/CG0/flashcache_stats|grep dirty
raw_dirty_write_hits: 449801
dirty_write_hits: 10713
dirty_write_hit_percent: 99
dirty_blocks: 53915
Not only are all the blocks previously written by the 46GB copy still dirty, but over time as more filesystem activity has occurred, the cache has accumulated more dirty blocks.
I just looked at the Resource Monitor under Disk Activity in QTS, and all five hard drives are showing 0 iops/s, while the SSD is showing 24-26 write iops/s and around 14 read iops/s, which demonstrates the slow increase in dirty blocks in the cache.
Conclusion so far: the cache is absorbing all of the writes, which become dirty blocks in the cache (ie: material has NOT been flushed out to the array yet), and those dirty blocks are not being flushed. They should begin being flushed 15 minutes after they are written if they haven't been read from or subsequently written to again, but that isn't happening.
This is problematic for several reasons.
1) I have a RAID 6 for a reason, ie: it's way more secure in protecting my data than a single SSD, and of course way way more secure than a pair of SSDs in RAID 0, which I will probably make my cache. Leaving data for potentially days or weeks residing only in the cache and not being written back to the array exposes that data to whatever risk the hardware in question provides, and in the case of a RAID 0 SSD cache that could be considerable.
2) So far everything I'm doing with the array involves data originating on other computers that still will likely exist for a while even after it's copied over to the array, so even if I lost a small amount of data by having a write to a RAID 0 cache interupted by one of the SSDs going bad, I would still have the original data and I'd be OK. Obviously this could change if I started actually working with files stored on the array, but for now I could survive a RAID 0 cache failure.
3) If the cache were flushing properly then even in the event of a RAID 0 cache SSD failure the amount of data lost would be limited to whatever was written to the cache in the last 15 minutes or so (give or take, since it does take some time to flush the cache to the array, especially if data is being written into the cache at a very high continuous transfer rate), which I could probably survive just fine.
4) Since the cache is not being flushed properly, I believe what people are seeing is the result of the cache filling completely with dirty blocks, and subsequent writes to the cache, or reads from files not already cached, cause blocks to be flushed to the array first before the requested io can complete. This would explain why peoples' performance is absolutely tanking after a while.
With all this being said, I believe that this situation can be fixed. Since I'm seeing concrete benefit to writing to, and reading from, the cache when doing my directory copies from my Windows 10 PC to the array, I would like to see the problems mitigated so that I can continue to use the cache. Also, given how many people have struggled through this issue in this thread, it seems that a proper resolution to this issue would benefit a lot of people.
So, here's what I'm going to continue to do. If anyone is smarter than me about these things please feel free to chime in, since I'm an absolute neophyte here on this sort of issue. I know enough to be dangerous to myself, but not that much more.
1) See if I can figure out why the dirty cache blocks aren't being flushed out properly, or at least why it doesn't appear that they are. One thing for me to consider here is the following setting, printed out here again:
dev.flashcache.<cachedev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set
per second. Defaults to 2.
I don't know what it means "disk writes per set per second." This cache, according to my dmsetup status results, is 512-way set associative. If that is the "set" being discussed, then in theory I should see up to 1024 cache flushing disk writes per second of "fallow" dirty blocks, which I clearly am not seeing. If by "set" the entire cache is meant, then this would mean that the defaults for flash_cache flushing, which were probably set like 10 years ago when flash_cache was first created, and SSD caches were massively smaller than they can be today, is simply woefully inadequate. If the filesystem is receiving more new write iops/s than it can flush at the default rate of 2 writes per second, then that explains why the size of the dirty cache keeps on climbing over time, leading eventually to the cache being full and the performance taking a nose-dive.
This is probably tunable via sysctl. You can see that it is a tunable parameter from my "sysctl -a |grep dirty" results. One experiment I may work up the courage to do tonight is use sysctl to increase the fallow_clean_speed parameter to something much higher than the number of read iops/s I'm currently seeing in the QTS resource monitor, and then see if that number goes up and write iops/s to the actual array disks goes up by a corresponding figure. This would indicate that the cache is in fact now being flushed. I would track this by printing out the flashcache stats in /proc and see if they start going down over time. If this actually works then it will simply come down to finding out where QTS sets the fallow_clean_speed parameter in the first place, and experimenting to find a new value that flushes the cache fast enough to keep up with my intended usage rates. Ideally, other than during massive high-rate file transfers onto the array, I would like to see the number of dirty blocks stay at approximately zero, or whatever very low number represents continuing OS activities that generate filesystem activity.
2) If tuning fallow_clean_speed and/or fallow_delay cannot succeed in taming this problem, then I need to figure out if there is a command that I can run on the command line that forces flash_cache to flush dirty blocks to the array. I bet there is, but I don't (yet) know what it is. If there is, then I could simply setup a chron job that forces flash_cache to flush to disk every so many minutes. If the cache is flushed often, the amount of data and io that it generates is likely to be small enough as not to cause too much of a burden on ongoing user activity. The SSDs themselves are capable of way more IOPS than all five disks in the array are, so even if write iops from the cache to the array are taking place, new writes into the cache should still run about as fast as they otherwise would. I would bet that if we can get the flash_cache to flush promptly, for some definition of "prompt", we could end up with a SSD cache setup that pretty much never hits that 100% dirty performance crash during typical usage by most users pretty much all of the time. All these complaints of people seeing great performance and then after a couple of weeks it hits a brick wall would evaporate.
I think I'll summon the gumption this evening to use sysctl to set the fallow_clean_speed to something like 100, or maybe 1000, tonight, and see if that results in any increase in IOPS to the array immediately following. If it does, then I think I'm onto something that could help a ton of people out here. I think that using an SSD cache should in fact be useful and helpful to people. In its current functioning, however, if those benefits go sour after the cache fills to the point where people turn it off, that's a problem. They're leaving potential performance gains on the table.
I think that the 2TB SSDs are probably a mistake on my part. I already had one of them, and bought the second one for experimentation with this array. It'll be worth it if I receive $279 worth of education, however, which is how much I paid for it. I'm thinking that I probably would have been better served with 4x1TB SSDs, rather than 2x2TB. The problem is that with 4 TB of SSD cache I have to have 16GB of RAM in the QNAP, which I will have, but that's all it can support. I can't go RAID 10 with 2TB SSDs, since that would require 32GB of RAM, which this QNAP won't support. On the other hand, if I had gone with 4x1TB SSDs in a RAID 10 setup I could have had 2 TB of cache, which requires only 8GB of RAM, and had both the speed benefits of striped SSDs combined with the data security of the mirror. Who knows, though; SSDs are so cheap these days, I may just end up buying four 1TB SSDs anyway to try it out.
If I'm heading in the right direction here, then it's a crying shame that QNAP engineers have never figured this stuff out on their own and set up their configurations to avoid this performance crash. It's not like people having been complaining about this for years already.