-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MFU: data should be accessed more than twice to be in MFU (ARC and thus L2ARC) #16499
Comments
If the first write doesn't get counted, maybe using MAX(2, (median of access count of the last 1% accesses to MRU) + 1) as the criteria to move from MRU to MFU could be scan-resistant for a few times. The current implementation is only 1 pass scan-resistant. |
I think that doing what described here is not going to be very useful. When writing, ZFS buffers as much as
That means that even a write-centric workload can not really "purge" the MFU, as both throttling (see Finally, it is my understanding that ZFS does not really count per-block cache hits. Rather, it has two lists: MRU and MFU. If anything already stored in MRU is demand read, than it is moved into MFU. Adding such counters would be a significant change, complicating and slowing down a very time critical path (ARC read), which is already much slower than linux pagecache. |
I am not that familiar with ZFS so please correct me if I miss something. But it seems new data buffers are always added to MRU except when set to not cache. With this kind of write-then-read workload, all data of the files will be in MRU after the first write. When the files are later read from the MRU, they will all get promoted to MFU and would ruin more valuable data there.
There seems to be mfu_hits count and mru_hits count too. But if counting is too much, maybe a simple |
This feels subjective to me, but not impossible. I guess in some write-only scenarios it could be beneficial to even consider just-written data as a separate state (uncached?) to evict immediately, or with new separate size and ghost state for auto-adaptation. Needs thinking. Meanwhile I'd like to mention, that aside of just second access there is a second factor for promotion currently -- time since the last access, which should be more than at least 62ms ago (see ARC_MINTIME), which I suppose should filter out multiple accesses that are really parts of one workload. |
Disclaimer: this is my own understanding of how ARC works, which can be very well wrong or incomplete.
This is correct.
Not "all data", only the tail of what you wrote which fits into MRU: MFU is not reduced on write unless you buffer writes for more than current MRU size (and write buffer is capped by Sure, the "tail" can be so big to actually include all your data - in case of large MRU this is a possibility, and this seem what happened on your test (25G MRU vs 20G An example on a test machine:
Interesting. I completely missed that, thank you for reporting.
Maybe the |
Describe the feature would like to see added to OpenZFS
ZFS should be more adaptive when moving data from MRU into MFU. To be moved into MFU, data should be accessed maybe 3 times, or a user-configurable number of times, or an auto-self-tuned number of times.
How will this feature improve OpenZFS?
Since the MRU is also acting as a write cache, data that is written and then read could easily get counted twice and moved into MFU. That would ruin the data that could be more valuable in the MFU. Twice data access may be too common. For example, many large files are downloaded and then read only once during a system upgrade, a file server in a cluster downloads and then distributes files to other nodes, a user searches in large files twice, etc. Maybe the first write shouldn't get counted as an access. Maybe this could be easily prevented by using 3 times as the criteria instead. Data that is accessed less than 3 times stays in the MRU.
The number of times could be user configurable, or, best, be adaptive and auto self-tuned. Maybe the number of access to each data in MRU could be counted and used for self-tuning. Maybe MAX(3, average or max number of access to the last 1% accesses to MRU)? Using the last 1% could quickly adapt to changing workloads. Say, there're 100,000 blocks of data in MRU, only the last 1,000 accesses could be used.
How to trigger the problem?
fio could be used to write 20 GiB test data and then read to make twice access and ruined the MFU and MRU. MFU data size increased suddenly after fio, presumably with the test data that would never be needed. MRU data size decreased suddenly as well.
The text was updated successfully, but these errors were encountered: