Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Used size increased by ~400GB while defragmenting #266

Open
Timarrr opened this issue Sep 4, 2023 · 2 comments
Open

Used size increased by ~400GB while defragmenting #266

Timarrr opened this issue Sep 4, 2023 · 2 comments

Comments

@Timarrr
Copy link

Timarrr commented Sep 4, 2023

I had half a terabyte left on my 4TB HDD and wanted to dedupe it to increase available size. After running bees for over 36 hours, btrfs filesystem usage -h /hdd reports Free (estimated): 161.13GiB. Bees is still buzzing along and my free space has stopped shrinking around this point. Also I have to mention that the 4GB hash table started overfilling and i had to restart beesd with 8GB db size in config.

Hash table page occupancy histogram (339892117/536870912 cells occupied, 63%)
                                                                 1048576 pages
                                                               # 524288
                                                               # 262144
               ####                                            # 131072
              ######                                           # 65536
             ########                                          # 32768
            ##########                                        ## 16384
            ##########                                       ### 8192
           ############                                     #### 4096
           #############                                   ##### 2048
           #############                                  ###### 1024
          ###############                                ####### 512
          ###############                               ######## 256
          ################                             ######### 128
         #################                             ######### 64
         ##################                           ########## 32
         ##################                          ########### 16
        #####################                       ############ 8
        #####################                       ############ 4
        ######################   #           ## #  ############# 2
       #######################   #   ##  #   ## ################ 1
0%      |      25%      |      50%      |      75%      |   100% page fill
compressed 51958167 (15%)
uncompressed 287933950 (84%) unaligned_eof 266731 (0%) toxic 23379 (0%)

Another thing is that bees seem to spam the
2023-09-05 02:11:33 513194.513219<7> crawl_5_680152: exception (ignored): exception type std::runtime_error: FIXME: too many duplicate candidates, bailing out here
thing, sometimes for 15 seconds straight. Is this bad?

@Timarrr
Copy link
Author

Timarrr commented Sep 7, 2023

Update:
Free size now reports around 300GiB, but I needed to increase the DB size to 12 GiB so as to avoid it overfilling.
Also I found out that bees performs WAY better with one thread in my situation: worst case with very frequent seeks it still sits @3-4MB/s but now it sometimes goes to 100 something MB/s. Also with one thread it doesn't load the system nearly as much (i.e. with default settings all my cores were busy with I/O waiting and system was ~12 load avg, but now it's only 1-2.) and the HDD doesn't heat up as much

@kakra
Copy link
Contributor

kakra commented Sep 7, 2023

I'm not sure if the DB overfilling is really such a big issue. In the end, it's okay to push out older hashes and keep the hashes for big blocks, and you don't want to have too many shared extents per hash anyways. Thus you probably don't want to keep hashes for small blocks because that's like taking 99% time for 1% space savings.

Also, the problem with multiple threads is rather lock contention in btrfs. But I'm not sure if bees does some seek optimizing by re-ordering queued jobs, so seeking may be an issue, too.

What you observe for space is a documented behavior of bees, especially when coming from other dedup programs: Before freeing space, used space fills up or free space stops growing until the effort of bees finally resolves into freeing all the extents with the final snapshot sharing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants