Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] No longer checks hash when searching files #132

Closed
wants to merge 1 commit into from

Conversation

fvalasiad
Copy link
Collaborator

The changes described at #131, Closes #131

@fvalasiad fvalasiad requested a review from zvr December 5, 2022 04:29
@fvalasiad
Copy link
Collaborator Author

Also Closes #123 , since the segfault occurs at the hash-check.

@zvr
Copy link
Collaborator

zvr commented Dec 9, 2022

I'm not sure I understand... If we don't check the hash, how can we decide whether this is the same file object?

[trivia: there is exactly one (1) system call whose result can be cached, i.e., a second call will generate the same result. Everything else has to be checked every time.]

@fvalasiad fvalasiad marked this pull request as draft December 10, 2022 08:07
@fvalasiad fvalasiad marked this pull request as ready for review December 10, 2022 09:23
@fvalasiad
Copy link
Collaborator Author

@zvr

  1. The process opens a file with abspath = /path/to/file for reading or for writing
  2. It's now pushed to our finfo array, pay attention it's actually on top of the array
  3. find_finfo looks for an entry that matches the abspath field last to first

The reason we tested with hash as well so far is because abspath isn't unique in our finfo array. That is because two entries could share an abspath.

(1) If for example

  1. The file is written many times
  2. The file pointed to by /path/to/file is rename(2)ed to another path /path/to/newfile and then another entry named /path/to/file occurs
  3. etc...

So there are scenarios where ignoring the hash and only looking with abspath makes sense. Now i am trying to make a point here that none of the known (1) problems actually makes this PR produce wrong results.

  1. If a file is written many times, the last write is all we care about and it's in the topmost-position of the finfo_array which means find_finfo(3) will find it first.
  2. In that scenario the new file represented by abspath = /path/to/file will also be first from the top compared to previous versions of /path/to/file and thus find_finfo(3) will also find the correct one first since it searches backwards.

In other words, i am asking you if you can pinpoint any scenario where the first file to match abspath when searching backwards is actually not the file we are looking for.

I've found one so far, it's when the process uses sync(2). That is something we can trace though as described in #136

We could even make a dummy test about this by taking the main branch, adding some tracking code that checks to see how many processes pass the first test (matching abspath) and then see how many of those actually fail the second test(matching hash). My argument is that the answer shall be 0.

I will perform this exact test and post the results.

@fvalasiad
Copy link
Collaborator Author

The results when compiling the xbps project(about 8sec compile time) is that only 3 passed the first test and failed the second.

As you mentioned this isn't our current goals, but if we can get that number down to 0 it's gonna be a very worthwhile optimization in the future.

@zvr
Copy link
Collaborator

zvr commented Dec 10, 2022

We are not talking about an isolated system.

Don't over-think it; there are cases where even without recording any write action you cannot say the file is the same.

Example: you run build_recorder make which is supposed to compile a number of files. You record the first use of cc by adding /usr/bin/cc to finfo. When compiling the second file, /usr/bin/cc is to re-used.

But unbeknownst to you, an apt update is also running concurrently, and the compiler gets updated between the first and second invocation.

Without checking the hash, you would report that the older version of the compiler was used in the second compilation, which is not correct.

@fvalasiad
Copy link
Collaborator Author

@zvr Well yes i made the assumption that the files the build process handles aren't updated by external means. If we are to deal with such a scenario this pull request is indeed invalid. And thus I close it.

@fvalasiad fvalasiad closed this Dec 10, 2022
@fvalasiad
Copy link
Collaborator Author

@zvr We definitely need to look for ways to skip hashing files in the future though, one thing i can think of is checking modification dates.

@zvr
Copy link
Collaborator

zvr commented Dec 10, 2022

Modifications dates can be set arbitrarily, so they can't be used to actually determine whether something has been modified. Check touch(1) and utime(2) for example.

@fvalasiad
Copy link
Collaborator Author

@zvr All this time i thought touch(1) was just a tool to create files. The more you know.

Well we will see in the future I guess...

@fvalasiad fvalasiad deleted the optimize branch December 10, 2022 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regarding performance
2 participants