[IMPROVEMENT] No longer checks hash when searching files #132

fvalasiad · 2022-12-05T04:29:33Z

The changes described at #131, Closes #131

fvalasiad · 2022-12-06T06:52:33Z

Also Closes #123 , since the segfault occurs at the hash-check.

zvr · 2022-12-09T21:52:51Z

I'm not sure I understand... If we don't check the hash, how can we decide whether this is the same file object?

[trivia: there is exactly one (1) system call whose result can be cached, i.e., a second call will generate the same result. Everything else has to be checked every time.]

fvalasiad · 2022-12-10T09:32:06Z

@zvr

The process opens a file with abspath = /path/to/file for reading or for writing
It's now pushed to our finfo array, pay attention it's actually on top of the array
find_finfo looks for an entry that matches the abspath field last to first

The reason we tested with hash as well so far is because abspath isn't unique in our finfo array. That is because two entries could share an abspath.

(1) If for example

The file is written many times
The file pointed to by /path/to/file is rename(2)ed to another path /path/to/newfile and then another entry named /path/to/file occurs
etc...

So there are scenarios where ignoring the hash and only looking with abspath makes sense. Now i am trying to make a point here that none of the known (1) problems actually makes this PR produce wrong results.

If a file is written many times, the last write is all we care about and it's in the topmost-position of the finfo_array which means find_finfo(3) will find it first.
In that scenario the new file represented by abspath = /path/to/file will also be first from the top compared to previous versions of /path/to/file and thus find_finfo(3) will also find the correct one first since it searches backwards.

In other words, i am asking you if you can pinpoint any scenario where the first file to match abspath when searching backwards is actually not the file we are looking for.

I've found one so far, it's when the process uses sync(2). That is something we can trace though as described in #136

We could even make a dummy test about this by taking the main branch, adding some tracking code that checks to see how many processes pass the first test (matching abspath) and then see how many of those actually fail the second test(matching hash). My argument is that the answer shall be 0.

I will perform this exact test and post the results.

fvalasiad · 2022-12-10T09:45:17Z

The results when compiling the xbps project(about 8sec compile time) is that only 3 passed the first test and failed the second.

As you mentioned this isn't our current goals, but if we can get that number down to 0 it's gonna be a very worthwhile optimization in the future.

zvr · 2022-12-10T09:55:39Z

We are not talking about an isolated system.

Don't over-think it; there are cases where even without recording any write action you cannot say the file is the same.

Example: you run build_recorder make which is supposed to compile a number of files. You record the first use of cc by adding /usr/bin/cc to finfo. When compiling the second file, /usr/bin/cc is to re-used.

But unbeknownst to you, an apt update is also running concurrently, and the compiler gets updated between the first and second invocation.

Without checking the hash, you would report that the older version of the compiler was used in the second compilation, which is not correct.

fvalasiad · 2022-12-10T09:57:33Z

@zvr Well yes i made the assumption that the files the build process handles aren't updated by external means. If we are to deal with such a scenario this pull request is indeed invalid. And thus I close it.

fvalasiad · 2022-12-10T09:58:28Z

@zvr We definitely need to look for ways to skip hashing files in the future though, one thing i can think of is checking modification dates.

zvr · 2022-12-10T10:12:09Z

Modifications dates can be set arbitrarily, so they can't be used to actually determine whether something has been modified. Check touch(1) and utime(2) for example.

fvalasiad · 2022-12-10T10:31:34Z

@zvr All this time i thought touch(1) was just a tool to create files. The more you know.

Well we will see in the future I guess...

No longer checks hash when searching files

761ba39

fvalasiad requested a review from zvr December 5, 2022 04:29

fvalasiad mentioned this pull request Dec 5, 2022

Regarding performance #131

Open

fvalasiad marked this pull request as draft December 10, 2022 08:07

fvalasiad marked this pull request as ready for review December 10, 2022 09:23

fvalasiad closed this Dec 10, 2022

fvalasiad deleted the optimize branch December 10, 2022 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] No longer checks hash when searching files #132

[IMPROVEMENT] No longer checks hash when searching files #132

fvalasiad commented Dec 5, 2022

fvalasiad commented Dec 6, 2022

zvr commented Dec 9, 2022

fvalasiad commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

zvr commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

zvr commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

[IMPROVEMENT] No longer checks hash when searching files #132

[IMPROVEMENT] No longer checks hash when searching files #132

Conversation

fvalasiad commented Dec 5, 2022

fvalasiad commented Dec 6, 2022

zvr commented Dec 9, 2022

fvalasiad commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

zvr commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

fvalasiad commented Dec 10, 2022

zvr commented Dec 10, 2022

fvalasiad commented Dec 10, 2022