Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding performance #131

Open
fvalasiad opened this issue Dec 5, 2022 · 6 comments
Open

Regarding performance #131

fvalasiad opened this issue Dec 5, 2022 · 6 comments
Assignees

Comments

@fvalasiad
Copy link
Collaborator

@zvr After some profiling, turns out a massive performance hit occurs due to our "file entries re-usage".

While compiling the xbps package manager, i made the following measurements:

(All in CPU time)

without build recorder wrapping it: 6.934s
with build recorder wrapping it: 14.988s

That's an overhead of 8.064s, of which 5.180s is due to get_file_hash function call at src/tracer.c:255.

This raises the question, do we really need to search using the hash as well? Absolute paths are unique to begin with and we are already tracking rename(2) calls. This paired with the fact that the finfo array is sorted in ascending order with respect to time, is a guarantee that the first match on a backward search is the file we are looking for.

Is my logic flawed? Is the hash check there important?

@fvalasiad
Copy link
Collaborator Author

This PR #132 , makes the changes described above. As a result the new measurement for the build while build_recorder wraps it is 7.882s, which means an overhead of 0.948s compared to the former 8.064s.

Huge performance gain.

@fvalasiad
Copy link
Collaborator Author

Given the problems we encountered at #136, we can now confirm that the hash check isn't in vain. The performance overhead isn't to be ignored though so I think the solution provided by #136 is a much preferred one and if we were to anyway keep the hash instead, we should find ways to optimize the way we fetch hashes. Potentially by skipping work already done, checking a file's modification dates etc...

@fvalasiad
Copy link
Collaborator Author

So after properly profiling the code using valgrind. Even after the "optimization" mentioned above the three critical points in term of performance are:

  1. Computing the hash of each file, 70%
  2. find_finfo(3) & find_pinfo(3) most of the rest

The bigger the project, The bigger and more dominant the complexity of find_finfo(3) and find_pinfo(3) grows.

Now for 1 we can't do much. They are library calls so all we can do is decrease the number of calls we make to get_file_hash(3). I doubt we can find a faster algorithm(perhaps md5?) to hash the files.

find_pinfo(3) could be improved massively using two simple tactics:

  1. Keep a separate array that holds only the processes that are currently running and use it to search instead. That's because we only care for processes that are alive throughout our code.
  2. Extract pid field from PROCESS_INFO into a separate array. Purpose being that since we use it for searching, we don't want to pollute our caches with the rest of the struct's fields

This way we can keep a simple array used for find_pinfo(3).

For find_finfo(3) though we need to resort to other data structures, first thing i am going to try is a simple hashmap implementation.

@zvr
Copy link
Collaborator

zvr commented Dec 9, 2022

No matter the performance, the first goal would be to produce correct results.

Regarding the optimizations, my first thought would be that, once a process has ended, its data can probably be removed, as they will not be needed any more. Even if we stick with linear arrays (pinfo), the slot could be marked as free and re-used.

However, such optimizations are not the current goal.

@fvalasiad
Copy link
Collaborator Author

@zvr Well an update, after some extra tests.

It's the file reading that cripples performance, not the hash itself.

So that's the best we can get! Unless we can improve the way we read from disk somehow(hard to imagine).

@fvalasiad
Copy link
Collaborator Author

@zvr What about two options?

  1. One that assumes an immutable OS.
  2. One that assumes the build files are not to be modified by ANY external source

These give us respectively the ability to:

  1. Not use hash when searching for non-local files(libraries, executables and headers in /lib /bin etc.)
  2. Also skip checking hash on local files that haven't been modified by a write, rename, etc.

All of this is food for thought by the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants