Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement(setup): improve caching of local tarballs #557

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

juliayakovlev
Copy link
Contributor

@juliayakovlev juliayakovlev commented Jan 28, 2024

If the test version is unstable (not release, maybe private branch) and installation folder exists, we want to check if this version was downloaded nd installed already in the past and was changed.
In this case the version should be downloaded and installed again. Compare hash of saved version (it is saved in the 'scylla-core-package/source.txt' file) and hash of a new package.
If it is same - skip the download. If not - remove existing folder and download again.

Tests:

@juliayakovlev juliayakovlev force-pushed the improve_local_tarballs_caching branch 3 times, most recently from 22be9ac to 2c32b7a Compare January 28, 2024 17:40
@juliayakovlev juliayakovlev marked this pull request as ready for review January 29, 2024 07:56
@juliayakovlev juliayakovlev self-assigned this Jan 29, 2024
@fruch fruch requested a review from tchaikov January 31, 2024 08:26
@@ -172,7 +172,8 @@ def get_url_hash(url: str) -> str:
"""

if os.path.exists(url): # if file/dir is local, hash based on the path
return hashlib.md5(url).hexdigest()
# Strings must be encoded before hashing
return hashlib.md5(url.encode('utf-8')).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might worth adding a unit test for this case (we have test for downloading, but not for using local file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@juliayakovlev juliayakovlev force-pushed the improve_local_tarballs_caching branch 3 times, most recently from 2d5f7dd to 262b1ed Compare January 31, 2024 17:29
@juliayakovlev juliayakovlev force-pushed the improve_local_tarballs_caching branch 2 times, most recently from 9823d09 to 64970e2 Compare February 1, 2024 08:01
ccmlib/common.py Outdated
# get hash from file
for line in lines:
if line.startswith("hash="):
current_hash = line.replace("hash=", "").strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that the source_file does not contain a line of "hash=.."? is it possible that it contains multiple lines of "hash=.."? if both answers are "no", i'd suggest put something like:

if not source_file.exists():
    return ""
with open(source_file, 'r') as f:
    for line in f:
        if line.startswith("hash="):
            return line.replace("hash=", "").strip()
throw ArgumentError(f"{source_file} does not contains the hash")

less indent this way. and less tolerant for sure.

Copy link
Contributor Author

@juliayakovlev juliayakovlev Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few lines with "hash" is not possible because we replace whole file and not add the lines.

No "hash": it is possible. Due to this I added the comment:

https://github.com/scylladb/scylla-ccm/pull/557/files#:~:text=%23%20Current%20hash%20may,start%20downloading%20again

# Current hash may be None. It may be due to uncompleted downloading and installation.
            # Or because of it is old installation that was not saved the hash yet.
            # In any case the new download and installation should be performed.
            # For this goal we need to remove existing folder and start downloading again

It means that if hash is not in the source.txt file - version will be downloaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. then still good to return early in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think we need to return and stop the test? Maybe there is the reason that I did not think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @tchaikov is talking more on the style, i.e. readability of it, nothing logical you are missing here.
i.e. no need for exception, but the style suggestion still can be relevert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov I changed it, let me know if it is more clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. indeed. i was concerned about the readability. the code was and still correct.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple comments.

if validate_by_hash and packages:
"""
Validate if packages hash was changed and the new package(s) have to be downloaded
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, i'd suggest use a comment instead of a multi-line string for comment. as it cannot be used as doc-string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, if we had a linter of any kind on this project, it would have complain about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 334 to 335
if not (new_hash and current_hash) or new_hash != current_hash:
# remove version_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, new_hash should always be a non-empty string, right? so, probably we can simplify this a little bit. how about:

for package in zip(packages._fields, packages):
    # ...
    if current_hash is None or new_hash != current_hash:
        rmdirs(version_dir)
        version_dir = None
        break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I do not know, may be from some reason it will be empty. I added this validation to be sure. If it will be None (I can not let you example now), the comparison new_hash != current_hash will fail

Copy link
Contributor

@tchaikov tchaikov Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. the ! and not in the condition expression makes my eyes hurt. hence this suggestion. if we can trust the 'ETag' returned by S3, then it should not be empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if get_url_hash always returns string, there no need to check for emptiness of it return value.

also get_installed_scylla_package_hash doesn't return None (at least not now), it always return a string, could be empty, so using is None wouldn't work as one might expect.

so current_hash and new_hash != current_hash should be suffice
if get_url_hash might return None
so current_hash and new_hash and new_hash != current_hash should do the trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we can trust. 'ETag' is not mandatory. https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html

@fruch please, let you opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov I changed it, let me know if it is more clear


def test_get_local_tarball_hash(self):
this_path = Path().resolve()
url_hash = get_url_hash(url=str(this_path / "tests" / "test_data" / "scylla_unified_master_2023_04_03.tar.gz"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is that file ? in the the source code ?

let make sure it's not a full sized release, the a minimal empty tar.gz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yeah, CI show there no such file:

FAILED tests/test_scylla_repository.py::TestScyllaRepositoryRelease::test_get_local_tarball_hash - requests.exceptions.MissingSchema: Invalid URL '/home/runner/work/scylla-ccm/scylla-ccm/tests/test_data/scylla_unified_master_2023_04_03.tar.gz': No scheme supplied. Perhaps you meant https:///home/runner/work/scylla-ccm/scylla-ccm/tests/test_data/scylla_unified_master_2023_04_03.tar.gz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took is from unstable. It's size 166M

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yeah, CI show there no such file:

FAILED tests/test_scylla_repository.py::TestScyllaRepositoryRelease::test_get_local_tarball_hash - requests.exceptions.MissingSchema: Invalid URL '/home/runner/work/scylla-ccm/scylla-ccm/tests/test_data/scylla_unified_master_2023_04_03.tar.gz': No scheme supplied. Perhaps you meant https:///home/runner/work/scylla-ccm/scylla-ccm/tests/test_data/scylla_unified_master_2023_04_03.tar.gz?

Yeah, I did not added it. fixing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruch
Oh, it is too big. It's is allowed 100m. I'll take any tarball. It is not must to be scylla package, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliayakovlev as I've said, if we can make it even smaller, I would rather not had 24mb binary on a git repository

and if we need something at this size for some reason, that it would fetch or generated during the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliayakovlev as I've said, if we can make it even smaller, I would rather not had 24mb binary on a git repository

and if we need something at this size for some reason, that it would fetch or generated during the test

replaced it with small 131K. is it ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better than 24Mb

@juliayakovlev juliayakovlev force-pushed the improve_local_tarballs_caching branch 4 times, most recently from 15e000f to 35aba32 Compare February 1, 2024 12:25
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better. thanks!

@juliayakovlev juliayakovlev force-pushed the improve_local_tarballs_caching branch 2 times, most recently from bf32b6e to 7a4d007 Compare February 1, 2024 13:25
@juliayakovlev juliayakovlev force-pushed the improve_local_tarballs_caching branch 2 times, most recently from ad3a94e to 61671b9 Compare February 1, 2024 14:47
If the test version is unstable (not release, maybe private branch) and installation folder exists,
we want to check if this version was downloaded nd installed already in the past and was
changed.
In this case the version should be downloaded and installed again.
Compare hash of saved version (it is saved in the 'scylla-core-package/source.txt' file) and hash
of a new package.
If it is same - skip the download. If not - remove existing folder and download again.
@juliayakovlev
Copy link
Contributor Author

juliayakovlev commented Feb 6, 2024

@fruch @bhalevy can you review, please?

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juliayakovlev can you also update dtest readme, and remove the comment about this issue?

@fruch fruch merged commit cf94a58 into scylladb:next Feb 6, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants