Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rosdistro_build_cache should have a max age option for the source repos cache #66

Open
mikepurvis opened this issue Feb 25, 2016 · 15 comments

Comments

@mikepurvis
Copy link
Contributor

Via #65 (comment):

It would be better to optionally extend the cache building to explicitly know how to walk source repos. And then with the extended cache the rosinstall_generator could take advantage of the extended cache and fallback to upstream sources.

The biggest issue I see here is that unlike with the GBP release branches, it's harder to tell if a devel branch cache is stale or not. The best thing you could do would be cache the commit hash, and then use git ls-remote to determine if the branch has moved on from that point.

Apart from that, there are two major ways I see this going:

  • rosdistro_build_cache gets a switch like --from-source, and the resulting cache format is unchanged. Has the disadvantage of needing to build a separate cache for source, but the advantage is that other tooling like rosinstall_generator is unchanged.
  • The cache format changes, and the source branch package.xml is cached alongside the release one.

Thoughts?

@dirk-thomas
Copy link
Member

For release repositories the distribution file contains an exact version. Until that changes the cache entry is valid and doesn't have to be rechecked. That is why each cache update only takes a few seconds.

For the source entry the repository state can change anytime and therefore the script must query the state of the repo every time. Each cache update will take quite some time.

Tools like rosinstall_generator should therefore maybe clone the exact hash used for building the cache. Otherwise there will be inconsistencies between the cached information and the cloned repos.

Beside that I don't see any other problem. I would suggest to implement the new option (to build a separate cache) and then gain experience by using it and seeing if anything unforeseen happens.

@mikepurvis
Copy link
Contributor Author

Tools like rosinstall_generator should therefore maybe clone the exact hash used for building the cache. Otherwise there will be inconsistencies between the cached information and the cloned repos.

Another way that we're looking at to tackle this (and by extension, the business of creating reproducible nightlies) is by a rosdistro_freeze script, which creates a single commit to swap all source/devel branch pointers into hashes, and then tags that frozen branch with a timestamp. From there, we can generate a cache, and create a deterministic "source" workspace from which to generate a build asset.

Would be happy to discuss further how this workflow would go and how we might be able to collaborate on it.

(For now, our plan had been to build all/most of this as in-house tooling, but if a script like rosdistro_freeze would be welcome in python-rosdistro, we'd be delighted to contribute it...)

@dirk-thomas
Copy link
Member

Just looking at the tools we already have and what they do:

  • rosinstall_generator: generate a .rosinstall file from either the release entries, or the upstream repo of the gbp pointing to the exact release tag, or the development branch which is specified in the source entry
  • wstool / vcstool: clone a set of repositories based on a .rosinstall / .repos file
    • after cloning vcstool e.g. can generate a derived .repos file which uses the hashes instead of the branches

If we ignore efficiency for a second your workflow is almost covered by:

  • running rosinstall_generator to export the set of repos
  • running vcs import to clone the repos, running vcs export --exact to get the exact hashes

The remaining step would be to use the resulting set of repos to generate a manifest cache. This cache is slightly different then the standard rosdistro cache since it contains manifest files for packages the distribution file doesn't know about. Because other then for release entries, source entries do not specify the package names contained in the repo. And since the package names and location within a repo are unknown I think this step always requires to (shallow) clone the repos in question.

So I think the next step would be to write a script which generates a cache based on the manifest files found in a workspace which contains a set of cloned repos. Does that sound reasonable?

@tfoote
Copy link
Member

tfoote commented Feb 25, 2016

I fixed the above usage to use the correct --exact option.

@mikepurvis
Copy link
Contributor Author

I'd really like to have the tag information all the way back at the master rosdistro level, though, rather than only at the rosinstall (a derived entity). There are a variety of reasons for this, but a big one is giving us the ability to +1 it— like, I want all the sources from that nightly, plus this one slight change (for example, a set of PR branches). I also don't care for the snapshot definition being only a rosinstall file + cache, since it's not as clear how that would be stored, whereas a git tag of the rosdistro repo is an extremely natural and obvious means of storing it.

And, of course, efficiency. Doing git ls-remote from every source repo is fast compared with cloning everything and working backwards from a created workspace. Even if you do end up cloning everything anyway to generate the cache, it's still preferred to have the snapshot operation itself be a fast, relatively atomic affair.

So yes, it would be reasonable to generate the cache from an existing workspace, but if we have a snapshotted/frozen rosdistro anyway, and we already have a suite of functions capable of quickly fetching package.xml files given git URLs, it seems like it would preferable to maintain the semantics of rosdistro_build_cache, either with a flag, or a separate rosdistro_build_source_cache. ROSDISTRO_INDEX_URL in, *-cache.yaml files out. The only gap (which you've identified) is finding package.xml files in subdirectories. All that means is that a shallow clone will be required always for the source cache— at least, until someone implements spidering repos with the trees API, for example.

@dirk-thomas
Copy link
Member

I don't understand what you mean at "master distro level". On one hand you mention it being updated automatically on e.g. a nightly base. On the other hand want the ability to +1 those. (Maybe we should talk via Hangout to figure out the exact needs / goals?)

Since the source repos don't have to be on GitHub the tree API can only be an optimization. The script needs to be able to work for arbitrary repos (even non-git).

@mikepurvis
Copy link
Contributor Author

I'm conflating two related use cases. The first is freezing the rosdistro for the purposes of cutting an overall release of the software stack— that's where the +1 builds are most critical. The second is generating nightly builds. +1 builds of the nightly may be important, but in that case, it's more about easy reproducibility. The developer debugging a failed nightly shouldn't be casting around for a generated rosinstall file that was stashed somewhere, they should be checking out a tag from the rosdistro, and pointing rosinstall_generator at that.

Re: Github. Yup— it's just that the current Github manifest provider won't work in a source branch cache builder, since it depends on the package.xml being the root.

To respond to this implementation proposal more directly:

... I think the next step would be to write a script which generates a cache based on the manifest files found in a workspace which contains a set of cloned repos. Does that sound reasonable?

This feels brittle. There would be a rosdistro_cache_from_workspace tool, but it would only be a cache of as much of the rosdistro as you happened to have in your workspace at that particular moment, and only at the branches you happened to have checked out— this would be fine for automated workflows, but it would be hard to document, and easy for a user to screw up.

I'd really rather generate a source distro cache directly from the distribution.yaml itself, even if that does necessitate a shallow clone of every repo whose devel branch has changed.

@mikepurvis
Copy link
Contributor Author

From a phone discussion between @dirk-thomas, @jjekircp, and myself on March 2:

  • rosinstall_generator's ability to construct workspaces of source repos would be improved by having up-to-date dependency information.
  • A centralized cache that is generally external to the repo itself (like the release one) is preferred.
  • Because of file size concerns, initial implementation should save the source branch cache to a separate file (probably from a new CLI script, eg rosdistro_build_source_cache?)
  • Unlike the release cache which is just a dict keyed to repo name, the format should include at a minimum the hash of the commit it comes from, the list of packages in the repo (since that info is otherwise unknown for source branches), and the date fetched. It's conceivable that there could be some effort made toward de-duplication with the release cache, where a sentinel value can represent "no change since last released version", or just assume that gzip handles this well enough?
  • rosinstall_generator in source mode would default to just trusting the cache (for speed), but would have a flag to git ls-remote each repo down the chain and compare hashes to the ones reported in the cache, and make fetches to update the in-memory cache as required.
  • This will all still be best-effort, since a source .rosinstall by definition points to branches that are in flux. You could generate a workspace definition today based on deps in an up to date cache, and not actually clone the repos until tomorrow, by which point someone could have added a dependency which breaks it. The only way to be fully deterministic with this is by freezing a rosdistro snapshot, so that the source "versions" become immutable.
  • A big part of making all this realistic is giving the manifest_provider classes the ability to efficiently find package XMLs not in the root path of the repositories. This could look like downloading tarballs into memory and using the python tarfile module to find package XMLs? Or maybe using more provider-specific APIs?

I think that mostly covers it, at least with respect to changes in this repo.

@dirk-thomas
Copy link
Member

Regarding the content of the source cache: I think it might be helpful to also store the relative path of each package in the repo.

@mikepurvis
Copy link
Contributor Author

Is the thinking there that it assists with future package.xml grabs? I'm unsure what's gained, since you always need to clone the whole repo anyway in order to check for new packages. That said, relative path could be helpful for a future rosinstall_generator/wstool arrangement capable of extracting subdirectories.

In any case, I think I'm thinking of a format something like

source_repo_xmls: {
  'my_fancy_repo': {
    'b858cb282617fb0956d960215c8e84d1ccf909c6': {
      'pkgname1': ['path/to/pkgname1', '<package>....</package>'],
      'pkgname2': ['path/to/pkgname1', '<package>....</package>'],
      'pkgname3': ['path/to/pkgname1', '<package>....</package>'],
    }
  'next_fancy_repo': {
    'da39a3ee5e6b4b0d3255bfef95601890afd80709': {
      'pkgname1': ['path/to/pkgname1', '<package>....</package>'],
      'pkgname2': ['path/to/pkgname1', '<package>....</package>'],
      'pkgname3': ['path/to/pkgname1', '<package>....</package>'],
    }
  }
}

Here there is provision for storing multiple versions, but my assumption is that the initial implementation would store only the newest, and it would be up to a policy on the consuming end to either a) trust that, or b) check each one and freshen as required.

@dirk-thomas
Copy link
Member

The proposed format looks good to me.

@dirk-thomas
Copy link
Member

Yes, I thought it might be helpful to pull newer manifests with the knowledge of the path.

@mikepurvis
Copy link
Contributor Author

mikepurvis commented Apr 22, 2016

One potential argument for including both caches in the same file, and having switches on rosdistro_build_cache to control which are included: A massive amount of de-duplication can occur completely for free via yaml references, eg:

import yaml
yaml.Dumper.ignore_aliases = lambda *args : False

print yaml.dump({
  1: "abcdef",
  2: "abcdef"
})

Result:

$ python test-yaml.py
{1: &id001 abcdef, 2: *id001}

EDIT: Looks like it's not completely for free— the above code is benefitting from the python compiler noticing that those two strings are the same and aliasing them to each other. In the real world, it would be like this:

import urllib2, yaml
yaml.Dumper.ignore_aliases = lambda *args : False

xml_release = urllib2.urlopen('https://raw.githubusercontent.com/ros-gbp/ros_comm-release/release/indigo/roscpp/package.xml').read()
xml_source = urllib2.urlopen('https://raw.githubusercontent.com/ros/ros_comm/indigo-devel/clients/roscpp/package.xml').read()

# De-duplication only happens with this line present.
if xml_release == xml_source: xml_source = xml_release

print yaml.dump({ 'rel': xml_release, 'src': xml_source })

@mikepurvis
Copy link
Contributor Author

This ticket could be closed in the sense that the feature is now available (we've been using it for the better part of a year), however it would be great to have the source cache max age option completed, as discussed in #84 (diff). That would enable the source cache to be turned on for the main buildfarm, which would significantly broaden the availability and exposure of the feature.

@dirk-thomas dirk-thomas changed the title rosdistro_build_cache should have an option to cache source repo rosdistro_build_cache should have a max age option for the source repos cache Jun 19, 2017
@dirk-thomas
Copy link
Member

I updated the title to cover the remaining task.

@dirk-thomas dirk-thomas added this to the untargeted milestone Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants