Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make out/ folder contents (more) reproducible and filesystem layout agnostic (1500USD bounty) #3660

Open
lihaoyi opened this issue Oct 4, 2024 · 2 comments
Labels

Comments

@lihaoyi
Copy link
Member

lihaoyi commented Oct 4, 2024


From the maintainer Li Haoyi: I'm putting a 1500USD bounty on this issue, payable by bank transfer on a merged PR implementing this.


The goal of this ticket is to make the out/ folder contents more reproducible, such that it contains the same bytes and hashes regardless of the user's filesystem layout outside of that folder. This is would allow re-using the out/ folder as a build cache between different machines that may have the checkout in different place (e.g. /Users/alice/my-repository vs /Users/charlie/my-repository), both coarse grained (e.g. by sending over a zip file) and fine grained (via the bazel remote cache protocol)

The main thing that needs to happen is that every os.Path and mill.api.PathRef that is serialized within a "known" directory needs to be normalized to a path relative to an abstract reference to that known directory. e.g.

  • /Users/alice/my-repository/out/foo/bar.dest/qux should be serialized as $WORKSPACE/out/foo/bar.dest/qux
  • /Users/lihaoyi/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.13.14/scala-library-2.13.14.jar should be serialized as $COURSIER_CACHE/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.13.14/scala-library-2.13.14.jar
  • /Users/alice/thing-outside-repository should be serialized as $HOME/thing-outside-repository

AFAIK the necessary known roots should all be available globally (e.g. mill.api.workspace.WorkspaceRoot.workspaceRoot, os.home, sys.env("COURSIER_CACHE")). It should be easy enough to add to the serialization logic:

  • mill.api.PathRef serialization
    implicit def jsonFormatter: RW[PathRef] = upickle.default.readwriter[String].bimap[PathRef](
    p => p.toString(),
    s => {
    val Array(prefix, valid0, hex, pathString) = s.split(":", 4)
    val path = os.Path(pathString)
    val quick = prefix match {
    case "qref" => true
    case "ref" => false
    }
    val validOrig = valid0 match {
    case "v0" => Revalidate.Never
    case "v1" => Revalidate.Once
    case "vn" => Revalidate.Always
    }
    // Parsing to a long and casting to an int is the only way to make
    // round-trip handling of negative numbers work =(
    val sig = java.lang.Long.parseLong(hex, 16).toInt
    val pr = PathRef(path, quick, sig, revalidate = validOrig)
    validatedPaths.value.revalidateIfNeededOrThrow(pr)
    pr
    }
    )
  • os.Path serialization
    implicit val pathReadWrite: RW[os.Path] = upickle.default.readwriter[String]
    .bimap[os.Path](
    _.toString,
    os.Path(_)
    )

Apart from PathRef and Path, we will also need to deal with:

  • Files in out/ which are naturally non-deterministic: mill-profile.json, mill-chrome-profile.json, mill-server/* and mill-no-server/*, etc.

  • Modified times are also expected to vary. These may need to be zeroed out in the process of making zip and jar files such that they do not affect the byte contents, and ignored as part of any equivalence comparison

  • Any foo.json files belonging to workers can also be expected to differ since they contain the toString of the worker, and may need to be renamed to foo.worker.json or similar to make them identifiable.

  • There will also be inherent differences between files generated on different platforms (e.g. native binaries). This is fine for now, and likely unavoidable.

  • There may be other files that need to be made reproducible that are not listed here

The success criteria would be a test in integration/feature/ that:

  • Copies the code in example/scalalib/web/5-webapp-scalajs-shared into two separate subfolders.
    • The choice of example/scalalib/web/5-webapp-scalajs-shared is somewhat arbitrary, but should give us good coverage of a variety of Mill module and task types, exercising a wide range of code paths
  • Runs ./mill runBackground && ./mill clean runBackground && ./mill jar && ./mill assembly in each folder
    • (one with a custom COURSIER_CACHE and -Duser.home passed in),
  • Does a file-by-file and byte-for-byte comparison against the two outfolders with some normalization criteria (ignoring the expected-to-differ files and ignoring mtimes), to assert that the out/ folder is byte-for-byte identical

Related issues with prior discussion:

@lihaoyi lihaoyi added the bounty label Oct 4, 2024
@lihaoyi lihaoyi changed the title Make out/ folder contents filesystem layout agnostic (1000USD bounty) Make out/ folder contents reproducible and filesystem layout agnostic (1000USD bounty) Oct 4, 2024
@lihaoyi lihaoyi changed the title Make out/ folder contents reproducible and filesystem layout agnostic (1000USD bounty) Make out/ folder contents (more) reproducible and filesystem layout agnostic (1000USD bounty) Oct 4, 2024
@lihaoyi lihaoyi changed the title Make out/ folder contents (more) reproducible and filesystem layout agnostic (1000USD bounty) Make out/ folder contents (more) reproducible and filesystem layout agnostic (1500USD bounty) Oct 4, 2024
@rahat2134
Copy link

Trying this issue, it looks interesting...

@lihaoyi
Copy link
Member Author

lihaoyi commented Oct 13, 2024

@rahat2134 got for it! Feel free to ask here if you have any questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants