Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow "non-repository" input to the analyzer #8803

Open
sschuberth opened this issue Jun 27, 2024 · 12 comments
Open

Allow "non-repository" input to the analyzer #8803

sschuberth opened this issue Jun 27, 2024 · 12 comments
Labels
analyzer About the analyzer tool enhancement Issues that are considered to be enhancements

Comments

@sschuberth
Copy link
Member

sschuberth commented Jun 27, 2024

As ORT takes transparency and reproducibility of results serious, currently only local directories that are under version control can serve as the input to the analyzer. This is because for such "working trees" it is machine-readable where they originate from, and also the state of the source code is encoded into the VCS revision. If a remote VCS repository is to be analyzed, it either needs to be manually cloned with the respective VCS tool or the ORT Downloader first.

However, there are use-cases where the source code to analyze never was checked into version control, and committing it to a "fake" repository just to be able to analyze it defeats the purpose of capturing genuine provenance information.

Thus the proposal is to relax / extend the valid input types for the analyzer to the following:

  • local directories under version control (currently the only supported input)
  • local directories not under version control (or an unsupported VCS; maybe use e.g. the SPDX package verification code as part of provenance information)
  • local ZIP files (use e.g. the SHA1 / SHA256 of the archive as part of provenance information)
  • Docker images (mount the image as a file system and analyze a directory inside it; use e.g. the Docker image tag / SHA1 as part of provenance information)
  • Remote ZIP files / Docker images referred to by URLs (the URL plus hashes could serve as provenance information)

Open questions:

  • How to ensure that following ORT tools, that may run on a different machine, has access to the source code of the project as well?
    • For local directories, it could simply be a requirement that all ORT tools have to run on the same machine (or have access to the same file system with the project source code).

Related issues / PRs:

@sschuberth sschuberth added enhancement Issues that are considered to be enhancements analyzer About the analyzer tool labels Jun 27, 2024
@sschuberth sschuberth pinned this issue Jun 27, 2024
@fviernau
Copy link
Member

fviernau commented Jul 2, 2024

What we have missed to discuss so far, but I believe is important to clarify is: Project.vcs / Project.vcsProcessed.
Previously, one could always set it. The feature implemented here should probably ensure we can still always set that property. So, its datatype would need to change?!

@sschuberth
Copy link
Member Author

Good point about a Project's provenance info. If the analysis is not performed on a repository, it also does not make sense to store VCS-related information for the Projects, and these should probably be migrated to KnownProvenance as well.

This relates a bit to @mnonnenmacher's questions about how follow-up ORT tools, that might run on a different machine / node, should get access to the provenance's / project's source code.

@pepper-jk
Copy link
Contributor

Would you like me to incorporate those changes to Project into #8764 as well before we proceed?

@sschuberth
Copy link
Member Author

Would you like me to incorporate those changes to Project into #8764 as well before we proceed?

I'm not sure. Your PR is already quite involving, so maybe we should not make it yet more complex, but roll out the changes in stages, if possible.

@pepper-jk
Copy link
Contributor

Yes, I see the problem: either we have breaking changes after every small pull request, or we have a large pull request, which is impossible to review.

Maybe we could create a feature branch for this topic and merge any PRs onto there first and only once it is finished you merge it into main? This way there would only be breaking changes in one release and you also had small PRs to review.

FYI: I have taken a look at the changes required for Project and they appear to be very similar to what I already did in Repository, so I believe adding those on top would be straight forward. It would just takes a bit more work.

@pepper-jk
Copy link
Contributor

@sschuberth Is there any news from the core developer meeting?

@sschuberth
Copy link
Member Author

@sschuberth Is there any news from the core developer meeting?

No, because we had to skip it for this week due to appointment conflicts, sorry.

@pepper-jk
Copy link
Contributor

pepper-jk commented Jul 18, 2024

Today, we had a meeting with @fviernau and discussed potential issues regarding the current refactoring approach.

1. Ambiguous VcsInfo source

While the vcs from Project is said to be defined in the project's "metadata" ¹,

* Original VCS-related information as defined in the [Project]'s metadata.

but the vcs from Repository originates from the analyzer root in the working tree.
* Original VCS-related information from the working tree containing the analyzer root.

Frank pointed out this ambiguity in the kdoc strings, which may have lead to inconsistent implementation regarding the source of the VcsInfo objects.
Further, any LocalFileProvenance would lack metadata and would not fit the given description. Frank suggests, the semantics of the VcsInfos source should be clarified, before further refactoring measures are taken.

2. KnownProvenance sub-classes

Frank suggested to devide the KnownProvenance class into two distinct sub-classes RemoteProvenance and LocalProvenance. This way a feature would be able to support RepositoryProvenance and ArtifactProvenance without the requirement to support all KnownProvenances including the new LocalProvances.

I believe that approach to be sensible. For one it separates the Provenances logically. It also allows granular control of which provenances are supported when, allowing distinction if necessary.

A scheme for such an inheritance could look like this:

  • KnownProvenance
    • RemoteProvenance
      • VcsProvenance
      • ArtifactProvenance
    • LocalProvancen
      • FileProvenance DirectoryProvenance

3. "Dummy" VersionControlSystem Plugin

We also briefly talked about the alternative solution, @sschuberth already suggested before. Although it was not Frank's area of expertise, he assumes such an implementation would be less complex, as it only requires implementation of a single class with little to no interfaces to the rest of the code base.

@sschuberth
Copy link
Member Author

1. Ambiguous VcsInfo source

I'm not sure I follow here about what the "ambiguity" is.

Let me start by saying that I believe the properties to be documented correctly regarding their intent. That is, IMO the Project class basically is a common model for the data in definition files from all supported package managers.

So for example, the vcs property is supposed to contain the data from a Maven project pom.yml's <scm> tag. However, data is not taken 1:1, but package-manager-specific syntax is resolved, like Maven's scm:git:... syntax of NPM's shortcut syntax.

On top of that, the vcsProcessed property is supposed to contain further normalized data, e.g. by mapping scp-type Git URLs consistently to full ssh URLs.

That said, I believe the implementation of the above is partly wrong in some package managers. For example, we have processProjectVcs() which enriches "VCS information determined from the project's directory with VCS information determined from the project's metadata", mixing two potentially unrelated sources of VCS information.

The current Repository class on the other hand is really only supposed to record the (repository) provenance of the input directory the analyzer was run on. It would also be present if no projects were found at all.

2. KnownProvenance sub-classes

The proposed structure makes sense to me.

@pepper-jk
Copy link
Contributor

That said, I believe the implementation of the above is partly wrong in some package managers. For example, we have processProjectVcs() which enriches "VCS information determined from the project's directory with VCS information determined from the project's metadata", mixing two potentially unrelated sources of VCS information.

Would that be something that needs to be fixed before our provenance implementation?

The current Repository class on the other hand is really only supposed to record the (repository) provenance of the input directory the analyzer was run on. It would also be present if no projects were found at all.

Sounds like the two are only indirectly related.

@sschuberth
Copy link
Member Author

Would that be something that needs to be fixed before our provenance implementation?

IMO, no.

Sounds like the two are only indirectly related.

IMO, yes.

sschuberth added a commit that referenced this issue Aug 28, 2024
This is until [1] is resolved.

[1]: #8803

Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth added a commit that referenced this issue Aug 28, 2024
This is until [1] is resolved.

[1]: #8803

Signed-off-by: Sebastian Schuberth <[email protected]>
@pepper-jk
Copy link
Contributor

pepper-jk commented Aug 29, 2024

Summarizing today's meeting between @sschuberth (ORT), @mnonnenmacher (ORT), @jens-erdmann (HELLA), and myself (HELLA).

TL:DR: We are still looking to contribute on the LocalProvenance topic and the maintainers are still interested in seeing the contribution to fruition.

1. Ambiguous VcsInfo source

With the absence of @fviernau, we could not conclude this topic. For now it should not be a blocker though and we hope we can resolve it once we get there.

2. KnownProvenance sub-classes

The proposed hierarchy seems like a good approach, but the refactoring overhead still has to be investigated. Specifically it is unclear, if conditional statements differentiating between RemoteProvenances and LocalProvances would outweigh the conditional statements, which would be required to limit interfaces to VcsProvenance and ArtifactProvenance, which would be required without the Provenance Hierarchy Refactoring.

Additional reason for such a refactoring however would be the later support for other possible Provenances, such as ZipProvenance or DockerProvenance. In light of that possibility, we updated the name of Local FileProvenance to DirectoryProvenance. The previous comment was amended.

We further surmised that if this task should be realized, it should be the first incremental contribution of the Local Provenance Refactoring. @sschuberth offered to support on this and perhaps implement this feature. He and I will stay in contact on this and I will contribute as he sees fit.

3. "Dummy" VersionControlSystem Plugin

I have now implemented a "Dummy" VCS plugin, which we will probably use as a workaround for the use case of scanning non-vcs software projects for the time being.

This allows us to take our time with the potenial Provenance Hierarchy Refactoring and the Local Provenance Reafactoing as a whole.

We also discussed the possibility to release said plugin as a vcs plugin example, similar to ort-package-manager-plugin.

4. Further Local Provenance Refactoring

Once the Provenance Hierarchy is clear, I can begin to rebase existing refactoring branches and we can continue the implementation. Existing branches are:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer About the analyzer tool enhancement Issues that are considered to be enhancements
Projects
None yet
Development

No branches or pull requests

3 participants