Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform network I/O in parallel when validating multiple files #455

Open
chris48s opened this issue Jun 16, 2024 · 1 comment
Open

Perform network I/O in parallel when validating multiple files #455

chris48s opened this issue Jun 16, 2024 · 1 comment

Comments

@chris48s
Copy link
Owner

If I ask v8r to validate multiple files (e.g: v8r *.json), it will work through each of the files in sequence, fetching the schema and then validating the file. Cacheing the catalog and schemas speeds things up a bit - particularly if we are validating lots of files against the same schema. However, there is scope to make this a lot faster by doing things in parallel. The process of fetching and resolving schema references in particular is I/O-bound and lends itself to being done in parallel to speed things up.

Probably the ideal workflow here is something like:

  • Fetch and cache the catalog(s) we are using first
  • Query all the files we want to validate against the catalog and decide what schemas we need to fetch
  • Collate all the schemas and de-dupe
  • Kick off fetching all the schemas (in parallel)
  • Having fetched/resolved/cached all the schemas, iterate over each file and validate it

Some possible problems:

This is quite a big/fiddly project, but this could make v8r a lot faster in some situations.

@chris48s
Copy link
Owner Author

I think race conditions on cache writes don't really matter that much. The most naive approach would be to just do nothing.

In that case, we might make 2 requests for the same thing at roughly the same time. In the most common case, we get the same result both times and write it twice. The second write sets a slightly later timestamp.

I guess the worst case scenario is that this happens just as the upstream resource changes. In that case, we get 2 different responses and different bits of the validation within the same run are using 2 slightly different versions of the same schema. In principle, you could probably reproduce that with the current setup using ttl=0 or a really short ttl though.

I think I'm happy enough with that edge case not to implement any kind of special locking or anything for the cache though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant