Perform network I/O in parallel when validating multiple files #455

chris48s · 2024-06-16T15:09:31Z

If I ask v8r to validate multiple files (e.g: v8r *.json), it will work through each of the files in sequence, fetching the schema and then validating the file. Cacheing the catalog and schemas speeds things up a bit - particularly if we are validating lots of files against the same schema. However, there is scope to make this a lot faster by doing things in parallel. The process of fetching and resolving schema references in particular is I/O-bound and lends itself to being done in parallel to speed things up.

Probably the ideal workflow here is something like:

Fetch and cache the catalog(s) we are using first
Query all the files we want to validate against the catalog and decide what schemas we need to fetch
Collate all the schemas and de-dupe
Kick off fetching all the schemas (in parallel)
Having fetched/resolved/cached all the schemas, iterate over each file and validate it

Some possible problems:

Race conditions on cache writes: Just de-duplicating once at the start doesn't completely alleviate this. Potentially multiple schemas could reference another schema. Very practical example: https://json.schemastore.org/package.json references https://json.schemastore.org/eslintrc.json and https://json.schemastore.org/prettierrc.json and others. How much would a race matter here?
Race conditions on log output: Need to make sure we collect up stuff we want to write and write it to stdout in some kind of sensible order.

This is quite a big/fiddly project, but this could make v8r a lot faster in some situations.

The text was updated successfully, but these errors were encountered:

chris48s · 2024-06-17T15:57:24Z

I think race conditions on cache writes don't really matter that much. The most naive approach would be to just do nothing.

In that case, we might make 2 requests for the same thing at roughly the same time. In the most common case, we get the same result both times and write it twice. The second write sets a slightly later timestamp.

I guess the worst case scenario is that this happens just as the upstream resource changes. In that case, we get 2 different responses and different bits of the validation within the same run are using 2 slightly different versions of the same schema. In principle, you could probably reproduce that with the current setup using ttl=0 or a really short ttl though.

I think I'm happy enough with that edge case not to implement any kind of special locking or anything for the cache though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform network I/O in parallel when validating multiple files #455

Perform network I/O in parallel when validating multiple files #455

chris48s commented Jun 16, 2024

chris48s commented Jun 17, 2024

Perform network I/O in parallel when validating multiple files #455

Perform network I/O in parallel when validating multiple files #455

Comments

chris48s commented Jun 16, 2024

chris48s commented Jun 17, 2024