-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-6836] schema registry json internal references #23461
base: dev
Are you sure you want to change the base?
Conversation
auto maybe_draft4_id_it = this_obj.find("id"); | ||
auto maybe_id_it = this_obj.find("$id"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the dialect be lookup up first, to determine whether to use "$id" or "id"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried to come with a simpler form but the fact is that we know that we are in bundled schema if we find a "id" member, and then if we are in a bundled schema then "$schema" can be analyzed, and the dialect has agree to "$id" or "id. it's kinda a chicken/egg problem,
so that's why i just try to find both, and if at least one is found then i assume we are in bundled schema and i can check if it's correct or not.
77db262
to
9096718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great effort, looks good so far. I'd love to see some tests for all this behaviour and various edge cases 😄
9096718
to
03ec3de
Compare
force push: addressed comments, added test cases |
new failures in https://buildkite.com/redpanda/redpanda/builds/55458#01924419-c191-4958-bfe6-605084731cb1:
new failures in https://buildkite.com/redpanda/redpanda/builds/55458#0192441a-3d8e-4787-b67a-3d90e154d27c:
|
…e uri type the base uri for a schema is saved with the "$id" or "id" key and any relative reference is relative to this uri. since the uri can contain a bunch of other parts like protocol (http vs https) or port, this type is meant to be a canonical representation as far as references are concerned. the form is host[/path] where /path is optional new type alias: id_to_schema_pointer it's a map to resolve references json_id_uri->{json_pointer, dialect} with a json_id_uri, we get the path to the actual json object for the bundled schema, and the dialect used by it. an absolute reference will first query the json_id_uri, get the path to the root object, and then reach for the specific object
03ec3de
to
2818ab0
Compare
force push: rebase on dev, switched from jsoncons::json to jsoncons::ojson to preserve insertion order of the json keys.
|
60e2e23
to
8fb9da1
Compare
force push: removed brittle heuristic, fixed a warning for an error message, tried to appease clang-format |
8fb9da1
to
92ca754
Compare
#23461 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Not reviewed all the tests yet.
bundled_schemas.insert_or_assign( | ||
{}, std::pair{json::Pointer{}, dialect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is anything inserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to keep the invariant of bundled schemas always having at least an entry for the root.
in this specific case there is no practical purpose, but since in the rest of the file we try to treat true
as {}
i felt that it's no harm to extend this also here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.
for (auto i = 0u; i < value.size(); ++i) { | ||
if (value[i].is_object()) { | ||
collect_and_fix(i, value[i]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in theory, there could be arrays of arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭 in the interest of getting this pr in, I'll rework this in the next pr for external refs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a second thought, I don't think it's possible meaningfully to have arrays of arrays with some reachable references:
"type": "array" requires prefixItems to be a schema of objects,
similarly allOf/oneOf/anyOf are schema arrays
fmt::format( | ||
"Unsupported merging of references with base object: '{}'", | ||
pj{candidate})}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will give a subset of the doc, but I wonder if the $ref
is more helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in the following commit set (kept here to support bisecting)
for (auto& [k, v] : obj) { | ||
if (k != ref_key) { | ||
doc.AddMember( | ||
json::Value(k, alloc), json::Value(v, alloc), alloc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I guess we don't need to copy any strings (json::Value
3rd param) since everything is kept alive in another doc somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried but that's not how the rapidjson api works, the Document has to own <key,value>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried but that's not how the rapidjson api works, the Document has to own <key,value>
I thought the default was to not copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically it requires lvalues Value::AddMember(Value&, Value&, Allocator&)
but the implementation will take ownership and set the inputs as null.
i haven't found another API to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically it requires lvalues
Value::AddMember(Value&, Value&, Allocator&)
but the implementation will take ownership and set the inputs as null. i haven't found another API to do it
I was thinking of this: https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L742
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you mean bool copyConstStrings = false)
do we dare? seems doable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you mean
bool copyConstStrings = false)
do we dare? seems doable
Yes. Nothing to be done, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i added it
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cdc-6216-4f22-bddd-a8d4fadad868 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13c1-441c-85f1-9474004c556e ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13bd-4fcc-b7ca-c41b5c03981e ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55820#0192591e-d718-4be7-b33b-4e62a8ec48e3 |
switch roles between jsoncons<->rapidjson. jsoncons has a nicer api for the next commit
92ca754
to
bce34ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor suggestions/questions but it pretty much looks good.
bundled_schemas.insert_or_assign( | ||
{}, std::pair{json::Pointer{}, dialect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.
if (dialect_it == this_obj.object_range().end()) { | ||
// we can keep using the parent dialect | ||
return dialect; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (dialect_it == this_obj.object_range().end()) { | |
// we can keep using the parent dialect | |
return dialect; | |
} | |
if (dialect_it == this_obj.object_range().end()) { | |
// If no $schema is declared in an embedded schema, it defaults to using the dialect of the parent schema. | |
// from https://json-schema.org/understanding-json-schema/structuring#bundling | |
return dialect; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// run validation since we are not a guaranteed to be in proper schema | ||
if (validate_json_schema(maybe_new_dialect.value(), this_obj) | ||
.has_error()) { | ||
// stop exploring this branch, the schema is invalid | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I suggest throwing here. Afaik, this can only happen when we enter a bundled schema outside the standard (and metaschema-validated) "$defs/definitions" path. The behaviour is sufficiently "weird" in this case that I would be inclined to be defensive here, fail early and reject the schema altogether rather than accept a (partially) valid schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking the ref implementation behavior, will add a commit to match the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking the stricter approach to throw if the bundled schema is invalid.
in a follow up this can be relaxed if the bundled is never actually referenced
if (maybe_new_dialect.has_value() == false) { | ||
// stop scanning this tree, we might be in a bundled schema but we | ||
// don't know the dialect. | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking the ref implementation behavior, will add a commit to match the result
// keyword | $schema | is bundled schema | ||
// "$id" | | if root is >=draft6 | ||
// "$id" | >=draft6 | yes | ||
// "$id" | draft4 | no | ||
// "id" | | if root is draft4 | ||
// "id" | >=draft6 | no | ||
// "id" | draft4 | yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't clear for me initially:
// keyword | $schema | is bundled schema | |
// "$id" | | if root is >=draft6 | |
// "$id" | >=draft6 | yes | |
// "$id" | draft4 | no | |
// "id" | | if root is draft4 | |
// "id" | >=draft6 | no | |
// "id" | draft4 | yes | |
// keyword | $schema | is bundled schema | |
// "$id" | missing | if parent is >=draft6 | |
// "$id" | >=draft6 | yes | |
// "$id" | draft4 | no | |
// "id" | missing | if parent is draft4 | |
// "id" | >=draft6 | no | |
// "id" | draft4 | yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I know that draft5 exists but it might be simpler to just write draft4
and >draft4
(or !=draft4
) here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
bundled_schemas.insert_or_assign( | ||
{}, std::pair{json::Pointer{}, dialect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be equivalent? I'm wondering if (1) there's a difference in behaviour here between an empty object and a bool ("true") and (2) whether the creation of root_id
could be pulled out of the branches.
bundled_schemas.insert_or_assign( | |
{}, std::pair{json::Pointer{}, dialect}); | |
bundled_schemas.insert_or_assign( | |
json_id_uri{""}, std::pair{json::Pointer{}, dialect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalent, but reworked the function to factor out this part as for (2)
// to use rapidjson we need to serialized schema again | ||
auto iobuf_os = iobuf_ostream{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// to use rapidjson we need to serialized schema again | |
auto iobuf_os = iobuf_ostream{}; | |
// to use rapidjson we need to serialize the schema again | |
// We take a copy of the jsoncons schema here because it has the fixed-up references that we want to use for compatibility checks | |
auto iobuf_os = iobuf_ostream{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 commits here:
- I didn't see tests for relative references so I added some
- Recursion results in stack-overflow (this should be fixed)
bce34ad
to
1f01a68
Compare
the switch to jsoncons::json in parse_json means that the function always performed key sorting, making the "normalize" flag redundant jsoncons can work in insertion-order mode, to do so we switch to the type alias jsoncons::ojson. this is done to preserve the original order of the input see tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize for an example where this can be observed externally from the schema registry api
"^b" is repeated twice, and this is not illegal but the result is not dependent on the strategy used by the json parse
before this commit, get_object_or_empty would get a context to pass to get_schema. this was done with the idea of resolving references in get_schema and benefit from that. however, this clash with the design of resolving references only in is_superset, and the standard prevents "properties" "patternProperties" "dependency" from being references (see tests). this means that the whole code can be simplified (at the cost of handing objects and bools directly in get_object_or_empty). keywords (like items, additionalItems, additionalProperties) that can be references where already passed directly to is_superset, so resolving early the references is a capability that's not needed
this recursive function crawl the schema to gather all bundled schemas. a bundled schema is defined as a schema with "$id" defined. the recursion is on the "$defs"/"definitions" members of a schema, since the schema validation ensures that children of these properties are valid schemas. some small notes: - it's possible that bundled schemas use a different dialect than the root object, so we collect this - the bundled schemas could be nested - we can't error in this function since it's called at parse time and it's not assured that a bundled schema is actually accessed
add helper to resolve references (local or to bundled schemas) add _ref_unit to keep track of how many references can be traversed in the current subtree, this will be used to prevent infinite recursion.
this helper can hold a json::Value const& or a json::Document, and can transparently convert to json::Value const&. it will be used to merge references, when to resolve a reference with siblings we need to create a new json schema internally is a variant of two pointers, because holding a json::Document is the unlikely case of and it would occupy 104 bytes
this function accepts list of json object to merge into a new schema with the form { "allOf": [ *input_list ] } the input_list is expected to be a chain of reference objects of size >=2, all of them with a $ref key except the last one the final schema will contain these objects, with the $ref key removed from each one the user is responsible to keep the original json::Document alive, since the non-ref strings will be referenced in the result. one optimization: if the chain contains only one non-empty object without the $ref key, this will be returned directly. this is to support the likely case of a chain of pure $ref pointing to the final schema
get_schema will use resolve_reference to traverse references until the final target is reached. it's expected for references to absolute, thanks to the preprocessing phase. retrieval is two step process: first retrieve the bundled schema, then retrieve the sub object. if the sub object contains a reference, repeat the process. the function uses schema_context to limit the number of iterations. the number of iterations is shared across children of an object to prevent indirect references to cause a stack overflow. external references are not supported different dialects in bundled schemas are not supported in this commit, the function throws if the dialects do not match
the test cases are positive/negative test around $ref resolution. we have local relative refs local absolute ref refs to bundled schemas multiple jumps refs recursive refs (rejection) array of refs refs with siblings this last one shows the transformation perfomed: a { $ref: uri siblings... } will be internally treated as { allOf: [ { //uri referenced schema} { siblings... } ] }
this test shows how the in-memory representation of $ref is absolute relative to the correct base uri. it uses jsoncons and the jsonpatch extension for easier parse/print/compare ops
1f01a68
to
cae592f
Compare
|
fixed stackoverflow due to unhandled infinite recursion
take the safe approach and reject schemas with invalid bundled schema. reason for rejection: the dialect is unknown, the dialect does not match the key used for "$id", or the bundled schema does not pass validation.
Note: in draft since I need to port tests and reference merging
Support for internal references for json schema compat checks
Implementation strategy:
prepoc phase:
unbundle schemas and save them in a map <$id value, <jsonpointer to schema, dialect>>
note: it’s possible to have nested bundled schemas, the code eagerly collects anything that looks like a bundled schema.
then
for each schema (main or bundled), resolve all relative $ref to absolute, using the correct base uri.
this is done by manually traversing the tree in parse_json, looking for "$id" or "id", and "$ref".
The object is then modified in memory.
No error is thrown in this phase, as it is relevant only if a reference is accessed during the compat check phase
Compat check phase:
Hook in get_schema to iteratively resolve references and use the previous map to handle references into bundled schemas.
Error out if:
To safeguard against recursion, a counter in schema_context is decremented every time a ref is resolved, and this counter is shared with all children of
is_superset
, the chosen value is 20get_object_or_empty make use of get_schema but it does not modify the counter for its siblings, need to think about this.
reference merging
$ref with siblings is transformed, during evaluation, to
allOf
schemas.This captures the semantics defined by the json schema standard and the implementation used by validators like jsoncons.
an example:
this schema
will be converted to
not implemented yet
Fixes https://redpandadata.atlassian.net/browse/CORE-6836
Backports Required
Release Notes