Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-6836] schema registry json internal references #23461

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Sep 24, 2024

Note: in draft since I need to port tests and reference merging

Support for internal references for json schema compat checks

Implementation strategy:

prepoc phase:

unbundle schemas and save them in a map <$id value, <jsonpointer to schema, dialect>>

note: it’s possible to have nested bundled schemas, the code eagerly collects anything that looks like a bundled schema.

then
for each schema (main or bundled), resolve all relative $ref to absolute, using the correct base uri.

this is done by manually traversing the tree in parse_json, looking for "$id" or "id", and "$ref".

The object is then modified in memory.

No error is thrown in this phase, as it is relevant only if a reference is accessed during the compat check phase

Compat check phase:

Hook in get_schema to iteratively resolve references and use the previous map to handle references into bundled schemas.

Error out if:

  • the local ref is invalid
  • the absolute schema uri does not exist
  • The maximum depth of references is reached. This is to safeguard against direct recursive references and indirect recursive references

To safeguard against recursion, a counter in schema_context is decremented every time a ref is resolved, and this counter is shared with all children of is_superset, the chosen value is 20

get_object_or_empty make use of get_schema but it does not modify the counter for its siblings, need to think about this.

reference merging

$ref with siblings is transformed, during evaluation, to allOf schemas.
This captures the semantics defined by the json schema standard and the implementation used by validators like jsoncons.

an example:

this schema

{ 
"$ref": "#/a_fragment",
"type": "number", 
"minimum": 10
}

will be converted to

{
  "allOf": [
    { //schema at #/a_fragment
    },
    { "type": "number", "minimum": 10 }
  ]
}

not implemented yet

  • different dialects for bundled schema: need to keep track of the dialect for each json::Value
  • external references: part of another jira ticket.

Fixes https://redpandadata.atlassian.net/browse/CORE-6836

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines +2029 to +2228
auto maybe_draft4_id_it = this_obj.find("id");
auto maybe_id_it = this_obj.find("$id");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the dialect be lookup up first, to determine whether to use "$id" or "id"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to come with a simpler form but the fact is that we know that we are in bundled schema if we find a "id" member, and then if we are in a bundled schema then "$schema" can be analyzed, and the dialect has agree to "$id" or "id. it's kinda a chicken/egg problem,
so that's why i just try to find both, and if at least one is found then i assume we are in bundled schema and i can check if it's correct or not.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 77db262 to 9096718 Compare September 26, 2024 16:09
Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great effort, looks good so far. I'd love to see some tests for all this behaviour and various edge cases 😄

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 9096718 to 03ec3de Compare September 30, 2024 16:16
@andijcr andijcr marked this pull request as ready for review September 30, 2024 16:37
@andijcr
Copy link
Contributor Author

andijcr commented Sep 30, 2024

force push: addressed comments, added test cases

@andijcr andijcr requested a review from a team September 30, 2024 16:56
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 30, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#01924419-c191-4958-bfe6-605084731cb1:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#0192441a-3d8e-4787-b67a-3d90e154d27c:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

…e uri type

the base uri for a schema is saved with the "$id" or "id" key and any
relative reference is relative to this uri.

since the uri can contain a bunch of other parts like protocol (http vs
https) or port, this type is meant to be a canonical representation as
far as references are concerned.

the form is

host[/path]

where /path is optional

new type alias: id_to_schema_pointer

it's a map to resolve references json_id_uri->{json_pointer, dialect}

with a json_id_uri, we get the path to the actual json object for the
bundled schema, and the dialect used by it.

an absolute reference will first query the json_id_uri, get the path to
the root object, and then reach for the specific object
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 03ec3de to 2818ab0 Compare October 1, 2024 20:08
@andijcr
Copy link
Contributor Author

andijcr commented Oct 1, 2024

force push: rebase on dev, switched from jsoncons::json to jsoncons::ojson to preserve insertion order of the json keys.

    the switch to jsoncons::json in parse_json means that the function always performed
    key sorting, making the "normalize" flag redundant

    jsoncons can work in insertion-order mode, to do so we switch to the
    type alias jsoncons::ojson.

    this is done to preserve the original order of the input
    see
    tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize
    for an example where this can be observed externally from the schema
    registry API

@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch 3 times, most recently from 60e2e23 to 8fb9da1 Compare October 2, 2024 09:37
@andijcr
Copy link
Contributor Author

andijcr commented Oct 2, 2024

force push: removed brittle heuristic, fixed a warning for an error message, tried to appease clang-format

@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 8fb9da1 to 92ca754 Compare October 2, 2024 09:41
@andijcr
Copy link
Contributor Author

andijcr commented Oct 2, 2024

#23461 (comment)
edit: issue was between keyboard and chair

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Not reviewed all the tests yet.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines 2086 to 2317
bundled_schemas.insert_or_assign(
{}, std::pair{json::Pointer{}, dialect});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is anything inserted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to keep the invariant of bundled schemas always having at least an entry for the root.
in this specific case there is no practical purpose, but since in the rest of the file we try to treat true as {} i felt that it's no harm to extend this also here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines +2072 to +2319
for (auto i = 0u; i < value.size(); ++i) {
if (value[i].is_object()) {
collect_and_fix(i, value[i]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in theory, there could be arrays of arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 in the interest of getting this pr in, I'll rework this in the next pr for external refs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second thought, I don't think it's possible meaningfully to have arrays of arrays with some reachable references:
"type": "array" requires prefixItems to be a schema of objects,
similarly allOf/oneOf/anyOf are schema arrays

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines 656 to 658
fmt::format(
"Unsupported merging of references with base object: '{}'",
pj{candidate})});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will give a subset of the doc, but I wonder if the $ref is more helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in the following commit set (kept here to support bisecting)

src/v/pandaproxy/schema_registry/json.cc Show resolved Hide resolved
for (auto& [k, v] : obj) {
if (k != ref_key) {
doc.AddMember(
json::Value(k, alloc), json::Value(v, alloc), alloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I guess we don't need to copy any strings (json::Value 3rd param) since everything is kept alive in another doc somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried but that's not how the rapidjson api works, the Document has to own <key,value>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried but that's not how the rapidjson api works, the Document has to own <key,value>

I thought the default was to not copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L1381C9-L1387C36

basically it requires lvalues Value::AddMember(Value&, Value&, Allocator&) but the implementation will take ownership and set the inputs as null.
i haven't found another API to do it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L1381C9-L1387C36

basically it requires lvalues Value::AddMember(Value&, Value&, Allocator&) but the implementation will take ownership and set the inputs as null. i haven't found another API to do it

I was thinking of this: https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L742

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you mean bool copyConstStrings = false)
do we dare? seems doable

Copy link
Member

@BenPope BenPope Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you mean bool copyConstStrings = false) do we dare? seems doable

Yes. Nothing to be done, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i added it

src/v/pandaproxy/schema_registry/json.cc Show resolved Hide resolved
switch roles between jsoncons<->rapidjson. jsoncons has a nicer api for
the next commit
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 92ca754 to bce34ad Compare October 2, 2024 13:23
Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor suggestions/questions but it pretty much looks good.

Comment on lines 2086 to 2317
bundled_schemas.insert_or_assign(
{}, std::pair{json::Pointer{}, dialect});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.

Comment on lines 2228 to 2244
if (dialect_it == this_obj.object_range().end()) {
// we can keep using the parent dialect
return dialect;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (dialect_it == this_obj.object_range().end()) {
// we can keep using the parent dialect
return dialect;
}
if (dialect_it == this_obj.object_range().end()) {
// If no $schema is declared in an embedded schema, it defaults to using the dialect of the parent schema.
// from https://json-schema.org/understanding-json-schema/structuring#bundling
return dialect;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 2264 to 2282
// run validation since we are not a guaranteed to be in proper schema
if (validate_json_schema(maybe_new_dialect.value(), this_obj)
.has_error()) {
// stop exploring this branch, the schema is invalid
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I suggest throwing here. Afaik, this can only happen when we enter a bundled schema outside the standard (and metaschema-validated) "$defs/definitions" path. The behaviour is sufficiently "weird" in this case that I would be inclined to be defensive here, fail early and reject the schema altogether rather than accept a (partially) valid schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking the ref implementation behavior, will add a commit to match the result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking the stricter approach to throw if the bundled schema is invalid.

in a follow up this can be relaxed if the bundled is never actually referenced

Comment on lines 2238 to 2255
if (maybe_new_dialect.has_value() == false) {
// stop scanning this tree, we might be in a bundled schema but we
// don't know the dialect.
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking the ref implementation behavior, will add a commit to match the result

Comment on lines 2208 to 2225
// keyword | $schema | is bundled schema
// "$id" | | if root is >=draft6
// "$id" | >=draft6 | yes
// "$id" | draft4 | no
// "id" | | if root is draft4
// "id" | >=draft6 | no
// "id" | draft4 | yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't clear for me initially:

Suggested change
// keyword | $schema | is bundled schema
// "$id" | | if root is >=draft6
// "$id" | >=draft6 | yes
// "$id" | draft4 | no
// "id" | | if root is draft4
// "id" | >=draft6 | no
// "id" | draft4 | yes
// keyword | $schema | is bundled schema
// "$id" | missing | if parent is >=draft6
// "$id" | >=draft6 | yes
// "$id" | draft4 | no
// "id" | missing | if parent is draft4
// "id" | >=draft6 | no
// "id" | draft4 | yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I know that draft5 exists but it might be simpler to just write draft4 and >draft4 (or !=draft4) here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 2316 to 2317
bundled_schemas.insert_or_assign(
{}, std::pair{json::Pointer{}, dialect});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be equivalent? I'm wondering if (1) there's a difference in behaviour here between an empty object and a bool ("true") and (2) whether the creation of root_id could be pulled out of the branches.

Suggested change
bundled_schemas.insert_or_assign(
{}, std::pair{json::Pointer{}, dialect});
bundled_schemas.insert_or_assign(
json_id_uri{""}, std::pair{json::Pointer{}, dialect});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalent, but reworked the function to factor out this part as for (2)

// to use rapidjson we need to serialized schema again
auto iobuf_os = iobuf_ostream{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// to use rapidjson we need to serialized schema again
auto iobuf_os = iobuf_ostream{};
// to use rapidjson we need to serialize the schema again
// We take a copy of the jsoncons schema here because it has the fixed-up references that we want to use for compatibility checks
auto iobuf_os = iobuf_ostream{};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

BenPope
BenPope previously requested changes Oct 3, 2024
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 commits here:

  1. I didn't see tests for relative references so I added some
  2. Recursion results in stack-overflow (this should be fixed)

@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from bce34ad to 1f01a68 Compare October 4, 2024 14:07
the switch to jsoncons::json in parse_json means that the function always performed
key sorting, making the "normalize" flag redundant

jsoncons can work in insertion-order mode, to do so we switch to the
type alias jsoncons::ojson.

this is done to preserve the original order of the input
see
tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize
for an example where this can be observed externally from the schema
registry api
"^b" is repeated twice, and this is not illegal but the result is not
dependent on the strategy used by the json parse
before this commit, get_object_or_empty would get a context to pass to
get_schema. this was done with the idea of resolving references in
get_schema and benefit from that.

however, this clash with the design of resolving references only in
is_superset, and the standard prevents "properties" "patternProperties"
"dependency" from being references (see tests).

this means that the whole code can be simplified (at the cost of handing
objects and bools directly in get_object_or_empty).

keywords (like items, additionalItems, additionalProperties) that can be
references where already passed directly to is_superset, so resolving
early the references is a capability that's not needed
this recursive function crawl the schema to gather all bundled schemas.
a bundled schema is defined as a schema with "$id" defined.
the recursion is on the "$defs"/"definitions" members of a schema, since
the schema validation ensures that children of these properties are
valid schemas.

some small notes:
- it's possible that bundled schemas use a different
dialect than the root object, so we collect this
- the bundled schemas could be nested
- we can't error in this function since it's called at parse time and
  it's not assured that a bundled schema is actually accessed
add helper to resolve references (local or to bundled schemas)

add _ref_unit to keep track of how many references can be traversed in
the current subtree, this will be used to prevent infinite recursion.
this helper can hold a json::Value const& or a json::Document, and can
transparently convert to json::Value const&.

it will be used to merge references, when to resolve a reference with siblings we need to
create a new json schema

internally is a variant of two pointers, because holding a json::Document is the
unlikely case of and it would occupy 104 bytes
this function accepts list of json object to merge into a new schema
with the form

{
  "allOf": [ *input_list ]
}

the input_list is expected to be a chain of reference objects of size
>=2, all of them with a $ref key except the last one

the final schema will contain these objects, with the $ref key removed
from each one

the user is responsible to keep the original json::Document alive, since
the non-ref strings will be referenced in the result.

one optimization: if the chain contains only one non-empty object
without the $ref key, this will be returned directly.
this is to support the likely case of a chain of pure $ref pointing to
the final schema
get_schema will use resolve_reference to traverse references until the
final target is reached.

it's expected for references to absolute, thanks to the preprocessing
phase.
retrieval is two step process: first retrieve the bundled schema, then
retrieve the sub object.

if the sub object contains a reference, repeat the process.

the function uses schema_context to limit the number of iterations. the
number of iterations is shared across children of an object to prevent
indirect references to cause a stack overflow.

external references are not supported

different dialects in bundled schemas are not supported in this commit,
the function throws if the dialects do not match
the test cases are positive/negative test around $ref resolution.

we have local relative refs
local absolute ref
refs to bundled schemas
multiple jumps refs
recursive refs (rejection)
array of refs
refs with siblings
	this last one shows the transformation perfomed:

a
{
    $ref: uri
    siblings...
}
will be internally treated as
{
    allOf: [
      { //uri referenced schema}
      { siblings... }
    ]
}
this test shows how the in-memory representation of $ref is absolute
relative to the correct base uri.

it uses jsoncons and the jsonpatch extension for easier
parse/print/compare ops
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 1f01a68 to cae592f Compare October 4, 2024 15:34
@andijcr
Copy link
Contributor Author

andijcr commented Oct 4, 2024

I have 2 commits here:

1. I didn't see tests for relative references so I added some

2. Recursion results in stack-overflow (this should be fixed)
  1. added them
  2. fix for this is to ensure that ref resolution is done only in is_superset. to do so i decoupled get_object_or_empty from get_schema. get_object_or_empty does not need the context because it's used for keyword like "properties" that can't be a reference (added a test for this), or for objects that gets passed directly to is_superset, so there is no need to resolve references

@andijcr andijcr dismissed BenPope’s stale review October 4, 2024 15:53

fixed stackoverflow due to unhandled infinite recursion

take the safe approach and reject schemas with invalid bundled schema.

reason for rejection: the dialect is unknown, the dialect does not match
the key used for "$id", or the bundled schema does not pass validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants