[CORE-6836] schema registry json internal references #23461

andijcr · 2024-09-24T16:29:24Z

Note: in draft since I need to port tests and reference merging

Support for internal references for json schema compat checks

Implementation strategy:

prepoc phase:

unbundle schemas and save them in a map <$id value, <jsonpointer to schema, dialect>>

note: it’s possible to have nested bundled schemas, the code eagerly collects anything that looks like a bundled schema.

then
for each schema (main or bundled), resolve all relative $ref to absolute, using the correct base uri.

this is done by manually traversing the tree in parse_json, looking for "$id" or "id", and "$ref".

The object is then modified in memory.

No error is thrown in this phase, as it is relevant only if a reference is accessed during the compat check phase

Compat check phase:

Hook in get_schema to iteratively resolve references and use the previous map to handle references into bundled schemas.

Error out if:

the local ref is invalid
the absolute schema uri does not exist
The maximum depth of references is reached. This is to safeguard against direct recursive references and indirect recursive references

To safeguard against recursion, a counter in schema_context is decremented every time a ref is resolved, and this counter is shared with all children of is_superset, the chosen value is 20

get_object_or_empty make use of get_schema but it does not modify the counter for its siblings, need to think about this.

reference merging

$ref with siblings is transformed, during evaluation, to allOf schemas.
This captures the semantics defined by the json schema standard and the implementation used by validators like jsoncons.

an example:

this schema

{ 
"$ref": "#/a_fragment",
"type": "number", 
"minimum": 10
}

will be converted to

{
  "allOf": [
    { //schema at #/a_fragment
    },
    { "type": "number", "minimum": 10 }
  ]
}

not implemented yet

different dialects for bundled schema: need to keep track of the dialect for each json::Value
external references: part of another jira ticket.

Fixes https://redpandadata.atlassian.net/browse/CORE-6836

Backports Required

Release Notes

none

src/v/pandaproxy/schema_registry/json.cc

BenPope · 2024-09-25T09:42:51Z

src/v/pandaproxy/schema_registry/json.cc

+    auto maybe_draft4_id_it = this_obj.find("id");
+    auto maybe_id_it = this_obj.find("$id");


Should the dialect be lookup up first, to determine whether to use "$id" or "id"?

i tried to come with a simpler form but the fact is that we know that we are in bundled schema if we find a "id" member, and then if we are in a bundled schema then "$schema" can be analyzed, and the dialect has agree to "$id" or "id. it's kinda a chicken/egg problem,
so that's why i just try to find both, and if at least one is found then i assume we are in bundled schema and i can check if it's correct or not.

src/v/pandaproxy/schema_registry/json.cc

pgellert

great effort, looks good so far. I'd love to see some tests for all this behaviour and various edge cases 😄

src/v/pandaproxy/schema_registry/json.cc

andijcr · 2024-09-30T16:55:00Z

force push: addressed comments, added test cases

vbotbuildovich · 2024-09-30T19:02:53Z

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#01924419-c191-4958-bfe6-605084731cb1:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#0192441a-3d8e-4787-b67a-3d90e154d27c:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

…e uri type the base uri for a schema is saved with the "$id" or "id" key and any relative reference is relative to this uri. since the uri can contain a bunch of other parts like protocol (http vs https) or port, this type is meant to be a canonical representation as far as references are concerned. the form is host[/path] where /path is optional new type alias: id_to_schema_pointer it's a map to resolve references json_id_uri->{json_pointer, dialect} with a json_id_uri, we get the path to the actual json object for the bundled schema, and the dialect used by it. an absolute reference will first query the json_id_uri, get the path to the root object, and then reach for the specific object

andijcr · 2024-10-01T20:09:38Z

force push: rebase on dev, switched from jsoncons::json to jsoncons::ojson to preserve insertion order of the json keys.

    the switch to jsoncons::json in parse_json means that the function always performed
    key sorting, making the "normalize" flag redundant

    jsoncons can work in insertion-order mode, to do so we switch to the
    type alias jsoncons::ojson.

    this is done to preserve the original order of the input
    see
    tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize
    for an example where this can be observed externally from the schema
    registry API

andijcr · 2024-10-02T09:37:44Z

force push: removed brittle heuristic, fixed a warning for an error message, tried to appease clang-format

andijcr · 2024-10-02T09:43:29Z

#23461 (comment)
edit: issue was between keyboard and chair

BenPope

Looks good. Not reviewed all the tests yet.

src/v/pandaproxy/schema_registry/json.cc

BenPope · 2024-10-02T10:25:04Z

src/v/pandaproxy/schema_registry/json.cc

+        bundled_schemas.insert_or_assign(
+          {}, std::pair{json::Pointer{}, dialect});


Why is anything inserted?

i wanted to keep the invariant of bundled schemas always having at least an entry for the root.
in this specific case there is no practical purpose, but since in the rest of the file we try to treat true as {} i felt that it's no harm to extend this also here.

I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.

src/v/pandaproxy/schema_registry/json.cc

BenPope · 2024-10-02T10:51:21Z

src/v/pandaproxy/schema_registry/json.cc

+            for (auto i = 0u; i < value.size(); ++i) {
+                if (value[i].is_object()) {
+                    collect_and_fix(i, value[i]);
+                }
+            }


I suppose in theory, there could be arrays of arrays?

😭 in the interest of getting this pr in, I'll rework this in the next pr for external refs

on a second thought, I don't think it's possible meaningfully to have arrays of arrays with some reachable references:
"type": "array" requires prefixItems to be a schema of objects,
similarly allOf/oneOf/anyOf are schema arrays

src/v/pandaproxy/schema_registry/json.cc

BenPope · 2024-10-02T11:03:04Z

src/v/pandaproxy/schema_registry/json.cc

+          fmt::format(
+            "Unsupported merging of references with base object: '{}'",
+            pj{candidate})});


This will give a subset of the doc, but I wonder if the $ref is more helpful?

removed in the following commit set (kept here to support bisecting)

src/v/pandaproxy/schema_registry/json.cc

BenPope · 2024-10-02T11:41:17Z

src/v/pandaproxy/schema_registry/json.cc

+        for (auto& [k, v] : obj) {
+            if (k != ref_key) {
+                doc.AddMember(
+                  json::Value(k, alloc), json::Value(v, alloc), alloc);


note: I guess we don't need to copy any strings (json::Value 3rd param) since everything is kept alive in another doc somewhere.

i tried but that's not how the rapidjson api works, the Document has to own <key,value>

i tried but that's not how the rapidjson api works, the Document has to own <key,value>

I thought the default was to not copy?

https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L1381C9-L1387C36

basically it requires lvalues Value::AddMember(Value&, Value&, Allocator&) but the implementation will take ownership and set the inputs as null.
i haven't found another API to do it

https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L1381C9-L1387C36

basically it requires lvalues Value::AddMember(Value&, Value&, Allocator&) but the implementation will take ownership and set the inputs as null. i haven't found another API to do it

I was thinking of this: https://github.com/Tencent/rapidjson/blob/815e6e7e7e14be44a6c15d9aefed232ff064cad0/include/rapidjson/document.h#L742

ah, you mean bool copyConstStrings = false)
do we dare? seems doable

ah, you mean bool copyConstStrings = false) do we dare? seems doable

Yes. Nothing to be done, I think.

ok i added it

src/v/pandaproxy/schema_registry/json.cc

vbotbuildovich · 2024-10-02T12:21:28Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cdc-6216-4f22-bddd-a8d4fadad868

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13c1-441c-85f1-9474004c556e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13bd-4fcc-b7ca-c41b5c03981e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55820#0192591e-d718-4be7-b33b-4e62a8ec48e3

switch roles between jsoncons<->rapidjson. jsoncons has a nicer api for the next commit

pgellert

I left a few minor suggestions/questions but it pretty much looks good.

src/v/pandaproxy/schema_registry/test/test_json_schema.cc

pgellert · 2024-10-02T18:20:40Z

src/v/pandaproxy/schema_registry/json.cc

+        bundled_schemas.insert_or_assign(
+          {}, std::pair{json::Pointer{}, dialect});


I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.

pgellert · 2024-10-03T10:42:33Z

src/v/pandaproxy/schema_registry/json.cc

+            if (dialect_it == this_obj.object_range().end()) {
+                // we can keep using the parent dialect
+                return dialect;
+            }


nit:

Suggested change

if (dialect_it == this_obj.object_range().end()) {

// we can keep using the parent dialect

return dialect;

}

if (dialect_it == this_obj.object_range().end()) {

// If no $schema is declared in an embedded schema, it defaults to using the dialect of the parent schema.

// from https://json-schema.org/understanding-json-schema/structuring#bundling

return dialect;

}

pgellert · 2024-10-03T10:46:40Z

src/v/pandaproxy/schema_registry/json.cc

+        // run validation since we are not a guaranteed to be in proper schema
+        if (validate_json_schema(maybe_new_dialect.value(), this_obj)
+              .has_error()) {
+            // stop exploring this branch, the schema is invalid
+            return;
+        }


As discussed, I suggest throwing here. Afaik, this can only happen when we enter a bundled schema outside the standard (and metaschema-validated) "$defs/definitions" path. The behaviour is sufficiently "weird" in this case that I would be inclined to be defensive here, fail early and reject the schema altogether rather than accept a (partially) valid schema.

checking the ref implementation behavior, will add a commit to match the result

taking the stricter approach to throw if the bundled schema is invalid.

in a follow up this can be relaxed if the bundled is never actually referenced

pgellert · 2024-10-03T10:55:18Z

src/v/pandaproxy/schema_registry/json.cc

+        if (maybe_new_dialect.has_value() == false) {
+            // stop scanning this tree, we might be in a bundled schema but we
+            // don't know the dialect.
+            return;
+        }


Should we throw here?

checking the ref implementation behavior, will add a commit to match the result

pgellert · 2024-10-03T10:59:01Z

src/v/pandaproxy/schema_registry/json.cc

+    // keyword | $schema  | is bundled schema
+    //   "$id" |          | if root is >=draft6
+    //   "$id" | >=draft6 |       yes
+    //   "$id" |  draft4  |       no
+    //   "id"  |          | if root is draft4
+    //   "id"  | >=draft6 |       no
+    //   "id"  |  draft4  |       yes


Also, I know that draft5 exists but it might be simpler to just write draft4 and >draft4 (or !=draft4) here

pgellert · 2024-10-03T11:04:40Z

src/v/pandaproxy/schema_registry/json.cc

+        bundled_schemas.insert_or_assign(
+          {}, std::pair{json::Pointer{}, dialect});


Would this be equivalent? I'm wondering if (1) there's a difference in behaviour here between an empty object and a bool ("true") and (2) whether the creation of root_id could be pulled out of the branches.

Suggested change

bundled_schemas.insert_or_assign(

{}, std::pair{json::Pointer{}, dialect});

bundled_schemas.insert_or_assign(

json_id_uri{""}, std::pair{json::Pointer{}, dialect});

equivalent, but reworked the function to factor out this part as for (2)

pgellert · 2024-10-03T11:06:54Z

src/v/pandaproxy/schema_registry/json.cc

+    // to use rapidjson we need to serialized schema again
+    auto iobuf_os = iobuf_ostream{};


Suggested change

// to use rapidjson we need to serialized schema again

auto iobuf_os = iobuf_ostream{};

// to use rapidjson we need to serialize the schema again

// We take a copy of the jsoncons schema here because it has the fixed-up references that we want to use for compatibility checks

auto iobuf_os = iobuf_ostream{};

BenPope

I have 2 commits here:

I didn't see tests for relative references so I added some
Recursion results in stack-overflow (this should be fixed)

the switch to jsoncons::json in parse_json means that the function always performed key sorting, making the "normalize" flag redundant jsoncons can work in insertion-order mode, to do so we switch to the type alias jsoncons::ojson. this is done to preserve the original order of the input see tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize for an example where this can be observed externally from the schema registry api

"^b" is repeated twice, and this is not illegal but the result is not dependent on the strategy used by the json parse

before this commit, get_object_or_empty would get a context to pass to get_schema. this was done with the idea of resolving references in get_schema and benefit from that. however, this clash with the design of resolving references only in is_superset, and the standard prevents "properties" "patternProperties" "dependency" from being references (see tests). this means that the whole code can be simplified (at the cost of handing objects and bools directly in get_object_or_empty). keywords (like items, additionalItems, additionalProperties) that can be references where already passed directly to is_superset, so resolving early the references is a capability that's not needed

this recursive function crawl the schema to gather all bundled schemas. a bundled schema is defined as a schema with "$id" defined. the recursion is on the "$defs"/"definitions" members of a schema, since the schema validation ensures that children of these properties are valid schemas. some small notes: - it's possible that bundled schemas use a different dialect than the root object, so we collect this - the bundled schemas could be nested - we can't error in this function since it's called at parse time and it's not assured that a bundled schema is actually accessed

add helper to resolve references (local or to bundled schemas) add _ref_unit to keep track of how many references can be traversed in the current subtree, this will be used to prevent infinite recursion.

this helper can hold a json::Value const& or a json::Document, and can transparently convert to json::Value const&. it will be used to merge references, when to resolve a reference with siblings we need to create a new json schema internally is a variant of two pointers, because holding a json::Document is the unlikely case of and it would occupy 104 bytes

this function accepts list of json object to merge into a new schema with the form { "allOf": [ *input_list ] } the input_list is expected to be a chain of reference objects of size >=2, all of them with a $ref key except the last one the final schema will contain these objects, with the $ref key removed from each one the user is responsible to keep the original json::Document alive, since the non-ref strings will be referenced in the result. one optimization: if the chain contains only one non-empty object without the $ref key, this will be returned directly. this is to support the likely case of a chain of pure $ref pointing to the final schema

get_schema will use resolve_reference to traverse references until the final target is reached. it's expected for references to absolute, thanks to the preprocessing phase. retrieval is two step process: first retrieve the bundled schema, then retrieve the sub object. if the sub object contains a reference, repeat the process. the function uses schema_context to limit the number of iterations. the number of iterations is shared across children of an object to prevent indirect references to cause a stack overflow. external references are not supported different dialects in bundled schemas are not supported in this commit, the function throws if the dialects do not match

the test cases are positive/negative test around $ref resolution. we have local relative refs local absolute ref refs to bundled schemas multiple jumps refs recursive refs (rejection) array of refs refs with siblings this last one shows the transformation perfomed: a { $ref: uri siblings... } will be internally treated as { allOf: [ { //uri referenced schema} { siblings... } ] }

this test shows how the in-memory representation of $ref is absolute relative to the correct base uri. it uses jsoncons and the jsonpatch extension for easier parse/print/compare ops

andijcr · 2024-10-04T15:41:27Z

I have 2 commits here:

1. I didn't see tests for relative references so I added some

2. Recursion results in stack-overflow (this should be fixed)

added them
fix for this is to ensure that ref resolution is done only in is_superset. to do so i decoupled get_object_or_empty from get_schema. get_object_or_empty does not need the context because it's used for keyword like "properties" that can't be a reference (added a test for this), or for objects that gets passed directly to is_superset, so there is no need to resolve references

fixed stackoverflow due to unhandled infinite recursion

take the safe approach and reject schemas with invalid bundled schema. reason for rejection: the dialect is unknown, the dialect does not match the key used for "$id", or the bundled schema does not pass validation.

github-actions bot added the area/redpanda label Sep 24, 2024

andijcr requested review from BenPope and pgellert September 24, 2024 16:30

BenPope reviewed Sep 25, 2024

View reviewed changes

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 77db262 to 9096718 Compare September 26, 2024 16:09

pgellert reviewed Sep 26, 2024

View reviewed changes

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 9096718 to 03ec3de Compare September 30, 2024 16:16

andijcr requested review from BenPope and pgellert September 30, 2024 16:37

andijcr marked this pull request as ready for review September 30, 2024 16:37

andijcr requested a review from a team September 30, 2024 16:56

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 03ec3de to 2818ab0 Compare October 1, 2024 20:08

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch 3 times, most recently from 60e2e23 to 8fb9da1 Compare October 2, 2024 09:37

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 8fb9da1 to 92ca754 Compare October 2, 2024 09:41

BenPope reviewed Oct 2, 2024

View reviewed changes

schema_registry/json: refactor parse_json

bf3a7fc

switch roles between jsoncons<->rapidjson. jsoncons has a nicer api for the next commit

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 92ca754 to bce34ad Compare October 2, 2024 13:23

BenPope mentioned this pull request Oct 3, 2024

[CORE-6836] schema_registry/json: Support Internal References #22815

Closed

7 tasks

pgellert reviewed Oct 3, 2024

View reviewed changes

BenPope previously requested changes Oct 3, 2024

View reviewed changes

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from bce34ad to 1f01a68 Compare October 4, 2024 14:07

andijcr added 11 commits October 4, 2024 16:39

schema_registry/test_json_schema: test fix

dfb852c

"^b" is repeated twice, and this is not illegal but the result is not dependent on the strategy used by the json parse

schema_registry/json: schema_context find_bundled and consume_ref_units

0f8019f

add helper to resolve references (local or to bundled schemas) add _ref_unit to keep track of how many references can be traversed in the current subtree, this will be used to prevent infinite recursion.

schema_registry/test_json_schema: test_refs_fixing

cac344f

this test shows how the in-memory representation of $ref is absolute relative to the correct base uri. it uses jsoncons and the jsonpatch extension for easier parse/print/compare ops

schema_registry/json: remove unused headers

cae592f

andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 1f01a68 to cae592f Compare October 4, 2024 15:34

andijcr requested review from pgellert and BenPope October 4, 2024 15:37

schema_registry/json: reject invalid bundled schema

a620b5f

take the safe approach and reject schemas with invalid bundled schema. reason for rejection: the dialect is unknown, the dialect does not match the key used for "$id", or the bundled schema does not pass validation.

		auto maybe_draft4_id_it = this_obj.find("id");
		auto maybe_id_it = this_obj.find("$id");

		bundled_schemas.insert_or_assign(
		{}, std::pair{json::Pointer{}, dialect});

		// to use rapidjson we need to serialized schema again
		auto iobuf_os = iobuf_ostream{};

[CORE-6836] schema registry json internal references #23461

Are you sure you want to change the base?

[CORE-6836] schema registry json internal references #23461

Conversation

andijcr commented Sep 24, 2024 • edited Loading

Implementation strategy:

prepoc phase:

Compat check phase:

reference merging

not implemented yet

Backports Required

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert left a comment

Choose a reason for hiding this comment

andijcr commented Sep 30, 2024

vbotbuildovich commented Sep 30, 2024 • edited Loading

andijcr commented Oct 1, 2024

andijcr commented Oct 2, 2024

andijcr commented Oct 2, 2024

BenPope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenPope Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Oct 2, 2024 • edited Loading

pgellert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenPope left a comment

Choose a reason for hiding this comment

andijcr commented Oct 4, 2024

andijcr commented Sep 24, 2024 •

edited

Loading

vbotbuildovich commented Sep 30, 2024 •

edited

Loading

BenPope Oct 3, 2024 •

edited

Loading

vbotbuildovich commented Oct 2, 2024 •

edited

Loading