Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1921849: fix deprecated type: multi_field from super_search_fields #6736

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

relud
Copy link
Member

@relud relud commented Sep 30, 2024

multi_field was removed in elasticsearch 1.0, and in 1.x is automatically translated to the new format, but starting with 5.x it throws an error

also fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1905772

@relud relud requested a review from willkg September 30, 2024 17:49
@relud relud requested a review from a team as a code owner September 30, 2024 17:49
@@ -1718,9 +1708,8 @@ def apply_schema_properties(fields, schema):
"storage_mapping": {
"fields": {
"full": {"index": "not_analyzed", "type": "string"},
"os_name": {"type": "string"},
Copy link
Member Author

@relud relud Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only multi-field where in_database_name doesn't match the field name, which was confusing at first but I think I got the change correct here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 1716 suggests this shouldn't be analyzed. However, the values are things like "Windows NT", "Mac OS X", etc and people don't search with those values--they're searching with "Windows" and "Mac". Further, the Super Search form values don't match the actual values, either.

I think you've got the right idea here in indexing the value with the string analyzer. We should remove my comment on line 1706.

Then I wondered whether we should change the data_validation_type and query_type values. Fun fact, for all supersearch fields, those match with the exception of addons_checked. That has a query_type of enum but that should be bool.

Weirdly, the query_type affects whether we can use the field for histograms and data_validation_type affects the operators you can use (which I would have expected query_type to be used for) and how the querystring parameter values are converted.

I think given that data_validation_type and query_type match for all the fields, we should merge those into a single data_type field used for all the things. That'll simplify the code a little more and make it easier to read and understand the search code.

We could do that work (merging data_validation_type and query_type fields and fixing addons_checked) in a separate bug/PR. Let me know if you want me to write that up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'd definitely prefer to do that in a separate PR

@willkg
Copy link
Contributor

willkg commented Sep 30, 2024

I'm not sure how to review this. It's hard to double-check all the work and because it's applied to socorro/external/es/, there's no way to test it.

I think I would apply this to socorro/external/legacy_es/ so we can test it and verify it. If we don't want to do that, then I would defer this to the PR where we have Elasticsearch 8 working so we can test things.

@relud
Copy link
Member Author

relud commented Sep 30, 2024

I'm not sure how to review this. It's hard to double-check all the work and because it's applied to socorro/external/es/, there's no way to test it.

you can test locally by modifying mozilla_settings.py or running with ELASTICSEARCH_MODE=PREFER_NEW and ELASTICSEARCH_URL=$LEGACY_ELASTICSEARCH_URL like the CI tests, depending on whether you're testing the webapp or not.

I think I would apply this to socorro/external/legacy_es/ so we can test it and verify it. If we don't want to do that, then I would defer this to the PR where we have Elasticsearch 8 working so we can test things.

I'm fine applying it to socorro/external/legacy_es/ as well, but I got the impression this had been put off previously because it could make elasticsearch in stage/prod unhappy? Most of the changes should be fine because elasticsearch is essentially converting it to the new thing anyway in 1.X, but that may not be the case for the fields whose name wasn't in "fields" to start with?

@relud relud force-pushed the relud-bug-1921849-rm-multi-field branch 2 times, most recently from 1e80bcb to 56dcc64 Compare September 30, 2024 19:23
@willkg
Copy link
Contributor

willkg commented Oct 1, 2024

you can test locally by modifying mozilla_settings.py or running with ELASTICSEARCH_MODE=PREFER_NEW and ELASTICSEARCH_URL=$LEGACY_ELASTICSEARCH_URL like the CI tests, depending on whether you're testing the webapp or not.

In order to test this change, we need indexing and searching working and we don't have searching working, yet. That's why I think I can't test this.

If that's not true, can you walk me through how you tested indexing and searching?

I'm fine applying it to socorro/external/legacy_es/ as well, but I got the impression this had been put off previously because it could make elasticsearch in stage/prod unhappy? Most of the changes should be fine because elasticsearch is essentially converting it to the new thing anyway in 1.X, but that may not be the case for the fields whose name wasn't in "fields" to start with?

In our talk last week, we talked about making the multi_field change as a pre-migration step so it applied to both Elasticsearch crashstorage implementations because Elasticsearch is already doing it internally and then we had fewer things in the big PR coming later.

I think this PR makes additional changes to super_search_fields.py to fix things like bug 1905772 that we can't make without rebuilding indexes--those changes we should make as part of the big PR later and they should only be made to the ES 8 code.

Does that plan work? If not, then I think we should bail on this PR.

@willkg
Copy link
Contributor

willkg commented Oct 1, 2024

We talked about this in Zoom. Summary:

  1. In the local dev environment at the time of this PR, both the socorro/external/es/ and socorro/external/legacy_es/ directories have Elasticsearch 1.4 code in them. So I can review this and test things out.
  2. We talked about @relud 's concern about why I want to manually test things and whether that's an indication that our test suite is missing stuff. I think our test suite covers these bits well. However, this is a big change to mapping that I want to understand better and want to be able to explore the change. That's the reason I want to be able to index and search. That's doable with the PR in its current state.

I'm good to review this now.

When I'm done reviewing it, @relud will remove the changes to the socorro/external/legacy_es/ directory.

@relud ^^^ I think that captures everything for this PR. Let me know if I misunderstood anything.

Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one minor thing that should get fixed and thoughts on a follow-up PR to simplify data_validation_type and query_type.

This is good! 💯

# object, and multi_field storages don't work with doc_values=True
if storage_type in ("object", "multi_field"):
# object storages don't work with doc_values=True
if storage_type in ("object"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if storage_type in ("object"):
if storage_type == "object":

socorro/external/es/super_search_fields.py Show resolved Hide resolved
@@ -1718,9 +1708,8 @@ def apply_schema_properties(fields, schema):
"storage_mapping": {
"fields": {
"full": {"index": "not_analyzed", "type": "string"},
"os_name": {"type": "string"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 1716 suggests this shouldn't be analyzed. However, the values are things like "Windows NT", "Mac OS X", etc and people don't search with those values--they're searching with "Windows" and "Mac". Further, the Super Search form values don't match the actual values, either.

I think you've got the right idea here in indexing the value with the string analyzer. We should remove my comment on line 1706.

Then I wondered whether we should change the data_validation_type and query_type values. Fun fact, for all supersearch fields, those match with the exception of addons_checked. That has a query_type of enum but that should be bool.

Weirdly, the query_type affects whether we can use the field for histograms and data_validation_type affects the operators you can use (which I would have expected query_type to be used for) and how the querystring parameter values are converted.

I think given that data_validation_type and query_type match for all the fields, we should merge those into a single data_type field used for all the things. That'll simplify the code a little more and make it easier to read and understand the search code.

We could do that work (merging data_validation_type and query_type fields and fixing addons_checked) in a separate bug/PR. Let me know if you want me to write that up.

@relud relud force-pushed the relud-bug-1921849-rm-multi-field branch from 20040be to c68e0ec Compare October 3, 2024 17:04
@relud relud enabled auto-merge October 3, 2024 17:04
@relud relud added this pull request to the merge queue Oct 3, 2024
Merged via the queue into main with commit 976725d Oct 3, 2024
1 check passed
@relud relud deleted the relud-bug-1921849-rm-multi-field branch October 3, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants