Optimized nested pagination is broken for m2m fields #650

SupImDos · 2024-10-25T05:46:58Z

Describe the Bug

Nested pagination through m2m fields is broken with the optimizer enabled. This appears to be because using .prefetch_related() with a QuerySet annotated with a Window function where partition_by is an m2m field causes an extra join and duplicate results.

Reproducible Example

See a minimal reproducible example here:

Branch: mre/nested-pagination-m2m
Commit: 5469f26

In the unit test above, the expected result of the query is:

{
  "tagConn": {
    "edges": [
      {
        "node": {
          "name": "Tag 1",
          "issues": {
            "edges": [
              {"node": {"name": "Issue 1"}},
              {"node": {"name": "Issue 2"}}
            ]
          }
        }
      },
      {
        "node": {
          "name": "Tag 2",
          "issues": {
            "edges": [
              {"node": {"name": "Issue 2"}},
              {"node": {"name": "Issue 3"}}
            ]
          }
        }
      }
    ]
  }
}

However, the test fails, and the actual result of the query is:

{
  "tagConn": {
    "edges": [
      {
        "node": {
          "name": "Tag 1",
          "issues": {
            "edges": [
              {"node": {"name": "Issue 1"}},
              {"node": {"name": "Issue 2"}},
              {"node": {"name": "Issue 2"}}  // Duplicate!
            ]
          }
        }
      },
      {
        "node": {
          "name": "Tag 2",
          "issues": {
            "edges": [
              {"node": {"name": "Issue 2"}},
              {"node": {"name": "Issue 2"}},  // Duplicate!
              {"node": {"name": "Issue 3"}}
            ]
          }
        }
      }
    ]
  }
}

System Information

Operating system: MacoS
Strawberry version: 0.247.0
Strawberry Django version: 0.49.1

Additional Context

It appears that this is caused by the implementation of apply_window_pagination().

As suggested above, when you annotate a .prefetch_related() QuerySet with a Window function and refer back to the other side of the m2m, you add another join - causing duplicate results. When Django introduced the ability to prefetch with sliced querysets (which Strawberry-Django isn't directly using) they had to do a bunch of hacks to combine calls to filter together (see QuerySet._next_is_sticky() and _filter_prefetch_queryset()).

To demonstrate the difference directly with the Django ORM see: a6f4e31.

In the unit test above, Strawberry-Django produces the following SQL query for the paginated prefetch:

SELECT *
FROM
  (SELECT ("projects_issue_tags"."tag_id") AS "_prefetch_related_val_tag_id",
          "projects_issue"."name" AS "col1",
          "projects_issue"."id" AS "col2",
          ROW_NUMBER() OVER (PARTITION BY "projects_issue_tags"."tag_id"
                             ORDER BY "projects_issue"."id" ASC) AS "_strawberry_row_number",
          COUNT(1) OVER (PARTITION BY "projects_issue_tags"."tag_id") AS "_strawberry_total_count"
   FROM "projects_issue"
   LEFT OUTER JOIN "projects_issue_tags" ON ("projects_issue"."id" = "projects_issue_tags"."issue_id")
   INNER JOIN "projects_issue_tags" T4 ON ("projects_issue"."id" = T4."issue_id")
   WHERE T4."tag_id" IN (1,
                         2)
   ORDER BY "projects_issue"."id" ASC) "qualify"
WHERE "_strawberry_row_number" <= 2
ORDER BY "col2" ASC

Whereas Django produces this SQL query for the paginated prefetch:

SELECT "_prefetch_related_val_tag_id",
       "col1",
       "col2"
FROM
  (SELECT *
   FROM
     (SELECT ("projects_issue_tags"."tag_id") AS "_prefetch_related_val_tag_id",
             "projects_issue"."name" AS "col1",
             "projects_issue"."id" AS "col2",
             ROW_NUMBER() OVER (PARTITION BY "projects_issue_tags"."tag_id"
                                ORDER BY "projects_issue"."id" ASC) AS "qual0"
      FROM "projects_issue"
      INNER JOIN "projects_issue_tags" ON ("projects_issue"."id" = "projects_issue_tags"."issue_id")
      WHERE "projects_issue_tags"."tag_id" IN (1,
                                               2)
      ORDER BY "projects_issue"."id" ASC) "qualify"
   WHERE ("qual0" > 0
          AND "qual0" <= 2)) "qualify_mask"
ORDER BY "col2" ASC

Note the extra LEFT OUTER JOIN which causes the issue.

Potential Solution

As noted in the apply_window_pagination() docstring, Django 4.2+ actually supports sliced QuerySets in .prefetch_related() now.

The solution here may be to actually use Django's inbuilt support for .prefetch_related() QuerySet slicing. However, this means we can't annotate _strawberry_total_count onto the nodes anymore. It is likely that we would need to refactor that functionality onto the parent records, maybe using a subquery count and OuterRef?

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

The text was updated successfully, but these errors were encountered:

SupImDos · 2024-10-25T07:38:19Z

For reference, our understanding is that (at a high level) the current nested pagination looks like this:

tag_queryset.prefetch_related(
    Prefetch(
        "issues",
        queryset=issue_queryset.annotate(
            _strawberry_row_number=Window(
                RowNumber(),
                partition_by=["tags"],
                order_by=[...],
            ),
            _strawberry_total_count=Window(
                Count(1),
                partition_by=["tags"],
            ),
        ).filter(
            _strawberry_row_number__gt=5,
            _strawberry_row_number__lte=10,
        ),
    )
)

But to work correctly, it could be updated to something like this:

tag_queryset.prefetch_related(
    Prefetch(
        "issues",
        queryset=issue_queryset[5:10],
        to_attr="_strawberry_optimized_issues",
    )
).annotate(
    _strawberry_total_count_issues=Subquery(
        issue_queryset.filter(tags=OuterRef("pk"))
        .values("tags")
        .annotate(total_count=Count("pk"))
        .values("total_count"),
    )
)

The main complexity here is how to get Strawberry / Strawberry-Django to:

Recognise that a tag's issues will actually be on the _strawberry_optimized_issues attribute (instead of issues)
Recognise that a tag's issue count will actually be on the _strawberry_total_count_issues attribute (instead of annotated on its issues as _strawberry_total_count)

The other complexity is how to calculate things like cursor / start_cursor / end_cursor / has_previous_page / has_next_page - but I think conceptually it should still possible.

bellini666 · 2024-12-22T13:48:45Z

Hey @SupImDos ,

I was doing some tests on this and discovered that the issue is with Django.

If I do this in your test_nested_pagination_m2m for example right after creating/assigning the issues/tags:

Tag.objects.prefetch_related(
    Prefetch("issues", queryset=Issue.objects.all().annotate(foo=F("tags")))
)

I get duplicated issues as well. This is the SQL generated:

SELECT ("projects_issue_tags"."tag_id") AS "_prefetch_related_val_tag_id",
       "projects_issue"."name",
       "projects_issue"."id",
       "projects_issue"."kind",
       "projects_issue"."priority",
       "projects_issue"."milestone_id",
       "projects_issue_tags"."tag_id" AS "foo"
FROM "projects_issue"
LEFT OUTER JOIN "projects_issue_tags" ON ("projects_issue"."id" = "projects_issue_tags"."issue_id")
INNER JOIN "projects_issue_tags" T4 ON ("projects_issue"."id" = T4."issue_id")
WHERE T4."tag_id" IN (3,
                      4)
ORDER BY "projects_issue"."id" ASC

The only thing I did here was to annotate "projects_issue_tags"."tag_id", which is also annotated by the prefetcher (see the first selected column), but Django is forcing a new join for that instead of using the same one, which it should.

We can try to find a related issue for this or report it to django.

In the meantime, I would like to find a way to workaround this without having to rewrite this, because as you already pointed out, the challenge will probably be even greater.

I was trying to play with the QuerySet.query object from the Prefetch object to see if I can force the join to be reused, but so far I got nothing. Will continue trying

bellini666 · 2024-12-23T20:37:30Z

Opened this issue on the Django bug tracker: https://code.djangoproject.com/ticket/36035

Eraldo · 2024-12-24T15:17:45Z

Opened this issue on the Django bug tracker: https://code.djangoproject.com/ticket/36035

Thank you for putting in the effort and figuring out what was going on there. 🙏

bellini666 · 2024-12-24T17:50:00Z

Btw, it seems that we might get a bug fix soon: https://code.djangoproject.com/ticket/36035#comment:1

I tested the patch they mentioned and it indeed fixes the issue. I'll try to workaround some solution for strawberry-django in the meantime

@SupImDos

When annotating something from the relation into the prefetch queryset for a m2m relation, Django will mistakenly not reuse the existing join and end up resulting in the generation of spurious results. There's an ongoing fix for this i this ticket: https://code.djangoproject.com/ticket/35677 This is monkey patching older versions of Django which doesn't contain the fix, and most likely won't (Django usually only backports security issues), to fix the issue. Thanks @SupImDos for providing an MRE in the form of a test for this! Fix #650

@SupImDos

When annotating something from the relation into the prefetch queryset for a m2m relation, Django will mistakenly not reuse the existing join and end up resulting in the generation of spurious results. There's an ongoing fix for this i this ticket: https://code.djangoproject.com/ticket/35677 This is monkey patching older versions of Django which doesn't contain the fix, and most likely won't (Django usually only backports security issues), to fix the issue. Thanks @SupImDos for providing an MRE in the form of a test for this! Fix #650

@SupImDos

When annotating something from the relation into the prefetch queryset for a m2m relation, Django will mistakenly not reuse the existing join and end up resulting in the generation of spurious results. There's an ongoing fix for this i this ticket: https://code.djangoproject.com/ticket/35677 This is monkey patching older versions of Django which doesn't contain the fix, and most likely won't (Django usually only backports security issues), to fix the issue. Thanks @SupImDos for providing an MRE in the form of a test for this! Fix #650

bellini666 · 2024-12-25T13:13:33Z

The only workaround I found was to monkey patch django =P

Although I prefer to avoid doing that, this seems safe as the monkey patched functions share the same code in all supported versions (4.2, 5.0 and 5.1) and we are not going to apply this for 5.2 which is going to contain the fix.

The PR is opened here: #681

SupImDos added the bug Something isn't working label Oct 25, 2024

SupImDos mentioned this issue Oct 28, 2024

Related types yield duplicate nodes and distinct filter fails #651

Open

bellini666 linked a pull request Dec 25, 2024 that will close this issue

fix(optimizer): Fix nested pagination optimization for m2m relations #681

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized nested pagination is broken for m2m fields #650

Optimized nested pagination is broken for m2m fields #650

SupImDos commented Oct 25, 2024 •

edited

Loading

SupImDos commented Oct 25, 2024 •

edited

Loading

bellini666 commented Dec 22, 2024

bellini666 commented Dec 23, 2024

Eraldo commented Dec 24, 2024

bellini666 commented Dec 24, 2024

bellini666 commented Dec 25, 2024

Optimized nested pagination is broken for m2m fields #650

Optimized nested pagination is broken for m2m fields #650

Comments

SupImDos commented Oct 25, 2024 • edited Loading

Describe the Bug

Reproducible Example

System Information

Additional Context

Potential Solution

Upvote & Fund

SupImDos commented Oct 25, 2024 • edited Loading

bellini666 commented Dec 22, 2024

bellini666 commented Dec 23, 2024

Eraldo commented Dec 24, 2024

bellini666 commented Dec 24, 2024

bellini666 commented Dec 25, 2024

SupImDos commented Oct 25, 2024 •

edited

Loading

SupImDos commented Oct 25, 2024 •

edited

Loading