Try to tune project listing query again #11620

agjohnson · 2024-09-25T19:21:13Z

Don't subquery for builds in project listing prefetch

This seems like it's unneccessary, but the prefetch is not accurate using Build.objects first.

agjohnson · 2024-09-25T22:44:11Z

readthedocs/projects/querysets.py

-            .values_list("id", flat=True)[:1]
+        # Get most recent and recent successful builds
+        builds_latest = (
+            Build.internal.filter(project__in=self)


It's not clear why Build.internal is needed in both the inner query and the prefetch query. It seems like one of them could be Build.objects at very least? The performance is a little better with Build.objects.

I understand is to avoid getting builds from PRs (external versions) and I'd say that .internal should perform better than .objects since it removes a lot of builds to consider and they should be removed using an index Build.type. That's the theory, tho 😄

Heh same, that was my thought initially. There is some complexity added by this method though, which I think ultimately annoys the query planner.

agjohnson · 2024-09-25T22:44:53Z

readthedocs/projects/querysets.py

+            .annotate(latest=Max("pk"))
+            .values_list("latest", flat=True)
+        )
+        builds_success = (


This feels like it could be combined into the query above, saving ~400ms.

agjohnson · 2024-09-25T23:04:31Z

This reduced the time needed for prefetch from 12s to 3s, but this is still not usable. It's really not clear why, but the planner was previously falling apart on this and triggered a sequence scan on builds_version for some reason.

This is most easily testable against one of our accounts:

In [1]: %time list(Project.objects.dashboard(User.objects.filter(is_staff=True).first()))
CPU times: user 28.8 ms, sys: 45 μs, total: 28.9 ms
Wall time: 2.44 s

The current explain looks different than it did (specifically it doesn't sequence scan builds_version), but it is overly complex for :

Sort  (cost=32709.47..32709.91 rows=179 width=339) (actual time=1304.421..1304.425 rows=15 loops=1)
  Sort Key: builds_build.date DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Nested Loop Left Join  (cost=30818.25..32702.77 rows=179 width=339) (actual time=1304.218..1304.392 rows=15 loops=1)
        Filter: (((builds_version.type)::text <> 'external'::text) OR (builds_version.type IS NULL))
        ->  Nested Loop  (cost=30817.82..32514.43 rows=185 width=339) (actual time=1304.207..1304.307 rows=15 loops=1)
              ->  HashAggregate  (cost=30817.39..30819.39 rows=200 width=4) (actual time=1304.175..1304.183 rows=19 loops=1)
                    Group Key: max(v0.id)
                    ->  GroupAggregate  (cost=78.79..30715.36 rows=8162 width=8) (actual time=2.336..1304.159 rows=19 loops=1)
                          Group Key: v0.project_id
                          ->  Nested Loop Left Join  (cost=78.79..30592.93 rows=8162 width=8) (actual time=0.152..1261.318 rows=450264 loops=1)
                                Filter: (((v1.type)::text <> 'external'::text) OR (v1.type IS NULL))
                                Rows Removed by Filter: 1898
                                ->  Nested Loop  (cost=78.36..26593.19 rows=8462 width=12) (actual time=0.136..442.442 rows=452162 loops=1)
                                      ->  Unique  (cost=77.80..77.86 rows=11 width=4) (actual time=0.121..0.140 rows=19 loops=1)
                                            ->  Sort  (cost=77.80..77.83 rows=11 width=4) (actual time=0.121..0.129 rows=19 loops=1)
                                                  Sort Key: u0.id
                                                  Sort Method: quicksort  Memory: 25kB
                                                  ->  Nested Loop  (cost=0.85..77.61 rows=11 width=4) (actual time=0.019..0.108 rows=19 loops=1)
                                                        ->  Index Scan using projects_project_users_user_id on projects_project_users u1  (cost=0.42..24.74 rows=11 width=4) (actual time=0.007..0.027 rows=19 loops=1)
                                                              Index Cond: (user_id = 14481)
                                                        ->  Index Only Scan using projects_project_pkey on projects_project u0  (cost=0.42..4.81 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=19)
                                                              Index Cond: (id = u1.project_id)
                                                              Heap Fetches: 3
                                      ->  Index Scan using builds_build_project_id on builds_build v0  (cost=0.56..2402.78 rows=769 width=12) (actual time=0.010..20.851 rows=23798 loops=19)
                                            Index Cond: (project_id = u0.id)
                                ->  Index Only Scan using idx_builds_version_id_type on builds_version v1  (cost=0.43..0.46 rows=1 width=9) (actual time=0.001..0.001 rows=1 loops=452162)
                                      Index Cond: (id = v0.version_id)
                                      Heap Fetches: 410238
              ->  Index Scan using builds_build_pkey on builds_build  (cost=0.44..8.47 rows=1 width=339) (actual time=0.006..0.006 rows=1 loops=19)
                    Index Cond: (id = (max(v0.id)))
                    Filter: (project_id = ANY ('{487639,74581,689368,24458,170010,613422,714226,521174,256207,815321,527062,451683,17662,233368,489923}'::integer[]))
                    Rows Removed by Filter: 0
        ->  Index Only Scan using idx_builds_version_id_type on builds_version  (cost=0.43..1.01 rows=1 width=9) (actual time=0.005..0.005 rows=1 loops=15)
              Index Cond: (id = builds_build.version_id)
              Heap Fetches: 12

humitos

One thing to note here is that we are using .prefetch_related here, which does the joining on the Python side. That could explain why some of there queries are fast when testing them in the DB, but slow when using Django to access these views:

prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python

(from https://docs.djangoproject.com/en/5.0/ref/models/querysets/#prefetch-related)

Since we are using Max() to get the latest build and the latest successful build for each project, we could probably use select_related instead here which will make everything at the DB.

humitos · 2024-09-26T10:28:33Z

readthedocs/projects/querysets.py

-            .values_list("id", flat=True)[:1]
+        # Get most recent and recent successful builds
+        builds_latest = (
+            Build.internal.filter(project__in=self)


I understand is to avoid getting builds from PRs (external versions) and I'd say that .internal should perform better than .objects since it removes a lot of builds to consider and they should be removed using an index Build.type. That's the theory, tho 😄

humitos · 2024-09-26T10:29:19Z

readthedocs/projects/querysets.py

-            .values_list("id", flat=True)[:1]
+        # Get most recent and recent successful builds
+        builds_latest = (
+            Build.internal.filter(project__in=self)


Instead of __in can't we just use project__pk=self.pk here as I did in another PR? That worked pretty good there.

self is a Project.objects queryset, not an individual model instance.

humitos · 2024-09-26T10:42:10Z

readthedocs/projects/querysets.py

+        # Get most recent and recent successful builds
+        builds_latest = (
+            Build.internal.filter(project__in=self)
+            .values("project")


Do we need this line here? I understand the project value is not used.

It might not, but this is for grouping by project. The latest build day per project is what is needed here. If there is a different way to group this, we don't need the second query at all.

humitos · 2024-09-26T10:46:07Z

Since we are using Max() to get the latest build and the latest successful build for each project, we could probably use select_related instead here which will make everything at the DB.

I quickly tested this and it's slower 😄

agjohnson added 3 commits September 25, 2024 12:19

Don't subquery for builds in project listing prefetch

e494414

Missing filter by projects

c89d84f

Use Build.internal queryset instead

87ca17e

This seems like it's unneccessary, but the prefetch is not accurate using Build.objects first.

agjohnson commented Sep 25, 2024

View reviewed changes

agjohnson mentioned this pull request Sep 26, 2024

Try reverting prefetch changes for project/version listing views #11621

Merged

humitos reviewed Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to tune project listing query again #11620

Try to tune project listing query again #11620

agjohnson commented Sep 25, 2024

agjohnson Sep 25, 2024

humitos Sep 26, 2024

agjohnson Sep 26, 2024

agjohnson Sep 25, 2024

agjohnson commented Sep 25, 2024

humitos left a comment

humitos Sep 26, 2024

humitos Sep 26, 2024

agjohnson Sep 26, 2024

humitos Sep 26, 2024

agjohnson Sep 26, 2024

humitos commented Sep 26, 2024

Try to tune project listing query again #11620

Are you sure you want to change the base?

Try to tune project listing query again #11620

Conversation

agjohnson commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agjohnson commented Sep 25, 2024

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Sep 26, 2024