Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

Closed
lampajr opened this issue Apr 9, 2024 · 1 comment
Closed

GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

lampajr opened this issue Apr 9, 2024 · 1 comment
Labels
area/backend priority/hold An issuee that we need to look at - but later type/enhancement An enhancement to an existing feature

Comments

@lampajr
Copy link
Member

lampajr commented Apr 9, 2024

Long term fix for of #1577

Describe the issue

Right now the query

protected static final String FIND_RUNS_WITH_URI = """
SELECT id, testid
FROM run
WHERE NOT trashed
AND (data->>'$schema' = ?1
OR (CASE
WHEN jsonb_typeof(data) = 'object' THEN ?1 IN (SELECT values.value->>'$schema' FROM jsonb_each(data) as values)
WHEN jsonb_typeof(data) = 'array' THEN ?1 IN (SELECT jsonb_array_elements(data)->>'$schema')
ELSE false
END)
OR (metadata IS NOT NULL AND ?1 IN (SELECT jsonb_array_elements(metadata)->>'$schema'))
)
""";
is really slow as it perform a full (sequential) scan over all the runs looking into all jsonbobjects searching for$schema` key.

This is the query plan (obtained in local machine using prod backup):

                                                                                                                                                                       QUERY PLAN                                                                                                                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on run  (cost=0.00..103758.07 rows=7746 width=8) (actual time=37775.492..37775.493 rows=0 loops=1)
   Filter: ((NOT trashed) AND (can_view(access, owner, token) OR has_role('horreum.system'::text)) AND (((data ->> '$schema'::text) = 'urn:achme:0.1'::text) OR CASE WHEN (jsonb_typeof(data) = 'object'::text) THEN (SubPlan 1) WHEN (jsonb_typeof(data) = 'array'::text) THEN (SubPlan 2) ELSE false END OR ((metadata IS NOT NULL) AND (SubPlan 3))))
   Rows Removed by Filter: 28414
   SubPlan 1
     ->  Function Scan on jsonb_each "values"  (cost=0.00..1.25 rows=100 width=32) (actual time=0.653..0.785 rows=6 loops=25827)
   SubPlan 2
     ->  Result  (cost=0.00..2.02 rows=100 width=32) (actual time=7.914..8.241 rows=2 loops=95)
           ->  ProjectSet  (cost=0.00..0.52 rows=100 width=32) (actual time=7.912..8.239 rows=2 loops=95)
                 ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=95)
   SubPlan 3
     ->  Result  (cost=0.00..2.02 rows=100 width=32) (actual time=0.001..0.001 rows=1 loops=6345)
           ->  ProjectSet  (cost=0.00..0.52 rows=100 width=32) (actual time=0.001..0.001 rows=1 loops=6345)
                 ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=6345)
 Planning Time: 0.470 ms
 Execution Time: 37775.565 ms
(15 rows)

Worth to note that the worst case happens when we are creating a new Schema that is not referenced anywhere, as in that case it has to look into every place for every run.

Moreover as the number of runs increase, the performance of this query can only get worse.

Describe the solution you'd like

After some discussion we came up to a possible solution which will require a bit of refactoring.
Here a summary of required changes that could speed up the overall process:

  1. The run_schemas table needs to be updated on creation/update of a Run
    • Add records even if the referenced schema does not exist yet, e.g., keep schemaId null but schemaUri set.
    • Ensure that Horreum can manage run_schemas.schemaId as null and no other parts wil break.
  2. Update the query to scan the run_schemas table rather than looking into every run's jsonb object
    • The run_schemas will contain all information that we need
  3. Update the run_schemas records accordingly, e.g., if we are creating a schema that is referenced there, we should update all records with new id.
  4. Make this process sync again, i.e., revert GH-1577 / Make processNewOrUpdatedSchema asynchronous #1579

Additional information

n/a

@lampajr lampajr changed the title Optimize FIND_RUNS_WITH_URI query GH-1577 / Optimize FIND_RUNS_WITH_URI query Apr 9, 2024
@johnaohara johnaohara added type/enhancement An enhancement to an existing feature priority/hold An issuee that we need to look at - but later area/backend labels May 9, 2024
@johnaohara
Copy link
Member

On Hold due to #1603

@lampajr lampajr closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend priority/hold An issuee that we need to look at - but later type/enhancement An enhancement to an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants