GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

lampajr · 2024-04-09T07:17:04Z

Long term fix for of #1577

Describe the issue

Right now the query

Horreum/horreum-backend/src/main/java/io/hyperfoil/tools/horreum/svc/RunServiceImpl.java

Lines 91 to 103 in 3849f04

    
              protected static final String FIND_RUNS_WITH_URI = """ 
        
                    SELECT id, testid 
        
                    FROM run 
        
                    WHERE NOT trashed 
        
                       AND (data->>'$schema' = ?1 
        
                       OR (CASE 
        
                          WHEN jsonb_typeof(data) = 'object' THEN ?1 IN (SELECT values.value->>'$schema' FROM jsonb_each(data) as values) 
        
                          WHEN jsonb_typeof(data) = 'array' THEN ?1 IN (SELECT jsonb_array_elements(data)->>'$schema') 
        
                          ELSE false 
        
                          END) 
        
                       OR (metadata IS NOT NULL AND ?1 IN (SELECT jsonb_array_elements(metadata)->>'$schema')) 
        
                    ) 
        
                    """;

is really slow as it perform a full (sequential) scan over all the runs looking into all jsonbobjects searching for$schema` key.

This is the query plan (obtained in local machine using prod backup):

                                                                                                                                                                       QUERY PLAN                                                                                                                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on run  (cost=0.00..103758.07 rows=7746 width=8) (actual time=37775.492..37775.493 rows=0 loops=1)
   Filter: ((NOT trashed) AND (can_view(access, owner, token) OR has_role('horreum.system'::text)) AND (((data ->> '$schema'::text) = 'urn:achme:0.1'::text) OR CASE WHEN (jsonb_typeof(data) = 'object'::text) THEN (SubPlan 1) WHEN (jsonb_typeof(data) = 'array'::text) THEN (SubPlan 2) ELSE false END OR ((metadata IS NOT NULL) AND (SubPlan 3))))
   Rows Removed by Filter: 28414
   SubPlan 1
     ->  Function Scan on jsonb_each "values"  (cost=0.00..1.25 rows=100 width=32) (actual time=0.653..0.785 rows=6 loops=25827)
   SubPlan 2
     ->  Result  (cost=0.00..2.02 rows=100 width=32) (actual time=7.914..8.241 rows=2 loops=95)
           ->  ProjectSet  (cost=0.00..0.52 rows=100 width=32) (actual time=7.912..8.239 rows=2 loops=95)
                 ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=95)
   SubPlan 3
     ->  Result  (cost=0.00..2.02 rows=100 width=32) (actual time=0.001..0.001 rows=1 loops=6345)
           ->  ProjectSet  (cost=0.00..0.52 rows=100 width=32) (actual time=0.001..0.001 rows=1 loops=6345)
                 ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=6345)
 Planning Time: 0.470 ms
 Execution Time: 37775.565 ms
(15 rows)

Worth to note that the worst case happens when we are creating a new Schema that is not referenced anywhere, as in that case it has to look into every place for every run.

Moreover as the number of runs increase, the performance of this query can only get worse.

Describe the solution you'd like

After some discussion we came up to a possible solution which will require a bit of refactoring.
Here a summary of required changes that could speed up the overall process:

The run_schemas table needs to be updated on creation/update of a Run
- Add records even if the referenced schema does not exist yet, e.g., keep schemaId null but schemaUri set.
- Ensure that Horreum can manage run_schemas.schemaId as null and no other parts wil break.
Update the query to scan the run_schemas table rather than looking into every run's jsonb object
- The run_schemas will contain all information that we need
Update the run_schemas records accordingly, e.g., if we are creating a schema that is referenced there, we should update all records with new id.
Make this process sync again, i.e., revert GH-1577 / Make processNewOrUpdatedSchema asynchronous #1579

Additional information

n/a

The text was updated successfully, but these errors were encountered:

johnaohara · 2024-05-09T08:23:08Z

On Hold due to #1603

lampajr mentioned this issue Apr 9, 2024

Slow performance adding new Schema. #1577

Closed

lampajr changed the title ~~Optimize FIND_RUNS_WITH_URI query~~ GH-1577 / Optimize FIND_RUNS_WITH_URI query Apr 9, 2024

johnaohara mentioned this issue Apr 12, 2024

increase txn timeout to 120s #1568

Closed

johnaohara added type/enhancement An enhancement to an existing feature priority/hold An issuee that we need to look at - but later area/backend labels May 9, 2024

lampajr closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

lampajr commented Apr 9, 2024

johnaohara commented May 9, 2024

GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

GH-1577 / Optimize FIND_RUNS_WITH_URI query #1580

Comments

lampajr commented Apr 9, 2024

johnaohara commented May 9, 2024