Suggested facets should only consider first 1000 rows #2406

simonw · 2024-08-21T19:37:37Z

We get a lot of performance issues from suggested facets - on large tables we end up running multiple SQL queries for every column (one for column facets, one for date facets, one for JSON facets), each with a 50ms facet_suggest_time_limit_ms time limit but even with that in place these can add up - 20 columns could be 20 * 3 * 50 = 3000ms, not including overhead of the Python code that manages the queries.

Since these are really just suggestions, an optimization could be to only consider the first 1,000 rows in the table. This would be enough to spot likely date / JSON / column facets and should be much faster.

The text was updated successfully, but these errors were encountered:

simonw · 2024-08-21T19:38:37Z

Here's a trace illustrating the problem (thanks to #2405):

simonw · 2024-08-21T19:39:24Z

I tried a patch to just do this for column string facets and it worked as expected:

diff --git a/datasette/facets.py b/datasette/facets.py
index ccd85461..1e091afd 100644
--- a/datasette/facets.py
+++ b/datasette/facets.py
@@ -170,9 +170,8 @@ class ColumnFacet(Facet):
             if column in already_enabled:
                 continue
             suggested_facet_sql = """
-                select {column} as value, count(*) as n from (
-                    {sql}
-                ) where value is not null
+                with limited as (select * from ({sql}) limit 1000)
+                select {column} as value, count(*) as n from limited where value is not null
                 group by value
                 limit {limit}
             """.format(

But that doesn't cover other facet types, so I'll try turning the SQL that gets fed to that method into the limited SQL first.

simonw · 2024-08-21T19:41:34Z

Given the structure of the current .facet() method:

datasette/datasette/facets.py

Lines 160 to 180 in 8a63cdc

    
           class ColumnFacet(Facet): 
        
               type = "column" 
        
               async def suggest(self): 
        
                   row_count = await self.get_row_count() 
        
                   columns = await self.get_columns(self.sql, self.params) 
        
                   facet_size = self.get_facet_size() 
        
                   suggested_facets = [] 
        
                   already_enabled = [c["config"]["simple"] for c in self.get_configs()] 
        
                   for column in columns: 
        
                       if column in already_enabled: 
        
                           continue 
        
                       suggested_facet_sql = """ 
        
                           select {column} as value, count(*) as n from ( 
        
                               {sql} 
        
                           ) where value is not null 
        
                           group by value 
        
                           limit {limit} 
        
                       """.format( 
        
                           column=escape_sqlite(column), sql=self.sql, limit=facet_size + 1 
        
                       )

datasette/datasette/facets.py

Lines 460 to 467 in 8a63cdc

    
           class DateFacet(Facet): 
        
               type = "date" 
        
               async def suggest(self): 
        
                   columns = await self.get_columns(self.sql, self.params) 
        
                   already_enabled = [c["config"]["simple"] for c in self.get_configs()] 
        
                   suggested_facets = [] 
        
                   for column in columns:

It's going to be easier to modify each facet definition rather than the calling code to pass in that limit.

simonw · 2024-08-21T20:15:28Z

HUGE improvement:

simonw · 2024-08-21T20:19:10Z

DateFacet doesn't need this, it already only checks the first 100:

datasette/datasette/facets.py

Lines 470 to 477 in 8a63cdc

    
                       # Does this column contain any dates in the first 100 rows? 
        
                       suggested_facet_sql = """ 
        
                           select date({column}) from ( 
        
                               {sql} 
        
                           ) where {column} glob "????-??-*" limit 100; 
        
                       """.format( 
        
                           column=escape_sqlite(column), sql=self.sql 
        
                       )

Refs #2320, #2342, #2398, #2399, #2400, #2403, #2404, #2405, #2406, #2407, #2408, #2414, #2415, #2420

simonw added enhancement faceting performance labels Aug 21, 2024

simonw closed this as completed in f28ff8e Aug 21, 2024

simonw added a commit that referenced this issue Sep 6, 2024

Release 1.0a16

0bc6a2a

Refs #2320, #2342, #2398, #2399, #2400, #2403, #2404, #2405, #2406, #2407, #2408, #2414, #2415, #2420

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested facets should only consider first 1000 rows #2406

Suggested facets should only consider first 1000 rows #2406

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024 •

edited

Loading

Suggested facets should only consider first 1000 rows #2406

Suggested facets should only consider first 1000 rows #2406

Comments

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024

simonw commented Aug 21, 2024 • edited Loading

simonw commented Aug 21, 2024 •

edited

Loading