Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greatly improve database performance for esgpull update #47

Merged
merged 7 commits into from
Jul 17, 2024
Merged

Conversation

svenrdz
Copy link
Collaborator

@svenrdz svenrdz commented Jul 16, 2024

New

  • Database: add Database.commit_context() for easier bulk transactions

Changed

  • Database: add a few sqlite PRAGMAs for more aggressive performance
  • cli.update: directly insert & delete into query_file table, instead of relying on the ORM
  • cli.update: bulk inserts for each query instead of one commit per file

Example

I created and filled this database in a few minutes with large queries that previously took hours to run, now a good chunk of the time is spent on fetching metadata from index nodes:

$ esgpull show
<853631>
├── distrib:   True 
│   latest:    True 
│   replica:   None 
│   retracted: False
│   table_id:  fx   
└── <1124a9>
    └── distrib:       True 
        latest:        True 
        replica:       None 
        retracted:     False
        experiment_id: dcpp*
<c95ebd>
└── distrib:       True  
    latest:        True  
    replica:       None  
    retracted:     False 
    experiment_id: ssp245
    frequency:     day   
    variant_label: r1i*  
<cc2c09>
├── distrib:       True       
│   latest:        True       
│   replica:       None       
│   retracted:     False      
│   frequency:     day        
│   variable_id:   tas, tasmax
│   variant_label: r1i*       
└── <ef4f6f>
    └── distrib:       True                        
        latest:        True                        
        replica:       None                        
        retracted:     False                       
        experiment_id: ssp245                      
        files:         0 bytes / 338.8 GiB [0/2206]

$ time esgpull update -y
<1124a9> -> 60759 files.
<853631> -> 191146 files.
<c95ebd> -> 126748 files.
<cc2c09> -> 227623 files.
<ef4f6f> -> 6724 files.
613000 files found.
<1124a9> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:32
<853631> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:18
<c95ebd> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:48
<cc2c09> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:43
<ef4f6f> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
esgpull update -y  419.46s user 11.52s system 93% cpu 7:41.62 total

Running a similar test over a non-empty database (~1.4GB) produces no significant difference:

$ esgpull show cc2c -c
<cc2c09>
├── distrib:       True       
│   latest:        True       
│   replica:       None       
│   retracted:     False      
│   frequency:     day        
│   variable_id:   tas, tasmax
│   variant_label: r1i*       
└── <ef4f6f>
    └── distrib:       True  
        latest:        True  
        replica:       None  
        retracted:     False 
        experiment_id: ssp245

$ time esgpull update cc2c -c -y
<cc2c09> -> 227623 files.
<ef4f6f> -> 6724 files.
234347 files found.
<cc2c09> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:49
<ef4f6f> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01
esgpull update cc2c -c -y  170.65s user 4.87s system 87% cpu 3:21.09 total

Before the current PR, a non-empty database would take longer to update. Multiple reasons made it very inefficient SQL to add a new relation to a query for a file that already had existing relations to other queries. This is now a single insert in all cases, which makes it irrelevant for the database to be empty or not.

svenrdz added 2 commits July 16, 2024 10:33
* new(db): Database.commit_context for bulk transactions
* changed(cli.update): direct insert/delete into query_file table
* changed(cli.update): bulk insert/delete instead of per file
@svenrdz svenrdz changed the title Dev db perf Greatly improve database performance for esgpull update Jul 16, 2024
@svenrdz svenrdz added the enhancement New feature or request label Jul 16, 2024
@svenrdz svenrdz merged commit 3a1da98 into main Jul 17, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant