Database insertion optimization #1742
Replies: 7 comments 10 replies
-
Sounds like a bug to me. I tried to reproduce this using a toy project but didn't encounter slow downs or increasing memory consumption but my system could have been just too simple. Can you identify which item specifically slows down while executing the project day-by-day? How large is the whole year dataset? Would you expect it to fit into your system's memory? On antoher note, is there a specific reason you have the BidOffer Data connection item between BidOfferController and BidOfferImport? BidOfferImport should be able to import the files generated by bidOfferController directly without the extra item as long as you have specified the output files in bidOfferController's Tool specification. |
Beta Was this translation helpful? Give feedback.
-
Hi Antti,
Thanks for the information. I'll try it and let you now how it went.
Best regards,
Rui Carvalho
…________________________________
De: Antti Soininen ***@***.***>
Enviado: 26 de agosto de 2022 13:06
Para: Spine-project/Spine-Toolbox ***@***.***>
Cc: Rui Gonçalves De Carvalho ***@***.***>; Author ***@***.***>
Assunto: Re: [Spine-project/Spine-Toolbox] Database insertion optimization (Discussion #1742)
I managed to decrease the memory footprint of import operations somewhat as well as speed them up a bit. If you installed Toolbox according to the "Installation from sources using Git" instructions, python -mpip install -U -r requirements.txt in the Toolbox directory should get you the updated spinedb_api module. Otherwise I need to make a new release to PyPI. Please, try it out and let me know how it works.
If Toolbox still consumes too much memory or is too slow, perhaps you could zip you project and sent it to me or if that is too much to ask, maybe you could provide the file(s) you're trying to import as well as the Importer specification? It would help me find the remaining bottlenecks.
—
Reply to this email directly, view it on GitHub<#1742 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQLCLVHOEXJZEJ2ZU6ENODTV3CXMNANCNFSM56UYST6A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Rui Gonçalves De Carvalho partilhou um ficheiro do OneDrive para Empresas consigo. Para vê-lo, clique na ligação abaixo.
<https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/Documents/Attachments/populateSpineDB.rar>
[https://r1.res.office365.com/owa/prem/images/dc-generic_20.png]<https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/Documents/Attachments/populateSpineDB.rar>
populateSpineDB.rar<https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/Documents/Attachments/populateSpineDB.rar>
Hi Antti,
I tried your suggested solution and noticed some improvements but it is still slowing down. I created a project with two of the twelve months since it is enough to create the slowdowns.
Thanks for your help.
Best regards,
Rui Carvalho
…________________________________
De: Antti Soininen ***@***.***>
Enviado: 26 de agosto de 2022 13:06
Para: Spine-project/Spine-Toolbox ***@***.***>
Cc: Rui Gonçalves De Carvalho ***@***.***>; Author ***@***.***>
Assunto: Re: [Spine-project/Spine-Toolbox] Database insertion optimization (Discussion #1742)
I managed to decrease the memory footprint of import operations somewhat as well as speed them up a bit. If you installed Toolbox according to the "Installation from sources using Git" instructions, python -mpip install -U -r requirements.txt in the Toolbox directory should get you the updated spinedb_api module. Otherwise I need to make a new release to PyPI. Please, try it out and let me know how it works.
If Toolbox still consumes too much memory or is too slow, perhaps you could zip you project and sent it to me or if that is too much to ask, maybe you could provide the file(s) you're trying to import as well as the Importer specification? It would help me find the remaining bottlenecks.
—
Reply to this email directly, view it on GitHub<#1742 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQLCLVHOEXJZEJ2ZU6ENODTV3CXMNANCNFSM56UYST6A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
hi Antti, [02-09-2022 14:54:52] Processing table Decommissioned |
Beta Was this translation helpful? Give feedback.
-
@rui-gcarvalho: I'm currently looking into the example project you kindly provided, see #1761. Indeed, importing stuff is both slow and leads up to out-of-memory. There is something easy I can do about the memory usage but we'll see if it is enough. If it is OK to you, we can also try changing the data structure a bit - I don't think having a separate object for e.g. each bid id is the best way to populate the database. I need to do some tests on that regard, though. |
Beta Was this translation helpful? Give feedback.
-
Hi Antti,
Your suggestions worked out really well!
Thank you for your support.
Best regards,
Rui Carvalho
…________________________________
De: Antti Soininen ***@***.***>
Enviado: 13 de setembro de 2022 08:17
Para: Spine-project/Spine-Toolbox ***@***.***>
Cc: Rui Gonçalves De Carvalho ***@***.***>; Mention ***@***.***>
Assunto: Re: [Spine-project/Spine-Toolbox] Database insertion optimization (Discussion #1742)
So, having hundreds of thousands of objects in a Spine database just doesn't scale well memory and performance wise. Additionally, it makes Toolbox Database editor completly useless. I propose we decrease the number of objects by "packing" data such as BidOffers into arrays. In this scheme, each object would represent a single day and the arrays would contain data for that particular day. For example, the bidOffer_Day_XX.csv files could look like this:
date,alternative,price,energy
2019-01-12,Base,180.30,5.8
2019-01-12,Base,180.30,0.5
2019-01-12,Base,180.30,2.7
[...]
Note that the first column always has the same data - this we'll map to the object name.
Import mappings for the data should look something like this:
[kuva]<https://user-images.githubusercontent.com/19147159/189834318-9d70b0e8-537c-49e6-ad8d-f36de3d294ac.png>
Note the "Array" choice in Value.
I've removed bidID from the data as you can reconstruct it by combining the object name (which is the date) with the array index. Optionally, you could store the bidID into another array, if needed, or replace "Array" by "Map" which allows you to index the price and energy values by bidIDs.
In my tests the above import mapping doesn't need nearly as much memory and is generally faster than the single-object-per-bidID approach.
A few additional recommendations:
1. If you are using a MySQL database on network, you may want to consider importing to a local sqlite database file first since that should be faster. Once you're happy with the imported data you can copy it to the MySQL database using Merger.
2. The numbers in your CSV files are actually written as localized strings with double quotes around them, i.e. dot . as thousands separator and comma , as decimal point. Is this intentional? If you want to see real numbers in the database instead of number-like-strings you should write non-localized numbers with dot . as the decimal point to the CSV files. Note, that you also need to set the column type to float in Importer as explained in my post above.
—
Reply to this email directly, view it on GitHub<#1742 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQLCLVHVQFI7SF5MTEESTXTV6AS7HANCNFSM56UYST6A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I´m populating a database with a whole year of Information about an energy market. For that, I created a loop that inserts the data for each day.
![2022-08-16 08_34_57-rugco gecad isep ipp pt - Ligação ao Ambiente de Trabalho Remoto](https://user-images.githubusercontent.com/68560340/184823944-64a95f26-dd90-428f-b5ee-3f7ace54cef1.png)
Although each file is approximately the same size, every new file takes longer than the previous to be inserted. I tested for two months and it took a day to insert all the data, which leads me to belive that the whole year will take a lot longer and can possibly cause a memory error due to the ram usage.
My first approach was to insert the whole year in one file, but ended up with a memory error due to not having enough ram. (My workstation has 16GB of ram).
Is there a way to optimize this?
Thanks in advance.
Best regards,
Rui Carvalho
Beta Was this translation helpful? Give feedback.
All reactions