Database insertion optimization #1742

rui-gcarvalho · 2022-08-16T07:50:11Z

rui-gcarvalho
Aug 16, 2022

Hi everyone,

I´m populating a database with a whole year of Information about an energy market. For that, I created a loop that inserts the data for each day.

Although each file is approximately the same size, every new file takes longer than the previous to be inserted. I tested for two months and it took a day to insert all the data, which leads me to belive that the whole year will take a lot longer and can possibly cause a memory error due to the ram usage.

My first approach was to insert the whole year in one file, but ended up with a memory error due to not having enough ram. (My workstation has 16GB of ram).

Is there a way to optimize this?

Thanks in advance.

Best regards,
Rui Carvalho

soininen · 2022-08-22T10:17:18Z

soininen
Aug 22, 2022
Maintainer

Sounds like a bug to me. I tried to reproduce this using a toy project but didn't encounter slow downs or increasing memory consumption but my system could have been just too simple. Can you identify which item specifically slows down while executing the project day-by-day? How large is the whole year dataset? Would you expect it to fit into your system's memory?

On antoher note, is there a specific reason you have the BidOffer Data connection item between BidOfferController and BidOfferImport? BidOfferImport should be able to import the files generated by bidOfferController directly without the extra item as long as you have specified the output files in bidOfferController's Tool specification.

5 replies

rui-gcarvalho Aug 22, 2022
Author

Hi Antti,

Thanks for the reply.
The item that is causing the slow downs is the importer. Each day has around 45000 lines which means the the whole year has around 17 million lines. Since I'm inserting a day at a time and each file has around 2.5MB I wouldn´t expect that much memory usage. It seams like the spine toolbox is loading the entire database before inserting each file, but I´m not sure about what is happening.

The use of a data connection after the controller is an attempt to reduce memory usage. The data connection points to one single file that has the data for a specific day. The controller is there to update that file so that in each run of the loop the file used in the data connection has the data for a new day. At the end of the controller script the memory used in the data strutures used is freed.

Aditional information:
Only the BidOffer and Bid entities cause this issues (both have aroun 45000 lines for each day).
Other entities like Players(1585 lines) or Periods (8762 lines) are inserted quickly if the insertion is done before bids or bid offers.
I tried to insert Players an Periods after inserting two months of bidOffers and it took more that a day and the memory usage was around 12GB.

Thanks in advance.

Best regards,
Rui Carvalho

soininen Aug 24, 2022
Maintainer

For your information, I'm currently looking into reducing memory usage in Importer. There is indeed some room for improvement but we'll see if that's enough.

The use of a data connection after the controller is an attempt to reduce memory usage.

It seems there is a bit of misunderstanding on how data is passed between project items. The file written by bidOfferController is directly seen and used by bidOffer Import without a Data Controller between as long as bidOfferController's Tool specification lists the file as one of its Output files. The superfluous data connection doesn't really reduce any memory usage.

rui-gcarvalho Aug 24, 2022
Author

Hi Antti,

Thanks for the clarification. I'll be alert to any updates to the spine toolbox.

Best regards,
Rui Carvalho

soininen Aug 25, 2022
Maintainer

The relevant issue to follow is spine-tools/Spine-Database-API#177.

soininen Aug 26, 2022
Maintainer

I managed to decrease the memory footprint of import operations somewhat as well as speed them up a bit. If you installed Toolbox according to the "Installation from sources using Git" instructions, python -mpip install -U -r requirements.txt in the Toolbox directory should get you the updated spinedb_api module. Otherwise I need to make a new release to PyPI. Please, try it out and let me know how it works.

If Toolbox still consumes too much memory or is too slow, perhaps you could zip you project and sent it to me or if that is too much to ask, maybe you could provide the file(s) you're trying to import as well as the Importer specification? It would help me find the remaining bottlenecks.

rui-gcarvalho · 2022-08-29T09:35:53Z

rui-gcarvalho
Aug 29, 2022
Author

Hi Antti, Thanks for the information. I'll try it and let you now how it went. Best regards, Rui Carvalho

…

________________________________ De: Antti Soininen ***@***.***> Enviado: 26 de agosto de 2022 13:06 Para: Spine-project/Spine-Toolbox ***@***.***> Cc: Rui Gonçalves De Carvalho ***@***.***>; Author ***@***.***> Assunto: Re: [Spine-project/Spine-Toolbox] Database insertion optimization (Discussion #1742) I managed to decrease the memory footprint of import operations somewhat as well as speed them up a bit. If you installed Toolbox according to the "Installation from sources using Git" instructions, python -mpip install -U -r requirements.txt in the Toolbox directory should get you the updated spinedb_api module. Otherwise I need to make a new release to PyPI. Please, try it out and let me know how it works. If Toolbox still consumes too much memory or is too slow, perhaps you could zip you project and sent it to me or if that is too much to ask, maybe you could provide the file(s) you're trying to import as well as the Importer specification? It would help me find the remaining bottlenecks. — Reply to this email directly, view it on GitHub<#1742 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQLCLVHOEXJZEJ2ZU6ENODTV3CXMNANCNFSM56UYST6A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

rui-gcarvalho · 2022-08-30T11:43:09Z

rui-gcarvalho
Aug 30, 2022
Author

Rui Gonçalves De Carvalho partilhou um ficheiro do OneDrive para Empresas consigo. Para vê-lo, clique na ligação abaixo. <https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/Documents/Attachments/populateSpineDB.rar> [https://r1.res.office365.com/owa/prem/images/dc-generic_20.png]<https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/Documents/Attachments/populateSpineDB.rar> populateSpineDB.rar<https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/Documents/Attachments/populateSpineDB.rar> Hi Antti, I tried your suggested solution and noticed some improvements but it is still slowing down. I created a project with two of the twelve months since it is enough to create the slowdowns. Thanks for your help. Best regards, Rui Carvalho

…

________________________________ De: Antti Soininen ***@***.***> Enviado: 26 de agosto de 2022 13:06 Para: Spine-project/Spine-Toolbox ***@***.***> Cc: Rui Gonçalves De Carvalho ***@***.***>; Author ***@***.***> Assunto: Re: [Spine-project/Spine-Toolbox] Database insertion optimization (Discussion #1742) I managed to decrease the memory footprint of import operations somewhat as well as speed them up a bit. If you installed Toolbox according to the "Installation from sources using Git" instructions, python -mpip install -U -r requirements.txt in the Toolbox directory should get you the updated spinedb_api module. Otherwise I need to make a new release to PyPI. Please, try it out and let me know how it works. If Toolbox still consumes too much memory or is too slow, perhaps you could zip you project and sent it to me or if that is too much to ask, maybe you could provide the file(s) you're trying to import as well as the Importer specification? It would help me find the remaining bottlenecks. — Reply to this email directly, view it on GitHub<#1742 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQLCLVHOEXJZEJ2ZU6ENODTV3CXMNANCNFSM56UYST6A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

rui-gcarvalho · 2022-08-30T11:43:11Z

rui-gcarvalho
Aug 30, 2022
Author

To view populateSpineDB.rar, sign in<https://myisepipp-my.sharepoint.com/personal/rugco_isep_ipp_pt/_layouts/15/acceptinvite.aspx?invitation=%7B61F87ED8%2DBC39%2D4576%2D9D20%2DE4FAFCAD3F99%7D&listId=0bc51ce2%2D05fe%2D4818%2D831a%2D4abfa03ada9f&itemId=96b0b9f8%2Dfcc5%2D4f72%2Db261%2D1c159ca7b668> or create an account.

1 reply

soininen Aug 31, 2022
Maintainer

I've sent you an email about the download link.

IngridJSJ · 2022-09-02T13:02:34Z

IngridJSJ
Sep 2, 2022

hi Antti,
I wanted to test the faster importer, so I updated to the new spinetoolbox but now some values are not being imported.
I havent figured out why, there seem to be no error in the event log.

[02-09-2022 14:54:52] Processing table Decommissioned
[02-09-2022 14:54:52] * Applying mapping Mapping 1...
[02-09-2022 14:54:52] Successful (4 data to be written).

2 replies

soininen Sep 2, 2022
Maintainer

I tried to reproduce your issue based on the screenshots but in my test case importing the value succeeds. My logs contain these lines:

* Applying mapping Mapping 1...
Successful (5 data to be written).
Inserted 5 data with 0 errors into Data Store 1.sqlite

Note that in my case, 5 data was inserted while in your only 4.

Where are you importing from? A CSV file, Excel, or...? There is a small issue in you mapping: the type of the second column is string while it should be float.

Clicking that A should give you the option to change the column's type to float.

If this doesn't help, perhaps you could open a new issue in Toolbox issue tracker with instructions on how to reproduce the problem?

IngridJSJ Sep 2, 2022

Well very strangely, it worked when I changed the name of the Excel tab. Maybe it was because I have erased that tab and then added it again.

soininen · 2022-09-12T13:20:39Z

soininen
Sep 12, 2022
Maintainer

@rui-gcarvalho: I'm currently looking into the example project you kindly provided, see #1761. Indeed, importing stuff is both slow and leads up to out-of-memory. There is something easy I can do about the memory usage but we'll see if it is enough. If it is OK to you, we can also try changing the data structure a bit - I don't think having a separate object for e.g. each bid id is the best way to populate the database. I need to do some tests on that regard, though.

2 replies

rui-gcarvalho Sep 12, 2022
Author

@soininen: Thank you for your reply. I dont think that changing the data structure will be a problem. Any suggestion that you might have in that regard will be very welcomed.

Best regards,
Rui Carvalho

soininen Sep 13, 2022
Maintainer

So, having hundreds of thousands of objects in a Spine database just doesn't scale well memory and performance wise. Additionally, it makes Toolbox Database editor completly useless. I propose we decrease the number of objects by "packing" data such as BidOffers into arrays. In this scheme, each object would represent a single day and the arrays would contain data for that particular day. For example, the bidOffer_Day_XX.csv files could look like this:

date,alternative,price,energy
2019-01-12,Base,180.30,5.8
2019-01-12,Base,180.30,0.5
2019-01-12,Base,180.30,2.7
[...]

Note that the first column always has the same data - this we'll map to the object name.

Import mappings for the data should look something like this:

Note the "Array" choice in Value.

I've removed bidID from the data as you can reconstruct it by combining the object name (which is the date) with the array index. Optionally, you could store the bidID into another array, if needed, or replace "Array" by "Map" which allows you to index the price and energy values by bidIDs.

In my tests the above import mapping doesn't need nearly as much memory and is generally faster than the single-object-per-bidID approach.

A few additional recommendations:

If you are using a MySQL database on network, you may want to consider importing to a local sqlite database file first since that should be faster. Once you're happy with the imported data you can copy it to the MySQL database using Merger.
The numbers in your CSV files are actually written as localized strings with double quotes around them, i.e. dot . as thousands separator and comma , as decimal point. Is this intentional? If you want to see real numbers in the database instead of number-like-strings you should write non-localized numbers with dot . as the decimal point to the CSV files. Note, that you also need to set the column type to float in Importer as explained in my post above.

rui-gcarvalho · 2022-09-15T13:27:15Z

rui-gcarvalho
Sep 15, 2022
Author

Hi Antti, Your suggestions worked out really well! Thank you for your support. Best regards, Rui Carvalho

…

________________________________ De: Antti Soininen ***@***.***> Enviado: 13 de setembro de 2022 08:17 Para: Spine-project/Spine-Toolbox ***@***.***> Cc: Rui Gonçalves De Carvalho ***@***.***>; Mention ***@***.***> Assunto: Re: [Spine-project/Spine-Toolbox] Database insertion optimization (Discussion #1742) So, having hundreds of thousands of objects in a Spine database just doesn't scale well memory and performance wise. Additionally, it makes Toolbox Database editor completly useless. I propose we decrease the number of objects by "packing" data such as BidOffers into arrays. In this scheme, each object would represent a single day and the arrays would contain data for that particular day. For example, the bidOffer_Day_XX.csv files could look like this: date,alternative,price,energy 2019-01-12,Base,180.30,5.8 2019-01-12,Base,180.30,0.5 2019-01-12,Base,180.30,2.7 [...] Note that the first column always has the same data - this we'll map to the object name. Import mappings for the data should look something like this: [kuva]<https://user-images.githubusercontent.com/19147159/189834318-9d70b0e8-537c-49e6-ad8d-f36de3d294ac.png> Note the "Array" choice in Value. I've removed bidID from the data as you can reconstruct it by combining the object name (which is the date) with the array index. Optionally, you could store the bidID into another array, if needed, or replace "Array" by "Map" which allows you to index the price and energy values by bidIDs. In my tests the above import mapping doesn't need nearly as much memory and is generally faster than the single-object-per-bidID approach. A few additional recommendations: 1. If you are using a MySQL database on network, you may want to consider importing to a local sqlite database file first since that should be faster. Once you're happy with the imported data you can copy it to the MySQL database using Merger. 2. The numbers in your CSV files are actually written as localized strings with double quotes around them, i.e. dot . as thousands separator and comma , as decimal point. Is this intentional? If you want to see real numbers in the database instead of number-like-strings you should write non-localized numbers with dot . as the decimal point to the CSV files. Note, that you also need to set the column type to float in Importer as explained in my post above. — Reply to this email directly, view it on GitHub<#1742 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQLCLVHVQFI7SF5MTEESTXTV6AS7HANCNFSM56UYST6A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database insertion optimization #1742

{{title}}

Replies: 7 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Database insertion optimization #1742

rui-gcarvalho Aug 16, 2022

Replies: 7 comments · 10 replies

soininen Aug 22, 2022 Maintainer

rui-gcarvalho Aug 22, 2022 Author

soininen Aug 24, 2022 Maintainer

rui-gcarvalho Aug 24, 2022 Author

soininen Aug 25, 2022 Maintainer

soininen Aug 26, 2022 Maintainer

rui-gcarvalho Aug 29, 2022 Author

rui-gcarvalho Aug 30, 2022 Author

rui-gcarvalho Aug 30, 2022 Author

soininen Aug 31, 2022 Maintainer

IngridJSJ Sep 2, 2022

soininen Sep 2, 2022 Maintainer

IngridJSJ Sep 2, 2022

soininen Sep 12, 2022 Maintainer

rui-gcarvalho Sep 12, 2022 Author

soininen Sep 13, 2022 Maintainer

rui-gcarvalho Sep 15, 2022 Author

rui-gcarvalho
Aug 16, 2022

Replies: 7 comments 10 replies

soininen
Aug 22, 2022
Maintainer

rui-gcarvalho Aug 22, 2022
Author

soininen Aug 24, 2022
Maintainer

rui-gcarvalho Aug 24, 2022
Author

soininen Aug 25, 2022
Maintainer

soininen Aug 26, 2022
Maintainer

rui-gcarvalho
Aug 29, 2022
Author

rui-gcarvalho
Aug 30, 2022
Author

rui-gcarvalho
Aug 30, 2022
Author

soininen Aug 31, 2022
Maintainer

IngridJSJ
Sep 2, 2022

soininen Sep 2, 2022
Maintainer

soininen
Sep 12, 2022
Maintainer

rui-gcarvalho Sep 12, 2022
Author

soininen Sep 13, 2022
Maintainer

rui-gcarvalho
Sep 15, 2022
Author