-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of large files is really bad #382
Comments
Does excelise read the entire excel file into memory? or is it stream based? |
I use this library for fairly large excel files as well. We're creating a document with images, styles, hyperlinks, merged cells, and about 50 sheets some of which have about 50,000 rows. Most of our time is spent loading data through an API rather than limited to the excelize library. The 50,000 item sheet takes us about 5 seconds. Have you tried profiling this code? |
Your example didn't work out of the box so I created a trimmed down test: https://gist.github.com/mlh758/0a8daf41d63af7198beb3ed25ccb0ac2 I did notice an issue with scaling. 10,000 records was about 500ms and 50,000 records was 11 seconds. It's not nearly as severe as what you are experiencing though. I tried this against master and saw a tiny improvement, but only about 300 ms. |
I refactored my gist so I could run it with a benchmark. At 50,000 items the runtime starts to peek in but most of the work is in
Memory allocation basically looks similar:
It seems to be forced to spend a great deal of time allocating memory to grow the number of rows and the column count in those rows. |
note that i'm running closer to 500,000 rows, not 50,000 (also around 30 columns) |
@jjmartin I see that, but
You're getting significantly worse performance than I am even from the start. Edit: This comparison is meaningless, just having more columns in your csv would change everything. Sorry about that. |
We also may not actually need to. If you create an Excel document and create something in I changed func prepareSheetXML(xlsx *xlsxWorksheet, col int, row int) {
rowCount := len(xlsx.SheetData.Row)
if rowCount < row {
if cap(xlsx.SheetData.Row) < row {
newRow := make([]xlsxRow, row-1, row*2)
copy(newRow, xlsx.SheetData.Row)
xlsx.SheetData.Row = newRow
}
xlsx.SheetData.Row = append(xlsx.SheetData.Row, xlsxRow{R: row})
}
rowData := &xlsx.SheetData.Row[row-1]
cellCount := len(rowData.C)
if cellCount < col {
for colIdx := cellCount; colIdx < col; colIdx++ {
cellName, _ := CoordinatesToCellName(colIdx+1, row)
rowData.C = append(rowData.C, xlsxC{R: cellName})
}
}
} and 50,000 rows dropped to 1 second and it creates a valid xlsx document. 500,000 rows takes 10 seconds so we're scaling linearly now instead of exponentially. Out of order insertion works with this as well, so you can assign |
I opened a PR to fix this and included some benchmarks to help prevent a performance regression. I ended up reverting part of my change, my custom copy logic to fill in rows was actually worse than just doing the necessary appends and it was breaking other code. The PR makes the performance of writing cells scale linearly now instead of |
Also, since my PR is going into master if you're on 1.4.1 there's been a lot of changes to the excelize interface on master. A lot of functions now return errors in addition to values and some things have shifted from 0 based to 1 based indexing to be consistent with Excel. If, like me, you don't have the option of refactoring your whole code base to adhere to that, this gist might help. I moved the functions I was using behind an interface and wrapped calls to them to adjust the indexes back or swallow the errors. Most of those new errors aren't pertinent if you're generating the excel yourself rather than allowing a user to provide input. |
I'll wait for the release. I did already make my own interface for testing but your gist is more exhaustive, might be nice to have that in a separate repo that i could just include (similar to the AWS API interfaces) |
Hm, I could but my interface stub isn't complete and some of it I intend to be temporary. If I end up making a complete mock I'll put it in a repo and post it back here, it wouldn't be a bad idea to have. |
Hi @nurzhannogerbek, please read and write the large file with stream API, reference NewStreamWriter, Rows. |
@xuri thank you for your recommendations but I am confused with Right now my code looks like this:
I update the |
Hi @NogerbekNurzhan, please use the master branch code if you can, stream writing will be in the next release version 2.1.0. |
…#383) * Rewrite prepareSheetXML to scale linearly We don't need to backfill columns into every row for most purposes Provided makeContiguousColumns for setting styles where we do need it for a specific region. Added a benchmark to monitor progress. For 50,000 rows this went from about 11 seconds to 1 second. The improvements are more dramatic as the row/column count increases. * Assigning that row value was redundant
I see your performance metrics in the docs but the files i'm dealing with can often be up to half a million rows.
I'm importing from a CSV and as I continually SetCellValues processing each line, the performance of adding those lines gets worse and worse.
Describe the results you received:
In the log output below, each time elapsed is the difference from the previous log line. you can see that after about 10,000 rows, its starting to get really bad in terms of how long it takes to process each next 10,000 rows.
Describe the results you expected:
if there was a direct way to import a CSV or some method to speed this sort of import up, it would be really useful
Excelize version or commit ID:
Environment details (OS, Microsoft Excel™ version, physical, etc.):
The above log was captured from an AWS Fargate Docker task running with 4096 CPU units and 30720 MiB
The text was updated successfully, but these errors were encountered: