Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of large files is really bad #382

Closed
jjmartin opened this issue Apr 10, 2019 · 15 comments
Closed

Performance of large files is really bad #382

jjmartin opened this issue Apr 10, 2019 · 15 comments

Comments

@jjmartin
Copy link

I see your performance metrics in the docs but the files i'm dealing with can often be up to half a million rows.

I'm importing from a CSV and as I continually SetCellValues processing each line, the performance of adding those lines gets worse and worse.

	reader := csv.NewReader(csvBody)
	const headerRow = 1
	row := headerRow
	var headers []string
	rowstart := time.Now()
	for {
		line, err := reader.Read()
		if err == io.EOF {
			break
		} else if err != nil {
			log.Printf("Error Reading CSV: %+v ", err)
			return err
		}
		if row == headerRow {
			headers = line
		}
		for cellIndex, cellValue := range line {
			if row > headerRow && stringInSlice(headers[cellIndex], []string{"quantity", "price"}) {
				cellParsedValue, err := strconv.ParseFloat(cellValue, 64)
				if err != nil {
					log.Printf("%s\ncell %d in line %d had invalid value %s", err, cellIndex, row, cellValue)
					return err
				}
				xlsx.SetCellValue("details", fmt.Sprintf("%s%d", excelize.ToAlphaString(cellIndex), row), cellParsedValue)
			} else {
				xlsx.SetCellValue("details", fmt.Sprintf("%s%d", excelize.ToAlphaString(cellIndex), row), cellValue)
			}

		}
		pow10row := findPow10(row)
		if row <= pow10row*10 && (row%pow10row == 0 || row%10000 == 0) {
			elapsed := time.Since(rowstart)
			log.Printf("Row %d, time elapsed %s", row, elapsed)
			rowstart = time.Now()
		}
		row++
	}

Describe the results you received:
In the log output below, each time elapsed is the difference from the previous log line. you can see that after about 10,000 rows, its starting to get really bad in terms of how long it takes to process each next 10,000 rows.

2019/04/08 21:38:37 Row 1, time elapsed 1.505084ms
2019/04/08 21:38:37 Row 2, time elapsed 96.097µs
2019/04/08 21:38:37 Row 3, time elapsed 101.482µs
2019/04/08 21:38:37 Row 4, time elapsed 94.35µs
2019/04/08 21:38:37 Row 5, time elapsed 94.585µs
2019/04/08 21:38:37 Row 6, time elapsed 96.621µs
2019/04/08 21:38:37 Row 7, time elapsed 97.552µs
2019/04/08 21:38:37 Row 8, time elapsed 99.631µs
2019/04/08 21:38:37 Row 9, time elapsed 108.849µs
2019/04/08 21:38:37 Row 10, time elapsed 74.078µs
2019/04/08 21:38:37 Row 20, time elapsed 545.239µs
2019/04/08 21:38:37 Row 30, time elapsed 513.501µs
2019/04/08 21:38:37 Row 40, time elapsed 532.816µs
2019/04/08 21:38:37 Row 50, time elapsed 564.326µs
2019/04/08 21:38:37 Row 60, time elapsed 669.845µs
2019/04/08 21:38:37 Row 70, time elapsed 1.508732ms
2019/04/08 21:38:37 Row 80, time elapsed 666.172µs
2019/04/08 21:38:37 Row 90, time elapsed 594.624µs
2019/04/08 21:38:37 Row 100, time elapsed 630.948µs
2019/04/08 21:38:37 Row 200, time elapsed 7.519094ms
2019/04/08 21:38:37 Row 300, time elapsed 6.852758ms
2019/04/08 21:38:37 Row 400, time elapsed 8.674476ms
2019/04/08 21:38:37 Row 500, time elapsed 8.159781ms
2019/04/08 21:38:37 Row 600, time elapsed 9.568621ms
2019/04/08 21:38:37 Row 700, time elapsed 8.916284ms
2019/04/08 21:38:37 Row 800, time elapsed 10.846477ms
2019/04/08 21:38:37 Row 900, time elapsed 9.282789ms
2019/04/08 21:38:37 Row 1000, time elapsed 12.92103ms
2019/04/08 21:38:37 Row 2000, time elapsed 128.488664ms
2019/04/08 21:38:37 Row 3000, time elapsed 189.107883ms
2019/04/08 21:38:37 Row 4000, time elapsed 278.586948ms
2019/04/08 21:38:38 Row 5000, time elapsed 391.341065ms
2019/04/08 21:38:38 Row 6000, time elapsed 471.830863ms
2019/04/08 21:38:39 Row 7000, time elapsed 530.416468ms
2019/04/08 21:38:39 Row 8000, time elapsed 602.603427ms
2019/04/08 21:38:40 Row 9000, time elapsed 652.277227ms
2019/04/08 21:38:41 Row 10000, time elapsed 729.849772ms
2019/04/08 21:38:52 Row 20000, time elapsed 10.977776474s
2019/04/08 21:39:10 Row 30000, time elapsed 18.55464695s
2019/04/08 21:39:37 Row 40000, time elapsed 26.336721766s
2019/04/08 21:40:10 Row 50000, time elapsed 33.489274657s
2019/04/08 21:40:51 Row 60000, time elapsed 40.729296603s
2019/04/08 21:41:39 Row 70000, time elapsed 47.928431496s
2019/04/08 21:42:34 Row 80000, time elapsed 55.148409674s
2019/04/08 21:43:36 Row 90000, time elapsed 1m2.4532031s
2019/04/08 21:44:46 Row 100000, time elapsed 1m9.608536367s
2019/04/08 21:46:03 Row 110000, time elapsed 1m16.990387462s
2019/04/08 21:47:27 Row 120000, time elapsed 1m24.146257207s
2019/04/08 21:48:59 Row 130000, time elapsed 1m31.308584865s
2019/04/08 21:50:37 Row 140000, time elapsed 1m38.654884213s
2019/04/08 21:52:23 Row 150000, time elapsed 1m46.000199696s
2019/04/08 21:54:16 Row 160000, time elapsed 1m53.238934707s
2019/04/08 21:56:17 Row 170000, time elapsed 2m0.485714266s
2019/04/08 21:58:25 Row 180000, time elapsed 2m7.897305904s
2019/04/08 22:00:40 Row 190000, time elapsed 2m15.234462928s
2019/04/08 22:03:03 Row 200000, time elapsed 2m23.134322152s
2019/04/08 22:05:34 Row 210000, time elapsed 2m30.40930936s
2019/04/08 22:08:11 Row 220000, time elapsed 2m37.873410076s
2019/04/08 22:10:59 Row 230000, time elapsed 2m47.92659603s
2019/04/08 22:13:58 Row 240000, time elapsed 2m58.625053178s
2019/04/08 22:17:08 Row 250000, time elapsed 3m10.348595584s
2019/04/08 22:20:29 Row 260000, time elapsed 3m20.726383957s
2019/04/08 22:24:08 Row 270000, time elapsed 3m38.840478421s
2019/04/08 22:28:12 Row 280000, time elapsed 4m4.294031488s
2019/04/08 22:32:43 Row 290000, time elapsed 4m30.85305806s
2019/04/08 22:37:45 Row 300000, time elapsed 5m2.183625905s
2019/04/08 22:43:18 Row 310000, time elapsed 5m33.135633645s
2019/04/08 22:49:22 Row 320000, time elapsed 6m3.47749514s
2019/04/08 22:55:56 Row 330000, time elapsed 6m33.647828s
2019/04/08 23:02:59 Row 340000, time elapsed 7m3.546443285s
2019/04/08 23:10:35 Row 350000, time elapsed 7m35.978277292s
2019/04/08 23:18:43 Row 360000, time elapsed 8m8.039533099s
2019/04/08 23:27:22 Row 370000, time elapsed 8m38.447390938s
2019/04/08 23:36:33 Row 380000, time elapsed 9m11.603785808s
2019/04/08 23:46:15 Row 390000, time elapsed 9m41.515021912s
2019/04/08 23:56:17 Row 400000, time elapsed 10m2.085553551s
2019-04-09
2019/04/09 00:06:42 Row 410000, time elapsed 10m25.252517462s
2019/04/09 00:17:38 Row 420000, time elapsed 10m55.909756693s

Describe the results you expected:

if there was a direct way to import a CSV or some method to speed this sort of import up, it would be really useful

Excelize version or commit ID:

  digest = "1:9b67e96a030cc96a3bef1d7cb1143f1e13440f1087eee5999fa9ba5514c1027c"
  name = "github.com/360EntSecGroup-Skylar/excelize"
  packages = ["."]
  pruneopts = ""
  revision = "dea7ba0ec43a4c29a6642d02b6edc73b8b0369f0"
  version = "v1.4.1"

Environment details (OS, Microsoft Excel™ version, physical, etc.):
The above log was captured from an AWS Fargate Docker task running with 4096 CPU units and 30720 MiB

@asad-awadia
Copy link

Does excelise read the entire excel file into memory? or is it stream based?

@mlh758
Copy link
Contributor

mlh758 commented Apr 11, 2019

I use this library for fairly large excel files as well. We're creating a document with images, styles, hyperlinks, merged cells, and about 50 sheets some of which have about 50,000 rows. Most of our time is spent loading data through an API rather than limited to the excelize library. The 50,000 item sheet takes us about 5 seconds.

Have you tried profiling this code?

@mlh758
Copy link
Contributor

mlh758 commented Apr 11, 2019

Your example didn't work out of the box so I created a trimmed down test:

https://gist.github.com/mlh758/0a8daf41d63af7198beb3ed25ccb0ac2

I did notice an issue with scaling. 10,000 records was about 500ms and 50,000 records was 11 seconds. It's not nearly as severe as what you are experiencing though.

I tried this against master and saw a tiny improvement, but only about 300 ms.

@mlh758
Copy link
Contributor

mlh758 commented Apr 11, 2019

I refactored my gist so I could run it with a benchmark. At 50,000 items the runtime starts to peek in but most of the work is in excelize.prepareSheetXML. Here's the output from pprof:

Showing top 10 nodes out of 67
      flat  flat%   sum%        cum   cum%
     4.68s 40.14% 40.14%      9.27s 79.50%  github.com/360EntSecGroup-Skylar/excelize.prepareSheetXML
     2.53s 21.70% 61.84%      2.53s 21.70%  runtime.memmove
     1.92s 16.47% 78.30%      1.92s 16.47%  runtime.memclrNoHeapPointers
     0.64s  5.49% 83.79%      0.64s  5.49%  runtime.pthread_cond_signal
     0.54s  4.63% 88.42%      0.54s  4.63%  runtime.duffcopy
     0.21s  1.80% 90.22%      2.41s 20.67%  runtime.mallocgc
     0.18s  1.54% 91.77%      0.18s  1.54%  runtime.nanotime
     0.12s  1.03% 92.80%      0.12s  1.03%  runtime.(*mspan).init
     0.10s  0.86% 93.65%      0.33s  2.83%  runtime.convTstring
     0.08s  0.69% 94.34%      0.25s  2.14%  encoding/csv.(*Reader).readRecord

Memory allocation basically looks similar:

Showing nodes accounting for 309.34MB, 97.97% of 315.76MB total
Dropped 27 nodes (cum <= 1.58MB)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
  110.55MB 35.01% 35.01%   118.05MB 37.39%  github.com/360EntSecGroup-Skylar/excelize.prepareSheetXML
   81.51MB 25.81% 60.83%    92.01MB 29.14%  encoding/xml.(*printer).marshalAttr
   35.13MB 11.13% 71.95%    35.13MB 11.13%  bytes.makeSlice
   30.02MB  9.51% 81.46%    30.02MB  9.51%  github.com/360EntSecGroup-Skylar/excelize.trimCell
   11.81MB  3.74% 85.20%    11.81MB  3.74%  bytes.Replace
   11.81MB  3.74% 88.94%    11.81MB  3.74%  github.com/360EntSecGroup-Skylar/excelize.(*File).saveFileList
   10.50MB  3.33% 92.27%    10.50MB  3.33%  reflect.packEface
       6MB  1.90% 94.17%        6MB  1.90%  encoding/csv.(*Reader).readRecord
       6MB  1.90% 96.07%   313.72MB 99.35%  github.cerner.com/mh039333/exc_test.WriteExcel
       6MB  1.90% 97.97%     7.50MB  2.38%  github.com/360EntSecGroup-Skylar/excelize.CoordinatesToCellName

It seems to be forced to spend a great deal of time allocating memory to grow the number of rows and the column count in those rows.

@jjmartin
Copy link
Author

jjmartin commented Apr 11, 2019

note that i'm running closer to 500,000 rows, not 50,000 (also around 30 columns)

@mlh758
Copy link
Contributor

mlh758 commented Apr 11, 2019

@jjmartin I see that, but

2019/04/08 21:40:10 Row 50000, time elapsed 33.489274657s

You're getting significantly worse performance than I am even from the start.

Edit: This comparison is meaningless, just having more columns in your csv would change everything. Sorry about that.

@mlh758
Copy link
Contributor

mlh758 commented Apr 11, 2019

prepareSheetXML is essentially an O(m*n) operation where it steps through every column and row in the document every time you add something to the worksheet to ensure that the worksheet has a rectangular shape. We should just do this once at the end instead of on every operation.

We also may not actually need to. If you create an Excel document and create something in C5 that's the only value that gets saved by Excel. It doesn't back fill anything.

I changed prepareSheetXML to:

func prepareSheetXML(xlsx *xlsxWorksheet, col int, row int) {
	rowCount := len(xlsx.SheetData.Row)
	if rowCount < row {
		if cap(xlsx.SheetData.Row) < row {
			newRow := make([]xlsxRow, row-1, row*2)
			copy(newRow, xlsx.SheetData.Row)
			xlsx.SheetData.Row = newRow
		}
		xlsx.SheetData.Row = append(xlsx.SheetData.Row, xlsxRow{R: row})
	}
	rowData := &xlsx.SheetData.Row[row-1]
	cellCount := len(rowData.C)
	if cellCount < col {
		for colIdx := cellCount; colIdx < col; colIdx++ {
			cellName, _ := CoordinatesToCellName(colIdx+1, row)
			rowData.C = append(rowData.C, xlsxC{R: cellName})
		}
	}
}

and 50,000 rows dropped to 1 second and it creates a valid xlsx document. 500,000 rows takes 10 seconds so we're scaling linearly now instead of exponentially. Out of order insertion works with this as well, so you can assign C5 and then A1 and get a valid excel document. It should work for your csv generation but I'm getting test failures now so I'm sure I've broken something else in the library.

@mlh758
Copy link
Contributor

mlh758 commented Apr 12, 2019

I opened a PR to fix this and included some benchmarks to help prevent a performance regression.

I ended up reverting part of my change, my custom copy logic to fill in rows was actually worse than just doing the necessary appends and it was breaking other code. The PR makes the performance of writing cells scale linearly now instead of O(m*n). Feedback is welcome on this, all the tests are passing and the documents I generated locally look okay but it would be good to have some more eyes on the changes.

@mlh758
Copy link
Contributor

mlh758 commented Apr 12, 2019

Also, since my PR is going into master if you're on 1.4.1 there's been a lot of changes to the excelize interface on master. A lot of functions now return errors in addition to values and some things have shifted from 0 based to 1 based indexing to be consistent with Excel. If, like me, you don't have the option of refactoring your whole code base to adhere to that, this gist might help.

I moved the functions I was using behind an interface and wrapped calls to them to adjust the indexes back or swallow the errors. Most of those new errors aren't pertinent if you're generating the excel yourself rather than allowing a user to provide input.

@jjmartin
Copy link
Author

I'll wait for the release. I did already make my own interface for testing but your gist is more exhaustive, might be nice to have that in a separate repo that i could just include (similar to the AWS API interfaces)

@mlh758
Copy link
Contributor

mlh758 commented Apr 15, 2019

Hm, I could but my interface stub isn't complete and some of it I intend to be temporary. If I end up making a complete mock I'll put it in a repo and post it back here, it wouldn't be a bad idea to have.

@nurzhannogerbek
Copy link

I have the same problem. In my case, I have ~100000 rows and excelize generates Excel file a little more than a minute which is too long in my opinion. @jjmartin how you solve your problem? @mlh758 what you can recommend solving this problem?

@xuri
Copy link
Member

xuri commented Jan 12, 2020

Hi @nurzhannogerbek, please read and write the large file with stream API, reference NewStreamWriter, Rows.

@nurzhannogerbek
Copy link

nurzhannogerbek commented Jan 12, 2020

@xuri thank you for your recommendations but I am confused with NewStreamWriter. Is it some extension? I can't import it. The version of the excelize package which I use right now is: 1.4.1. My IDE shows me such a message: Unresolved reference 'NewStreamWriter'. Are there any working examples?

Right now my code looks like this:

rows, err = database.DBGORM.Raw("SELECT * FROM RESULTS;").Rows()
if err != nil {
	fmt.Println(err)
}

defer rows.Close()

file := excelize.NewFile()

sheetName := "Results"
file.SetSheetName(file.GetSheetName(1), sheetName)

file.SetCellValue(sheetName, "A1", "Column A1")
file.SetCellValue(sheetName, "B1", "Column B1")
file.SetCellValue(sheetName, "C1", "Column C1")
file.SetCellValue(sheetName, "D1", "Column D1")

style, err := file.NewStyle(`{"fill":{"type":"pattern","color":["#FFFF00"],"pattern":1}}`)
if err != nil {
	fmt.Println(err)
}

file.SetCellStyle(sheetName, "A1", "A1", style)
file.SetCellStyle(sheetName, "B1", "A1", style)
file.SetCellStyle(sheetName, "C1", "A1", style)
file.SetCellStyle(sheetName, "D1", "A1", style)

err = file.AutoFilter(sheetName, "A1", "I1", "")
if err != nil {
	fmt.Println(err)
}

index := 0

for rows.Next() {
	var result Result

	if err = rows.Scan(&result.A1, &result.B1, &result.C1, &result.D1)
	err != nil {
		fmt.Println(err)
	}

	file.SetCellValue(sheetName, fmt.Sprintf("A%d", index+2), result.A1)
	file.SetCellValue(sheetName, fmt.Sprintf("B%d", index+2), result.B1)
	file.SetCellValue(sheetName, fmt.Sprintf("C%d", index+2), result.C1)
	file.SetCellValue(sheetName, fmt.Sprintf("D%d", index+2), result.D1)

	index++
}

buffer, err := file.WriteToBuffer()
err != nil {
	fmt.Println(err)
}

filename := "results.xlsx"

http.ServeContent(responseWriter, request, filename, time.Now(), strings.NewReader(buffer.String()))

I update the excelize package to the latest version 2.0.2. For some reason, the problem is still the same. Interesting thing is that my code above after version update works faster. Excel file with ~100000 rows generates ~4 seconds.

@xuri
Copy link
Member

xuri commented Jan 13, 2020

Hi @NogerbekNurzhan, please use the master branch code if you can, stream writing will be in the next release version 2.1.0.

nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
…#383)

* Rewrite prepareSheetXML to scale linearly

We don't need to backfill columns into every row for most purposes
Provided makeContiguousColumns for setting styles where we do
need it for a specific region.

Added a benchmark to monitor progress. For 50,000 rows this went
from about 11 seconds to 1 second. The improvements are more
dramatic as the row/column count increases.

* Assigning that row value was redundant
nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants