-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanner *runStream* cause a lot memory consumption #934
Comments
@hadson19 I've tried to reproduce the problem with the following script: async function queryWithMemUsage(instanceId, databaseId, projectId) {
// Imports the Google Cloud client library
const {Spanner} = require('@google-cloud/spanner');
// Creates a client
const spanner = new Spanner({
projectId: projectId,
});
// Gets a reference to a Cloud Spanner instance and database
const instance = spanner.instance(instanceId);
const database = instance.database(databaseId);
const query = {
sql: `SELECT *
FROM TableWithAllColumnTypes
ORDER BY ColInt64`,
gaxOptions: {pageSize: 1000},
};
let count = 0;
database
.runStream(query)
.on('data', row => {
count++;
if (count % 100 === 0) {
console.log(`Current row: ${row.toJSON({wrapNumbers: true})}`);
console.log(`Processed ${count} rows so far`);
const used = process.memoryUsage().heapUsed / 1024 / 1024;
console.log(`Current mem usage: ${Math.round(used * 100) / 100} MB`);
}
})
.on('error', console.log)
.on('end', () => {
console.log(`Finished processing ${count} rows`);
database.close();
});
} When running that script, the memory usage printed out for each 100 rows will vary between 30MB to 80MB. The memory usage will increase steadily to approx 80MB when a garbage collect will reduce it back to 30MB. Example log from a run of the script using a table which contains huge rows:
|
@olavloite
Can you try to write the result into the file? Can you also check the pause method ? |
I've changed my test case to the following: async function queryWithMemUsage(instanceId, databaseId, projectId) {
// Imports the Google Cloud client library
const {Spanner} = require('@google-cloud/spanner');
const fs = require('fs');
const stream = require('stream');
// eslint-disable-next-line node/no-extraneous-require
const through = require('through');
// Creates a client
const spanner = new Spanner({
projectId: projectId,
});
// Gets a reference to a Cloud Spanner instance and database
const instance = spanner.instance(instanceId);
const database = instance.database(databaseId);
const query = {
sql: `SELECT *
FROM TableWithAllColumnTypes
ORDER BY ColInt64`,
gaxOptions: {pageSize: 1000},
};
let count = 0;
const fileStream = fs.createWriteStream('/home/loite/rs.txt');
const rs = database
.runStream(query)
.on('data', () => {
count++;
if (count % 100 === 0) {
console.log(`Processed ${count} rows so far`);
const used = process.memoryUsage().heapUsed / 1024 / 1024;
console.log(`Current mem usage: ${Math.round(used * 100) / 100} MB`);
}
})
.on('error', console.log)
.on('end', () => {
console.log('Finished writing file');
database.close();
});
// eslint-disable-next-line node/no-unsupported-features/node-builtins
await stream.pipeline(
rs,
through(function (data) {
return this.queue(
`${JSON.stringify(data.toJSON({wrapNumbers: true}))}\n`
);
}),
fileStream
);
} This generates a results file sized 2.1GB. The memory usage is even lower than in the initial test. It now varies between approx 30MB and 55MB.
|
hm... it seems that you have a good ssd (the latency is not big). Can you add postpone Transform.. (to emit backpresure)? for example:
Because we have a bit latency with storage (which emit pause).. And this pause doesn't effect on nodejs spanner library. Thanks |
@hadson19 Yep, that seems to trigger the problem for me as well. |
When you see the increased memory usage, is it beyond what the system is capable of handling? I have noticed while researching stream memory consumption issues in the past, Node's internals are in control, and while the usage gets high, it's not causing a crash, for example. |
I've been able to make it go out of memory on my local machine while testing this. |
I should note that this only happens when I try to write my 2GB result set to a file and deliberately add the I have a solution locally that will fix this now, but I want to do some additional testing first. |
The |
The request stream should be paused if the downstream is indicating that it cannot handle any more data at the moment. The request stream should be resumed once the downstream does accept data again. This reduces memory consumption and potentially out of memory errors when a result stream is piped into a slow writer. Fixes googleapis#934
Well, both should handle backpressure correctly. If you were to add some custom transformer or stream to your pipeline that itself does not handle backpressure correctly, you could still run into problems. In this case it seems that one of the internal streams used by |
@skuruppu @olavloite |
@hadson19 What is the size of the result set that you are streaming in terms of number of columns and number of rows, and what is the size of the data in the columns (or maybe easier: the size of the total result set?). |
@olavloite
and the library (nodejs-spanner) try to consume a lot of memory. |
@olavloite |
What happens if you run the same query on the same table without this change? I agree that 900MB is a lot, and I'll look into that as well, but for me this change at least reduced the total memory consumption for these kind of very large result sets to an amount that could be handled. In other words; it did not run out of memory. |
@olavloite And after a nearly one minute Thanks |
* fix: pause request stream on backpressure The request stream should be paused if the downstream is indicating that it cannot handle any more data at the moment. The request stream should be resumed once the downstream does accept data again. This reduces memory consumption and potentially out of memory errors when a result stream is piped into a slow writer. Fixes #934 * fix: do not retry stream indefinitely PartialResultSetStream should stop retrying to push data into the stream after a configurable number of retries have failed. * fix: process review comments * fix: remove unused code * tests: add test for pause/resume * fix: return after giving up retrying + add test
Environment details
@google-cloud/spanner
version: @google-cloud/[email protected]Steps to reproduce
The stream consume a lot of memory (RAM) (some time it's take 600 Mb)
The
gaxOptions
(pageSize, maxResults, retry) doesn't work.Thanks!
The text was updated successfully, but these errors were encountered: