Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure crawler can't run out of space with --diskUtilization param #264

Merged
merged 2 commits into from
Mar 31, 2023

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Mar 24, 2023

Fixes #242

The first commit implements --diskUtilization as described in the issue. I found that this stopped a lot of crawls prematurely in local testing even though the total size, even after combining WARCs and generating WACZ, still wouldn't have crossed the disk threshold.

The second commit modifies the routine to keep the disk utilization percentage high (90% by default) but instead projects whether that threshold is likely to be reached based on the size of the current archive directory (x4 if combineWARC and generateWACZ are passed, x2 if only one of them).

@tw4l tw4l requested a review from ikreymer March 24, 2023 17:19
@@ -148,6 +150,21 @@ export async function getDirSize(dir) {
return size;
}

export async function getDiskUsage(path="/") {
const exec = util.promisify(child_process.exec);
const result = await exec(`df ${path}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is the best way to check disk usage... i guess maybe it is, instead of using some node library for it?
I suppose if it works well, can go with this for now..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for node libraries first (like https://www.npmjs.com/package/check-disk-space and https://www.npmjs.com/package/diskusage) but they all seemed to be wrappers for OS tools like df or its Windows equivalent so I figured I'd spare us another dependency and just implement it myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough!

const kbUsed = parseInt(diskUsage["Used"]);
const kbTotal = parseInt(diskUsage["1K-blocks"]);
let kbArchiveDirSize = Math.floor(size/1024);
if (this.params.combineWARC && this.params.generateWACZ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yes, this is a better way to do this!

@ikreymer ikreymer merged commit 746d80a into main Mar 31, 2023
@ikreymer ikreymer deleted the issue-242-disk-space branch March 31, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure Crawler Can Not Run out of Disk Space / Stops at Disk Utilization
2 participants