Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure crawler can't run out of space with --diskUtilization param #264

Merged
merged 2 commits into from
Mar 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 35 additions & 5 deletions crawler.js
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import * as warcio from "warcio";

import { HealthChecker } from "./util/healthcheck.js";
import { TextExtract } from "./util/textextract.js";
import { initStorage, getFileSize, getDirSize, interpolateFilename } from "./util/storage.js";
import { initStorage, getFileSize, getDirSize, interpolateFilename, getDiskUsage } from "./util/storage.js";
import { ScreenCaster, WSTransport, RedisPubSubTransport } from "./util/screencaster.js";
import { Screenshots } from "./util/screenshots.js";
import { parseArgs } from "./util/argParser.js";
Expand Down Expand Up @@ -553,11 +553,14 @@ export class Crawler {
async checkLimits() {
let interrupt = false;

if (this.params.sizeLimit) {
const dir = path.join(this.collDir, "archive");

const size = await getDirSize(dir);
let dir;
let size;
if (this.params.sizeLimit || this.params.diskUtilization) {
dir = path.join(this.collDir, "archive");
size = await getDirSize(dir);
}

if (this.params.sizeLimit) {
if (size >= this.params.sizeLimit) {
logger.info(`Size threshold reached ${size} >= ${this.params.sizeLimit}, stopping`);
interrupt = true;
Expand All @@ -573,6 +576,33 @@ export class Crawler {
}
}

if (this.params.diskUtilization) {
// Check that disk usage isn't already above threshold
const diskUsage = await getDiskUsage();
const usedPercentage = parseInt(diskUsage["Use%"].slice(0, -1));
if (usedPercentage >= this.params.diskUtilization) {
logger.info(`Disk utilization threshold reached ${usedPercentage}% > ${this.params.diskUtilization}%, stopping`);
interrupt = true;
}

// Check that disk usage isn't likely to cross threshold
const kbUsed = parseInt(diskUsage["Used"]);
const kbTotal = parseInt(diskUsage["1K-blocks"]);
let kbArchiveDirSize = Math.floor(size/1024);
if (this.params.combineWARC && this.params.generateWACZ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yes, this is a better way to do this!

kbArchiveDirSize *= 4;
} else if (this.params.combineWARC || this.params.generateWACZ) {
kbArchiveDirSize *= 2;
}

const projectedTotal = kbUsed + kbArchiveDirSize;
const projectedUsedPercentage = Math.floor(kbTotal/projectedTotal);
if (projectedUsedPercentage >= this.params.diskUtilization) {
logger.info(`Disk utilization projected to reach threshold ${projectedUsedPercentage}% > ${this.params.diskUtilization}%, stopping`);
interrupt = true;
}
}

if (interrupt) {
this.gracefulFinish();
}
Expand Down
10 changes: 10 additions & 0 deletions util/argParser.js
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,12 @@ class ArgParser {
default: 0,
},

"diskUtilization": {
describe: "If set, save state and exit if disk utilization exceeds this percentage value",
type: "number",
default: 90,
},

"timeLimit": {
describe: "If set, save state and exit after time limit, in seconds",
type: "number",
Expand Down Expand Up @@ -465,6 +471,10 @@ class ArgParser {
argv.statsFilename = path.resolve(argv.cwd, argv.statsFilename);
}

if ((argv.diskUtilization < 0 || argv.diskUtilization > 99)) {
argv.diskUtilization = 90;
}

return true;
}
}
Expand Down
17 changes: 17 additions & 0 deletions util/storage.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import child_process from "child_process";
import fs from "fs";
import fsp from "fs/promises";
import util from "util";

import os from "os";
import { createHash } from "crypto";
Expand Down Expand Up @@ -148,6 +150,21 @@ export async function getDirSize(dir) {
return size;
}

export async function getDiskUsage(path="/") {
const exec = util.promisify(child_process.exec);
const result = await exec(`df ${path}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is the best way to check disk usage... i guess maybe it is, instead of using some node library for it?
I suppose if it works well, can go with this for now..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for node libraries first (like https://www.npmjs.com/package/check-disk-space and https://www.npmjs.com/package/diskusage) but they all seemed to be wrappers for OS tools like df or its Windows equivalent so I figured I'd spare us another dependency and just implement it myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough!

const lines = result.stdout.split("\n");
const keys = lines[0].split(/\s+/ig);
const rows = lines.slice(1).map(line => {
const values = line.split(/\s+/ig);
return keys.reduce((o, k, index) => {
o[k] = values[index];
return o;
}, {});
});
return rows[0];
}

function checksumFile(hashName, path) {
return new Promise((resolve, reject) => {
const hash = createHash(hashName);
Expand Down