refactor(techStackGeneric): improve getProcNames #250

ee7 · 2024-03-20T15:31:26Z

Description

The getProcNames procedure was doing a couple of strange things:

It was trying to read files at /proc/foo/status, where foo is the name of a file, not directory, in /proc.
It scanned each file at /proc/[pid]/status more than once (instead, once per digit of the pid).

This wasn't producing any incorrect results because:

It caught and ignored the exceptions from opening nonexistent files.
The name values are added to a HashSet[string], which deduplicates them.

It was also doing some further unnecessary work: it kept scanning lines of /proc/[pid]/status even after it found the Name value (which is specified to be on the first line).

This PR fixes those issues.

Note that /proc/[pid]/comm also exists in Linux since 2.6.33 (2010-02-24), but that's less portable.

Refs: #249 (to a tiny degree)

Testing

Add to plugins/techStackDetection.nim

when isMainModule:
  echo getProcNames()

and run nim r plugins/techStackDetection.nim

It'll be easier to add tests for this kind of thing when we're running unit tests in CI.

Each status file was previously scanned N times, where N is the length of its parent directory name.

Use the more natural construct.

Clarify that p_path isn't used later.

I find this more readable, and it avoids allocating a string for `head`. Examples for lastPathPart(): doAssert lastPathPart("foo/bar") == "bar" doAssert lastPathPart("foo/bar/") == "bar" Examples for splitPath(): doAssert splitPath("foo/bar") == (head: "foo", tail: "bar") doAssert splitPath("foo/bar/") == (head: "foo/bar", tail: "")

Improve readability (well, at least for me), and allocate less.

Previously, every line of each status file was scanned even after finding the Name value, which should be on the first line.

Improve readability.

src/plugins/techStackGeneric.nim

miki725 · 2024-03-20T20:15:54Z

src/plugins/techStackGeneric.nim

-            result.incl(name)
-      except:
-         continue
+      for line in data.split("\n"):


Why not splitLines stdlib function?

You can also read the file one line at a time if you use the file stream from fd_cache so that we don't need to read the full file and then split by lines

Why not splitLines stdlib function?

Just to reduce the diff size.

I would personally have written splitLines though. Done in 2dbbf77.

You can also read the file one line at a time if you use the file stream from fd_cache so that we don't need to read the full file and then split by lines

Generally if we wanted to optimize for performance, we could use mmap or just read a single line (taking advantage of the fact that the Name value is specified to be on the first line). Both avoid the stream overhead, and the readability cost of the withFileStream template that injects a variable.

But in this case, the data at /proc/[pid]/status is in memory, not on disk. And it's not a significant speedup to read a line from memory instead of the full data (which is about 1 KiB here). The file reads here are a completely negligible contributor to the execution time of chalk insert. Let's stick with tryToLoadFile for now?

miki725 · 2024-03-20T20:19:07Z

src/plugins/techStackGeneric.nim

+  for kind, path in walkDir(directory):
    if inFileScope[category][subcategory]:
      break


Not following logic here. Why break out of the inner loop on a variable which is not controlled by the loop? Can we bypass loop if the condition doesn't match?

This PR doesn't touch the behavior here, and I'll have future PRs that continue to refactor this file. There's definitely some more things in this file that are unexpected and hard to read.

Can we leave this till later? This PR is aimed at getProcNames, and it's already a stretch to include the current walkDir refactoring in this PR unless I title the squashed commit something like "improve walkDir usage", which makes the changes sound stylistic only.

Edit: moved the walkDir changes to #252 to keep this PR focused on one thing.

miki725 · 2024-03-20T20:20:17Z

src/plugins/techStackGeneric.nim

-    if filePath.kind == pcDir:
-      scanDirectory(filePath.path, category, subcategory)
-      continue
+    if kind == pcFile:


Not sure what the nim conventions are but maybe use switch here as it switched on the kind and there are no other conditions in the branches?

I kept the existing

if kind == pcFile: foo elif kind == pcDir: bar

mainly to minimize the diff size. Whether that's better than the case version:

case kind of pcFile: foo of pcDir: bar of pcLinkToFile, pcLinkToDir: discard

depends mostly on personal taste: does the reader prefer to minimize the number of lines, or to see all the enum members?

Recall that in Nim, it is a compile-time error if a case statement is not exhaustative.

Some factors that make it more likely for me personally to use case:

I'm doing something with most of the enum members

The case-based version is not significantly longer in relative terms

The enum is not defined by the Nim stdlib

There is no other condition on the branch (as you mention), and there is not likely to be in the future

There's little readability benefit from trying to express the likelihood of branches

For performance-critical code (not here) there is some subtlety: we can use the linearScanEnd pragma to express which branches are most likely. The ordering of an if statements branches already suggest that, although Nim also provides likely and unlikely.

Let's just leave as-is for now? In the long term, I might propose a helper iterator in nimutils for walking with filtering.

Let's handle this in a separate PR.

ee7 · 2024-03-21T18:06:47Z

CI passed on the PR, but fails on the commit merged to main. I believe the failure is unrelated to the change:


main.go:62: error during command execution: unknown flag: --output-key-prefix

subprocess.CalledProcessError: Command '['cosign', 'generate-key-pair', '--output-key-prefix', 'chalk']' returned non-zero exit status 1.

[...]

Error: unknown flag: --tlog-upload

[...]

=========================== short test summary info >============================
FAILED test_command.py::test_setup_existing_keys[copy_files0-config1-False]
FAILED test_command.py::test_setup[config0-copy_files0] - subprocess.CalledPr...
FAILED test_command.py::test_setup[config1-copy_files0] - subprocess.CalledPr...
FAILED test_command.py::test_setup[config2-copy_files0] - subprocess.CalledPr...
FAILED test_command.py::test_setup_existing_keys[copy_files0-config0-True] - ...
======= 5 failed, 136 passed, 14 skipped, 1 warning in 273.03s (0:04:33) =======

ee7 added 15 commits March 20, 2024 16:15

refactor(techStackGeneric): fix multiple scans of same status

d9b576e

Each status file was previously scanned N times, where N is the length of its parent directory name.

refactor(techStackGeneric): scan only directories in /proc

9b2a610

refactor(techStackGeneric): check set membership, not string

efcf658

Use the more natural construct.

refactor(techStackGeneric): clarify a try/except

84bd26d

refactor(techStackGeneric): prefer let over var

e4f6797

refactor(techStackGeneric): inline once-used variable

221e8c1

Clarify that p_path isn't used later.

refactor(techStackGeneric): parse Name via strscans, not split

7bd12be

Improve readability (well, at least for me), and allocate less.

refactor(techStackGeneric): don't parse remainder of status

fde0c51

Previously, every line of each status file was scanned even after finding the Name value, which should be on the first line.

refactor(techStackGeneric): add doc comment

336aaf4

refactor(techStackGeneric): remove a useless continue

195aaae

refactor(techStackGeneric): reduce use of continue

2c531a3

Improve readability.

refactor(techStackGeneric): remove some more continue

be3e189

refactor(techStackGeneric): use consistent walkDir style

ed62b4f

refactor(techStackGeneric): declare var for splFile.ext

24f60ec

ee7 requested a review from miki725 March 20, 2024 15:31

ee7 requested a review from viega as a code owner March 20, 2024 15:31

ee7 commented Mar 20, 2024

View reviewed changes

src/plugins/techStackGeneric.nim Outdated Show resolved Hide resolved

nettrino previously approved these changes Mar 20, 2024

View reviewed changes

miki725 requested changes Mar 20, 2024

View reviewed changes

ee7 added 2 commits March 21, 2024 12:50

refactor(techStackGeneric): use nimutils tryToLoadFile

d7200e5

refactor(techStackGeneric): use splitLines(), not split("\n")

2dbbf77

ee7 dismissed nettrino’s stale review via 2dbbf77 March 21, 2024 11:53

ee7 requested a review from miki725 March 21, 2024 12:16

refactor(techStackGeneric): revert other walkDir changes

722058b

Let's handle this in a separate PR.

ee7 mentioned this pull request Mar 21, 2024

techStackGeneric: should be faster #249

Closed

11 tasks

miki725 approved these changes Mar 21, 2024

View reviewed changes

ee7 merged commit 7cf3029 into main Mar 21, 2024
2 checks passed

ee7 deleted the ee7/refactor-techStackGeneric branch March 21, 2024 17:28

ee7 mentioned this pull request Mar 21, 2024

Consider pinning CI dependencies more strictly #142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(techStackGeneric): improve getProcNames #250

refactor(techStackGeneric): improve getProcNames #250

ee7 commented Mar 20, 2024 •

edited

Loading

miki725 Mar 20, 2024

miki725 Mar 20, 2024

ee7 Mar 21, 2024 •

edited

Loading

miki725 Mar 20, 2024

ee7 Mar 21, 2024 •

edited

Loading

miki725 Mar 20, 2024

ee7 Mar 21, 2024 •

edited

Loading

ee7 commented Mar 21, 2024

refactor(techStackGeneric): improve getProcNames #250

refactor(techStackGeneric): improve getProcNames #250

Conversation

ee7 commented Mar 20, 2024 • edited Loading

Description

Testing

miki725 Mar 20, 2024

Choose a reason for hiding this comment

miki725 Mar 20, 2024

Choose a reason for hiding this comment

ee7 Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

miki725 Mar 20, 2024

Choose a reason for hiding this comment

ee7 Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

miki725 Mar 20, 2024

Choose a reason for hiding this comment

ee7 Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

ee7 commented Mar 21, 2024

ee7 commented Mar 20, 2024 •

edited

Loading

ee7 Mar 21, 2024 •

edited

Loading

ee7 Mar 21, 2024 •

edited

Loading

ee7 Mar 21, 2024 •

edited

Loading