Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scancode fails to inventory package manifest files #2408

Open
bmarsh9 opened this issue Feb 19, 2021 · 17 comments
Open

Scancode fails to inventory package manifest files #2408

bmarsh9 opened this issue Feb 19, 2021 · 17 comments

Comments

@bmarsh9
Copy link

bmarsh9 commented Feb 19, 2021

Description

Hey there,

Scancode appears to only scan package manifest files based on the name AND if the scan is run in the same directory. For example, if my python manifest file is named dev.txt, scancode does not appear to inventory it. However, if I rename the file to requirements.txt AND run scancode from the same directory, scancode will inventory it and the package dependencies will populate in the json output.. as shown below. There's also a possibility I am using it wrong and this is expected.

How To Reproduce

Here we run scancode against the entire directory (no package dependencies shown)

Command: scancode -clpeui --json-pp out.json flask-sqlalchemy

    {
      "path": "flask-sqlalchemy/requirements/dev.txt", ### name is dev.txt
      "type": "file",
      "name": "dev.txt",
      "base_name": "dev",
      "extension": ".txt",
      "size": 2363,
      "date": "2021-02-19",
      "sha1": "f3ec0c8082673263ff92cf965fa104bc9c11e078",
      "md5": "80d6553a1d0caf914fb7ff299d742b2a",
      "sha256": "53dd0a09ff4313129309e0231463226372482868af9601fedf8758d2f74cea74",
      "mime_type": "text/plain",
      "file_type": "ASCII text",
      "programming_language": null,
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "percentage_of_license_text": 0,
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [], ### NOTHING HERE
      "emails": [],
      "urls": [],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },

Now we run scancode in the same directory against the file and we get package dependencies

Command: scancode -clpeui --json-pp out.json requirements.txt

      "packages": [ ### NOW WE HAVE DATA
        {
          "type": "pypi",
          "namespace": null,
          "name": null,
          "version": null,
          "qualifiers": {},
          "subpath": null,
          "primary_language": "Python",
          "description": null,
          "release_date": null,
          "parties": [],
          "keywords": [],
          "homepage_url": null,
          "download_url": null,
          "size": null,
          "sha1": null,
          "md5": null,
          "sha256": null,
          "sha512": null,
          "bug_tracking_url": null,
          "code_view_url": null,
          "vcs_url": null,
          "copyright": null,
          "license_expression": null,
          "declared_license": null,
          "notice_text": null,
          "root_path": null,
          "dependencies": [
            {
              "purl": "pkg:pypi/[email protected]",
              "requirement": "==0.7.12",
              "scope": "dependencies",

Tell us how to reproduce the issue.

scancode -clpeui --json-pp out.json flask-sqlalchemy

scancode -clpeui --json-pp out.json requirements.txt

System configuration

Python 3.6.13

FROM python:3.6-slim-buster is the base docker image

Version: ScanCode version 21.2.9

@bmarsh9 bmarsh9 added the bug label Feb 19, 2021
@pombredanne
Copy link
Member

@bmarsh9 hey! thanks for the report!
This is the file name patterns that are used to decide to start a parse https://github.com/nexB/scancode-toolkit/blob/5174a3b21758d0d6d5c1db3c74494d56e49a8e74/src/packagedcode/pypi.py#L116

Hence dev.txt alone would not be picked, while requirements.txt or dev-requirements.txt would be. But looking at the code requirements-dev.txt would not be picked too which is a bug IMHO.

The point is that we need to get some trigger to decide to look in a file especially since the syntax of a requirements file is pretty lax, so there is no obvious content-based way to detect that a text file is a really a requirements file.

FYI, note that there are a few pending/in progress tickets related to Python packages in general:

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 19, 2021

Thanks for the quick response. That makes sense.. I figured it was doing some pattern matching.

Is there a flag for specifying which packages to look for or a pattern? So for example if I wanted to tell scancode to treat dev.txt as a python manifest file.

If not, that would be a nice feature to have

Thanks!

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 19, 2021

However Im still seeing some issues when a run a scan against a directory. requirements.txt is in the directory but the dependencies aren't picked up.

Verify requirements.txt exists and run scan

root@2e3956739895:/api-scan/scancode# ls flask-sqlalchemy/requirements/requirements.txt
flask-sqlalchemy/requirements/requirements.txt

root@2e3956739895:/api-scan/scancode# scancode -clpeui --json-pp out.json flask-sqlalchemy
Setup plugins...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls with 1 process(es)...
[####################] 0
Scanning done.
Summary:        info, licenses, copyrights, packages, emails, urls with 1 process(es)
Errors count:   0
Scan Speed:     1.61 files/sec. 6.04 KB/sec.
Initial counts: 111 resource(s): 88 file(s) and 23 directorie(s)
Final counts:   111 resource(s): 88 file(s) and 23 directorie(s) for 329.66 KB
Timings:
  scan_start: 2021-02-19T172341.229902
  scan_end:   2021-02-19T172437.725320
  setup_scan:licenses: 1.82s
  setup: 1.82s
  scan: 54.58s
  total: 56.60s
Removing temporary files...done.

Output with no dependency results

    {
      "path": "flask-sqlalchemy/requirements/requirements.txt",
      "type": "file",
      "name": "requirements.txt",
      "base_name": "requirements",
      "extension": ".txt",
      "size": 2363,
      "date": "2021-02-19",
      "sha1": "f3ec0c8082673263ff92cf965fa104bc9c11e078",
      "md5": "80d6553a1d0caf914fb7ff299d742b2a",
      "sha256": "53dd0a09ff4313129309e0231463226372482868af9601fedf8758d2f74cea74",
      "mime_type": "text/plain",
      "file_type": "ASCII text",
      "programming_language": null,
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "percentage_of_license_text": 0,
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 19, 2021

Looking at that code snippet you posted.. it should work.

"name": "requirements.txt",

And the code is basically

if file_name.endswith(name):

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 19, 2021

More testing

Scanning directory with requirements.txt

from scancode import api

path = "flask-sqlalchemy/requirements.txt"
data = api.get_package_info(path)

print(data)
>> {'packages': []}

Scanning file in same directory

from scancode import api

path = "requirements.txt"
data = api.get_package_info(path)

print(data)
>> {'packages': [{'type': 'pypi', 'namespace': None, 'name': None, 'version': None, 'qualifiers': {}, 'subpath': None, 'primary_language': 'Python', 'desc 

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 19, 2021

Looks like its throwing this error:

[Errno 2] No such file or directory: 'MANIFEST.in'

If I include MANIFEST.in in my local directory, it works

Error is thrown here: https://github.com/nexB/scancode-toolkit/blob/f3ef5b4ad823577e507d673a7fbc65d5efe4f6af/src/packagedcode/recognize.py#L44

@pombredanne
Copy link
Member

Is there a flag for specifying which packages to look for or a pattern? So for example if I wanted to tell scancode to treat dev.txt as a python manifest file.

If not, that would be a nice feature to have

I do not like adding too many command line flags... here the issue would be if there is a need for a flag to look for any package type? (one flag per type would be worse). The rationale is that at some level each additional way to configure things is a sign at some level that we are getting it done right in the first place.

On the other hand adding extra patterns such as dev.txt, prod.txt solo or inside a /requirements/ dir would made sense.
Worst case it may fail to parse and return nothing.

@pombredanne
Copy link
Member

However Im still seeing some issues when a run a scan against a directory. requirements.txt is in the directory but the dependencies aren't picked up.

Can you attach your requirements file?

In https://github.com/nexB/scancode-toolkit/blob/5174a3b21758d0d6d5c1db3c74494d56e49a8e74/src/scancode/api.py#L299 we mute the exceptions thrown from failing to parse a package manifest as there was too much noise from that.
You can make that visible again for debugging with the SCANCODE_DEBUG_PACKAGE_API env var. (any value will work)
So:
SCANCODE_DEBUG_PACKAGE_API=yes scancode --clipeu ......
or in the virtualenv:
SCANCODE_DEBUG_PACKAGE_API=yes python
With this the scan results should have a detailed failure trace

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 20, 2021

Here is the file (renamed to requirements.txt) https://github.com/pallets/flask-sqlalchemy/blob/master/requirements/dev.txt

With the debug flag
Traceback (most recent call last):
  File "run.py", line 4, in <module>
    data = api.get_package_info(path)
  File "/usr/local/lib/python3.6/site-packages/scancode/api.py", line 296, in get_package_info
    recognized_packages = recognize_packages(location)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/recognize.py", line 71, in recognize_packages
    for recognized in package_type.recognize(location):
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 81, in recognize
    yield parse(location)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 132, in parse
    parse_dependencies(parent_directory, package)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 194, in parse_dependencies
    dependencies = parse_with_dparse(resource_location)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 228, in parse_with_dparse
    with open(location) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'

It says FileNotFound but I have tried the absolute and relative path in the variable

root@ac8daea8d689:/api-scan/scancode# ls flask-sqlalchemy/requirements.txt
flask-sqlalchemy/requirements.txt
from scancode import api

path = "/api-scan/scancode/flask-sqlalchemy/requirements.txt"
data = api.get_package_info(path)

print(data)

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 20, 2021

But it will work if I move the requirements.txt file in the same directory as the script. Also I don't think its the format of the file b/c it parses normally/correctly (if in the same directory).

@pombredanne
Copy link
Member

@bmarsh9 are you sure that your path is correct?
Does this pass?

import os
assert os.path.exists("/api-scan/scancode/flask-sqlalchemy/requirements.txt") is True

Also in all cases, we should detect .txt files in a requirements dir alright: That's a bug we need to fix.

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 23, 2021

@pombredanne Yep, path exists

>>> import os
>>> assert os.path.exists("/api-scan/scancode/flask-sqlalchemy/requirements.txt") is True
>>>

@pombredanne
Copy link
Member

Yep, path exists

So this should work. Does it (assuming you have wget)?:

wget -O dev-requirements.txt https://raw.githubusercontent.com/pallets/flask-sqlalchemy/master/requirements/dev.txt

SCANCODE_DEBUG_PACKAGE_API=yes python

import os
from scancode import api

loc = "dev-requirements.txt"
assert os.path.exists(loc)
data = api.get_package_info(loc)
print(data)

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 24, 2021

Yep that works. But if the requirements file is placed in a sub directory, it will throw an error. See below

returns the error with absolute or relative path
mkdir subdir
wget -O subdir/dev-requirements.txt https://raw.githubusercontent.com/pallets/flask-sqlalchemy/master/requirements/dev.txt
SCANCODE_DEBUG_PACKAGE_API=yes python

>>> import os
>>> from scancode import api
>>>
>>> loc = "/root/subdir/dev-requirements.txt" #absolute path
>>> assert os.path.exists(loc)
>>> data = api.get_package_info(loc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/scancode/api.py", line 296, in get_package_info
    recognized_packages = recognize_packages(location)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/recognize.py", line 71, in recognize_packages
    for recognized in package_type.recognize(location):
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 81, in recognize
    yield parse(location)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 132, in parse
    parse_dependencies(parent_directory, package)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 194, in parse_dependencies
    dependencies = parse_with_dparse(resource_location)
  File "/usr/local/lib/python3.6/site-packages/packagedcode/pypi.py", line 228, in parse_with_dparse
    with open(location) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'dev-requirements.txt'
>>> print(data)

So I guess parse_with_dparse function is not getting the correct location variable or it is wrangled by something else before it gets there.

@pombredanne
Copy link
Member

so this is a clear bug! Thank you for your patience.

@bmarsh9
Copy link
Author

bmarsh9 commented Feb 25, 2021

Course. Let me know if you need any re-testing. Thanks!

@pombredanne
Copy link
Member

@bmarsh9 we have now a brand new requirements.txt parser derived from pip that landed in the develop branch

$ scancode --package --yaml new.yml.txt foobar/ yield the attached 🎉
new.yml.txt

We are also revamping the dependencies handling across the board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants