Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyse yml - search for duplicates #664

Open
haesleinhuepf opened this issue Feb 16, 2025 · 5 comments
Open

Analyse yml - search for duplicates #664

haesleinhuepf opened this issue Feb 16, 2025 · 5 comments

Comments

@haesleinhuepf
Copy link
Member

Write Python code to navigate recursively through the resources/nfdi4bioimage.yml file and search for all urls. They may appear as string value of any "url" key in dictionaries, or as list of values. Print out all duplicates (urls that appear twice).

git-bob try to do this.

Copy link
Contributor

This message was generated by git-bob (version: 0.19.0, model: github_models:gpt-4o, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

A new script was added to analyze a YAML file for duplicate URLs. The script navigates through the resources/nfdi4bioimage.yml file recursively to locate URLs that may appear as dictionary values for "url" keys or as part of lists. It then identifies and outputs any duplicate URLs that occur more than once.

@haesleinhuepf
Copy link
Member Author

Ok, I think a notebook would be better.

Git-bob try again

Copy link
Contributor

This message was generated by git-bob (version: 0.19.0, model: github_models:gpt-4o, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

A new Jupyter notebook was created to replace the earlier Python script for analyzing a YAML file to detect duplicate URLs. The notebook is structured to recursively search for URLs in the specified file and identify duplicates. Additionally, a check was introduced to ensure that the output file's parent directory exists, using os.makedirs with the exist_ok=True flag.

During solving this task, the following errors occurred:

  • Error during {'action': 'delete', 'filename': 'scripts/find_duplicate_urls.py'}: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/repos/contents#get-repository-content", "status": "404"}
        Traceback (most recent call last):
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/git_bob/_ai_github_utilities.py", line 659, in solve_github_issue
          Config.git_utilities.delete_file_from_repository(repository, branch_name, filename)
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/git_bob/_github_utilities.py", line 787, in delete_file_from_repository
          file = get_file_in_repository(repository, branch_name, file_path)
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/git_bob/_github_utilities.py", line 487, in get_file_in_repository
          return repo.get_contents(file_path, ref=branch_name)
                 ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/github/Repository.py", line 2389, in get_contents
          headers, data = self._requester.requestJsonAndCheck(
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
              "GET",
              ^^^^^^
          ...<2 lines>...
              follow_302_redirect=True,
              ^^^^^^^^^^^^^^^^^^^^^^^^^
          )
          ^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/github/Requester.py", line 624, in requestJsonAndCheck
          return self.__check(
                 ~~~~~~~~~~~~^
              *self.requestJson(
              ^^^^^^^^^^^^^^^^^^
          ...<7 lines>...
              )
              ^
          )
          ^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/github/Requester.py", line 792, in __check
          raise self.createException(status, responseHeaders, data)
      github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/repos/contents#get-repository-content", "status": "404"}
      

@haesleinhuepf
Copy link
Member Author

Ok, nice! Instead of writing duplicates to a file, just print them out. Also print out the total number of urls and the number of duplicates.

git-bob try to do this

Copy link
Contributor

This message was generated by git-bob (version: 0.19.0, model: github_models:gpt-4o, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

The task involved analyzing a YAML file to identify duplicate URLs. Initially, a Python script was created to process the file recursively and print duplicates. Following feedback, the script was converted into a Jupyter notebook for a more interactive and user-friendly approach. The notebook now includes functionality to extract all URLs, count the total URLs and duplicates, and simply print this information instead of writing it to a file. Additionally, minor improvements such as ensuring directory existence for potential output operations were incorporated.

During solving this task, the following errors occurred:

  • Error during {'action': 'delete', 'filename': 'scripts/find_duplicate_urls.py'}: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/repos/contents#get-repository-content", "status": "404"}
        Traceback (most recent call last):
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/git_bob/_ai_github_utilities.py", line 659, in solve_github_issue
          Config.git_utilities.delete_file_from_repository(repository, branch_name, filename)
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/git_bob/_github_utilities.py", line 787, in delete_file_from_repository
          file = get_file_in_repository(repository, branch_name, file_path)
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/git_bob/_github_utilities.py", line 487, in get_file_in_repository
          return repo.get_contents(file_path, ref=branch_name)
                 ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/github/Repository.py", line 2389, in get_contents
          headers, data = self._requester.requestJsonAndCheck(
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
              "GET",
              ^^^^^^
          ...<2 lines>...
              follow_302_redirect=True,
              ^^^^^^^^^^^^^^^^^^^^^^^^^
          )
          ^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/github/Requester.py", line 624, in requestJsonAndCheck
          return self.__check(
                 ~~~~~~~~~~~~^
              *self.requestJson(
              ^^^^^^^^^^^^^^^^^^
          ...<7 lines>...
              )
              ^
          )
          ^
        File "/opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/site-packages/github/Requester.py", line 792, in __check
          raise self.createException(status, responseHeaders, data)
      github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/repos/contents#get-repository-content", "status": "404"}
      

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant