Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Programming question]: #238

Open
haesleinhuepf opened this issue Feb 19, 2025 · 1 comment
Open

[Programming question]: #238

haesleinhuepf opened this issue Feb 19, 2025 · 1 comment

Comments

@haesleinhuepf
Copy link
Owner

I need a Jupyter notebook which does the following things to determine copyright of used materials in a slide-deck saved as PDF.

  • Opens a PDF file and extracts text from the pages page-by-page and stores them in a list of urls for each slide.
  • Goes through the pages and corresponding URLs, loads the websites and extracts the text from the pages.
  • For each page, the text is sent to a prompt(messag:str)->str function. The goal of using this LLM-based prompt function is to identify the license of the page.
  • Returns a list of licenses per page.

git-bob try to do this

Copy link
Contributor

This message was generated by git-bob (version: 0.19.0, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

I created a Jupyter notebook that analyzes PDF files to determine copyright information of materials used in slide decks. The notebook uses various Python libraries to extract text from PDFs and associated URLs, and then processes this information through an LLM-based prompt function to identify licenses for each page. I also generated a sample PDF file for testing and created a CSV file to store the license analysis results. Additionally, I updated the notebook's dependencies to include the fpdf2 library for improved PDF handling.

  • sample.pdf Created a sample PDF file for testing the license analysis functionality.
  • license_analysis.csv Generated a CSV file containing the license analysis results for each page.
  • pdf_license_analyzer.ipynb Created a Jupyter notebook that extracts and analyzes license information from PDF files and their associated URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant