Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Increase Crawler standardization regarding Pipelines #4122

Merged
merged 28 commits into from
Feb 22, 2023

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Feb 9, 2023

Related Issues

Proposed Changes:

Crawler implementation was changed to adhere to Pipeline flows and improved support for Agents. Now, its main function is to extract Documents. It can save to files and it, optionally, allows keeping track at the Document (meta).

+Output Documents primarily
+Optional file save
+Optional add file path to Document meta

How did you test it?

  • Current tests have been run under Python 3.7 and Python 3.10 environment.
  • One extra test

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added tests that demonstrate the correct behavior of the change
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

+Output Documents
+Optional file saving
+Optional Document meta about file path
@danielbichuetti danielbichuetti requested a review from a team as a code owner February 9, 2023 16:09
@danielbichuetti danielbichuetti requested review from masci and removed request for a team February 9, 2023 16:09
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Feb 9, 2023

@vblagoje This is about that Issue.

Crawler will only save files if the user sets a parameter. It will, as most other Nodes, generate Documents.

@vblagoje
Copy link
Member

@danielbichuetti Looking good, much better than what we had before. @masci your turn.

@danielbichuetti
Copy link
Contributor Author

It appears that CI is failing to add type:documentation to this PR.

@vblagoje vblagoje added type:documentation Improvements on the docs and removed type:documentation Improvements on the docs labels Feb 15, 2023
@anakin87
Copy link
Member

It appears that CI is failing to add type:documentation to this PR.

These failures will be solved by #4145 and #4146.

@CLAassistant
Copy link

CLAassistant commented Feb 16, 2023

CLA assistant check
All committers have signed the CLA.

@silvanocerza
Copy link
Contributor

We merged #4146 so I took the freedom to update the PR with main to fix the failing job. @danielbichuetti

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Feb 16, 2023

Hi @masci

Any updates to be made?

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Feb 17, 2023

No worries. Just pinged here in case it was lost in the middle of so many PRs.

@masci masci requested a review from agnieszka-m February 18, 2023 09:45
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just questions and nits.

haystack/nodes/connector/crawler.py Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
test/nodes/test_connector.py Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
@danielbichuetti danielbichuetti requested review from masci and removed request for agnieszka-m February 18, 2023 13:03
@danielbichuetti
Copy link
Contributor Author

@masci When I checked for a new review, GH removed @agnieszka-m from the PR automaticaly.

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some docs comments while Agnieszka is out, everything else is good to go and I'll merge it asap.

haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@masci masci changed the title feat!(Crawler): Increase Crawler standardization regarding Pipelines feat!: Increase Crawler standardization regarding Pipelines Feb 21, 2023
@masci
Copy link
Contributor

masci commented Feb 22, 2023

Oh my it's finally green 🎉 merging before the CI changes their mind

@masci masci merged commit e0b0fe1 into deepset-ai:main Feb 22, 2023
@danielbichuetti danielbichuetti deleted the crawler_output branch February 22, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crawler should return Documents for given URLs without first saving docs to disk
6 participants