feat!: Increase Crawler standardization regarding Pipelines #4122

danielbichuetti · 2023-02-09T16:09:48Z

Related Issues

Proposed Changes:

Crawler implementation was changed to adhere to Pipeline flows and improved support for Agents. Now, its main function is to extract Documents. It can save to files and it, optionally, allows keeping track at the Document (meta).

+Output Documents primarily
+Optional file save
+Optional add file path to Document meta

How did you test it?

Current tests have been run under Python 3.7 and Python 3.10 environment.
One extra test

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

+Output Documents +Optional file saving +Optional Document meta about file path

danielbichuetti · 2023-02-09T19:44:22Z

@vblagoje This is about that Issue.

Crawler will only save files if the user sets a parameter. It will, as most other Nodes, generate Documents.

vblagoje · 2023-02-10T09:09:29Z

@danielbichuetti Looking good, much better than what we had before. @masci your turn.

danielbichuetti · 2023-02-14T22:58:41Z

It appears that CI is failing to add type:documentation to this PR.

anakin87 · 2023-02-16T14:17:53Z

It appears that CI is failing to add type:documentation to this PR.

These failures will be solved by #4145 and #4146.

CLAassistant · 2023-02-16T15:04:11Z

All committers have signed the CLA.

silvanocerza · 2023-02-16T15:05:08Z

We merged #4146 so I took the freedom to update the PR with main to fix the failing job. @danielbichuetti

danielbichuetti · 2023-02-16T19:00:49Z

Hi @masci

Any updates to be made?

danielbichuetti · 2023-02-17T14:50:03Z

No worries. Just pinged here in case it was lost in the middle of so many PRs.

masci

Overall LGTM, just questions and nits.

haystack/nodes/connector/crawler.py

test/nodes/test_connector.py

haystack/nodes/connector/crawler.py

danielbichuetti · 2023-02-18T13:04:05Z

@masci When I checked for a new review, GH removed @agnieszka-m from the PR automaticaly.

masci

I left some docs comments while Agnieszka is out, everything else is good to go and I'll merge it asap.

haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <[email protected]>

masci

LGTM!

masci · 2023-02-22T16:34:06Z

Oh my it's finally green 🎉 merging before the CI changes their mind

danielbichuetti added 2 commits February 9, 2023 15:36

feat!(Crawler): Integrate Crawler in the Pipeline.

a01b666

+Output Documents +Optional file saving +Optional Document meta about file path

refactor: add Optional decl.

cabe91d

danielbichuetti requested a review from a team as a code owner February 9, 2023 16:09

danielbichuetti requested review from masci and removed request for a team February 9, 2023 16:09

github-actions bot added topic:crawler topic:tests labels Feb 9, 2023

Merge branch 'deepset-ai:main' into crawler_output

fe3f5fe

chore: dummy commit

5e17897

vblagoje mentioned this pull request Feb 10, 2023

feat: add parallel multiprocessing option to Crawler #4126

Closed

5 tasks

danielbichuetti added 4 commits February 12, 2023 00:23

Merge branch 'deepset-ai:main' into crawler_output

3fcb029

Merge branch 'deepset-ai:main' into crawler_output

772b82f

Merge branch 'main' into crawler_output

080b8c7

chore: dummy commit

9648b12

vblagoje added type:documentation Improvements on the docs and removed type:documentation Improvements on the docs labels Feb 15, 2023

danielbichuetti added 2 commits February 16, 2023 00:18

Merge branch 'deepset-ai:main' into crawler_output

490247c

Merge branch 'deepset-ai:main' into crawler_output

1bbd3ff

Merge branch 'main' into crawler_output

f7ff40d

danielbichuetti added 2 commits February 16, 2023 12:44

Merge branch 'main' into crawler_output

50378e8

Merge branch 'main' into crawler_output

3b124f2

danielbichuetti mentioned this pull request Feb 17, 2023

Crawler .crawl() does not have the return_documents option #4188

Closed

masci requested a review from agnieszka-m February 18, 2023 09:45

masci suggested changes Feb 18, 2023

View reviewed changes

masci added the breaking change label Feb 18, 2023

danielbichuetti added 3 commits February 18, 2023 07:51

Merge branch 'deepset-ai:main' into crawler_output

899b40a

refactor: improve overwrite flow

2fae451

refactor: change custom file path meta logic + add test

09c3fc1

danielbichuetti requested review from masci and removed request for agnieszka-m February 18, 2023 13:03

masci reviewed Feb 20, 2023

View reviewed changes

danielbichuetti and others added 7 commits February 20, 2023 14:21

Update haystack/nodes/connector/crawler.py

d7f8e16

Co-authored-by: Massimiliano Pippi <[email protected]>

Update haystack/nodes/connector/crawler.py

8b56bdc

Co-authored-by: Massimiliano Pippi <[email protected]>

Update haystack/nodes/connector/crawler.py

8759e56

Co-authored-by: Massimiliano Pippi <[email protected]>

Update haystack/nodes/connector/crawler.py

2402d73

Co-authored-by: Massimiliano Pippi <[email protected]>

Update haystack/nodes/connector/crawler.py

74f2864

Co-authored-by: Massimiliano Pippi <[email protected]>

Merge branch 'main' into crawler_output

1e0bf59

Merge branch 'main' into crawler_output

c37142b

danielbichuetti requested a review from masci February 21, 2023 02:53

masci approved these changes Feb 21, 2023

View reviewed changes

masci changed the title ~~feat!(Crawler): Increase Crawler standardization regarding Pipelines~~ feat!: Increase Crawler standardization regarding Pipelines Feb 21, 2023

danielbichuetti added 5 commits February 21, 2023 07:21

Merge branch 'main' into crawler_output

bce02fa

Merge branch 'main' into crawler_output

c5afca3

Merge branch 'main' into crawler_output

5ce6f22

Merge branch 'deepset-ai:main' into crawler_output

376982c

Merge branch 'main' into crawler_output

f161972

masci linked an issue Feb 22, 2023 that may be closed by this pull request

Crawler should return Documents for given URLs without first saving docs to disk #4081

Closed

masci merged commit e0b0fe1 into deepset-ai:main Feb 22, 2023

danielbichuetti deleted the crawler_output branch February 22, 2023 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: Increase Crawler standardization regarding Pipelines #4122

feat!: Increase Crawler standardization regarding Pipelines #4122

danielbichuetti commented Feb 9, 2023 •

edited

Loading

danielbichuetti commented Feb 9, 2023 •

edited

Loading

vblagoje commented Feb 10, 2023

danielbichuetti commented Feb 14, 2023

anakin87 commented Feb 16, 2023

CLAassistant commented Feb 16, 2023 •

edited

Loading

silvanocerza commented Feb 16, 2023

danielbichuetti commented Feb 16, 2023 •

edited

Loading

danielbichuetti commented Feb 17, 2023 •

edited

Loading

masci left a comment

danielbichuetti commented Feb 18, 2023

masci left a comment •

edited

Loading

masci left a comment

masci commented Feb 22, 2023

feat!: Increase Crawler standardization regarding Pipelines #4122

feat!: Increase Crawler standardization regarding Pipelines #4122

Conversation

danielbichuetti commented Feb 9, 2023 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

danielbichuetti commented Feb 9, 2023 • edited Loading

vblagoje commented Feb 10, 2023

danielbichuetti commented Feb 14, 2023

anakin87 commented Feb 16, 2023

CLAassistant commented Feb 16, 2023 • edited Loading

silvanocerza commented Feb 16, 2023

danielbichuetti commented Feb 16, 2023 • edited Loading

danielbichuetti commented Feb 17, 2023 • edited Loading

masci left a comment

Choose a reason for hiding this comment

danielbichuetti commented Feb 18, 2023

masci left a comment • edited Loading

Choose a reason for hiding this comment

masci left a comment

Choose a reason for hiding this comment

masci commented Feb 22, 2023

danielbichuetti commented Feb 9, 2023 •

edited

Loading

danielbichuetti commented Feb 9, 2023 •

edited

Loading

CLAassistant commented Feb 16, 2023 •

edited

Loading

danielbichuetti commented Feb 16, 2023 •

edited

Loading

danielbichuetti commented Feb 17, 2023 •

edited

Loading

masci left a comment •

edited

Loading