Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to get URL Status (returns an URLItem) #92

Merged
merged 11 commits into from
Sep 3, 2024

Conversation

klockla
Copy link
Collaborator

@klockla klockla commented Aug 2, 2024

Add a new API method to retrieve information about an URL

 /** Get status of a particular URL 
     This does not take into account URL scheduling.
     Used to check current status of an URL within the frontier
 **/
 rpc GetURLStatus(URLStatusRequest) returns (URLItem) {}

Implemented only for MemoryFrontier and RocksDb
(may fullfill partially #57 )

Unfortunately the internal storage doesn't make a distinction between Discovered and Known URLs which have to be refetched (or I have missed the point)

So all scheduled items will be returned as a KnownURLItem (with a refetch date equal to 0 for completed items)
If the URL is not in URLFrontier, the method will return io.grpc.Status.NOT_FOUND.asRuntimeException()

Signed-off-by: Laurent Klock [email protected]

@klockla klockla marked this pull request as ready for review August 2, 2024 14:58
@klockla klockla marked this pull request as draft August 2, 2024 15:01
@klockla klockla marked this pull request as ready for review August 2, 2024 15:07
@klockla klockla marked this pull request as draft August 6, 2024 12:07
Implemented only for MemoryFrontier and RocksDb

Unfortunately the internal storage doesn't make a distinction
between Discovered and Known URLs which have to be refetched

So all scheduled items will be returned as ill always return KwownURLItem or Status.NOT_FOUND runtime exception

Signed-off-by: Laurent Klock <[email protected]>
@klockla klockla marked this pull request as ready for review August 8, 2024 14:16
Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor issues and questions. Great to have additional tests!

@jnioche
Copy link
Collaborator

jnioche commented Aug 29, 2024

Thanks @klockla
Looks good at this stage but I think it needs an addition to the client so that we can query the new endpoint and display the status of a URL.

Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment in the conversation re-client side

@klockla
Copy link
Collaborator Author

klockla commented Sep 2, 2024

see comment in the conversation re-client side

Added the method in client.

API/urlfrontier.proto Show resolved Hide resolved
private String crawl;

@Option(
names = {"-k", "--key"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about key being generated by default on the server side. Should be optional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jnioche
Copy link
Collaborator

jnioche commented Sep 2, 2024

thanks a lot @klockla - I gave it a try and it seems to work fine
let me know what you think of my comments and suggestions above

Added missing license header

Signed-off-by: Laurent Klock <[email protected]>
@jnioche
Copy link
Collaborator

jnioche commented Sep 3, 2024

Tested, works great! Thanks @klockla, this is a great contribution to the project

@jnioche jnioche merged commit 08f09c3 into crawler-commons:master Sep 3, 2024
2 checks passed
@jnioche jnioche added this to the 2.3 milestone Sep 3, 2024
@jnioche jnioche added enhancement New feature or request API Client labels Sep 3, 2024
@jnioche jnioche mentioned this pull request Sep 3, 2024
@klockla klockla deleted the geturlstatus branch September 17, 2024 12:42
@jnioche jnioche modified the milestones: 2.3, 2.4.0 Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Client enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants