-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
org.apache.http.NoHttpResponseException: The target server failed to respond #405
Comments
Could it be that you are at the limits of your bandwidth? How many fetch threads are you using? |
I was thinking that first, but i have the same error with just 1 url in the seed list |
The 3 sites you mentioned earlier all point to the same server (193.252.138.58). Could it be that you got blacklisted by them? |
I know for the same server. It's the first web hosting company in France for professionnal. |
I can't reproduce the issue.
|
I found the mistake in HttpProtocol.java by enabling AutomaticRetries : Here the result log : Why disabling AutomaticRetries and RedirectHandling ? |
I would not call that a mistake. Retrying the URL does not explain why it failed in the first place, as you pointed out initially, it worked after retrying
you can set the schedule for fetch_errors to a low value so that the URL gets eligible for re-fetching soon. It would be interesting to know why it fails on the first attempt. |
Closing for now. Please reopen if necessary. |
Note for self: seeing the same problem with The explanation can be found here The issue does not happen when specifying Setting a retry value at the protocol level is one possible solution but as pointed out earlier such URLs get retried by stormcrawler later on anyway - with the politeness. A better approach is suggested in http://stackoverflow.com/questions/10570672/get-nohttpresponseexception-for-load-testing/10680629 i.e. set a lower validate-before-reuse time
but even if we set a low value for setValidateAfterInactivity(), this would not get applied unless we applied the politeness setting between the call to robots.txt and the fetching of the URL, for which I had opened (and closed) #343. As a quick test, I added a call to Thread.sleep call to the HttpRobotRulesParser for a few seconds and the fetches were successful after that! I will reopen #343 but make the behavior configurable. Ideally, the FetcherBolt could - if configured to be polite after querying robots - pop the URL back into the queue and deal with another queue until the politeness delay has passed. |
Another way to retry is if you're using Spring, then you can use @retryable. Here is the code snippet:
|
@jnioche why robot.txt has anything to do with this issue? Is the underline httpclient trying too had to be smart? |
Can't reproduce the problem, probably fixed itself by upgrading the version of httpclient |
I have a lot of FETCH_ERROR (about ten percent on one million french url).
On debug i can see this error : org.apache.http.NoHttpResponseException: The target server failed to respond
Sometimes, it's working after many retries ?
Here some of the url :
http://www.serigraph-herault.fr/
http://www.courtage-mayenne.fr/
http://www.alur-diagnostics-sete.fr/
Is there something wrong with HttpProtocol.java and org.apache.http.impl.client.HttpClients ?
I purchase my investigations
The text was updated successfully, but these errors were encountered: