Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.apache.http.NoHttpResponseException: The target server failed to respond #405

Closed
Laurent-Hervaud opened this issue Jan 10, 2017 · 12 comments
Assignees
Labels

Comments

@Laurent-Hervaud
Copy link

I have a lot of FETCH_ERROR (about ten percent on one million french url).
On debug i can see this error : org.apache.http.NoHttpResponseException: The target server failed to respond
Sometimes, it's working after many retries ?
Here some of the url :
http://www.serigraph-herault.fr/
http://www.courtage-mayenne.fr/
http://www.alur-diagnostics-sete.fr/
Is there something wrong with HttpProtocol.java and org.apache.http.impl.client.HttpClients ?
I purchase my investigations

@jnioche
Copy link
Contributor

jnioche commented Jan 10, 2017

Could it be that you are at the limits of your bandwidth? How many fetch threads are you using?

@Laurent-Hervaud
Copy link
Author

I was thinking that first, but i have the same error with just 1 url in the seed list

@jnioche
Copy link
Contributor

jnioche commented Jan 10, 2017

The 3 sites you mentioned earlier all point to the same server (193.252.138.58). Could it be that you got blacklisted by them?

@Laurent-Hervaud
Copy link
Author

I know for the same server. It's the first web hosting company in France for professionnal.
I try multiple test in local and on aws for blacklist. All is working with a simple curl. I also try with Nutch and it's working. I also try multiple user agent.

@jnioche
Copy link
Contributor

jnioche commented Jan 11, 2017

I can't reproduce the issue.

6493 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.serigraph-herault.fr/ with status 200 in msec 363
6521 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.alur-diagnostics-sete.fr/ with status 200 in msec 393
6530 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.courtage-mayenne.fr/ with status 200 in msec 402

@Laurent-Hervaud
Copy link
Author

I found the mistake in HttpProtocol.java by enabling AutomaticRetries :
builder = HttpClients.custom().setUserAgent(userAgent)
.setConnectionManager(CONNECTION_MANAGER)
.setConnectionManagerShared(true).disableRedirectHandling();
//.disableAutomaticRetries();

Here the result log :
12932 [Thread-48-fetch-executor[14 14]] INFO o.a.h.i.e.RetryExec - I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://www.serigraph-herault.fr:80: The target server failed to respond
12933 [Thread-48-fetch-executor[14 14]] INFO o.a.h.i.e.RetryExec - Retrying request to {}->http://www.serigraph-herault.fr:80
13300 [Thread-48-fetch-executor[14 14]] INFO c.d.s.b.SimpleFetcherBolt - [Fetcher #14] Fetched http://www.serigraph-herault.fr with status 200 in 371 after waiting 0

Why disabling AutomaticRetries and RedirectHandling ?

@jnioche
Copy link
Contributor

jnioche commented Jan 11, 2017

I would not call that a mistake. Retrying the URL does not explain why it failed in the first place, as you pointed out initially, it worked after retrying

Why disabling AutomaticRetries and RedirectHandling ?

  1. retries -> because we want to control politeness and to be efficient, there is no point trying again right away when it is likely that it will fail, when we could be fetching from a different server
  2. redirect -> politeness again and also the target URL could already be known and perhaps even fetched

you can set the schedule for fetch_errors to a low value so that the URL gets eligible for re-fetching soon.

It would be interesting to know why it fails on the first attempt.

@jnioche
Copy link
Contributor

jnioche commented Jan 13, 2017

Closing for now. Please reopen if necessary.

@jnioche jnioche closed this as completed Jan 13, 2017
@jnioche
Copy link
Contributor

jnioche commented Mar 7, 2017

Note for self: seeing the same problem with

The explanation can be found here
http://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception

The issue does not happen when specifying http.skip.robots=true, my interpretation of this is that the server closes the connection prematurely when we try to get the robots and when we query for the main URL straight away, we get this issue.

Setting a retry value at the protocol level is one possible solution but as pointed out earlier such URLs get retried by stormcrawler later on anyway - with the politeness.

A better approach is suggested in http://stackoverflow.com/questions/10570672/get-nohttpresponseexception-for-load-testing/10680629 i.e. set a lower validate-before-reuse time

connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setValidateAfterInactivity(connectionValidateLimit);

but even if we set a low value for setValidateAfterInactivity(), this would not get applied unless we applied the politeness setting between the call to robots.txt and the fetching of the URL, for which I had opened (and closed) #343.

As a quick test, I added a call to Thread.sleep call to the HttpRobotRulesParser for a few seconds and the fetches were successful after that! I will reopen #343 but make the behavior configurable. Ideally, the FetcherBolt could - if configured to be polite after querying robots - pop the URL back into the queue and deal with another queue until the politeness delay has passed.

@abhishekransingh
Copy link

Another way to retry is if you're using Spring, then you can use @retryable. Here is the code snippet:

@retryable(maxAttemptsExpression = "#{${startup.job.max.try}}",
value = {NoHttpResponseException.class},
backoff = @backoff(delayExpression = "#{${startup.job.delay}}", multiplierExpression = "#{${startup.job.multiplier}}"))
public void callHttpEndpint() throws IOException {
//Your code to call HTTP REST Endpoint here
}

@ade90036
Copy link

@jnioche why robot.txt has anything to do with this issue? Is the underline httpclient trying too had to be smart?

@jnioche
Copy link
Contributor

jnioche commented Jul 15, 2020

Can't reproduce the problem, probably fixed itself by upgrading the version of httpclient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants