org.apache.http.NoHttpResponseException: The target server failed to respond #405

Laurent-Hervaud · 2017-01-10T13:29:47Z

I have a lot of FETCH_ERROR (about ten percent on one million french url).
On debug i can see this error : org.apache.http.NoHttpResponseException: The target server failed to respond
Sometimes, it's working after many retries ?
Here some of the url :
http://www.serigraph-herault.fr/
http://www.courtage-mayenne.fr/
http://www.alur-diagnostics-sete.fr/
Is there something wrong with HttpProtocol.java and org.apache.http.impl.client.HttpClients ?
I purchase my investigations

jnioche · 2017-01-10T14:08:04Z

Could it be that you are at the limits of your bandwidth? How many fetch threads are you using?

Laurent-Hervaud · 2017-01-10T15:23:26Z

I was thinking that first, but i have the same error with just 1 url in the seed list

jnioche · 2017-01-10T15:34:53Z

The 3 sites you mentioned earlier all point to the same server (193.252.138.58). Could it be that you got blacklisted by them?

Laurent-Hervaud · 2017-01-11T07:46:40Z

I know for the same server. It's the first web hosting company in France for professionnal.
I try multiple test in local and on aws for blacklist. All is working with a simple curl. I also try with Nutch and it's working. I also try multiple user agent.

jnioche · 2017-01-11T11:31:45Z

I can't reproduce the issue.

6493 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.serigraph-herault.fr/ with status 200 in msec 363
6521 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.alur-diagnostics-sete.fr/ with status 200 in msec 393
6530 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.courtage-mayenne.fr/ with status 200 in msec 402

Laurent-Hervaud · 2017-01-11T13:53:56Z

I found the mistake in HttpProtocol.java by enabling AutomaticRetries :
builder = HttpClients.custom().setUserAgent(userAgent)
.setConnectionManager(CONNECTION_MANAGER)
.setConnectionManagerShared(true).disableRedirectHandling();
//.disableAutomaticRetries();

Here the result log :
12932 [Thread-48-fetch-executor[14 14]] INFO o.a.h.i.e.RetryExec - I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://www.serigraph-herault.fr:80: The target server failed to respond
12933 [Thread-48-fetch-executor[14 14]] INFO o.a.h.i.e.RetryExec - Retrying request to {}->http://www.serigraph-herault.fr:80
13300 [Thread-48-fetch-executor[14 14]] INFO c.d.s.b.SimpleFetcherBolt - [Fetcher #14] Fetched http://www.serigraph-herault.fr with status 200 in 371 after waiting 0

Why disabling AutomaticRetries and RedirectHandling ?

jnioche · 2017-01-11T14:22:33Z

I would not call that a mistake. Retrying the URL does not explain why it failed in the first place, as you pointed out initially, it worked after retrying

Why disabling AutomaticRetries and RedirectHandling ?

retries -> because we want to control politeness and to be efficient, there is no point trying again right away when it is likely that it will fail, when we could be fetching from a different server
redirect -> politeness again and also the target URL could already be known and perhaps even fetched

you can set the schedule for fetch_errors to a low value so that the URL gets eligible for re-fetching soon.

It would be interesting to know why it fails on the first attempt.

jnioche · 2017-01-13T12:08:56Z

Closing for now. Please reopen if necessary.

jnioche · 2017-03-07T10:35:29Z

Note for self: seeing the same problem with

The explanation can be found here
http://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception

The issue does not happen when specifying http.skip.robots=true, my interpretation of this is that the server closes the connection prematurely when we try to get the robots and when we query for the main URL straight away, we get this issue.

Setting a retry value at the protocol level is one possible solution but as pointed out earlier such URLs get retried by stormcrawler later on anyway - with the politeness.

A better approach is suggested in http://stackoverflow.com/questions/10570672/get-nohttpresponseexception-for-load-testing/10680629 i.e. set a lower validate-before-reuse time

connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setValidateAfterInactivity(connectionValidateLimit);

but even if we set a low value for setValidateAfterInactivity(), this would not get applied unless we applied the politeness setting between the call to robots.txt and the fetching of the URL, for which I had opened (and closed) #343.

As a quick test, I added a call to Thread.sleep call to the HttpRobotRulesParser for a few seconds and the fetches were successful after that! I will reopen #343 but make the behavior configurable. Ideally, the FetcherBolt could - if configured to be polite after querying robots - pop the URL back into the queue and deal with another queue until the politeness delay has passed.

abhishekransingh · 2018-05-01T21:04:37Z

Another way to retry is if you're using Spring, then you can use @retryable. Here is the code snippet:

@retryable(maxAttemptsExpression = "#{${startup.job.max.try}}",
value = {NoHttpResponseException.class},
backoff = @backoff(delayExpression = "#{${startup.job.delay}}", multiplierExpression = "#{${startup.job.multiplier}}"))
public void callHttpEndpint() throws IOException {
//Your code to call HTTP REST Endpoint here
}

ade90036 · 2019-06-12T16:56:05Z

@jnioche why robot.txt has anything to do with this issue? Is the underline httpclient trying too had to be smart?

jnioche · 2020-07-15T16:22:24Z

Can't reproduce the problem, probably fixed itself by upgrading the version of httpclient

jnioche closed this as completed Jan 13, 2017

jnioche reopened this Mar 7, 2017

jnioche self-assigned this Mar 7, 2017

jnioche added the core label Mar 7, 2017

jnioche mentioned this issue Mar 24, 2017

Investigate OKHTTP to replace Apache HTTPClient #443

Closed

jnioche closed this as completed Jul 15, 2020

jnioche mentioned this issue Jul 15, 2020

Apply crawl delay after fetching robots.txt #343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

org.apache.http.NoHttpResponseException: The target server failed to respond #405

org.apache.http.NoHttpResponseException: The target server failed to respond #405

Laurent-Hervaud commented Jan 10, 2017

jnioche commented Jan 10, 2017

Laurent-Hervaud commented Jan 10, 2017

jnioche commented Jan 10, 2017

Laurent-Hervaud commented Jan 11, 2017

jnioche commented Jan 11, 2017

Laurent-Hervaud commented Jan 11, 2017

jnioche commented Jan 11, 2017

jnioche commented Jan 13, 2017

jnioche commented Mar 7, 2017 •

edited

Loading

abhishekransingh commented May 1, 2018

ade90036 commented Jun 12, 2019

jnioche commented Jul 15, 2020

org.apache.http.NoHttpResponseException: The target server failed to respond #405

org.apache.http.NoHttpResponseException: The target server failed to respond #405

Comments

Laurent-Hervaud commented Jan 10, 2017

jnioche commented Jan 10, 2017

Laurent-Hervaud commented Jan 10, 2017

jnioche commented Jan 10, 2017

Laurent-Hervaud commented Jan 11, 2017

jnioche commented Jan 11, 2017

Laurent-Hervaud commented Jan 11, 2017

jnioche commented Jan 11, 2017

jnioche commented Jan 13, 2017

jnioche commented Mar 7, 2017 • edited Loading

abhishekransingh commented May 1, 2018

ade90036 commented Jun 12, 2019

jnioche commented Jul 15, 2020

jnioche commented Mar 7, 2017 •

edited

Loading