Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several new ZIMs appearing in the zimit directory on download.kiwix.org are too small #168

Closed
Jaifroid opened this issue Jan 21, 2023 · 12 comments
Assignees
Labels

Comments

@Jaifroid
Copy link

Here is a list of suspiciously small ZIMs in descending order of date (ignore www.ready.gov, which seems OK, and liberius). Most small ones I've tested have ended up being just one-page scrapes. Note that the previous bibnum_fr_all was 553MB, not 352KB!

image

@Jaifroid
Copy link
Author

Jaifroid commented Jan 21, 2023

NB rasberry_pi_docs and coopmaths are OK.

@kelson42
Copy link
Contributor

kelson42 commented Jan 21, 2023

@RavanJAltaie @Popolechien We should:

  • Have the list of these ZIM files and remove them
  • Put all the recipes (back?) to dev
  • Understand why they are there (do they have been properly validated first?)
  • Then we will start with the technical analysis why something goes wrong.

@Jaifroid Thank you for the bug report!

@RavanJAltaie
Copy link

I've checked the files below in the library and they were working properly:

1-https://master.download.kiwix.org/zim/zimit/coopmaths_2023-01.zim
(The file in the library is working properly, all links and exercises are working, all links to outside tests are working properly)
2-https://master.download.kiwix.org/zim/zimit/liberius.net_fr_all_2023-01.zim
(The file in the library is working properly, all links are working, all links to outside tests are working properly)
3-https://master.download.kiwix.org/zim/zimit/www.ready.gov_es_2023-01.zim
(The file in the library is working properly, all links are working, all links to outside tests are working properly)

@RavanJAltaie
Copy link

I investigated https://farm.openzim.org/recipes/raspberrypi_docs
I think the problem is that we only created the file for the documentation section, so if you click on documentation, everything under that section is working.

@Jaifroid
Copy link
Author

Thanks @RavanJAltaie for testing those properly. I had only managed to test a few, and the others were guesses based on anomalous size. Good to have it confirmed. Strange that some recipes that used to work fine are now broken.

@rgaudin
Copy link
Member

rgaudin commented Jan 23, 2023

Deleted the ZIMs

@rgaudin
Copy link
Member

rgaudin commented Jan 23, 2023

So from the discussion I understand that the only failing ZIM that used to work is bibnum_fr_all_2023-01.zim.
Indeed the task completed early (but not failing! that's the issue):

[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:44.925 (running for 13.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
    at CDPSession.send (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:281:35)
    at ExecutionContext._ExecutionContext_evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/ExecutionContext.js:206:46)
    at ExecutionContext.evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/ExecutionContext.js:103:113)
    at IsolatedWorld.evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/IsolatedWorld.js:171:24)
�[8AStarting repair
Page Load Failed: https://journals.openedition.org/bibnum/, Reason: Error: Page crashed!
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:45.424 (running for 14.0 seconds)
�[K== Progress:  1 / 1 (100.00%), errors: 1 (100.00%)
�[K== Remaining: 0.0 ms (@ 0.07 pages/second)
�[K== Sys. load: 2.6% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 IDLE 
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:45.439 (running for 14.0 seconds)
�[K== Progress:  1 / 1 (100.00%), errors: 1 (100.00%)
�[K== Remaining: 0.0 ms (@ 0.07 pages/second)
�[K== Sys. load: 2.6% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 IDLE 
�[8A�[8BWaiting to ensure pending data is written to WARCs...
done

This would be a crawler bug. Reporting upstream

@rgaudin
Copy link
Member

rgaudin commented Jan 23, 2023

@Jaifroid
Copy link
Author

It would be interesting to know if the early completion is the cause with the other ZIMs that only seemed to scrape a single page or only a very small part of the crawl. I.e., is it the same issue, or a different one?

@rgaudin
Copy link
Member

rgaudin commented Jan 23, 2023

Neither footballdatabase nor blockygames had errors.

@Jaifroid
Copy link
Author

OK, so maybe we could put it down to some problematic recipes for new, untested sites, and a fluke early completion for a known site? The January ready.gov (Spanish) seems to have worked fine, plus coopmaths and raspberry pi. Feel free to close the issue and we can re-open if other known good sites fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants