Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.pushData() - items are not written to dataset #1811

Closed
1 task
AndreyBykov opened this issue Mar 5, 2023 · 3 comments · Fixed by #1865
Closed
1 task

Dataset.pushData() - items are not written to dataset #1811

AndreyBykov opened this issue Mar 5, 2023 · 3 comments · Fixed by #1865
Assignees
Labels
bug Something isn't working.

Comments

@AndreyBykov
Copy link
Contributor

AndreyBykov commented Mar 5, 2023

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Was reported on discord by @yellott:

Does anyone have an issue with Dataset.pushData when "maxConcurrency" is set to anything greater then 1? Few items are not written to dataset. Can't find any topic on how it would be better to handle this. The only way I see is to write data after crawler is finished running.

There's a bit more discussion there, but there's definitely no stuff like missing await or something, at least I don't see anything obvious. I tried the repro locally and it happens randomly indeed. Never with maxConcurrency: 1, every second run without. I verified that items are there - e.g. if you also push items to array in memory and save it in the end of the run - it works as expected. Another way to verify that items are received - dataset could miss some items, but If I will push the final array (the one kept in memory) to Key-Value store - they are all there.

Reproduction here: https://github.com/yellott/crawlee-odd-behaviour-mre

Code sample

No response

Package version

3.2.2

Node.js version

18.14.2

Operating system

macOS

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@AndreyBykov AndreyBykov added the bug Something isn't working. label Mar 5, 2023
@B4nan
Copy link
Member

B4nan commented Mar 16, 2023

cc @vladfrangu this sounds like another issue with waiting for the writes. the repro fails for me consistently, and when I put 1s sleep after the run method resolves, it seems to help.

@B4nan
Copy link
Member

B4nan commented Mar 16, 2023

It did fail once with the 1s sleep too. IIRC we are waiting for the storages to complete the writes in Actor.exit but we need the same for crawler.run.

@yellott
Copy link

yellott commented Mar 19, 2023

@B4nan Hey, I have just reported another issue with the similar nature, here. Are those related in your opinion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants