-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[v1.x] V1.x CD blocked by test_gluon_data.test_list_dataset error #19918
Comments
@Zha0q1 Not seen this error before. Looks like it's fixed in default flavor ( |
I think it's a flaky test. It also fails on cu 101. In this run three days ealier native also passed. |
We saw this error before. The problem happens when you make a new data loader while the previous data loader is not yet fully destroyed (including the data it produced in shared memory). Workers in the new data loader inherit those shared memory ndarrays (without increasing the usage counter which exists in the shared memory region itself) and once python's garbage collector decides to destroy them, they decrement the usage counter (and so it gets decremented too much). There are 2 things that may happen then - either the workers destroy the ndarray and the main process gets this error, or the main process does it and then workers get this error and crash, which results in a hang. I made a small workaround for this in our container by inserting waitall and forcing python gc before the fork. I will make a pr tomorrow with this workaround. |
That's great thanks! |
should be fixed now thanks @ptrendx !! |
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1530/pipeline/222
@leezu @mseth10 @josephevans Have you seen this before?
The text was updated successfully, but these errors were encountered: