-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use published datasets in a different client #2336
Comments
Are your code for publishing dataset should looks like this:
Because |
Sorry for being not precise enough. names = dd.read_csv('/shared-store-path/names.csv')
names = names.set_index("ID")
names = client1.persist(names)
client1.publish_dataset(names=names) I have attached a full example that can be used to reproduce the behavior. Shown below is the output from the notebook I used to test - from dask.distributed import Client
from dask.distributed import wait
import dask.dataframe as dd
client1 = Client("scheduler-address:8786")
client2 = Client("scheduler-address:8786") names = dd.read_csv('/shared-store-path/names.csv')
names = names.set_index("ID")
names = client1.persist(names)
wait(names) DoneAndNotDoneFutures(done={<Future: status: finished, type: DataFrame, key: ('sort_index-6f9583110f3b77c24727c1e970735470', 0)>}, not_done=set()) client1.publish_dataset(names=names) roles = dd.read_csv('/shared-store-path/roles.csv')
roles = roles.set_index("ID")
names_dataset = client2.get_dataset("names")
roles = roles.join(names_dataset) roles.head()
client2.persist(roles)
|
Okay. Thanks for full example. I will try to run it tomorrow and see what will happen. |
Actually, I have reproduced your error, but not using your code. It enough to publish a dataset with one client, then using second client (different process) get a dataset, modify something and if you will try to make client = Client('localhost:8786')
client.restart()
df = dd.read_parquet('test_data/data.parquet')
df = client.persist(df)
client.publish_dataset(test=df) Now, you can open a new terminal: client = Client(...)
df = client.get_dataset('test')
new_df = df['some'] > 0.5
new_df = client.persist(new_df)
# error here... Do you have any ideas @mrocklin ? Python: 3.6.6 |
Any updates/suggestions about this problem @mrocklin @martindurant ? |
(I am at a conference and unlikely to be able to look into this right now) |
did anything ever happen with this ticket. i seem to have run into the same situation and dont see any information anywhere else about it. My situation is i am persisting a dataset the first time i load it and someone else is picking it up and adding more to the graph something akin to
and it is giving me this error |
The original posting above used versions that are now somewhat out of date. Would you mind trying with the latest versions? I just ran a very similar test and didn't see any error. |
sounds good after continuing to look i think i am getting something just slightly different. i will try and package up an example. I was able to get it working by not creating my own client and instead using get_client on my computer to make sure that the same client was being used with the client.compute call, but the behavior that i was experiencing didnt seem to make sense to me. I will attempt to get something small to illustrate the situation shortly |
We're also having troubles working with this check. There's a bunch of datasets published by a loader script, and another application is launching tasks that use |
|
That's not the case, the futures from the retrieved dataframe have a client with a different id (which is different for every session)
|
Can anyone provide a reproducible example? This doesn't do it, I suspect because we aren't getting references to the client in the task graph
|
Oh @tshatrov, can you repeat your test in #2336 (comment) after calling optimize on the graph? That's when things would be replaced: distributed/distributed/client.py Line 2519 in 587be8d
|
STR:
As a result I get I can't reproduce this on I have not figured out how to use optimize_insert_futures to make this not fail. As far as I see it doesn't change the dataframe. |
Thanks @tshatrov. FWIW, I can't reproduce the ValueError locally with that. Perhaps others can. |
Any updates on the issue, even I am facing the same problem as mentioned by @tshatrov |
Hm, the example code also works OK for me. I wonder what might be different on your system, @pranav-kohli - what exact versions are you using? |
I am using the following setup |
You should maybe try updating Dask and distributed and see if the problem
persists.
…On Mon, Jun 24, 2019 at 8:26 AM pranav-kohli ***@***.***> wrote:
I am using the following setup
dask==1.1.0
distributed==1.25.2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2336?email_source=notifications&email_token=AACKZTAV5HOAZ4SEQDXX2F3P4BSJ5A5CNFSM4GBZSIQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYL4TAQ#issuecomment-504875394>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTEQ5XMEWRX4JFZDVLTP4BSJ5ANCNFSM4GBZSIQQ>
.
|
As I posted above the problem persists on the most recently released versions. |
This is what i used to replicate the issue on the latest versions
The printed ids are different, hence the |
Thanks @pranav-kohli. Unfortunately this is the output I get for (a lightly modified) version of your script :/
import pandas as pd
from distributed import Client
import dask.dataframe as dd
def getCompute(df):
#print(df.id)
print(list(df.dask.values())[0].client.id)
df.compute()
if __name__ == '__main__':
daskClient = Client()
df = pd.DataFrame(data=[[1]], columns=['a'])
df = dd.from_pandas(df, npartitions=1)
df = df.persist()
print(daskClient.id)
print(list(df.dask.values())[0].client.id)
future = daskClient.submit(getCompute, df) |
@TomAugspurger So actually in the test case we are printing the same id twice
Can you check your dask worker log which prints from getCompute function? |
Is anybody who's able to reproduce this issue locally able to debug further? |
Here is a reproducer: from distributed.deploy.ssh2 import SSHCluster
from distributed import Client
import pandas as pd
import dask.dataframe as dd
def getCompute(df):
return df.compute()
async def f():
async with SSHCluster(
hosts=["localhost", "localhost", "localhost"],
worker_kwargs={"nthreads": 4},
connect_kwargs={"known_hosts": None},
asynchronous=True,
) as cluster:
client = await Client("localhost:8786", asynchronous=True)
df = pd.DataFrame(data=[[1]], columns=["a"])
df = dd.from_pandas(df, npartitions=1)
df = df.persist()
future = client.submit(getCompute, df)
await future
if __name__ == "__main__":
import asyncio
asyncio.get_event_loop().run_until_complete(f()) TracebackTraceback (most recent call last):
File "foo.py", line 28, in <module>
asyncio.get_event_loop().run_until_complete(f())
File "/Users/mrocklin/miniconda/envs/dev/lib/python3.7/asyncio/base_events.py", line 573, in run_until_complete
return future.result()
File "foo.py", line 23, in f
await future
File "/Users/mrocklin/workspace/distributed/distributed/client.py", line 232, in _result
six.reraise(*exc)
File "/Users/mrocklin/miniconda/envs/dev/lib/python3.7/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "foo.py", line 8, in getCompute
return df.compute()
File "/Users/mrocklin/workspace/dask/dask/base.py", line 175, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/Users/mrocklin/workspace/dask/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/Users/mrocklin/workspace/distributed/distributed/client.py", line 2500, in get
actors=actors,
File "/Users/mrocklin/workspace/distributed/distributed/client.py", line 2389, in _graph_to_futures
raise ValueError(msg)
ValueError: Inputs contain futures that were created by another client. |
I am able to reproduce following @tshatrov's instructions. I've slightly modified it to use import distributed
client = distributed.Client('tcp://localhost:8786')
import dask
ts = dask.datasets.timeseries()
ts = ts.persist()
client.publish_dataset(timeseries=ts)
def get_ts():
from dask.distributed import get_client
with get_client() as client:
return client.get_dataset('timeseries').compute()
client.submit(get_ts).result()
Dask and distributed |
This helps to normalize scheduler addresses before comparison Fixes dask#2336
Here is a new minimum reproducible example. The key here is that the clients connects to the scheduler on a different IP to the worker. # Run dash scheduler on a machine
$ dask-scheduler # Connect to that scheduler with a worker on the same machine via localhost
$ dask-worker localhost:8786 # Connect to the cluster on a different IP
import distributed
client = distributed.Client('10.1.2.3:8786') # Or whatever the LAN IP is
# Persist some data and publish it as a dataset
import dask
df = dask.datasets.timeseries().persist()
client.publish_dataset(df=df)
# Try and grab the dataset from within a delayed task
@dask.delayed
def remote_head():
client = distributed.get_client()
df = client.get_dataset('df')
return df.head()
remote_head().compute() # This results in an error
ValueError: Inputs contain futures that were created by another client. |
I"m stumbling on the same problem on distributed 2.7.0. Another minimal example: client = distributed.Client("localhost:8786")
a, = client.persist([delayed(1, pure=True)])
print(a) # Delayed('int-62645d78d66e2508256b7ab60a38b944')
print(a.compute()) # 1
print(client.compute([a])[0].result()) # 1
client.publish_dataset(foo=a)
b = client.get_dataset("foo")
print(b) # Delayed('int-62645d78d66e2508256b7ab60a38b944')
print(b.compute()) # 1
print(client.compute([b])[0].result()) # ValueError: Inputs contain futures that were created by another client. The issue disappears if I connect to a LocalCluster instead. Workaround: print(client.gather([b.dask[b.key]])) # 1 This issue is particularly troublesome when using the asynchronous client together with non-trivial collections (read: anything other than a delayed), since the compute() method does not work (dask/dask#5580) so one would have to reassemble the output by hand starting from output of the futures. |
Still facing the same issue with Dask 2.6.0 and Distributed 2.6.0.
Error: |
@pranav-kohli I'm not able to reproduce the error with the latest In [1]: import pandas as pd
...: import dask.dataframe as dd
...: from distributed import Client
...:
...: def getCompute(df):
...: print("client id inside")
...: print(list(df.dask.values())[0].client.id)
...: df.compute()
...:
...: daskClient = Client()
...: df = pd.DataFrame(data=[[1]], columns=['a'])
...: df = dd.from_pandas(df, npartitions=1)
...: df = df.persist()
...: print(daskClient.id)
...: future = daskClient.submit(getCompute, df)
...: print(future.result())
Client-cb589a84-63a4-11ea-9e5d-a0999b120aab
client id inside
Client-worker-ccab6e5c-63a4-11ea-9e6b-a0999b120aab
None Can you update those packages and see if the problem persists? |
Isn't this code using LocalCluster? We already know this bug does not reproduce on a LocalCluster. |
|
Ah, I missed that when reading through the previous comments. I am able to reproduce the |
How weird! Now that should give us something to diagnose by, but I am pretty mystified. |
The default multiprocessing method was updated to spawn in #3461. I checked that the problem does not occur with |
I have this same issue with the latest. Anyone one have any ideas? |
@pborgen my PR resolves the issue for get_dataset() specifically - which is what the opening post was about. What you're doing in that snippet is sending over an arbitrary python object which happens to contain Futures and expect them to be recreated correctly when the Worker deserializes them - which is a different problem, albeit related. |
Is there currently a issue to capture this? |
@pborgen there is one now: #3790 Note that, as a workaround, you can use publish, which is currently the only sanctioned way to move collections across clients:
The downside is that, if for any reason the Client is SIGKILL'ed or loses network connectivity before the end of the computation, you'll end up with a memory leak on the cluster. |
Thanks for your help.....I upgraded my dev box and all my machines that are part of the cluster to 2.16.0. I am using dask cli with one machine dedicated to the scheduler and 3 other dask workers... When I run with one worker everything works fine. But if I run with 2 workers I get the below error: Error: ValueError: Inputs contain futures that were created by another client. |
@pborgen run what? the POC that you posted, or my latest one with publish_dataset/get_dataset? |
I am running my code...It is pretty similar to what you posted though..... Your code seems to run fine on my cluster.....I am going to try to run your code with a very large dask dataframe to see if I can reproduce... |
I should say that this error I get only happens after things seem to be running for for a minute or so...I am also using a ddf that is created from a parquet file that is 140mb and about 40millions rows |
inside my task I am querying this very large parquet file |
@pborgen, if you can create a minimal reproducible example, please file a new issue and we can discuss there. |
|
Thanks for you help...I will try to get this to you today or tomarrow. |
Do you know of a way to programmaticly create a very large dataframe? like at least 40 million rows and 8 columns. |
There are some functions for doing this in |
Created #3791 |
I get the error "Inputs contain futures that were created by another client" when I try to join a published dataset within a client different from the one that originally published it. The whole flow can be summarised as follows:
Client 1
df = dd.read_csv(...)
client.persist(df)
client.publish_dataset(ds_name=df)
Client 2
df = client.get_dataset("ds_name")
df2 = dd.read_csv(...)
df2.join(df)
client.persist(df2)
Is this expected behavior?
For reference, this check was added in the commit c02ea63#diff-96a27223dc91b5c9ea3d03684d79ad3f%5D which is part of the pull request #2227
The text was updated successfully, but these errors were encountered: