-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DatasetAlreadyExistsError thrown when using ThreadRunner, dataset factories #3739
Comments
Here is a temporary fix, by creating all the catalog entries required before the pipeline starts running. class ResolveDatasetsHooks:
@hook_impl
def before_pipeline_run(self, pipeline, catalog):
data_sets = set()
for node in pipeline.nodes:
data_sets.update(node.outputs)
data_sets.update(node.inputs)
for ds in data_sets:
catalog._get_dataset(ds) |
Hey @melvinkokxw, love that you provide a clean script instead of a scaffold project, it's very easy for me to run this, appreciate your effort a lot ✨! I suspect this is related to: Can you try to change |
I manage to run this successfully, it is more of a problem of your script. Was it copied from old version of Kedro? Can you explain a little bit what you are trying to do? Maybe that will give us more context to come up with a better solution. There are few problems:
import yaml
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, node
from kedro.runner import ThreadRunner
from kedro.runner.parallel_runner import ParallelRunner
from kedro.runner.sequential_runner import SequentialRunner
if __name__ == "__main__":
catalog_yml = """
"{name}":
type: MemoryDataset
"""
from kedro.io.memory_dataset import MemoryDataset
MemoryDataset._load = lambda x: print("lambda!")
catalog = yaml.safe_load(catalog_yml)
io = DataCatalog.from_config(catalog)
def return_dataframe(input_df):
return "return!"
pipeline = Pipeline(
[
node(
func=return_dataframe, inputs="input_df", outputs="output_df1", name="node1"
),
node(
func=return_dataframe, inputs="input_df", outputs="output_df2", name="node2"
),
]
)
runner = ThreadRunner()
# runner = ThreadRunner()
runner.run(pipeline, io) |
I am closing this issue due to no activity, I tried to reproduce this last time and it work expected. Please reopen with an valid example. |
Description
Using
ThreadRunner
with dataset factories leads to aDatasetAlreadyExistsError
Context
I have a pipeline that has two nodes using the same input, both inputs should be loaded using dataset factories. When using
ThreadRunner
with my pipeline,kedro
throws aDatasetAlreadyExistsError
.Steps to Reproduce
Here is a minimal reproducible example:
Expected Result
Pipeline should run successfully with no errors
Actual Result
Full error logs here
Your Environment
pip show kedro
orkedro -V
): 0.18.14, also reproducible on 0.19.3python -V
): 3.9.18The text was updated successfully, but these errors were encountered: