-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make IncrementalDataset
's confirms
"namespaced"
#4039
Comments
I believe this is also hidding a bug. If the incremental dataset is namespaced and the confirms argument is not explicitely set as per the workaround, no checkpoint file is created. I would guess that this is because if confirms is not provided, it is set to the incremental dataset name without the namespace and this dataset does not actually exist. |
Thanks @gtauzin and sorry for the slow response. We will investigate the issue you mention first. |
ExampleThe related example of the pipeline shared by user: #4164 def create_pipeline(**kwargs) -> Pipeline:
def get_pipeline(namespace: str):
template_pipeline = pipeline(
[
node(
concatenate_increment,
inputs="data_increment",
outputs=["concatenated_data_increment", "data_increment_concatenated"],
name="concatenate_increment",
confirms=f"{namespace}.data_increment", # This is needed as the incremental dataset is namespaced
),
node(
concatenate_partition,
inputs=[
"partitioned_concatenated_data",
"data_increment_concatenated",
],
outputs="extracted_data",
name="concatenate_partition",
),
],
)
return template_pipeline
pipelines = pipeline(
pipe=get_pipeline(namespace=SOURCES[0]),
namespace=SOURCES[0],
)
for source in SOURCES[1:]:
pipelines += pipeline(
pipe=get_pipeline(namespace=source),
namespace=source,
)
return pipelines "{source}.data_increment":
type: partitions.IncrementalDataset
path: data/01_raw//{source}/
dataset:
type: pandas.CSVDataset
filename_suffix: ".csv" ExplanationAfter the node execution we check if dataset should be confirmed and then confirm it via Line 95 in 9c70bae
On the Line 151 in 9c70bae
So Line 331 in 9c70bae
@gtauzin currently there's no way to apply namespace dynamically to what you pass into Possible solutionSince we know node namespace and node name at the level of the node we can resolve inputs like that |
@ElenaKhaustova Thanks for looking into this! |
Description
I have a namespace-based incremental dataset and wish to use the confirms attribute to trigger CHECKPOINT update further down my pipeline. However, based on discussions on Slack, it seems that incremental datasets are not meant to be used within namespaces and so
confirms
is not "namespaced" by design.Following discussion with @noklam on Slack, it seems that my use case could justify having "namespaced"
confirms
.Context
I have many devices that regularly record event files and push it to a S3 bucket. I would like to run a preprocessing pipeline that is different for each device and that would for each of them:
Then , I use the concatenation of all recorded preprocessed event seen so far for data science purposes.
The way I achieve this with Kedro is:
IncrementalDataset
and the concatenated dataframe is saved using a versioned ParquetDatasetPartionedDataset
that is able to find all preprocessed recorded event computer so far (withload_args
withdirs
andmax_depth
set accordingly)Those steps are done for each device, so I use namespace to reuse the same logic for all of them varying the S3 bucket path. I need the confirms to be at step 2 because only then I can consider new files to have been processed.
Workaround
@noklam suggested to try putting the namespace in the argument, e.g. confirms=namespace.data, as a workaround and I can confirm this worked.
The text was updated successfully, but these errors were encountered: