[FEA] Standardize a Workflow's representation of its output columns #372

alecgunny · 2020-10-21T19:34:15Z

In a robust training pipeline, online preprocessing Workflows will need to know the columns output by the preprocessing Workflow, and the model will need to know the columns output by the online Workflow (or the preprocessing Workflow if there is no online preprocessing) in order to build the appropriate input tensors. Moreover, they will need to know which columns are list-like (or sparse or multi-hot or whatever you want to call it) in order to know what types of inputs to build. Creating a standard and intuitive representation of these columns that can be accessed from, saved by, and loaded to a Workflow will be invaluable in making pipelines simpler and more robust.

I understand that right now there is a method Workflow.create_final_cols that builds a representation like this. It would be good to better document what that representation is and possibly even build an object for it so that its attributes can be better codified and e.g. tab completed. I also wonder, depending on the amount of work it takes to generate this final representation, if this might be better done by a property like Workflow.output_columns that just computes this "final" representation based on whatever Ops are currently in the Workflow.

Once we have something like this, I can even envision a TensorFlow framework_util which can map from this representation to a set of input tensors (and even take care of constructing the SparseTensor multi-hot embedding inputs from values and nnz input tensors). Such a pipeline might look like:

input_columns = load_column_schema("path/to/columns/exported/by/preprocessing/workflow/columns.extension")
online_proc = Workflow(input_columns)
online_proc.add_cat_preprocess(HashBucket(...))

inputs = framework_utils.tensorflow.build_inputs(online_proc.output_columns) # probably return a dict of {name: tensor}
outputs = build_whatever_model(**model_params)(inputs)
model = tf.keras.Model(inputs=list(inputs.values()), outputs=outputs)

The text was updated successfully, but these errors were encountered:

benfred · 2021-01-07T17:27:04Z

We're part of the way there with the API overhaul - we now have a complete representation of the output column names at least, but we still need to be able to get the dtypes reliably.

benfred · 2021-07-27T17:32:57Z

tracked in #943 now

jperez999 self-assigned this Nov 2, 2020

jperez999 mentioned this issue Nov 2, 2020

various fixes for diff issues #393

Merged

jperez999 linked a pull request Nov 2, 2020 that will close this issue

various fixes for diff issues #393

Merged

jperez999 removed a link to a pull request Nov 9, 2020

various fixes for diff issues #393

Merged

benfred mentioned this issue Dec 3, 2020

[FEA] API Overhaul #472

Closed

benfred assigned benfred and unassigned jperez999 Jan 7, 2021

benfred added the P1 label Jan 7, 2021

viswa-nvidia added this to the NVTabular v0.7 milestone Apr 26, 2021

karlhigley assigned marcromeyn and unassigned benfred Jun 22, 2021

benfred added P0 and removed P1 labels Jul 13, 2021

benfred closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Standardize a Workflow's representation of its output columns #372

[FEA] Standardize a Workflow's representation of its output columns #372

alecgunny commented Oct 21, 2020

benfred commented Jan 7, 2021

benfred commented Jul 27, 2021

[FEA] Standardize a Workflow's representation of its output columns #372

[FEA] Standardize a Workflow's representation of its output columns #372

Comments

alecgunny commented Oct 21, 2020

benfred commented Jan 7, 2021

benfred commented Jul 27, 2021