Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Standardize a Workflow's representation of its output columns #372

Closed
alecgunny opened this issue Oct 21, 2020 · 2 comments
Closed
Assignees
Labels

Comments

@alecgunny
Copy link
Contributor

In a robust training pipeline, online preprocessing Workflows will need to know the columns output by the preprocessing Workflow, and the model will need to know the columns output by the online Workflow (or the preprocessing Workflow if there is no online preprocessing) in order to build the appropriate input tensors. Moreover, they will need to know which columns are list-like (or sparse or multi-hot or whatever you want to call it) in order to know what types of inputs to build. Creating a standard and intuitive representation of these columns that can be accessed from, saved by, and loaded to a Workflow will be invaluable in making pipelines simpler and more robust.

I understand that right now there is a method Workflow.create_final_cols that builds a representation like this. It would be good to better document what that representation is and possibly even build an object for it so that its attributes can be better codified and e.g. tab completed. I also wonder, depending on the amount of work it takes to generate this final representation, if this might be better done by a property like Workflow.output_columns that just computes this "final" representation based on whatever Ops are currently in the Workflow.

Once we have something like this, I can even envision a TensorFlow framework_util which can map from this representation to a set of input tensors (and even take care of constructing the SparseTensor multi-hot embedding inputs from values and nnz input tensors). Such a pipeline might look like:

input_columns = load_column_schema("path/to/columns/exported/by/preprocessing/workflow/columns.extension")
online_proc = Workflow(input_columns)
online_proc.add_cat_preprocess(HashBucket(...))

inputs = framework_utils.tensorflow.build_inputs(online_proc.output_columns) # probably return a dict of {name: tensor}
outputs = build_whatever_model(**model_params)(inputs)
model = tf.keras.Model(inputs=list(inputs.values()), outputs=outputs)
@jperez999 jperez999 self-assigned this Nov 2, 2020
@jperez999 jperez999 linked a pull request Nov 2, 2020 that will close this issue
@jperez999 jperez999 removed a link to a pull request Nov 9, 2020
@benfred
Copy link
Member

benfred commented Jan 7, 2021

We're part of the way there with the API overhaul - we now have a complete representation of the output column names at least, but we still need to be able to get the dtypes reliably.

@benfred benfred assigned benfred and unassigned jperez999 Jan 7, 2021
@benfred benfred added the P1 label Jan 7, 2021
@viswa-nvidia viswa-nvidia added this to the NVTabular v0.7 milestone Apr 26, 2021
@karlhigley karlhigley assigned marcromeyn and unassigned benfred Jun 22, 2021
@benfred benfred added P0 and removed P1 labels Jul 13, 2021
@benfred
Copy link
Member

benfred commented Jul 27, 2021

tracked in #943 now

@benfred benfred closed this as completed Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants