You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a robust training pipeline, online preprocessing Workflows will need to know the columns output by the preprocessing Workflow, and the model will need to know the columns output by the online Workflow (or the preprocessing Workflow if there is no online preprocessing) in order to build the appropriate input tensors. Moreover, they will need to know which columns are list-like (or sparse or multi-hot or whatever you want to call it) in order to know what types of inputs to build. Creating a standard and intuitive representation of these columns that can be accessed from, saved by, and loaded to a Workflow will be invaluable in making pipelines simpler and more robust.
I understand that right now there is a method Workflow.create_final_cols that builds a representation like this. It would be good to better document what that representation is and possibly even build an object for it so that its attributes can be better codified and e.g. tab completed. I also wonder, depending on the amount of work it takes to generate this final representation, if this might be better done by a property like Workflow.output_columns that just computes this "final" representation based on whatever Ops are currently in the Workflow.
Once we have something like this, I can even envision a TensorFlow framework_util which can map from this representation to a set of input tensors (and even take care of constructing the SparseTensor multi-hot embedding inputs from values and nnz input tensors). Such a pipeline might look like:
input_columns = load_column_schema("path/to/columns/exported/by/preprocessing/workflow/columns.extension")
online_proc = Workflow(input_columns)
online_proc.add_cat_preprocess(HashBucket(...))
inputs = framework_utils.tensorflow.build_inputs(online_proc.output_columns) # probably return a dict of {name: tensor}
outputs = build_whatever_model(**model_params)(inputs)
model = tf.keras.Model(inputs=list(inputs.values()), outputs=outputs)
The text was updated successfully, but these errors were encountered:
We're part of the way there with the API overhaul - we now have a complete representation of the output column names at least, but we still need to be able to get the dtypes reliably.
In a robust training pipeline, online preprocessing
Workflow
s will need to know the columns output by the preprocessingWorkflow
, and the model will need to know the columns output by the onlineWorkflow
(or the preprocessingWorkflow
if there is no online preprocessing) in order to build the appropriate input tensors. Moreover, they will need to know which columns are list-like (or sparse or multi-hot or whatever you want to call it) in order to know what types of inputs to build. Creating a standard and intuitive representation of these columns that can be accessed from, saved by, and loaded to aWorkflow
will be invaluable in making pipelines simpler and more robust.I understand that right now there is a method
Workflow.create_final_cols
that builds a representation like this. It would be good to better document what that representation is and possibly even build an object for it so that its attributes can be better codified and e.g. tab completed. I also wonder, depending on the amount of work it takes to generate this final representation, if this might be better done by aproperty
likeWorkflow.output_columns
that just computes this "final" representation based on whateverOp
s are currently in theWorkflow
.Once we have something like this, I can even envision a TensorFlow
framework_util
which can map from this representation to a set of input tensors (and even take care of constructing theSparseTensor
multi-hot embedding inputs fromvalues
andnnz
input tensors). Such a pipeline might look like:The text was updated successfully, but these errors were encountered: