Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support to sklearn TargetEncoder #1137

Merged
merged 14 commits into from
Feb 20, 2025

Conversation

boccaff
Copy link
Contributor

@boccaff boccaff commented Nov 7, 2024

This PR implements a converter and a shape calculator for the TargetEncoder class introduced in Scikit-learn 1.5. The code follows much of the implementation of the converter for Ordinal Encoder.

A partial suit of tests is already implemented, but there is at least a couple of additional tests that I would like to add (missing values and using the smooth parameter from sklearn, even though I think it shouldn't matter).

@xadupre
Copy link
Collaborator

xadupre commented Nov 14, 2024

Thanks for the contribution. One line should be removed. Everything else looks good.

[("input", StringTensorType([None, X.shape[1]]))],
target_opset=TARGET_OPSET,
)
self.assertTrue(model_onnx is not None)

Check notice

Code scanning / CodeQL

Imprecise assert Note test

assertTrue(a is not b) cannot provide an informative message. Using assertIsNot(a, b) instead will give more informative messages.
target_opset=TARGET_OPSET,
)
self.assertTrue(model_onnx is not None)
self.assertTrue(model_onnx.graph.node is not None)

Check notice

Code scanning / CodeQL

Imprecise assert Note test

assertTrue(a is not b) cannot provide an informative message. Using assertIsNot(a, b) instead will give more informative messages.
[("input", Int64TensorType([None, X.shape[1]]))],
target_opset=TARGET_OPSET,
)
self.assertTrue(model_onnx is not None)

Check notice

Code scanning / CodeQL

Imprecise assert Note test

assertTrue(a is not b) cannot provide an informative message. Using assertIsNot(a, b) instead will give more informative messages.
target_opset=TARGET_OPSET,
)
self.assertTrue(model_onnx is not None)
self.assertTrue(model_onnx.graph.node is not None)

Check notice

Code scanning / CodeQL

Imprecise assert Note test

assertTrue(a is not b) cannot provide an informative message. Using assertIsNot(a, b) instead will give more informative messages.
model_onnx = convert_sklearn(
model, "ordinal encoder two string cats", inputs, target_opset=TARGET_OPSET
)
self.assertTrue(model_onnx is not None)

Check notice

Code scanning / CodeQL

Imprecise assert Note test

assertTrue(a is not b) cannot provide an informative message. Using assertIsNot(a, b) instead will give more informative messages.
@boccaff boccaff force-pushed the feat-target_encoder_support branch from e945a85 to e243865 Compare November 15, 2024 00:53
@boccaff
Copy link
Contributor Author

boccaff commented Nov 15, 2024

Thanks for the comments @xadupre. I've removed the line, and solved a couple of the CodeQL suggestions (removed an unused import and an unused variable). The rest of the CodeQL suggested changes would diverge from the other implementations. For the .assertTrue I can just follow the suggestion, but it would diverge from other tests. Is it ok? For the except: pass, maybe we can add a warning (example above on the respective CodeQL comment).

@khoover
Copy link

khoover commented Feb 7, 2025

Hi @boccaff, @xadupre, I'm wondering whether there's a timeline on getting this merged in. My team's been interested in using ONNX, but almost all of our models use target encoding.

@xadupre
Copy link
Collaborator

xadupre commented Feb 7, 2025

The PR looks ok. I'll merge it when the CI test pass.

@boccaff
Copy link
Contributor Author

boccaff commented Feb 18, 2025

@xadupre, do you have any pointers on how to define/specify the opcode to solve the current error?

@khoover, I plan to have another pass on this PR by the end of the week.

edit: add link to error

@xadupre
Copy link
Collaborator

xadupre commented Feb 18, 2025

You should write convert_sklearn or to_onnx (..., target_opset={"": 18, "ai.onnx.ml": 3}). That will fix the opset.

@boccaff
Copy link
Contributor Author

boccaff commented Feb 19, 2025

@xadupre , I've noticed that the converter for the OrdinalEncoder sets op_version=2 when adding nodes. Would that be preferable? If so, should I raise an error as it is done on LabelEncoder?

PS: should we also raise on OrdinalEncoder? I could follow up on a different issue.

@xadupre
Copy link
Collaborator

xadupre commented Feb 19, 2025

2 or 3 should not matter. If you set 3, onnx won't find any version for opset 3 so it chooses the latest available, so 2.

@@ -419,3 +419,48 @@ def _run(self, x, int64_vocabulary=None, string_vocabulary=None):
return (np.array(res),)

raise TypeError(f"x must be iterable not {type(x)}.") # pragma: no cover

class TargetEncoder(OpRun):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed. TargetEncoder is not part of onnx specs and your converter does not produce such a node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that and the reference to it on tests/test_utils/utils_backend_onnx.py. Tks for the review.

@boccaff
Copy link
Contributor Author

boccaff commented Feb 20, 2025

I'll try to set up a ubuntu docker image and the failing config (sklearn 1.3.2 and py 3.11) to debug this failing test.
The .classes_ attribute was introduced later.
I had local tests running ok, but on more recent env (sklearn 1.6.1 and python 3.13.1).

py 3.12 and sklearn==1.4.2 appears ok, 1.5.2 also and 1.6.2.

@xadupre
Copy link
Collaborator

xadupre commented Feb 20, 2025

According to this https://scikit-learn.org/1.3/modules/generated/sklearn.preprocessing.TargetEncoder.html, the attribute does not exist in that version. You can either choose to retreive the information from another attribute or say the converter only works after sklearn>=1.4. Since it is a new converter, I don't see any issue with that. We may address older versions if other users post an issue on github. So feel free to raise an exception if the attribute is missing giving the information the user needs to understand why it does not work.

The attribute classes was introduced later on 1.4. Changed the test to
not rely on it.

Signed-off-by: boccaff <[email protected]>
@boccaff
Copy link
Contributor Author

boccaff commented Feb 20, 2025

Opted for checking only target_type_ that now should be only 'continuos' or 'binary'.

Scikit 1.3.2 does not allow multiclass to be fitted.

Signed-off-by: boccaff <[email protected]>
@xadupre xadupre merged commit 58b62c6 into onnx:main Feb 20, 2025
22 of 23 checks passed
@boccaff
Copy link
Contributor Author

boccaff commented Feb 20, 2025

thank you for the support and patience through this PR @xadupre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants