-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Separately check whether pyarrow
and cffi
are installed
#6785
base: master
Are you sure you want to change the base?
Conversation
pyarrow
and cffi
are installedpyarrow
and cffi
are installed
pyarrow
and cffi
are installedpyarrow
and cffi
are installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only tests I can think of would require a runner with pyarrow but not cffi installed.
Could you try adding tests that mock cffi
not being available by mocking sys.modules
? I tried that in an environment with scikit-learn
(another optional dependency of lightgbm
) installed and it seemed to work ok:
import sys
from unittest import mock
with mock.patch.dict(sys.modules, {'sklearn': None}):
import lightgbm as lgb
print(lgb.compat.SKLEARN_INSTALLED)
# False
import lightgbm as lgb
print(lgb.compat.SKLEARN_INSTALLED)
# True
We don't have any examples of that in lightgbm
's test suite, but I think it'd be interesting to try.
It appears this does not work. I don't really understand why. |
@jameslamb How would you like to continue here? |
I feel it'd be easy to accidentally undo this work in future refactorings. I will try to find a way to add a test covering this. |
python-package/lightgbm/basic.py
Outdated
if not PYARROW_INSTALLED: | ||
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` installed.") | ||
if not (PYARROW_INSTALLED and CFFI_INSTALLED): | ||
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.") | |
raise LightGBMError("Cannot init Dataset from Arrow without `pyarrow` and `cffi` installed.") |
This really should be Dataset
, not dataframe
... I'll make that change when I push testing changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e72d5e2.
In that commit, I also removed backticks from these log messages, in favor of single quotes. Special characters in log messages can occasionally be problematic.
I know these things were already there before this PR, but might as well fix them right here while we're touching these lines.
def missing_module_cffi(monkeypatch): | ||
"""Mock 'cffi' not being importable""" | ||
monkeypatch.setattr(lightgbm.compat, "CFFI_INSTALLED", False) | ||
monkeypatch.setattr(lightgbm.basic, "CFFI_INSTALLED", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came up with this based on https://docs.pytest.org/en/stable/reference/reference.html
I'm hoping that this could establish a pattern we re-use in other tests in future PRs.
It's not perfect (for example, if setting CFFI_INSTALLED
is done incorrectly in compat.py
, then this approach wouldn't catch that), but it's a lightweight and simple way to ensure we always cover code like the changes introduced in this PR.
Referenced https://docs.pytest.org/en/stable/reference/reference.html while working on this.
@jmoralez @borchero @StrikerRUS what do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks quite fragile. I think more complicated but right approach would be like one from the following ones: https://stackoverflow.com/a/51048604.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a lot more complicated :/
I'll try it though, I do agree that it'd be a stronger test to go all the way into getting the import to literally raise an ImportError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to implement something like those approaches, but just cannot get it working :/
lightgbm
does a LOT of stuff at import time, like dlopen()
-ing a shared library:
_LIB = ctypes.cdll.LoadLibrary(_find_lib_path()[0]) |
and registering a logging callback in it
LightGBM/python-package/lightgbm/basic.py
Lines 291 to 297 in 9f1af05
# connect the Python logger to logging in lib_lightgbm | |
if not environ.get("LIGHTGBM_BUILD_DOC", False): | |
_LIB.LGBM_GetLastError.restype = ctypes.c_char_p | |
callback = ctypes.CFUNCTYPE(None, ctypes.c_char_p) | |
_LIB.callback = callback(_log_callback) # type: ignore[attr-defined] | |
if _LIB.LGBM_RegisterLogCallback(_LIB.callback) != 0: | |
raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8")) |
Also, the test files (which are treated as Python modules too) end up storing their own references to lightgbm
.
Here's what I tried that got the closest:
diff --git a/tests/python_package_test/conftest.py b/tests/python_package_test/conftest.py
index 1f4a7943..e97ed82d 100644
--- a/tests/python_package_test/conftest.py
+++ b/tests/python_package_test/conftest.py
@@ -1,14 +1,37 @@
+import importlib
+import sys
+
import numpy as np
import pytest
-import lightgbm
+
+def _reload_lightgbm():
+ """
+ Re-process ``lightgbm.compat`` conditional imports,
+ then reload all other modules that use them.
+ """
+ importlib.reload(sys.modules["lightgbm.compat"])
+ importlib.reload(sys.modules["lightgbm.libpath"])
+ importlib.reload(sys.modules["lightgbm.basic"])
+ importlib.reload(sys.modules["lightgbm.callback"])
+ importlib.reload(sys.modules["lightgbm.engine"])
+ importlib.reload(sys.modules["lightgbm.sklearn"])
+ importlib.reload(sys.modules["lightgbm.dask"])
+ importlib.reload(sys.modules["lightgbm.plotting"])
+ importlib.reload(sys.modules["lightgbm"])
@pytest.fixture(scope="function")
def missing_module_cffi(monkeypatch):
- """Mock 'cffi' not being importable"""
- monkeypatch.setattr(lightgbm.compat, "CFFI_INSTALLED", False)
- monkeypatch.setattr(lightgbm.basic, "CFFI_INSTALLED", False)
+ monkeypatch.setitem(sys.modules, "pyarrow.cffi", None)
+ _reload_lightgbm()
+
+
+@pytest.fixture(autouse=True)
+def reset_imports():
+ """Re-load ``lightgbm`` modules, undoing any temporary modifications made by tests"""
+ yield
+ _reload_lightgbm()
@pytest.fixture(scope="function")
diff --git a/tests/python_package_test/test_arrow.py b/tests/python_package_test/test_arrow.py
index b592d733..a51fbd41 100644
--- a/tests/python_package_test/test_arrow.py
+++ b/tests/python_package_test/test_arrow.py
@@ -444,6 +444,8 @@ def test_arrow_feature_name_auto():
def test_arrow_feature_name_manual():
+ # just testing this import works
+ assert lgb.compat.CFFI_INSTALLED
data = generate_dummy_arrow_table()
dataset = lgb.Dataset(
data,
@@ -457,6 +459,7 @@ def test_arrow_feature_name_manual():
def test_dataset_construction_from_pa_table_without_cffi_raises_informative_error(missing_module_cffi):
+ assert not lgb.compat.CFFI_INSTALLED
with pytest.raises(
lgb.basic.LightGBMError, match="Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed."
):
@@ -468,6 +471,7 @@ def test_dataset_construction_from_pa_table_without_cffi_raises_informative_erro
def test_predicting_from_pa_table_without_cffi_raises_informative_error(missing_module_cffi):
+ assert not lgb.compat.CFFI_INSTALLED
data = generate_random_arrow_table(num_columns=3, num_datapoints=1_000, seed=42)
labels = generate_random_arrow_array(num_datapoints=data.shape[0], seed=42)
bst = lgb.train(
diff --git a/tests/python_package_test/test_import_mocking.py b/tests/python_package_test/test_import_mocking.py
new file mode 100644
index 00000000..50640a60
--- /dev/null
+++ b/tests/python_package_test/test_import_mocking.py
@@ -0,0 +1,13 @@
+import sys
+import importlib
+import lightgbm as lgb
+from unittest.mock import patch
+
+def test_has_cffi():
+ assert lgb.basic.CFFI_INSTALLED
+
+def test_imports_patch(missing_module_cffi):
+ assert not lgb.basic.CFFI_INSTALLED
+
+def test_has_cffi_again():
+ assert lgb.basic.CFFI_INSTALLED
With that patch applied on this branch, these new Arrow tests do pass.
pytest tests/python_package_test/test_arrow.py
# === 143 passed in 2.23s ===
But a LOT of other tests fail, with a variety of errors.
pytest tests/python_package_test
# === 275 failed, 516 passed, 33 skipped, 6 xfailed, 192 warnings in 67.89s (0:01:07) ===
Including many that look like isinstance()
checks failing, which is maybe a result of there being multiple competing copies of lightgbm
loaded (??? I'm not sure about my understanding here).
if stage == "fit":
if self._objective is None:
if isinstance(self, LGBMRegressor):
self._objective = "regression"
elif isinstance(self, LGBMClassifier):
if self._n_classes > 2:
self._objective = "multiclass"
else:
self._objective = "binary"
elif isinstance(self, LGBMRanker):
self._objective = "lambdarank"
else:
> raise ValueError("Unknown LGBMModel type.")
E ValueError: Unknown LGBMModel type.
and
FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-3-11] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-3-100] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-3-None] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-None-11] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-None-100] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-None-None] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-3-11] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-3-100] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-3-None] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-None-11] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-None-100] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-None-None] - TypeError: Data list can only be of ndarray or Sequence
FAILED tests/python_package_test/test_basic.py::test_sequence_get_data[1] - AssertionError:
FAILED tests/python_package_test/test_basic.py::test_sequence_get_data[2] - TypeError: Data list can only be of ndarray or Sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS do you want to try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for trying this approach! Yes, I'd like to check some things but I don't want to block merging of this PR and next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright thanks! I'll merge this after CI runs, then.
The Stack Overflow link you shared was excellent. In case you do try this, here are some other links I consulted:
generate_dummy_arrow_table(), | ||
label=pa.array([0, 1, 0, 0, 1]), | ||
params=dummy_dataset_params(), | ||
).construct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.construct()
is necessary here... __init_from_pyarrow_table()
is not run as part of lgb.Dataset()
.
I pushed changes here, my review shouldn't count towards a merge.
Ok, think I found a pattern that'll work for this testing! I just pushed e72d5e2, proposing that and adding some other small fixes. Let me know if it looks ok to you @mlondschien . I've also dismissed my review... now that I've made such significant edits here, my review shouldn't count towards a merge. @StrikerRUS @borchero @jmoralez could one of you help with a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor suggestions below.
CFFI_INSTALLED = True | ||
except ImportError: | ||
CFFI_INSTALLED = False | ||
|
||
class arrow_cffi: # type: ignore | ||
"""Dummy class for pyarrow.cffi.ffi.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need
CData = None
addressof = None
cast = None
new = None
class members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CData
is needed for type hinting:
LightGBM/python-package/lightgbm/basic.py
Line 416 in 9f1af05
chunks: arrow_cffi.CData |
But I think the others could be safely removed. It's only showing up in the diff in this PR because this code is being moved around... so this was missed in earlier PRs (I guess #6034).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed all but CData
in 7396613
Co-authored-by: Nikita Titov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The only tests I can think of would require a runner with
pyarrow
but notcffi
installed.Note that the
LightGBMErrors
will only raise whenpyarrow
is installed, butcffi
is not. Ifpyarrow
is not installed,pa_Table
is a dummy class andisinstance(data, pa_Table)
returnsFalse
.This is a breaking change for users who didn't install
lightgbm[arrow]
, but rather just installedlightgbm
andpyarrow
separately. Even if not intended, they could previously train a model on apyarrow.Table
, as this was converted via to ascipy.sparse.csr_matrix(data)
. The fix is simply to installcffi
or to transform manually withscipy.sparse.csr_matrix
.Still, it is good to inform people that they are not "natively" training from a
pyarrow.Table
, incurring an unnecessary copy.As already suggested in #6782, an alternative would be to raise a warning.