[python-package] Separately check whether `pyarrow` and `cffi` are installed #6785

mlondschien · 2025-01-12T19:16:50Z

The only tests I can think of would require a runner with pyarrow but not cffi installed.

Note that the LightGBMErrors will only raise when pyarrow is installed, but cffi is not. If pyarrow is not installed, pa_Table is a dummy class and isinstance(data, pa_Table) returns False.

This is a breaking change for users who didn't install lightgbm[arrow], but rather just installed lightgbm and pyarrow separately. Even if not intended, they could previously train a model on a pyarrow.Table, as this was converted via to a scipy.sparse.csr_matrix(data). The fix is simply to install cffi or to transform manually with scipy.sparse.csr_matrix.

Still, it is good to inform people that they are not "natively" training from a pyarrow.Table, incurring an unnecessary copy.

As already suggested in #6782, an alternative would be to raise a warning.

jameslamb

The only tests I can think of would require a runner with pyarrow but not cffi installed.

Could you try adding tests that mock cffi not being available by mocking sys.modules? I tried that in an environment with scikit-learn (another optional dependency of lightgbm) installed and it seemed to work ok:

import sys
from unittest import mock

with mock.patch.dict(sys.modules, {'sklearn': None}):
    import lightgbm as lgb
    print(lgb.compat.SKLEARN_INSTALLED)
    # False

import lightgbm as lgb
print(lgb.compat.SKLEARN_INSTALLED)
# True

We don't have any examples of that in lightgbm's test suite, but I think it'd be interesting to try.

python-package/lightgbm/basic.py

mlondschien · 2025-01-14T16:49:19Z

It appears this does not work. I don't really understand why.

mlondschien · 2025-01-23T13:54:09Z

@jameslamb How would you like to continue here?

jameslamb · 2025-01-24T03:32:23Z

I feel it'd be easy to accidentally undo this work in future refactorings. I will try to find a way to add a test covering this.

jameslamb · 2025-01-26T20:01:06Z

python-package/lightgbm/basic.py

-        if not PYARROW_INSTALLED:
-            raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` installed.")
+        if not (PYARROW_INSTALLED and CFFI_INSTALLED):
+            raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.")


Suggested change

raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.")

raise LightGBMError("Cannot init Dataset from Arrow without `pyarrow` and `cffi` installed.")

This really should be Dataset, not dataframe... I'll make that change when I push testing changes.

Fixed in e72d5e2.

In that commit, I also removed backticks from these log messages, in favor of single quotes. Special characters in log messages can occasionally be problematic.

I know these things were already there before this PR, but might as well fix them right here while we're touching these lines.

python-package/lightgbm/compat.py

jameslamb · 2025-01-26T21:02:11Z

tests/python_package_test/conftest.py

+def missing_module_cffi(monkeypatch):
+    """Mock 'cffi' not being importable"""
+    monkeypatch.setattr(lightgbm.compat, "CFFI_INSTALLED", False)
+    monkeypatch.setattr(lightgbm.basic, "CFFI_INSTALLED", False)


Came up with this based on https://docs.pytest.org/en/stable/reference/reference.html

I'm hoping that this could establish a pattern we re-use in other tests in future PRs.

It's not perfect (for example, if setting CFFI_INSTALLED is done incorrectly in compat.py, then this approach wouldn't catch that), but it's a lightweight and simple way to ensure we always cover code like the changes introduced in this PR.

Referenced https://docs.pytest.org/en/stable/reference/reference.html while working on this.

@jmoralez @borchero @StrikerRUS what do you think about this approach?

This approach looks quite fragile. I think more complicated but right approach would be like one from the following ones: https://stackoverflow.com/a/51048604.

That is a lot more complicated :/

I'll try it though, I do agree that it'd be a stronger test to go all the way into getting the import to literally raise an ImportError.

I've been trying to implement something like those approaches, but just cannot get it working :/

lightgbm does a LOT of stuff at import time, like dlopen()-ing a shared library:

LightGBM/python-package/lightgbm/libpath.py

Line 49 in 9f1af05

_LIB = ctypes.cdll.LoadLibrary(_find_lib_path()[0])

and registering a logging callback in it

LightGBM/python-package/lightgbm/basic.py

Lines 291 to 297 in 9f1af05

# connect the Python logger to logging in lib_lightgbm

if not environ.get("LIGHTGBM_BUILD_DOC", False):

_LIB.LGBM_GetLastError.restype = ctypes.c_char_p

callback = ctypes.CFUNCTYPE(None, ctypes.c_char_p)

_LIB.callback = callback(_log_callback) # type: ignore[attr-defined]

if _LIB.LGBM_RegisterLogCallback(_LIB.callback) != 0:

raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))

Also, the test files (which are treated as Python modules too) end up storing their own references to lightgbm.

Here's what I tried that got the closest:

diff --git a/tests/python_package_test/conftest.py b/tests/python_package_test/conftest.py index 1f4a7943..e97ed82d 100644 --- a/tests/python_package_test/conftest.py +++ b/tests/python_package_test/conftest.py @@ -1,14 +1,37 @@ +import importlib +import sys + import numpy as np import pytest -import lightgbm + +def _reload_lightgbm(): + """ + Re-process ``lightgbm.compat`` conditional imports, + then reload all other modules that use them. + """ + importlib.reload(sys.modules["lightgbm.compat"]) + importlib.reload(sys.modules["lightgbm.libpath"]) + importlib.reload(sys.modules["lightgbm.basic"]) + importlib.reload(sys.modules["lightgbm.callback"]) + importlib.reload(sys.modules["lightgbm.engine"]) + importlib.reload(sys.modules["lightgbm.sklearn"]) + importlib.reload(sys.modules["lightgbm.dask"]) + importlib.reload(sys.modules["lightgbm.plotting"]) + importlib.reload(sys.modules["lightgbm"]) @pytest.fixture(scope="function") def missing_module_cffi(monkeypatch): - """Mock 'cffi' not being importable""" - monkeypatch.setattr(lightgbm.compat, "CFFI_INSTALLED", False) - monkeypatch.setattr(lightgbm.basic, "CFFI_INSTALLED", False) + monkeypatch.setitem(sys.modules, "pyarrow.cffi", None) + _reload_lightgbm() + + +@pytest.fixture(autouse=True) +def reset_imports(): + """Re-load ``lightgbm`` modules, undoing any temporary modifications made by tests""" + yield + _reload_lightgbm() @pytest.fixture(scope="function") diff --git a/tests/python_package_test/test_arrow.py b/tests/python_package_test/test_arrow.py index b592d733..a51fbd41 100644 --- a/tests/python_package_test/test_arrow.py +++ b/tests/python_package_test/test_arrow.py @@ -444,6 +444,8 @@ def test_arrow_feature_name_auto(): def test_arrow_feature_name_manual(): + # just testing this import works + assert lgb.compat.CFFI_INSTALLED data = generate_dummy_arrow_table() dataset = lgb.Dataset( data, @@ -457,6 +459,7 @@ def test_arrow_feature_name_manual(): def test_dataset_construction_from_pa_table_without_cffi_raises_informative_error(missing_module_cffi): + assert not lgb.compat.CFFI_INSTALLED with pytest.raises( lgb.basic.LightGBMError, match="Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed." ): @@ -468,6 +471,7 @@ def test_dataset_construction_from_pa_table_without_cffi_raises_informative_erro def test_predicting_from_pa_table_without_cffi_raises_informative_error(missing_module_cffi): + assert not lgb.compat.CFFI_INSTALLED data = generate_random_arrow_table(num_columns=3, num_datapoints=1_000, seed=42) labels = generate_random_arrow_array(num_datapoints=data.shape[0], seed=42) bst = lgb.train( diff --git a/tests/python_package_test/test_import_mocking.py b/tests/python_package_test/test_import_mocking.py new file mode 100644 index 00000000..50640a60 --- /dev/null +++ b/tests/python_package_test/test_import_mocking.py @@ -0,0 +1,13 @@ +import sys +import importlib +import lightgbm as lgb +from unittest.mock import patch + +def test_has_cffi(): + assert lgb.basic.CFFI_INSTALLED + +def test_imports_patch(missing_module_cffi): + assert not lgb.basic.CFFI_INSTALLED + +def test_has_cffi_again(): + assert lgb.basic.CFFI_INSTALLED

With that patch applied on this branch, these new Arrow tests do pass.

pytest tests/python_package_test/test_arrow.py # === 143 passed in 2.23s ===

But a LOT of other tests fail, with a variety of errors.

pytest tests/python_package_test # === 275 failed, 516 passed, 33 skipped, 6 xfailed, 192 warnings in 67.89s (0:01:07) ===

Including many that look like isinstance() checks failing, which is maybe a result of there being multiple competing copies of lightgbm loaded (??? I'm not sure about my understanding here).

if stage == "fit": if self._objective is None: if isinstance(self, LGBMRegressor): self._objective = "regression" elif isinstance(self, LGBMClassifier): if self._n_classes > 2: self._objective = "multiclass" else: self._objective = "binary" elif isinstance(self, LGBMRanker): self._objective = "lambdarank" else: > raise ValueError("Unknown LGBMModel type.") E ValueError: Unknown LGBMModel type.

and

FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-3-11] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-3-100] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-3-None] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-None-11] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-None-100] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-False-None-None] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-3-11] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-3-100] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-3-None] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-None-11] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-None-100] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence[3-True-None-None] - TypeError: Data list can only be of ndarray or Sequence FAILED tests/python_package_test/test_basic.py::test_sequence_get_data[1] - AssertionError: FAILED tests/python_package_test/test_basic.py::test_sequence_get_data[2] - TypeError: Data list can only be of ndarray or Sequence

@StrikerRUS do you want to try?

Thanks a lot for trying this approach! Yes, I'd like to check some things but I don't want to block merging of this PR and next release.

Alright thanks! I'll merge this after CI runs, then.

The Stack Overflow link you shared was excellent. In case you do try this, here are some other links I consulted:

https://docs.python.org/3/reference/import.html

https://docs.pytest.org/en/7.1.x/explanation/pythonpath.html

https://github.com/jeertmans/pytest-missing-modules

jameslamb · 2025-01-26T21:03:02Z

tests/python_package_test/test_arrow.py

+            generate_dummy_arrow_table(),
+            label=pa.array([0, 1, 0, 0, 1]),
+            params=dummy_dataset_params(),
+        ).construct()


.construct() is necessary here... __init_from_pyarrow_table() is not run as part of lgb.Dataset().

I pushed changes here, my review shouldn't count towards a merge.

jameslamb · 2025-01-26T21:05:03Z

Ok, think I found a pattern that'll work for this testing! I just pushed e72d5e2, proposing that and adding some other small fixes.

Let me know if it looks ok to you @mlondschien .

I've also dismissed my review... now that I've made such significant edits here, my review shouldn't count towards a merge. @StrikerRUS @borchero @jmoralez could one of you help with a review?

StrikerRUS

Just some minor suggestions below.

python-package/lightgbm/compat.py

StrikerRUS · 2025-01-27T19:27:47Z

python-package/lightgbm/compat.py

+    CFFI_INSTALLED = True
+except ImportError:
+    CFFI_INSTALLED = False
+
    class arrow_cffi:  # type: ignore
        """Dummy class for pyarrow.cffi.ffi."""


Why do we need

CData = None addressof = None cast = None new = None

class members?

CData is needed for type hinting:

LightGBM/python-package/lightgbm/basic.py

Line 416 in 9f1af05

chunks: arrow_cffi.CData

But I think the others could be safely removed. It's only showing up in the diff in this PR because this code is being moved around... so this was missed in earlier PRs (I guess #6034).

Removed all but CData in 7396613

tests/python_package_test/test_arrow.py

Co-authored-by: Nikita Titov <[email protected]>

…ue-6782

StrikerRUS

LGTM!

Implement code as suggested by @jameslamb

6b37b34

mlondschien requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners January 12, 2025 19:16

jameslamb changed the title ~~Separately check whether pyarrow and cffi are installed~~ [pyhton-package] Separately check whether pyarrow and cffi are installed Jan 12, 2025

jameslamb changed the title ~~[pyhton-package] Separately check whether pyarrow and cffi are installed~~ [python-package] Separately check whether pyarrow and cffi are installed Jan 12, 2025

jameslamb added the fix label Jan 12, 2025

jameslamb previously requested changes Jan 13, 2025

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

mlondschien added 2 commits January 14, 2025 17:19

Add test.

512550b

Try pyarrow.cffi

a85027b

mlondschien requested a review from jameslamb January 16, 2025 09:23

jameslamb mentioned this pull request Jan 23, 2025

WIP: release v4.6.0 #6796

Draft

31 tasks

jameslamb added 2 commits January 26, 2025 11:54

Merge branch 'master' of github.com:microsoft/LightGBM into issue-6782

5852ce6

Merge branch 'master' of github.com:microsoft/LightGBM into issue-6782

acd5729

jameslamb reviewed Jan 26, 2025

View reviewed changes

python-package/lightgbm/compat.py Outdated Show resolved Hide resolved

fix compat.arrow_cffi, clarify log messages, fix tests

e72d5e2

jameslamb reviewed Jan 26, 2025

View reviewed changes

keep fixtures in alphabetical order

a4a711d

jameslamb added the awaiting review label Jan 27, 2025

StrikerRUS reviewed Jan 27, 2025

View reviewed changes

jameslamb and others added 4 commits January 27, 2025 13:51

Merge branch 'master' into issue-6782

8992344

Update python-package/lightgbm/compat.py

56127f5

Co-authored-by: Nikita Titov <[email protected]>

update docstring, remove unnecessary class members, re-organize test

7396613

Merge branch 'issue-6782' of github.com:mlondschien/LightGBM into iss…

0526390

…ue-6782

StrikerRUS approved these changes Jan 30, 2025

View reviewed changes

Merge branch 'master' into issue-6782

9b7a9e5

jameslamb self-requested a review January 31, 2025 05:52

jameslamb approved these changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Separately check whether `pyarrow` and `cffi` are installed #6785

[python-package] Separately check whether `pyarrow` and `cffi` are installed #6785

mlondschien commented Jan 12, 2025

jameslamb left a comment

mlondschien commented Jan 14, 2025

mlondschien commented Jan 23, 2025

jameslamb commented Jan 24, 2025

jameslamb Jan 26, 2025

jameslamb Jan 26, 2025

jameslamb Jan 26, 2025

StrikerRUS Jan 27, 2025

jameslamb Jan 27, 2025

jameslamb Jan 30, 2025

jameslamb Jan 30, 2025

StrikerRUS Jan 30, 2025

jameslamb Jan 31, 2025

jameslamb Jan 26, 2025

jameslamb commented Jan 26, 2025

StrikerRUS left a comment

StrikerRUS Jan 27, 2025

jameslamb Jan 27, 2025

jameslamb Jan 27, 2025

StrikerRUS left a comment

	raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.")
	raise LightGBMError("Cannot init Dataset from Arrow without `pyarrow` and `cffi` installed.")

	# connect the Python logger to logging in lib_lightgbm
	if not environ.get("LIGHTGBM_BUILD_DOC", False):
	_LIB.LGBM_GetLastError.restype = ctypes.c_char_p
	callback = ctypes.CFUNCTYPE(None, ctypes.c_char_p)
	_LIB.callback = callback(_log_callback) # type: ignore[attr-defined]
	if _LIB.LGBM_RegisterLogCallback(_LIB.callback) != 0:
	raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))

[python-package] Separately check whether pyarrow and cffi are installed #6785

Are you sure you want to change the base?

[python-package] Separately check whether pyarrow and cffi are installed #6785

Conversation

mlondschien commented Jan 12, 2025

jameslamb left a comment

Choose a reason for hiding this comment

mlondschien commented Jan 14, 2025

mlondschien commented Jan 23, 2025

jameslamb commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Jan 26, 2025

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

[python-package] Separately check whether `pyarrow` and `cffi` are installed #6785

[python-package] Separately check whether `pyarrow` and `cffi` are installed #6785