Optimize memory peak for _preprocess_state_vector in LightningTensor (#943)

LuisAlfredoNu · ringo-but-quantum · mlxd · web-flow · commit e1ec3ad516b4 · 2024-10-15T19:28:31.000-04:00
### Before submitting Please complete the following checklist when submitting a PR: - [ ] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [X] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [X] Ensure that the test suite passes, by running `make test`. - [X] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change, and including a link back to the PR. - [X] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** After profiling Lightning Tensor with [scalene profiler](https://github.com/plasma-umass/scalene). We found a bottleneck of memory in the function `_preprocess_state_vector` ([code](https://github.com/PennyLaneAI/pennylane-lightning/blob/4945ed08d5475d04add8d69a3cf5978ba31d1b39/pennylane_lightning/lightning_tensor/_tensornet.py#L208-L240)) which allocates 3 arrays with dimension 2 ** wires * wires. **Description of the Change:** Optimize the cartesian product to reduce the amount of memory necessary to set the `StatePrep` with LTensor **Benefits:** Reduce by half the peak of memory for large systems close to 30 qubit ![image](https://github.com/user-attachments/assets/431c7b1c-3877-472c-b2d7-e4bc38293d20) Benchmark code ``` python import pennylane as qml import numpy as np wires = 27 state = np.random.rand(2**(wires-1)) state = state / np.linalg.norm(state) dev = qml.device('lightning.tensor', wires=wires) dev_wires = dev.wires.tolist() @qml.qnode(dev) def circuit(state=state,dev_wires=dev_wires): qml.StatePrep(state, wires=dev_wires[1:]) return qml.expval(qml.Z(0)), qml.state() return circuit(state, dev_wires) ``` **Possible Drawbacks:** This change reduces readability but with a good improvement. **Related GitHub Issues:** [sc-75692] --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Lee James O'Riordan <mlxd@users.noreply.github.com>
diff --git a/.github/CHANGELOG.md b/.github/CHANGELOG.md
@@ -43,6 +43,9 @@
 
 ### Improvements
 
+* Optimize the cartesian product to reduce the amount of memory necessary to set the StatePrep with LightningTensor. 
+  [(#943)](https://github.com/PennyLaneAI/pennylane-lightning/pull/943)
+
 * The `prob` data return `lightning.gpu` C++ layer is aligned with other state-vector backends and `lightning.gpu` supports out-of-order `qml.prob`.
     [(#941)](https://github.com/PennyLaneAI/pennylane-lightning/pull/941)
 
diff --git a/.github/workflows/wheel_linux_aarch64.yml b/.github/workflows/wheel_linux_aarch64.yml
@@ -123,8 +123,13 @@ jobs:
           mkdir Kokkos
           cp -rf ${{ github.workspace }}/Kokkos_install/${{ matrix.exec_model }}/* Kokkos/
 
+      - name: Install Python 3.10
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+
       - name: Install dependencies
-        run: python -m pip install cibuildwheel~=2.20.0 tomlkit
+        run: python3.10 -m pip install cibuildwheel~=2.20.0 tomlkit
 
       - name: Configure pyproject.toml file
         run: PL_BACKEND="${{ matrix.pl_backend }}" python scripts/configure_pyproject_toml.py
diff --git a/.github/workflows/wheel_linux_aarch64_cuda.yml b/.github/workflows/wheel_linux_aarch64_cuda.yml
@@ -48,8 +48,13 @@ jobs:
       - name: Checkout PennyLane-Lightning
         uses: actions/checkout@v4
 
+      - name: Install Python 3.10
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+
       - name: Install cibuildwheel
-        run: python -m pip install cibuildwheel~=2.20.0 tomlkit
+        run: python3.10 -m pip install cibuildwheel~=2.20.0 tomlkit
 
       - name: Configure pyproject.toml file
         run: PL_BACKEND="${{ matrix.pl_backend }}" python scripts/configure_pyproject_toml.py
diff --git a/pennylane_lightning/core/_version.py b/pennylane_lightning/core/_version.py
@@ -16,4 +16,4 @@
    Version number (major.minor.patch[-label])
 """
 
-__version__ = "0.39.0-dev42"
+__version__ = "0.39.0-dev43"
diff --git a/pennylane_lightning/lightning_tensor/_tensornet.py b/pennylane_lightning/lightning_tensor/_tensornet.py
@@ -21,8 +21,6 @@
 except ImportError:
     pass
 
-from itertools import product
-
 import numpy as np
 import pennylane as qml
 from pennylane import BasisState, DeviceError, StatePrep
@@ -223,20 +221,46 @@ def _preprocess_state_vector(self, state, device_wires):
         if len(device_wires) == self._num_wires and Wires(sorted(device_wires)) == device_wires:
             return np.reshape(state, output_shape).ravel(order="C")
 
-        # generate basis states on subset of qubits via the cartesian product
-        basis_states = np.array(list(product([0, 1], repeat=len(device_wires))))
+        local_dev_wires = device_wires.tolist().copy()
+        local_dev_wires = local_dev_wires[::-1]
+
+        # generate basis states on subset of qubits via broadcasting as substitute of cartesian product.
+
+        # Allocate a single row as a base to avoid a large array allocation with
+        # the cartesian product algorithm.
+        # Initialize the base with the pattern [0 1 0 1 ...].
+        base = np.tile([0, 1], 2 ** (len(local_dev_wires) - 1)).astype(dtype=np.int64)
+        # Allocate the array where it will accumulate the value of the indexes depending on
+        # the value of the basis.
+        indexes = np.zeros(2 ** (len(local_dev_wires)), dtype=np.int64)
+
+        max_dev_wire = self._num_wires - 1
+
+        # Iterate over all device wires.
+        for i, wire in enumerate(local_dev_wires):
+
+            # Accumulate indexes from the basis.
+            indexes += base * 2 ** (max_dev_wire - wire)
+
+            if i == len(local_dev_wires) - 1:
+                continue
+
+            two_n = 2 ** (i + 1)  # Compute the value of the base.
 
-        # get basis states to alter on full set of qubits
-        unravelled_indices = np.zeros((2 ** len(device_wires), self._num_wires), dtype=int)
-        unravelled_indices[:, device_wires] = basis_states
+            # Update the value of the base without reallocating a new array.
+            # Reshape the basis to swap the internal columns.
+            base = base.reshape(-1, two_n * 2)
+            swapper_A = two_n // 2
+            swapper_B = swapper_A + two_n
 
-        # get indices for which the state is changed to input state vector elements
-        ravelled_indices = np.ravel_multi_index(unravelled_indices.T, [2] * self._num_wires)
+            base[:, swapper_A:swapper_B] = base[:, swapper_A:swapper_B][:, ::-1]
+            # Flatten the base array
+            base = base.reshape(-1)
 
         # get full state vector to be factorized into MPS
         full_state = np.zeros(2**self._num_wires, dtype=self.dtype)
         for i, value in enumerate(state):
-            full_state[ravelled_indices[i]] = value
+            full_state[indexes[i]] = value
         return np.reshape(full_state, output_shape).ravel(order="C")
 
     def _apply_state_vector(self, state, device_wires: Wires):
@@ -285,7 +309,7 @@ def _apply_MPO(self, gate_matrix, wires):
             None
         """
         # TODO: Discuss if public interface for max_mpo_bond_dim argument
-        max_mpo_bond_dim = 2 ** len(wires)  # Exact SVD decomposition for MPO
+        max_mpo_bond_dim = self._max_bond_dim
 
         # Get sorted wires and MPO site tensor
         mpos, sorted_wires = gate_matrix_decompose(