You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize the input state-vector copy into the LGPU (#1071)
**Context:**
After running different algorithm with LGPU and perform a memory
profile. Show a memory bottleneck for LGPU on the Python layer because
the peak of memory is 3 times the need for the computation.

**Description of the Change:**
Remove tmp allocation and skip indexes computation for common cases.
* Remove temporal GPU allocation for input values and indexes.
* The input state vector is copied directly from the host if **the
target wires are contiguous and start in the most/least significant
wires** (which are the most common cases).
* In the case of custom target wires, LGPU follow the previous algorithm
but with a speedup in the index computation thought parallel computing
with OpenMP
**Benefits:**
Using a test algorithm with 31 qubits produce the following memory
profile:

Reduction of the memory peak from 100GB to 66GB
Note: `memray` measures all the memory allocation, even for the GPU
`cudaMallocX`.
Using the following toy circuit
```python
state_init = random_normalize_sv(wires-1)
target_wires = wires[:-1]
dev = qml.device("lightning.gpu", wires=wires)
def circuit():
qml.StatePrep(input_state, wires=target_wires)
return qml.expval(qml.PauliZ(0))
```
Produce the following times


**Possible Drawbacks:**
**Related GitHub Issues:**
[sc-58833]
---------
Co-authored-by: ringo-but-quantum <[email protected]>
Copy file name to clipboardexpand all lines: pennylane_lightning/core/src/simulators/lightning_gpu/gates/tests/Test_StateVectorCudaManaged_NonParam.cpp
0 commit comments