Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add merging of Interners #13752

Merged
merged 4 commits into from
Jan 30, 2025
Merged

Add merging of Interners #13752

merged 4 commits into from
Jan 30, 2025

Conversation

jakelishman
Copy link
Member

@jakelishman jakelishman commented Jan 28, 2025

Summary

This adds ways of merging two interners together, including ways that re-use temporary allocations. The intention here is to make it much cheaper to rebuild DAGs and circuits from many small parts (like synthesis passes, basis-translation, etc), and to avoid many small-scale allocations when doing qarg remapping. The small-scale allocations are comparatively costly of themselves, and can often have amplified effects in threaded contexts (though it depends on the system allocator in use).

There's quite a few places that this could be used, such as UnitarySynthesis and BasisTranslator, but both of those will be substantial amounts of work to convert, and more appropriate for separate patches. Given the performance improvements in this PR, I suspect that they should both see decent improvements too.

Details and comments

Commit messages:

Add merging of Interner instances

This makes it possible to merge one interner into another, using a map-and-filter function to explain how the values should be considered afterwards. The main use-case for this is in DAG and circuit substitution methods, where we want to be able to map PackedInstructions from a small circuit to new ones on a larger circuit, without having to unwrap and re-hash the interned slices once for instruction; instead, we do that once per interner key, and then have a no-hash direct lookup table to remap the interner keys directly.

The methods here are designed to directly allow the reuse of the heap allocations used to track the interner key mappings, to help reduce the number of small-scale allocations done in substitution methods.

The filtering part of the mapping is because one cannot typically be certain that an Interner contains the minimally necessary keys, especially for DAGs; it can well be that substitutions and bit removals cause old data to be present.

Use new Interner merging in DAGCircuit::from_circuit

This converts one obvious candidate for the new interner-merging functionality. There are several others in the codebase (all uses of DAGCircuit::extend, for example, and several other full rebuilds), but the structure of their code wasn't designed with this in mind, so the modifications would typically be far more involved and more suitable for separate patches.

Using a timing script:

import itertools
from qiskit.circuit import QuantumCircuit
from qiskit.converters import circuit_to_dag

qc = QuantumCircuit(100, 100)
for pair in itertools.permutations(qc.qubits, 2):
    qc.rz(0, pair[0])
    qc.sx(pair[0])
    qc.rz(0, pair[0])
    qc.rz(0, pair[1])
    qc.sx(pair[1])
    qc.rz(0, pair[1])
    qc.cx(*pair)
    
%timeit circuit_to_dag(qc, copy_operations=False)

this patch showed an improvement from 18.5(6)ms to 14.4(5)ms on my machine, which is a 22% speedup, though admittedly this particular function was doing larger-scale allocation work that could be removed than other candidate replacements are.

This makes it possible to merge one interner into another, using a
map-and-filter function to explain how the values should be considered
afterwards.  The main use-case for this is in DAG and circuit
substitution methods, where we want to be able to map
`PackedInstruction`s from a small circuit to new ones on a larger
circuit, without having to unwrap and re-hash the interned slices once
for instruction; instead, we do that once _per interner key_, and then
have a no-hash direct lookup table to remap the interner keys directly.

The methods here are designed to directly allow the reuse of the heap
allocations used to track the interner key mappings, to help reduce the
number of small-scale allocations done in substitution methods.

The filtering part of the mapping is because one cannot typically be
certain that an `Interner` contains the minimally necessary keys,
especially for DAGs; it can well be that substitutions and bit removals
cause old data to be present.
This converts one obvious candidate for the new interner-merging
functionality.  There are several others in the codebase (all uses of
`DAGCircuit::extend`, for example, and several other full rebuilds), but
the structure of their code wasn't designed with this in mind, so the
modifications would typically be far more involved and more suitable for
separate patches.

Using a timing script:

```python
        import itertools
        from qiskit.circuit import QuantumCircuit
from qiskit.converters import circuit_to_dag

qc = QuantumCircuit(100, 100)
for pair in itertools.permutations(qc.qubits, 2):
    qc.rz(0, pair[0])
    qc.sx(pair[0])
    qc.rz(0, pair[0])
    qc.rz(0, pair[1])
    qc.sx(pair[1])
    qc.rz(0, pair[1])
    qc.cx(*pair)

%timeit circuit_to_dag(qc, copy_operations=False)
```

this patch showed an improvement from 18.5(6)ms to 14.4(5)ms on my
machine, which is a 22% speedup, though admittedly this particular
function was doing larger-scale allocation work that could be removed
than other candidate replacements are.
@jakelishman jakelishman added performance Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository labels Jan 28, 2025
@jakelishman jakelishman requested a review from a team as a code owner January 28, 2025 16:42
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

@jakelishman jakelishman changed the title Interner merge Add merging of Interners Jan 28, 2025
@coveralls
Copy link

coveralls commented Jan 28, 2025

Pull Request Test Coverage Report for Build 13041462412

Details

  • 123 of 253 (48.62%) changed or added relevant lines in 2 files are covered.
  • 5 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.1%) to 88.815%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/circuit/src/interner.rs 54 184 29.35%
Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/lex.rs 1 92.98%
crates/accelerate/src/two_qubit_decompose.rs 1 92.08%
crates/circuit/src/circuit_data.rs 3 94.94%
Totals Coverage Status
Change from base Build 13010967771: -0.1%
Covered Lines: 79759
Relevant Lines: 89804

💛 - Coveralls

Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one definitely seems like a slam dunk to me 😄.

Most of my comments are just for clarification.

The FromIterator implementation for Interner<T> is interesting. If it were implemented for T instead, would there really be any downside to that? It seems like implementing for ::Owned would mean equivalent values would still need to be duplicated by the caller.

work.reserve(slice.len());
for value in slice {
let Some(scalar) = scalar_map_fn(value) else {
break 'slice None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a feature to be careful of (feels like it could easily get spaghetti-ish), but when it works, it can really tidy up code. I'd originally tried to write this as a closure in order to use return in the same way (via ?), but it ended up really gross and I swapped to this.

// to allocate. This method is lower leveel, so in this case we ask them to tell us; if
// they're optimising to the point of re-using the return allocations, they probably have a
// good idea about the maximum slice size of the interner they'll be merging in.
let mut work = SmallVec::<[T; N]>::with_capacity(N);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very neat. From what I can tell, we only do a heap allocation if every scalar maps and the slice doesn't already exist? And the latter is possible because of Equivalent working for Vec<T> and &[T].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do heap allocations in two situations:

  • we have to insert into the interner (the situation you described) - this work thing here is a tacit assumption that that happens relatively rarely, though the cost of being wrong about that is very low in practice - it's a small memcpy.
  • we need more space for a slice than we allocated on the stack (so the SmallVec spills to the heap)

The Vec and the slice are Equivalent, but in this case, they also support PartialEq between borrows of each other, so it'd work in the default hashmaps without Equivalent too, I think.

// We're specifying the stack space here. This is just a guess, but it's not hugely
// important; we'll safely spill from the stack to the heap if needed, and this function is
// an API convenience at the cost of optimal allocation performance anyway.
self.merge_map_slice_using::<4>(other, scalar_map_fn, &mut out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is stack space, are you choosing a low count like 4 for cache locality concerns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sort of, though I didn't really put much thought into the exact value - it's unlikely to matter in this method, because of the cost of allocating the map anyway.

Some(ordered_vec)
// The `Vec::get` use is because an arbitrary interner might contain old references to
// bit instances beyond `num_qubits`, such as if it's from a DAG that had wires removed.
new_dag.merge_qargs(qc_data.qargs_interner(), |bit| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the new_dag does not yet contain any entries (other than the default), unless I'm mistaken. So it's less of a "merge" here and more of an efficient clone that skips over now-dead keys in the source interner, correct?

Separately, I just noticed that DAGCircuit::with_capacity preallocates its interners to have num_qubits and num_clbits entries. I'm not sure if that was an intentional choice, or due to a misread of the docstring when DAGCircuit::with_capacity was added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, this particular merge isn't so much a merge as an efficient mapping. In other situations we might want it (like substitute_node_with_dag), it's much more of an actual merge-with-map.

Co-authored-by: Kevin Hartman <[email protected]>
@jakelishman
Copy link
Member Author

The FromIterator implementation for Interner<T> is interesting. If it were implemented for T instead, would there really be any downside to that? It seems like implementing for ::Owned would mean equivalent values would still need to be duplicated by the caller.

In general (and in almost all the cases we care about), T isn't Sized (it's usually [T]), so you can't hold direct values of it - you've got to indirect somehow. So then we have to choose either the borrowed type (e.g. &[T]) or the owned (e.g. Vec<T>), or something that can Into<<T as ToOwned>::Owned>. The FromIterator impl takes ownership of the values coming in, so we don't need to clone the values in the way I wrote it (we just move them into the IndexSet), but the caller of FromIterator might in order to produce the iterator.

In practice, I mostly wrote that impl to make the tests shorter to write, and mostly I thought that the chances were that if you wanted to use that method, you'd have the owned types anyway.

Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very neat idea that should hopefully make conversion between circuits much easier. I will try to do a more in depth review of this tomorrow. But what I've seen so far looks great!

Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick passthrough of the code, this looks like a very useful addition that I can appreciate. Other than that, the additions to the Interner methods are fairly welcome as these were issues I was running into myself, and had to, in come cases, add them too. I like everything being done here. LGTM!

I left a couple of questions and comments, would suggest that we also wait for @kevinhartman's in depth review.

@kevinhartman
Copy link
Contributor

kevinhartman commented Jan 29, 2025

would suggest that we also wait for @kevinhartman's in depth review

@raynelfss I've already reviewed to a suitable depth. Feel free to sign off when your comments are addressed 🙂.

Co-authored-by: Raynel Sanchez <[email protected]>
@jakelishman
Copy link
Member Author

Hate to be doing this but...

you'll be here forever if you check the spelling in my code comments haha - I try and write public documentation with American spelling, but I write comments straight from my head.

(I don't mind changing it to American, I just won't spot it myself easily.)

Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for adding these changes and making the Interner much more flexible. It will be a helpful addition to #13686 and others.

@raynelfss raynelfss added this pull request to the merge queue Jan 30, 2025
Merged via the queue into Qiskit:main with commit 315d6ad Jan 30, 2025
17 checks passed
@jakelishman jakelishman deleted the interner-merge branch January 30, 2025 15:44
raynelfss added a commit to raynelfss/qiskit that referenced this pull request Jan 30, 2025
- Remove stale methods `contains` and `get_interned` from `Interner` in favor of the changes introduced by Qiskit#13752.
emilkovacev pushed a commit to emilkovacev/qiskit that referenced this pull request Feb 7, 2025
* Add merging of `Interner` instances

This makes it possible to merge one interner into another, using a
map-and-filter function to explain how the values should be considered
afterwards.  The main use-case for this is in DAG and circuit
substitution methods, where we want to be able to map
`PackedInstruction`s from a small circuit to new ones on a larger
circuit, without having to unwrap and re-hash the interned slices once
for instruction; instead, we do that once _per interner key_, and then
have a no-hash direct lookup table to remap the interner keys directly.

The methods here are designed to directly allow the reuse of the heap
allocations used to track the interner key mappings, to help reduce the
number of small-scale allocations done in substitution methods.

The filtering part of the mapping is because one cannot typically be
certain that an `Interner` contains the minimally necessary keys,
especially for DAGs; it can well be that substitutions and bit removals
cause old data to be present.

* Use new `Interner` merging in `DAGCircuit::from_circuit`

This converts one obvious candidate for the new interner-merging
functionality.  There are several others in the codebase (all uses of
`DAGCircuit::extend`, for example, and several other full rebuilds), but
the structure of their code wasn't designed with this in mind, so the
modifications would typically be far more involved and more suitable for
separate patches.

Using a timing script:

```python
        import itertools
        from qiskit.circuit import QuantumCircuit
from qiskit.converters import circuit_to_dag

qc = QuantumCircuit(100, 100)
for pair in itertools.permutations(qc.qubits, 2):
    qc.rz(0, pair[0])
    qc.sx(pair[0])
    qc.rz(0, pair[0])
    qc.rz(0, pair[1])
    qc.sx(pair[1])
    qc.rz(0, pair[1])
    qc.cx(*pair)

%timeit circuit_to_dag(qc, copy_operations=False)
```

this patch showed an improvement from 18.5(6)ms to 14.4(5)ms on my
machine, which is a 22% speedup, though admittedly this particular
function was doing larger-scale allocation work that could be removed
than other candidate replacements are.

* Fix typo

Co-authored-by: Kevin Hartman <[email protected]>

* Fix typos

Co-authored-by: Raynel Sanchez <[email protected]>

---------

Co-authored-by: Kevin Hartman <[email protected]>
Co-authored-by: Raynel Sanchez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants