Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special GC (Incremental GC with Central Object Table) #3894

Closed
wants to merge 292 commits into from

Conversation

luc-blaeser
Copy link
Contributor

@luc-blaeser luc-blaeser commented Mar 22, 2023

Do not merge - this is only an experiment for comparison.

Special GC (Incremental GC with Central Object Table)

Incremental, generational, and compacting garbage collector based on a central indirection table for object addresses.

Objective: An incremental GC with a different design than the evacuation-compacting GC (#3837).

  • No partitioned heap like in the evacuation-compact GC.
  • Easily combinable with generational collection.
  • Simple in-place full-heap compaction.
  • No pointer update phase.

Properties:

  • Fast reclamation of short-lived objects with generational collection.
  • Full-heap incremental snapshot-at-the-beginning marking.
  • Incremental compaction enabled by a central object table.

Design

The key concept of this garbage collector is to introduce a central object table over which all references to an object are indirected. This allows atomic movement of objects without having to search and update their incoming pointers. Based on this mechanism, the GC can perform incremental heap compacting, similar to a simple mark-and-compact GC.

Central Object Table

Each object has its object id that serves to reference the object. Thereby, all references are indirected via a central object table, that maps an object id (object reference) to the corresponding object address.
This enables fast moving of objects in the incremental GC by only updating the address of the corresponding object in the table. Objects also carry their id in the header to allow fast reverse lookup, i.e. determining the object id for a given object address.
The table object is allocated in the heap and can also be moved (e.g. when growing). A global pointer denotes the current table location. Allowing relocation of the table is more expensive in terms of object address lookup (extra indirection via the global object table pointer), but allows a significantly simpler implementation, since the table can grow at any time without having to move other objects.
The ids of reclaimed garbage objects are recycled in the object table using an inlined free list.
When running out of free object ids, the table is currently relocated to double-sized new table at a different location in the heap. Pre-amortized growth could be used in future to obtain O(1) worst-case per insertion.

Garbage Collection

Garbage collection is performed in two steps:

Young Generation Collection (Blocking)

The young generation collection runs non-incrementally, blocking the mutator. This is acceptable as the generation tends to be small and the GC work should adapt to the allocation rate

It is scheduled:

  • Before each GC increment of the old generation, for simplifying incremental collection.
  • After a certain amount of new allocations, i.e. when young generation has exceeded a threshold.

Young generation collection aims at fast reclamation of short-lived objects, to reduce GC latency, which is particularly relevant in an incremental GC.

The young generation collection requires an extra root set of old-to-young pointers. Those pointers are caught by a write barrier and recorded in a remembered set. The remembered set lives in the young generation and is freed during the young generation collection.

The compaction phase can use the simple object movement enabled by the central object table provided for the incremental GC.

Old Generation Collection (Incremental)

Old generation collection runs incrementally in multiple GC increments, where the mutator can resume work in between. It is scheduled to run after young generation collection, such that it can ignore generational aspects.

New objects are always first allocated in the young generation and become only visible to the incremental GC after they have been promoted to the old generation (i.e. survived the preceding young generation collection).

The incremental collection of the old generation is based on two phases:

  1. Mark: Incremental snapshot-at-the-beginning marking. The GC marks at least all objects that have been reachable at the start time of the incremental GC run. A pre-update write-barrier detects relevant concurrent pointer overwrites by the mutator between GC increments and marks the corresponding objects to guarantee snapshot-at-the-beginning consistency. Concurrent allocations to the old generation (being promotions from the young generation) are conservatively marked.
  2. Compact: Incremental compaction supported by the central object table. After the mark phase, the marked objects of the old generation are compacted towards the bottom of the generation. Objects can be easily moved one-by-one since all incoming references to a moved object can be atomically updated in the central object table. Concurrent allocations to the old generation during this phase (promotions from the young generations) are marked because they have to be retained and compacted during this incremental GC run.

GC increments are currently only scheduled on empty call stack such that we can ignore pointers on the call stack for the root set.

In-Header Mark Bit

The GC uses a mark bit in the object header instead of mark bitmaps since there is no performance advantage by using the mark bitmap for skipping garbage objects. This is because the object ids of garbage objects need to be freed in the object table and therefore, the compaction phase must visit all objects (marked and unmarked).

GC Configuration

  • Increment limit: Regular increment bounded to 4,000,000 steps (approximately 400 million instructions).
  • Young generation collection: When exceeding 8 MB young generation size.
  • Old generation collection: When the old generation has grown more than 2.5 times its size or passed the critical limit of 3.5GB (on 4GB heap size).

The configuration can be adjusted to tune the GC.

Measurement

The following results have been measured on the GC benchmark with dfx 0.12.1. Special denotes this new garbage collector, while Incremental here refers to the other incremental evacuation-compact GC (#3837).

The Copying, Compacting, Generational GC are based on the original runtime system without the header extension (for object id). No denotes the disabled GC based on the runtime system also with the header extension. Measurement results are rounded to two significant figures.

Scalability

Summary: Like the other incremental GC, the special GC scales up to the full heap size.

Average amount of allocations for the benchmark limit cases, until reaching a limit (instruction limit, heap limit, dfx cycles limit).

GC Avg. Allocation Limit
Special 140e6
Incremental 140e6
No 47e6
Generational 33e6
Compacting 37e6
Copying 47e6

Performance

Total number of instructions (mutator + GC), average across all benchmark cases:

GC Avg. Total Instructions
Special 2.9e10
Incremental 2.0e10
Generational 1.9e10
Compacting 2.2e10
Copying 2.0e10

45% slower than the other incremental and copying GC. 31% slower than the compacting GC. 52% slower than the basic generational GC.

Memory Size

Allocated WASM memory space, benchmark average:

GC Avg. Memory Size
Special 260 MB
Incremental 290 MB
No 500 MB
Generational 190 MB
Compacting 190 MB
Copying 270 MB

12% smaller than the other incremental GC. 4% smaller than the copying GC. 37% larger than the compacting and generational GC.

Overheads

Additional mutator costs implied by the incremental GC:

  • Object table indirection
    • Dereferencing requires an extra look up in the object table.
    • Two more instructions (add and load) for each field or array element access (and object header access).
  • Write barrier:
    • Pre-update barrier during the mark phase: Marking the target of overwritten pointers.
    • Post-update barrier: Recording old-to-young generation pointers in the remembered set.

Around 45% mutator overhead for the object table indirection and write barrier.

Testing

The special GC is enabled by --incremental-gc compiler flag (like the other incremental GC).

  1. RTS unit tests

    In Motoko repo folder rts:

    make test
    
  2. Motoko test cases

    In Motoko repo folder test/run and test/run-drun:

    export EXTRA_MOC_ARGS="--sanity-checks --incremental-gc"
    make
    
  3. GC Benchmark cases

    In gcbench repo:

    ./measure-all.sh
    

Discussion

Advantage

  • Low-latency reclamation: Fast memory reclamation due to the use of generational collection on top of the incremental collection. No update phase is needed in the incremental GC.

Limitations

  • Object table growth: Currently, for implementation simplicity, the object table is extended by copying the entries to a double-sized copy, which causes an unbound interruption in the incremental GC. This could be improved by using a pre-amortized table growth implementation.
  • Object table shrinking: Due to the fragmentation, the object table cannot be easily compacted without having to update all references.
  • Indirection costs: The performance costs of central object table iteration is noticeable by around 45%.
  • Iterating objects: In contrast to other GCs using a mark bitmap, all free objects need to be also iterated for freeing the object id.

Conclusion

The runtime costs of central object table indirection are not extremely high but relevant at around 45%. Moreover, the growth of the central object table is non-trivial. Both incremental GCs have similar code size complexity with around 5,200 - 5,500 LOC (additions and deletions): While the special GC avoids the partitioned heap, it introduces the central object table.

Except for the higher memory footprint, the incremental evacuation-compact GC shows superior performance according to the benchmark.

Related PRs

@luc-blaeser luc-blaeser mentioned this pull request Mar 23, 2023
@github-actions
Copy link

Comparing from 87f9371 to 49e47a0:
In terms of gas, 4 tests regressed and the mean change is +7.9%.
In terms of size, 4 tests regressed and the mean change is +9.6%.

@luc-blaeser luc-blaeser marked this pull request as ready for review March 23, 2023 13:57
@luc-blaeser luc-blaeser reopened this Mar 27, 2023
@luc-blaeser luc-blaeser closed this Apr 5, 2023
@crusso
Copy link
Contributor

crusso commented Apr 9, 2023

I wonder if this approach would be much more viable if we could just use a second, isolated memory for the object table. Unfortunately, the replica doesn't support this I think, but wasm does. Or use stable memory, once fast enough, but probably not efficient enough in combination with regions. @luc-blaeser

mergify bot pushed a commit that referenced this pull request May 12, 2023
### Incremental GC PR Stack
The Incremental GC is structured in three PRs to ease review:
1. #3837 **<-- this PR**
2. #3831
3. #3829

# Incremental GC

Incremental evacuating-compacting garbage collector.

**Objective**: Scalable memory management that allows full heap usage.

**Properties**:
* All GC pauses have bounded short time.
* Full-heap snapshot-at-the-beginning marking.
* Focus on reclaiming high-garbage partitions.
* Compacting heap space with partition evacuations.
* Incremental copying enabled by forwarding pointers.
* Using **mark bitmaps** instead of a mark bit in the object headers.
* Limiting number of evacuations on memory shortage.

## Design

The incremental GC distributes its workload across multiple steps, called increments, that each pause the mutator (user's program) for only a limited amount of time. As a result, the GC appears to run concurrently (although not parallel) to the mutator and thus allows scalable heap usage, where the GC work fits within the instruction-limited IC messages.

Similar to the recent Java Shenandoah GC [1], the incremental GC organizes the heap in equally-sized partitions and selects high-garbage partitions for compaction by using incremental evacuation and the Brooks forwarding pointer technique [2].

The GC runs in three phases:
1. **Incremental Mark**: The GC performs full heap incremental tri-color-marking with snapshot-at-the-beginning consistency. For this purpose, write barriers intercept mutator pointer overwrites between GC mark increments. The target object of an overwritten pointer is thereby marked. Concurrent new object allocations are also conservatively marked. To remember the mark state per object, the GC uses partition-associated mark bitmaps that are temporarily allocated during a GC run. The phase additionally needs a mark stack that is a growable linked table list in the heap that can be recycled as garbage during the active GC run. Full heap marking has the advantage that it can also deal with arbitrarily large cyclic garbage, even if spread across multiple partitions. As a side activity, the mark phase also maintains the bookkeeping of the amount of live data per partition. Conservative snapshot-at-the-beginning marking and retaining new allocations is necessary because the WASM call stack cannot be inspected for the root set collection. Therefore, the mark phase must also only start on an empty call stack.

2. **Incremental Evacuation**: The GC prioritizes partitions with a larger amount of garbage for evacuation based on the available free space. It also requires a defined minimum amount of garbage for a partition to be evacuated. Subsequently, marked objects inside the selected partitions are evacuated to free partitions and thereby compacted. To allow incremental object moving and incremental updating of pointers, each object carries a redirection information in its header, which is a forwarding pointer, also called Brooks pointer. For non-moved objects, the forwarding pointer reflexively points back to the object itself, while for moved objects, the forwarding pointer refers to the new object location. Each object access and equality check has to be redirected via this forwarding pointer. During this phase, evacuated partitions are still retained and the original locations of evacuated objects are forwarded to their corresponding new object locations. Therefore, the mutator can continue to use old incoming pointers to evacuated objects.

3. **Incremental Updates**: All pointers to moved objects have to be updated before free space can be reclaimed. For this purpose, the GC performs a full-heap scan and updates all pointers in alive objects to their forwarded address. As mutator may perform concurrent pointer writes behind the update scan line, a write barrier catches such pointer writes and resolves them to the forwarded locations. The same applies to new object allocations that may have old pointer values in their initialized state (e.g. originating from the call stack). Once this phase is completed, all evacuated partitions are freed and can later be reused for new object allocations. At the same time, the GC also frees the mark bitmaps stored in temporary partitions. The update phase can only be completed when the call stack is empty, since the GC does not access the WASM stack. No remembered sets are maintained for tracking incoming pointers to partitions.

**Humongous objects**:
* Objects with a size larger than a partition require special handling: A sufficient amount of contiguous free partitions is searched and reserved for a large object. Large objects are not moved by the GC. Once they have become garbage (not marked by the GC), their hosting partitions are immediately freed. Both external and internal fragmentation can only occur for huge objects. Partitions storing large objects do not require a mark bitmap during the GC.

**Increment limit**:
* The GC maintains a synthetic deterministic clock by counting work steps, such as marking an object, copying a word, or updating a pointer. The clock serves for limiting the duration of a GC increment. The GC increment is stopped whenever the limit is reached, such that the GC later resumes its work in a new increment. To also keep the limit on large objects, large arrays are marked and updated in incremental slices. Moreover, huge objects are never moved. 
For simplicity, the GC increment is only triggered at the compiler-instrumented scheduling points when the call stack is empty. The increment limit is increased depending on the amount of concurrent allocations, to reduce the reclamation latency on a high allocation rate during garbage collection.

**Memory shortage**
* If memory is scarce during garbage collection, the GC limits the amount of evacuations to available free space of free partitions. This is to prevent the GC to run out of memory while copying alive objects to new partitions.

## Configuration

* **Partition size**: 32 MB.

* **Increment limit**: Regular increment bounded to 3,500,000 steps (approximately 600 million instructions). Each allocation during GC increases the next scheduled GC increment by 20 additional steps.

* **Survival threshold**: If 85% of a partition space is alive (marked), the partition is not evacuated.

* **GC start**: Scheduled when the growth (new allocations since the last GC run) account for more than 65% of the heap size. When passing the critical limit of 3.25GB (on the 4GB heap size), the GC is already started when the growth exceeds 1% of the heap size.

The configuration can be adjusted to tune the GC.

## Measurement

The following results have been measured on the GC benchmark with `dfx` 0.13.1. The `Copying`, `Compacting`, and `Generational` GC are based on the original runtime system ***without*** the forwarding pointer header extension. `No` denotes the disabled GC based on the runtime system ***with*** the forwarding pointer header extension. 

### Scalability

**Summary**: The incremental GC allows full 4GB heap usage without that it exceeds the message instruction limit. It therefore scales much higher than the existing stop-and-go GCs and naturally also higher than without GC.

Average amount of allocations for the benchmark limit cases, until reaching a limit (instruction limit, heap limit, `dfx` cycles limit). Rounded to two significant figures.

| GC                | Avg. Allocation Limit   |
| ----------------- | ----------------------- |
| **Incremental**   | **150e6**               |
| No                | 47e6                    |
| Generational      | 33e6                    |
| Compacting        | 37e6                    |
| Copying           | 47e6                    |

3x higher than the other GCs and also than no GC.

Currently, the following limit benchmark cases do not reach the 4GB heap maximum due to GC-independent reasons:
* `buffer` applies exponential array list growth where the copying to the larger array exceeds the instruction limit.
* `rb-tree`, `trie-map`, and `btree-map` are such garbage-intense that they run out of `dfx` cycles or suffer from a sudden `dfx` network connection interruption.

### GC Pauses

Longest GC pause, maximum of all benchmark cases:

| GC                | Longest GC Pause          |
| ----------------- | ------------------------- |
| **Incremental**   | **0.712e9**               |
| Generational      | 1.19e9                    |
| Compacting        | 8.41e9                    |
| Copying           | 5.90e9                    |

Shorter than all the other GCs.

### Performance

Total number of instructions (mutator + GC), average across all benchmark cases:

| GC                | Avg. Total Instructions | 
| ----------------- | ----------------------- | 
| **Incremental**   | **1.85e10**             | 
| Generational      | 1.91e10                 | 
| Compacting        | 2.20e10                 | 
| Copying           | 2.05e10                 | 

Faster than all the other GCs.

Mutator utilization on average:

| GC                | Avg. Mutator Utilization |
| ----------------- | ------------------------ |
| **Incremental**   | **94.6%**                |
| Generational      | 85.4%                    |
| Compacting        | 75.8%                    |
| Copying           | 78.7%                    |

Higher than the other GCs.

### Memory Size

Occupied heap size at the end of each benchmark case, average across all cases:

| GC                | Avg. Final Heap Occupation |
| ----------------- | -------------------------- |
| **Incremental**   | **176 MB**                 |
| No                | 497 MB                     |
| Generational      | 156 MB                     |
| Compacting        | 144 MB                     |
| Copying           | 144 MB                     |

Up to 22% higher than the other GCs.

Allocated WASM memory space, benchmark average:

| GC                | Avg. Memory Size        |
| ----------------- | ----------------------- |
| **Incremental**   | **296 MB**              |
| No                | 499 MB                  |
| Generational      | 191 MB                  |
| Compacting        | 188 MB                  |
| Copying           | 271 MB                  |

9% higher than the copying GC. 57% higher (worse) than the generational and the compacting GC.

## Overheads

Additional mutator costs implied by the incremental GC:
* **Write barrier**: 
    - During the mark and evacuation phase: Marking the target of overwritten pointers.
    - During the update phase: Resolving forwarding of written pointers.
* **Allocation barrier**:
    - During the mark and evacuation phase: Marking new allocated objects.
    - During the update phase: Resolve pointer forwarding in initialized objects.
* **Pointer forwarding**:
    - Indirect each object access and equality check via the forwarding pointer.

Runtime costs for the barrier are reported in #3831.
Runtime costs for the forwarding pointers are reported in #3829.

## Testing

1. RTS unit tests

    In Motoko repo folder `rts`:
    ```
    make test
    ```

2. Motoko test cases

    In Motoko repo folder `test/run` and `test/run-drun`:
    ```
    export EXTRA_MOC_ARGS="--sanity-checks --incremental-gc"
    make
    ```

3. GC Benchmark cases

    In `gcbench` repo: 
    ```
    ./measure-all.sh
    ```

4. Extensive memory sanity checks

    Adjust `Cargo.toml` in `rts/motoko-rts` folder:
    ```
    default = ["ic", "memory-check"]
    ```

    Run selected benchmark and test cases. Some of the tests will exceed the instruction limit due to the expensive checks.

## Extension to 64-Bit Heaps

The design partition information would need to be adjusted to store the partition information dynamically instead of a static allocation. For example, the information could be stored in a reserved space at the beginning of a partition (except if the partition has static data or serves as an extension for hosting a huge object). Apart from that, the GC should be portable and scalable without significant design changes on 64-bit memory.

## Design Alternatives

* **Free list**: See the prototype in #3678. The free-list-based incremental GC shows higher reclamation latency, slower performance (free list selection), and potentially higher external fragmentation (no compaction, just free neighbor merging).
* **Mark bit in object header**: See implementation in #3756. Storing the mark bit in the object header instead of using a mark bitmap saves memory space, but is more expensive for scanning sparsely marked partitions. Moreover, it increases the amount of dirty pages.
* **Remembered set**: Inter-partition pointers could be stored in remembered set to allow more selective and faster pointer updates. Moreover, the write barrier would become more expensive to detect and store relevant pointers in the remembered set. Also, the remembered set would occupy additional memory.
* **Allocation increments**: On high allocation rate, the GC could also perform a short GC increment during an allocation. This design is however more complicated as it forbids that the compiler can store low-level pointers on the stack while performing an allocation (e.g. during assignments or array tabulate). It is also slower than the current solution where allocation increments are postponed to next regularly scheduled GC increment, running when the call stack is empty.
* **Special incremental GC**: Analzyed in PR #3894. An incremental GC based on a central object table that allows easy object movement and incremental compaction. Compared to this PR, the special GC has 35% worse runtime performance. 
* **Combining tag and forwarding pointer**: #3904. This seems to be less efficient than the Brooks pointer technique with a runtime performance degrade of 27.5%, while only offering a small memory saving of around 2%.

## References

[1] C. H. Flood, R. Kennke, A. Dinn, A. Haley, and R. Westrelin. Shenandoah. An Open-Source Concurrent Compacting Garbage Collector for OpenJDK. Intl. Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ'16, Lugano, Switzerland, August 2016.

[2] R. A. Brooks. Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware. ACM Symposium on LISP and Functional Programming, LFP'84, New York, NY, USA, 1984.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants