Skip to content

[Test] C-API #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 110 commits into
base: master
Choose a base branch
from
Draft

[Test] C-API #4

wants to merge 110 commits into from

Conversation

st0012
Copy link
Member

@st0012 st0012 commented Feb 24, 2025

RBS C Parser Library Refactoring

This PR refactors the RBS parser and related components into a standalone C library that no longer depends on the Ruby runtime. This architectural change enables direct integration with static analysis tools like Sorbet while potentially improving performance.

Sorbet's RBS support already runs on this new architecture and we haven't discovered any major issues around it.

Key Improvements

  • Ruby-Independent Implementation: Extracted parser from ext folder into a standalone C library with a clean API, which can now be embedded in non-Ruby tools without Ruby runtime dependency (e.g. Sorbet, JRuby)
  • Enhanced Memory Management: Implemented arena allocator to efficiently manage parser object lifecycles
  • Improved Architecture: Clear separation between public API (headers) and implementation
  • Performance: Potential performance gains from custom memory management and reduced overhead

Enhanced Memory Management

Arena allocator handles all memory for parser objects, including parser itself, lexer, constant pool, strings...etc. When the parser is freed by calling rbs_parser_free, the allocator will free all the objects it allocated. This eliminates the need to manually free individual objects and reduces the risk of memory leaks.

Component Architecture

graph TD
    RubyClient[Ruby Client] --> RubyAPI[Ruby API]
    CClient[C Client] --> CAPI[C API]

    RubyAPI --> CExtension[C Extension]
    CExtension --> CLibrary
    CAPI --> CLibrary

    subgraph CLibrary[C Library]
        subgraph Parser1[Parser Instance 1]
            direction TB
            ConstantPool1[Constant Pool]
            Lexer1[Lexer]
            ArenaAllocator1[Arena Allocator]
        end

        subgraph Parser2[Parser Instance 2]
            direction TB
            ConstantPool2[Constant Pool]
            Lexer2[Lexer]
            ArenaAllocator2[Arena Allocator]
        end
    end

    subgraph "Public API"
        RubyAPI
        CAPI
    end

    %% Parser1 --> ConstantPool1
    %% Parser1 --> Lexer1
    %% Parser1 --> ArenaAllocator1
    %% Parser2 --> ConstantPool2
    %% Parser2 --> Lexer2
    %% Parser2 --> ArenaAllocator2
Loading

@st0012
Copy link
Member Author

st0012 commented Feb 25, 2025

I'll leave this PR open because ruby/rbs only runs CI on pull requests, so we need this PR to have CI constantly running against changes in c-api.

amomchilov and others added 27 commits March 13, 2025 20:55
Initial template for C structs

Use allocator in node constructors
Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>

Add linked list implementation

Signed-off-by: Alexandre Terrasa <[email protected]>

Type `Class#super_class` field

Signed-off-by: Alexandre Terrasa <[email protected]>

Type fields of `RBS::Types::Block`

Signed-off-by: Alexandre Terrasa <[email protected]>

Type `block` fields

Signed-off-by: Alexandre Terrasa <[email protected]>

Type `RBS::Types::Proc#self_type` field

Signed-off-by: Alexandre Terrasa <[email protected]>

Refactor `parse_function`

Signed-off-by: Alexandre Terrasa <[email protected]>

Copy value in `rbs_struct_to_ruby_value`

Remove usages of `rbs_loc` from `parser.c`

Extract `rbs_location.h`

Migrate `RBS::Types::Function::Param` fields

Signed-off-by: Alexandre Terrasa <[email protected]>

Type `RBS::Types::UntypedFunction` fields

Signed-off-by: Alexandre Terrasa <[email protected]>

Type fields of `RBS::AST::TypeParam`

Signed-off-by: Alexandre Terrasa <[email protected]>

Type some more fields of `RBS::AST::Members::Attr`

Signed-off-by: Alexandre Terrasa <[email protected]>

Type fields in `RBS::AST::Members::MethodDefinition`

Signed-off-by: Alexandre Terrasa <[email protected]>

Type `RBS::AST::Directives::Use::SingleClause#new_name`

Signed-off-by: Alexandre Terrasa <[email protected]>

Type `RBS::Namespace#absolute`

Signed-off-by: Alexandre Terrasa <[email protected]>

Temporary handle nil types

Signed-off-by: Alexandre Terrasa <[email protected]>

Handle `bool` type

Signed-off-by: Alexandre Terrasa <[email protected]>

Type all fields of `RBS::Types::Variable`

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate `RBS::TypeName`

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate `parse_use_clauses`

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate `class_instance_name`

Signed-off-by: Alexandre Terrasa <[email protected]>

Handle overloads as a rbs_node_list

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove more `builds_ruby_object_internally` flags

Signed-off-by: Alexandre Terrasa <[email protected]>

Invert `builds_ruby_object_internally` default value

Signed-off-by: Alexandre Terrasa <[email protected]>

Introduce `rbs_location_t`

Signed-off-by: Alexandre Terrasa <[email protected]>

Store C structs instead of Ruby `VALUE`s

Introduce +rbs_ast_symbol_t and migrate to it

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove ZzzTmpNotImplemented node

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove one more instance of EMPTY_ARRAY

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate from VALUE array to rbs_node_list_t

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate `method_params` from taking a VALUE arrays

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate `parse_type_list` from taking a VALUE array

Signed-off-by: Alexandre Terrasa <[email protected]>

Forward all C-typed params as-is

Get types on constructor params

Handle mix of C types and Ruby VALUE

Move Ruby object construction into `new` functions

Conditionally construct `ruby_value` internally

Type Attr* field `ivar_name`

Signed-off-by: Alexandre Terrasa <[email protected]>

Add `AST::Bool`

Signed-off-by: Alexandre Terrasa <[email protected]>

Use two less VALUE values

Signed-off-by: Alexandre Terrasa <[email protected]>

Use more instance of `bool`

Signed-off-by: Alexandre Terrasa <[email protected]>

Add Hash implementation

Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>

Use C hash for `check_key_duplication`

Signed-off-by: Alexandre Terrasa <[email protected]>

Use C hash to represent Record fields

Signed-off-by: Alexandre Terrasa <[email protected]>

Migrate `memo` to using a C hash

Signed-off-by: Alexandre Terrasa <[email protected]>

Uses C hashes for keyword parameters

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove parser call to `todo!`

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove calls to `rbs_struct_to_ruby_value`

Signed-off-by: Alexandre Terrasa <[email protected]>

TMP symbol

Signed-off-by: Alexandre Terrasa <[email protected]>

Replace 2 fake nodes by one

Signed-off-by: Alexandre Terrasa <[email protected]>

Set fields for `Record::FieldType`

Signed-off-by: Alexandre Terrasa <[email protected]>

Make comment use a `rbs_ast_comment_t` instead of a `VALUE`

Signed-off-by: Alexandre Terrasa <[email protected]>

Add `rbs_ast_string_t`

Add `rbs_ast_integer_t`

Migrate `literal` to store C nodes

Remove `cached_ruby_string`

Remove useless templating stuff

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove `cached_ruby_value` from `rbs_node_list`

Signed-off-by: Alexandre Terrasa <[email protected]>

Remove `cached_ruby_value` from `rbs_hash`

Signed-off-by: Alexandre Terrasa <[email protected]>

Add `rbs_string`, and use it for annotations

Add `rbs_ast_symbol_t` to model symbols in the AST

Co-Authored-By: Alexander Momchilov <[email protected]>
And rename it to `class_constants` to disambiguate it from `rbs_constant_id`, `rbs_constant_pool`, etc.
Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>

Do not create comments using a VALUE

Use a rbs_string instead

Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>
Signed-off-by: Alexandre Terrasa <[email protected]>
st0012 and others added 9 commits March 13, 2025 21:46
* Remove unnecessary re-allocations from child-adding functions

We always pre-allocate enough space for a location's children with
`rbs_loc_alloc_children` in `parser.c`. So we should never hit the
case where we need to re-allocate the children array.

This change removes the unnecessary re-allocations and makes the code
simpler.

* Update src/rbs_location.c

Co-authored-by: Alexander Momchilov <[email protected]>

---------

Co-authored-by: Alexander Momchilov <[email protected]>
* Allocate parser error with allocator

* Allocate parse result with allocator
Since we don't manually allocate/free strings anymore, we don't need the string type enum
and all the complexity that comes with it.
* Remove unnecessary rbs_buffer_init_with_capacity function

* Manage buffer allocation with allocator

* Set default capacity for buffer to 128

Most comments are less than 128 bytes, so this should help reduce memory waste.

See #13 (comment)
Add `rbs_allocator_alloc_many` to avoid unnecessary 0 initialization on memory
### Avoid declaring unused `node` variables in `rbs_node_destroy`'s template

Such declarations will fail Sorbet's compiler:

```
prism/templates/src/ast.c.erb:203:31: error: unused variable 'node' [-Werror,-Wunused-variable]
        rbs_ast_annotation_t *node = (rbs_ast_annotation_t *)any_node;
                              ^
prism/templates/src/ast.c.erb:203:28: error: unused variable 'node' [-Werror,-Wunused-variable]
        rbs_ast_comment_t *node = (rbs_ast_comment_t *)any_node;
                           ^
prism/templates/src/ast.c.erb:203:28: error: unused variable 'node' [-Werror,-Wunused-variable]
        rbs_ast_integer_t *node = (rbs_ast_integer_t *)any_node;
                           ^
prism/templates/src/ast.c.erb:203:27: error: unused variable 'node' [-Werror,-Wunused-variable]
        rbs_ast_string_t *node = (rbs_ast_string_t *)any_node;
                          ^
```

### Use `size_t` instead of `int` for `capacity` in `rbs_loc_alloc_children`

This fixes a warning about comparing `int` and `size_t` in `rbs_loc_alloc_children`'s assertion.

### Avoid declaring the `max` variable that's just used in assertions

When the `assert` is stripped out, the variable becomes unused and triggers a warning.

```
external/rbs_parser/src/rbs_location.c:8:10: warning: unused variable 'max' [-Wunused-variable]
  size_t max = sizeof(rbs_loc_entry_bitmap) * 8;
         ^
```
st0012 and others added 20 commits March 17, 2025 17:21
… some cleanups (#26)

Part of Shopify/team-ruby-dx#1436

In addition to merging the files, I also did some cleanups like:
- Remove unused function declaration
- Make some helper functions that aren't referenced outside of C library
AND doesn't look like public API as static
- A leftover in #25 

I recommend reviewing by commits.
- `parserstate` -> `rbs_parser_t`
    - Variable names `state` -> `parser`
- `comment` -> `rbs_comment_t`
- `error` -> `rbs_error_t`
These names are too generic and can easily conflict in projects. New
names are:

- `rbs_position_t`
- `rbs_range_t`
- `rbs_token_t`
… files (#32)

Currently, camel case names in `config.yml`, like `MethodDefinition`,
will be generated as `methoddefinition`, which is not ideal. So this PR
updates `template.rb` to make sure they're generated as
`method_definition` instead.

I also changed `rbs_typename` to `rbs_type_name` as I think it's also
not ideal.

NO Ruby files are touched so it shouldn't be breaking for end users.
Assuming the top-level `include/rbs` holds public components (e.g.
`parser.h`), and `include/rbs/util` holds internal components, more
stuff should be moved under `util`.

This follows the same convention `prism` uses.

### Before

```
include
├── rbs
│   ├── ast.h
│   ├── defines.h
│   ├── encoding.h
│   ├── lexer.h
│   ├── parser.h
│   ├── rbs_buffer.h
│   ├── rbs_encoding.h
│   ├── rbs_location.h
│   ├── rbs_location_internals.h
│   ├── rbs_string.h
│   ├── rbs_strncasecmp.h
│   ├── rbs_unescape.h
│   └── util
│       ├── rbs_allocator.h
│       ├── rbs_assert.h
│       └── rbs_constant_pool.h
└── rbs.h
```

### After

```
include
├── rbs
│   ├── ast.h
│   ├── defines.h
│   ├── lexer.h
│   ├── parser.h
│   ├── rbs_location.h
│   ├── rbs_string.h
│   └── util
│       ├── rbs_allocator.h
│       ├── rbs_assert.h
│       ├── rbs_buffer.h
│       ├── rbs_constant_pool.h
│       ├── rbs_encoding.h
│       └── rbs_unescape.h
└── rbs.h
```
…g.*` (#34)

This aligns with the naming convention of Prism:

- Private headers are prefixed with `rbs_` and placed under
`include/rbs/util/`.
- Public headers are placed under `include/rbs/` without the `rbs_`
prefix.
1. `lexerstate` should be `rbs_lexer_t` instead
2. Params representing `rbs_lexer_t` should be named `lexer` instead of
`state`
3. `rbsparser_next_token` should be called `rbs_lexer_next_token`

After this PR, all structs & functions we want to expose should have
`rbs_` prefix.
It was only used in `assert` macros, so it only needs to be defined in
debug mode. But since `rbs_assert` is a function, `is_power_of_two`
always needs to be defined.

Without this change, Sorbet fails to build with `c-api`.
Now we can change the macro (back) to &parser->allocator if need be.
This has the handy effect of making allocations nearly free while
unfortunately having the side effect of crashing your process if you
write more than the arena size. However, if you are allocating more than
4 GiB, you likely have other problems.
`rbs_node_destroy`, `rbs_hash_free`, `rbs_node_list_free` are only
calling each other recursively without any real freeing logic.

This is the result of previous efforts to allocate all nodes on the
arena. So we don't need these functions anymore.

Discovered while working on #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants