Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AGS 4: RTTI for AGS script #1259

Closed
ivan-mogilko opened this issue Apr 14, 2021 · 8 comments
Closed

AGS 4: RTTI for AGS script #1259

ivan-mogilko opened this issue Apr 14, 2021 · 8 comments
Assignees
Labels
ags 4 related to the ags4 development context: script compiler type: enhancement a suggestion or necessity to have something improved

Comments

@ivan-mogilko
Copy link
Contributor

ivan-mogilko commented Apr 14, 2021

NOTE: RTTI stands for "RunTime Type Information".

Overview

When compiled, AGS script currently looses all information about types. This is tolerable with regular structs, because their work is fully controlled by the compiler and interpreter may rely on its instructions. But when it comes to managed structs lack of knowledge on engine side restricts their functionality and use.
The related problems are, but may not be limited to:

  • Managed pointers in managed structs. While engine handles managed struct disposal it cannot decrease refcount for other managed objects referenced from that struct, simply because it does not know that the struct has these pointers, nor at which offsets in data. This is why pointers in managed structs were disabled: to prevent memory leaks.
  • Dynamic pointer casting. Currently only child-to-parent cast works, and parent-to-child cannot work because there's no way to tell what type the pointed object really is.
  • C++/C# - style virtual struct methods, cannot be implemented because as there's no type info so there are no virtual tables and no place to store these. (C-style virtual tables are separate issue, as these would first of all require function pointer type in script, or other way to reference functions.)
  • Debugging user objects: would become easier if we knew what they are.

While each of the above problems could be individually solved by its own workaround, my belief is that RTTI would provide more consistent and wholesome solution.

RTTI may come in a form of a table containing information about types. It's generated and written by compiler and/or accompanying tool. This information may be at first limited to our basic needs but later expanded (thus when serialized it also has to have format number).
Each type comes with convenient ID, and when created managed objects would also store this type ID. Thus when engine needs to know something about their type it would use this ID to retrieve the type from the table.

Several years ago we had a conversation about this with other developers and following are points we came to as a result.


Building the RTTI table.

Before we know how to use it or what information to store there, the first question is how to build it.

When compiler compiles a script unit it gathers all the types visible to that script, and information on struct contents which let it calculate memory offsets for all operations. So we already have this info from the perspective of a single compiled unit. Let's say we append this information as an extra segment to the compiled script format. Let's say each table entry would have a numeric type ID which we could pass into "new" operator for user objects.
This sounds fine at first, but having just local tables per script is not going to be sufficient:

  • Type duplication; if we rely on local tables we will have same types included so many times as there are script units - in separate tables of course.
  • Engine will not be able to know which of those are same type. So if there's any operation that requires type comparison - that won't work between objects allocated by different scripts.
  • There may be situations where the script that allocated an object is no longer in memory while the object still is. This may happen if there's a global pointer to base type, and a room script creates an instance of an extended type and assigns to that pointer.
    This problem on its own could be worked around by keeping type tables separate from script data in the engine, so that when script is unloaded the type data remains in memory.

One could speculate that, because the order of how script headers are currently included in scripts is always fixed, then the order of types (and their numeric index) will also always be fixed. So it seems like we could use equal local type indexes found in multiple scripts as refering to same type.
In truth that is not correct, as scripts may have types declared inside their bodies - and then their indexes would conflict with types declared in script headers following them in the script list.
But even if above did work, in the long term this would not be a good solution, because that would impose limitations on how scripts are used. Consider if we will change the way script headers are included, especially with stand-alone compiler tools: that will break whole system.

This means that we cannot rely on local type table alone and need the global type table.
But how may it be constructed if compiler knows only about one script it compiles at a time?

If I remember correctly, we already have similar mechanism for function binding. Function binding works this way: every script has a list of functions it knows / uses, and when writing call ops it puts their local index in the byte code. Engine uses this local function index to look up in the local table and find function name, under which it is registered in global table.
Perhaps we could use same solution for types. Each compiled script module will have a local table of types where it will map local type index to something that defines its global key. So we would use local numeric id to map to a global (e.g. string) id from scripts's table of types, then this global id is used further to find a type description from global table.

What could be the unique global type ID? For instance it could be a string constructed as a pair of a "scope name" (header/script name?) and a type name.

Now, how a global table is built? I think there may be two approaches here.

  1. First is simply building one as we load individual script tables. Engine is capable of doing this. Obviously objects may be allocated only in loaded scripts, so we always will have their type infos ready.
    The downside is that each compiled unit will contain full info of all the included types. I don't think it will be large in practice, but still worth noting.
  2. An alternative is to have a separate tool that would do same thing as a post-compilation step, and write this global table into special file, called e.g. script.rtti. This would provide a full type table for whole game ready to be read at once.
    In such case individual compiled units will only have ID mapping in their tables, and it's the global table file that would contain type descriptions.
    The upside of this is that we have a type table for the whole project at once, and no data duplication.
    The downside though is that we would need extra handling for this file inside a project, need to know when and how to update it, and so forth.

So even if method 2 may be beneficial as an option, guess we'd rather first try method 1.
It may be also possible to set up compiler switch to toggle between saving full type info and ID mapping only, if we'd like to support method 2 in the future.


Using Type IDs in script (bytecode)

Assuming we have a local type table, where entries are identified with numeric IDs, we may pass these IDs as an argument when creating a user object. Current SCMD_NEWUSEROBJECT command accepts 1 arg meaning object size. Because object size may also be a part of type info, there are various ways we may go from here:

  • Keep SCMD_NEWUSEROBJECT but interpret its argument as either size or type ID depending on a "script format" or something.
  • Introduce new updated command, e.g. SCMD_NEWUSEROBJECT2 which accepts type ID. This will work faster (no if switch), but has an extra opcode.

Reading old discussion on this topic I also found a curious proposal to try to merge SCMD_NEWUSEROBJECT and SCMD_NEWARRAY in one new opcode. Because "dynamic array of T" may also be considered a type on its own, and written into type table, then type info could distinguish "plain" types and "array" types. OR array may be indicated as a flag in an object itself.
If either of this would seem feasible and convenient, then we'll only need one command for allocating anything managed in script.


Type Info in the engine

Like described above, engine would either construct global type table adding entries as it loads various scripts, or by reading a single file.

Supposing SCMD_NEWUSEROBJECT or an additional command would have local type ID as an argument, engine creates a user object (currently ScriptUserObject struct) by storing three members: global type info ID / index (or pointer to type info), size, and a buffer for instance data.

The necessary contents of type info are dictated by what we want to use it for. For example, if we'd like to be able to release managed pointers we'll need at least a list of offsets of those pointers (handles).
How the engine will deal with recursive object release, or with circular dependencies - I believe this is another topic entirely and may be discussed separately.

If I missed or forgot any potential problems - these may be added to this ticket as we realize them.

@ivan-mogilko ivan-mogilko added type: enhancement a suggestion or necessity to have something improved ags 4 related to the ags4 development context: script compiler labels Apr 14, 2021
@ivan-mogilko ivan-mogilko changed the title AGS4: RTTI for AGS script AGS 4: RTTI for AGS script Apr 14, 2021
@rofl0r
Copy link
Contributor

rofl0r commented Apr 15, 2021

If I remember correctly, we already have similar mechanism for function binding.

indeed, if you refer to the same thing as i think you're talking about, this is the fixup mechanism used by compiler and engine. it's basically what traditionally a linker does, to create a final executable from a bunch of object files (in toolchain lingua this is called relocation, not fixup). from your description it sounds likely this could be used for RTTI table too.

Introduce new updated command, e.g. SCMD_NEWUSEROBJECT2 which accepts type ID. This will work faster (no if switch), but has an extra opcode.

sounds like the cleaner way. since ags opcodes are 32bit, and only ~70 used so far, we have still about 4 million possible opcodes available...

@fernewelten
Copy link
Contributor

fernewelten commented Apr 15, 2021

this is the fixup mechanism used by compiler and engine. it's basically what traditionally a linker does,

Yes, if we create dedicated code to manage, consolidate and distribute the function, variable and type information that is collected by the individual compiler runs -- then we create a "linker" (in the modern meaning of the word).

It's the concept of a linker, even if it doesn't run in one go or as a separate phase.

@fernewelten
Copy link
Contributor

fernewelten commented Apr 15, 2021

Because "dynamic array of T" may also be considered a type on its own, and written into type table, then type info could distinguish "plain" types and "array" types.

That's what the new compiler does internally. It has atomic types, such as int and builds on them with modifiers such as "dynamic array of", "dynamic pointer to", and "constant" to form composite types. And also a "struct" is a composite type that is built from a sequence of named component types. The component types can be composite in their turn.

Then, the individual variables are assigned a type. All the characteristics are stored with the type and not with the variable. For instance, size: A variable has a size only indirectly. Instead, it has a type and that type has a size.

The number of elements of a static array is kept with the array type, too, not with the array name. For instance, the variable "arr" has the type, e.g. "int[20]", that has the number of elements of 20. So from the vantage point of the compiler, the static array is completely subsumed by the general type concept. There's no such thing as a special array variable. It's just one of the cases of a variable.

However, with dynamic arrays the number of elements can change at runtime. We can't make the number of elements a part of a dynamic array type because that can change lots of times. So in my opinion we need an opcode to specify just how many elements we want to request.

@ivan-mogilko
Copy link
Contributor Author

However, with dynamic arrays the number of elements can change at runtime. We can't make the number of elements a part of a dynamic array type because that can change lots of times. So in my opinion we need an opcode to specify just how many elements we want to request.

True, I guess if new opcode will include array creation, then we need at least 2 args: type and num of elements (we may also add type size for a tiny bit of extra speed, but size is also a part of type so that's optional).

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Jul 15, 2021

This is a very very dirty draft, but technically... it's working!
https://github.com/ivan-mogilko/ags-refactoring/commits/experiment--rtti

(the type table is written by the old compiler for now)

Tested by repeatedly allocating and disposing very big managed structs containing pointers to more big managed structs.
With this code enabled it seems to work well.
With that code disabled program quickly runs out of memory.

I think some cases may still not be working; like when inside a managed struct you've got a regular struct, which contains a managed pointer. But it's a matter of completing this properly.

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Feb 1, 2023

The supposed RTTI format, based on my older experimental branch, and comments to commits in that branch (branch link)

The table is designed having multiple potential uses, and possible expansion in mind, not only the issue of nesting managed pointers.

As per rofl0r's suggestion, I tried to design the table in such way that general entries have fixed size. This supposedly allows simplier parsing and faster jump to particular index in the table, without having to read through whole data. But as a consequence this requires to store varied length data, such as type names (strings) and member lists, in separate tables, while the entries only store indexes to reference these. (This is similar to how script strings are stored, for instance). Such approach may seem complicated at first, but, in my opinion, is quite viable once you learn how it works. It also allows for much easier expansion, especially if we'd need to add something completely new.

RTTI header

field type / size comment
format uint32 for expanding the rtti format
header size uint32 size in bytes of a header (counting from "format" field, until header ends)
full rtti size uint32 size in bytes of a rtti data (counting from "format" field, until data ends)
type entry size uint32 fixed size of a type's description in bytes (may depend on format)
types table offset uint32 a relative position of a types table (in file)
num types uint32 total number of type entries
member entry size uint32 fixed size of a type member's description in bytes (may depend on format)
members table offset uint32 a relative position of a type members table (in file)
num type members uint32 total number of all members in all types
string table offset uint32 a relative position of a strings table (in file)
string table size uint32 total size of a string table, in bytes

RTTI tables

field type / size comment
type table num types * type size see "Type description" below
members table num type members * member entry size see "Member description" below
string table string table size all RTTI null-terminated strings packed in a single array (separated by 0s)

Type description

field type / size comment
fully qualified name uint32 an offset of a name in a string table
local id uint32 local ID of this type
parent type uint32 local type ID; 0? if no parent
type flags uint32 may contains helper flags which simplify analyzing this type
size uint32 in bytes
num members uint32 number of member fields this type has, 0 if none
member table index uint32 index of the first member in members table

Member description

field type / size comment
offset uint32 relative offset of this field, in bytes
name uint32 an offset of a name in a string table
type uint32 this field's local type ID
type flags uint32 may contains helper flags which simplify analyzing this member
num elements uint32 number of (array) elements; *also see comment 1 below!

So, there is "Types" table, "Fields/Members" table and "Strings" table. Types and Members may reference each other using indexes and type IDs, and may reference Strings table using offsets.

How this data is read? Suppose you read the Type entry, and it has a

  • fully qualified name's offset in a string's table, let's call it NameOffset;
  • member table index, and number of members, let's call these MemberIndex, MemberNum.

After you read the above, you know that you need to go to MemberIndex in the member table. Or jump to (MemberIndex * member entry size) offset from the member table start, if you're still parsing from stream. Then recover consecutive MemberNum entries from there.

Similarly, for querying actual name strings, you use NameOffset from the string table start, and read a null-terminated string from there.

Engine (and any other tools that would work with this data) will have options on how to organize a final rtti storage in their memory. It may keep the data broken into 3 tables and keep using cross-reference indexes, or it may have a nested structure, with type members and strings inside the type descriptions; whatever seem more convenient.


Comments.

  1. In regards to describing a non-dynamic array member of a struct (that means that a struct has a regular array inside, e.g. int arr[10];. Originally my intention was to describe these as type + number of elements (e.g. type: int, num elems: 10).
    But according to @fernewelten, his new compiler does not store information like that, instead it merges the size of array into a type (virtually a separate entry per each int[5], int[66], int[100], and so forth, if I understand correctly). Personally, I believe that, for the purposes of using rtti table in the engine, it may be more convenient to store "raw type" and any other qualifiers separately. That would require to deconstruct the compiler's type description into type and num of elements when writing rtti.

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Feb 7, 2023

Preliminary branch that gathers rtti in both old and new compiler:
https://github.com/ivan-mogilko/ags-refactoring/tree/ags4--rtti

There are few things that could be optimized, probably, but I'll be dealing with that later, after opening a proper pr draft.

In regards to the scanning for managed pointers, currently is done:

  • plain fields;
  • nested regular structs;

missing:

  • nested regular arrays (either of pointers or of regular structs).

@ivan-mogilko
Copy link
Contributor Author

RTTI generation was implemented by #1922.

From now on it's a matter of extending and using this information as necessary in the engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ags 4 related to the ags4 development context: script compiler type: enhancement a suggestion or necessity to have something improved
Projects
None yet
Development

No branches or pull requests

3 participants