Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sizeof and lea operations #84

Open
KOLANICH opened this issue Jan 15, 2017 · 17 comments
Open

sizeof and lea operations #84

KOLANICH opened this issue Jan 15, 2017 · 17 comments
Milestone

Comments

@KOLANICH
Copy link

KOLANICH commented Jan 15, 2017

It'd be nice to have sizeof operators for getting amount of space occupied by some set of structures and lea for getting offset of properties to have relative addressing modes.
The specification is:
lea() returns an offset of the struct making the scope. It is the stream position where the value of this type have been started parsing.
lea(property_name) returns an offset of the property which name is passed. It is either the stream position where the property have been started parsing, or predicted using known sizes of already parsed fields.
sizeof() returns size of the struct making the scope known by the moment. If the size is unknown - compile error. The size is known when any of the following is satisfied:

  • the type is fixed size
  • property size is stored in some variable and can be accessed and modified (modification can have serialization implications like truncating of arrays)
  • all the subproperties have known size
    sizeof(property_name) returns a size of the property which name has been passed.
    sizeof(property_name1, property_name2) is equivalent to
    lea(property_name2) - lea(property_name1) + sizeof(property_name2) and lea(property_name2) >= lea(property_name1) must satisfy
    The example is
seq:
 - id: str
   type: strz
   encoding: ASCII
 - id: revision
   type: u1
 - id: len
   type: u1
 - id: struct_offset0
   type: u2
 - id: struct_offset1
   type: u2
 - id: data
   size: len - sizeof(str, struct_offset1)  #equivalently len - sizeof(str) - sizeof(revision) - sizeof(len) -sizeof(struct_offset0) - sizeof(struct_offset1)
instances:
  str0:
    type: superstruct
    pos: lea(data)+struct_offset0
  str1:
    type: superstruct
    pos: lea(data)+struct_offset1

It creates a struct of size len with variable-length string in the beginning. Then it places a 2 superstructs in data relative to beginning of data. Variable-length string is taken as an example of a case where sizes and offsets are not defined on compile time.

@KOLANICH
Copy link
Author

KOLANICH commented Aug 20, 2017

@GreyCat, when can I expect this landed? I'm implementing a ksy for a format, where a payload is in the middle of fixed-size fields, and the size of that payload is defined by container. Surely we cannot use "size: eos" here.

@koczkatamas
Copy link
Member

Could you show the file format? I think this can be worked around by splitting the type into two different subtypes.

@KOLANICH
Copy link
Author

KOLANICH commented Aug 21, 2017

Here is the spec for payload. It is a separate type, which size is defined by the field size it is placed in. It can be placed into different types, so I dislike to use the hacks like _parent.payload_len. In python source the size is also calculated from the len of payload buffer .

@GreyCat
Copy link
Member

GreyCat commented Aug 21, 2017

There are many different proposals here, so let's break it into some parts. First of all, sizeof. I totally agree that we need some functionality akin to C-like sizeof, but there are some important differences in KS vs traditional C-like implementation:

  • Not every structure and/or data type in KS would have a definite fixed size, unlike C
  • In C, there are actually two different kinds of sizeof operator: doing stuff like sizeof some_var vs sizeof char. First kind takes value expression as argument, second one takes types (actually, type expressions) as argument. Your proposal rotates around the first kind, and even enhances it more, but, actually, in KS, even that could result in some undefined behavior, i.e. when asking for stuff like sizeof(1), sizeof(some_member + 1), sizeof(some_member * 2.0), sizeof(x) where x is type: b26, sizeof(true), etc.
  • We don't have global functions, and I don't really like the idea of introducing them. Actually, even languages like C implement sizeof as a special operator due to its duality and special compile-time definition.

So, I can propose that we'd start from the basic sizeof. I can see that we could implement 2 basic forms:

  • using type name (dynamic type size, non-user type => error)
  • using expression (calculated expression type, boolean type, non-fixed type size, IO stream type => error)

Given that we support addressing in both bytes and bits, probably we'd need two implementations of these, like bitsize and bytesize.

I have an relatively easy syntax proposal for "using expression": do something like some_stuff_here._size and some_stuff_here._bits — akin to existing size and proposed bits (see #112) YAML keys.

I have no idea right now on how to implement type-based sizeof operator, except for introducing yet another keyword, which is probably doable, but definitely not very pretty...

@KOLANICH
Copy link
Author

KOLANICH commented Aug 21, 2017

Do we really need type-based syntax? I mean if the field of the type is never used, we don't need its size, if it is used, we can use expression-based syntax.

Updated the proposal a bit with impl details.

@GreyCat
Copy link
Member

GreyCat commented Aug 21, 2017

Do we really need type-based syntax? I mean if the field of the type is never used, we don't need its size, if it is used, we can use expression-based syntax.

Well, yes and no. Sometimes it's just very inconvenient to go through all that _parent._parent.blah.blah.blah stuff to name a particular element to take size of. Sometimes (and that's yet another question to decide), you want the size of basic structure by itself, and your only application of a structure is array, i.e.:

seq:
  - id: things
    type: thing
    repeat: eos

things._size would probably return size of whole array, but you want the size of one individual element.

@GreyCat
Copy link
Member

GreyCat commented Aug 23, 2017

I've just committed preliminary implementation of precompile stage that calculated fixed size of all types and members, if that's possible.

This already improved Graphviz output a lot: now it correctly prints out offsets/sizes table even for one fixed sized structure embedded into the other, i.e. stuff like BMP, Blender .blend, cramfs, etc.

Also, it obviously opens the way for sizeof / lea implementation.

@bsagal
Copy link

bsagal commented Nov 28, 2018

If the type is not fixed size, could you add the possibility of getting the min and max size of the type

@KOLANICH
Copy link
Author

@bsagal, it is a more complex derivation. Could you show use cases for it when the actual size used by a field is not enough?

@bsagal
Copy link

bsagal commented Nov 28, 2018

@KOLANICH I have non fixed size structs and would like to add logic to check that the size of the steam is valid before starting the parsing

@KOLANICH
Copy link
Author

I have non fixed size structs and would like to add logic to check that the size of the steam is valid before starting the parsing

I guess it should be done by the code generated by KSC, not by a ksy dev.

@GreyCat
Copy link
Member

GreyCat commented Apr 25, 2019

Implemented type-based sizeof in byte/bit flavors, as special keyword operator:

sizeof<some_type>
bitsizeof<some_type>
sizeof<foo::bar::baz>
sizeof<foo::bar::baz>

See expr_sizeof_type_1 for example.

@GreyCat
Copy link
Member

GreyCat commented Apr 25, 2019

Also implemented value-based sizeof, using virtual _sizeof attribute/operator: see expr_sizeof_value_0 for examples.

@KOLANICH
Copy link
Author

Implemented type-based sizeof in byte/bit flavors, as special keyword operator
Also implemented value-based sizeof, using virtual _sizeof attribute/operator: see expr_sizeof_value_0 for examples.

Thank you.

@generalmimon
Copy link
Member

generalmimon commented May 2, 2020

@KOLANICH I have a question regarding your initial example in #84 (comment). The first field str in your seq is the variable-length byte-terminated string, and you're apparently applying sizeof operator to it while calling sizeof(str, struct_offset1).

Does it imply that you don't want the sizeof operators to just yield a compile-time constant, but generate expressions that will be calculated at runtime as well (as I described in #721 (comment))? It seems like that, your last sentence confirms it:

Variable-length string is taken as an example of a case where sizes and offsets are not defined on compile time.

But then it seems to me that you're contradicting yourself. I think terminator delimited string doesn't meet the condition of known size:

If the size is unknown - compile error. The size is known when any of the following is satisfied:

  • the type is fixed size
  • size attr is used in property
  • all the subproperties have known size

If the sizeof operator is expected to work even if the size is derivable only at runtime, I recommend using the _io.pos attribute (#721 (comment)). The null-terminated string is actually a great example of when this approach really shines. We don't have any size key available that would tell us the byte size of the str field. We can just get the character length of the string, but the string is of course encoding-affected, so we can't just use it as the byte size as we can't expect that all encodings use 1 character per 1 byte (in fact, they are usually not). So the only way would be to convert the string back to bytes using the original encoding, but that might be quite an expensive operation, you wouldn't want to do that.

Querying _io.pos before and after is clean and simple, and it always works, no matter what's been parsed between the calls. It's the same approach as what we're used to do for time measurement: startTime = now(); /* ... */ print(now() - startTime);.

All we'd have to do is to collect the fields that are used in sizeof operator from anywhere in the KSY and inserting the _io.pos queries before and after the method call that reads them. This could be in both the _read() method for seq attributes and in the methods that read parse instances. The calculated sizes will be stored in the struct's properties after finding out. Like this:

public void _read() {
    int _ofsStr = this._io.pos();
    this.str = new String(this._io.readBytesTerm(0, false, true, true), Charset.forName("ASCII"));
    this._sizeofStr = this._io.pos() - _ofsStr;
}
private Integer _sizeofStr;
public Integer _sizeofStr() { return _sizeofStr; }

Actually, we can use this approach on all value-based _sizeofs and it would be perfectly valid, no need for compile-time calculations. The compile-time calculations are more likely to be wrong, if they're not careful about bit-byte alignment, if conditions, repetitions, type-switching etc.

And using _io.pos would make the lea implementation much easier as well - when requesting the offset of the last field of the seq, we hadn't to sum all preceding field sizes, it could be done with a single _io.pos call.

@KOLANICH
Copy link
Author

@generalmimon, yes, it is meant to generate something that gives us byte size (not code point surely) (or maybe struct our_size {uint64_t byte; uint8_t bit; ....}; object), in compile time if it is possible, if not - in run time. I have updated the top message for more clarity. If I remember right, the current impl of terminated stuff scans the stream for terminator, then puts the content into a separate stream, then parses the stream, so the size is known beforehand and is stored somewhen within stream object. We surely shouldn't done strlen on each call.

@generalmimon
Copy link
Member

I'm taking this out of the 0.9 milestone. This feature currently works only for compile-time calculations, so it's not yet finished, but I'd say that's fine for purposes of 0.9 version. We can always improve it in the future.

@generalmimon generalmimon removed this from the v0.9 milestone Aug 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants