Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: type for null terminated pointer #265

Closed
andrewrk opened this issue Feb 23, 2017 · 42 comments
Closed

proposal: type for null terminated pointer #265

andrewrk opened this issue Feb 23, 2017 · 42 comments
Labels
accepted This proposal is planned. enhancement Solving this issue will likely involve adding new logic or components to the codebase. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@andrewrk
Copy link
Member

andrewrk commented Feb 23, 2017


Currently the type of c"aoeu" is *const u8.

Instead, the type should indicate that the pointer is null terminated. Here are two ideas to represent that:

  • *0 const u8
  • *null const u8

This type would be implicitly castable to *const u8. You can explicitly cast the other way, and in debug mode this inserts a safety check to make sure there actually is a null byte there.

It should probably work for any type that supports T == 0 or T == null.

We want to steer users away from this type and instead use []const u8, which includes a pointer and a length. However, we still have to deal with null terminated things from C land, which makes this useful, and some kernel interfaces. For example, we currently have this:

pub fn open_c(path: *const u8, flags: usize, perm: usize) -> usize {
    arch.syscall3(arch.SYS_open, usize(path), flags, perm)
}

pub fn open(path: []const u8, flags: usize, perm: usize) -> usize {
    const buf = @alloca(u8, path.len + 1);
    @memcpy(&buf[0], &path[0], path.len);
    buf[path.len] = 0;
    return open_c(buf.ptr, flags, perm);
}

Having the open_c prototype be *0 const u8 would make it more type-safe. Further, we could provide an open function that supported either type for path, and if it happened to be null terminated then it could avoid the stack allocation.

We could also make the type of string literals be []0 const u8 meaning that the pointer value for the slice has a 0 after the last byte. The length would still indicate the memory before the null byte. If you slice this type then the pointer component would change from *0 const u8 to *const u8.

It would be extra helpful if automatic .h import could identify when a pointer in a function is supposed to be null-terminated, and we could emit a compile error if the user passes a pointer that is not null terminated. I'm not sure how we could detect this automatically though.

@andrewrk andrewrk added the enhancement Solving this issue will likely involve adding new logic or components to the codebase. label Feb 23, 2017
@andrewrk andrewrk modified the milestone: 0.2.0 Mar 26, 2017
@thejoshwolfe
Copy link
Contributor

in debug mode this inserts a safety check to make sure there actually is a null byte there.

How would that work? Wouldn't that be an unbounded linear search?

@andrewrk
Copy link
Member Author

andrewrk commented Apr 3, 2017

Maybe if you cast from []T to []0 T then len then we check for 0 in the last spot (since we have len) and then in the new slice, len is -1.

That's true though, for pointers we can't really have this check.

@andrewrk
Copy link
Member Author

andrewrk commented Apr 3, 2017

Proposal rejected in order to discourage use of null terminated things.

@andrewrk andrewrk closed this as completed Apr 3, 2017
@AndreaOrru
Copy link
Contributor

How do you plan to interface with C-style null-terminated strings?

@andrewrk
Copy link
Member Author

andrewrk commented May 9, 2017

Can you elaborate with this question? We have the cstr module for getting string length and converting to a slice.

@AndreaOrru
Copy link
Contributor

Solved thanks to that.

@tiehuis tiehuis added proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. rejected labels Sep 15, 2017
@thejoshwolfe
Copy link
Contributor

Reopening for #518.

My proposal is to leave the type unnamed; you have to refer to it as @typeOf(c""). This discourages people from using the type, and it clearly associates the type with the c"" syntax.

Instances of the type should not be usable in any way, except for implicitly converting to &const u8. No slicing syntax x[0..5], no array subscripting syntax x[0], no field access x.foo.

Literals with c"foo" syntax should obviously be of this type. Nothing implicitly casts to this type. Creating an instance of this type should be done through @typeOf(c"")(x), where x is of type []const u8. This is the only explicit conversion that should be allowed. In safety mode, the conversion should do this safety check:

assert(x.len > 0);
for (x[0..x.len - 1]) |b| {
    assert(b != 0);
}
assert(x[x.len - 1] == 0);

There is no way to create a @typeOf(c"") directly from a pointer. You have to slice the pointer, explicitly giving it a length, and then convert it. This way the safety check is always bounded.

There are many cases where you don't want to bother with this safety check, but you still want a @typeOf(c""). For that we can add these functions to the cstr module:

pub fn fromSliceUnsafe(x: []const u8) -> @typeOf(c"") {
    @setDebugSafety(this, false)
    return @typeOf(c"")(x);
}
pub fn fromPointerUnsafe(x: &const u8) -> @typeOf(c"") {
    return fromSliceUnsafe(x[0..0]);
}

The name "Unsafe" should properly inform people that they're taking a risk using these utilities.

I'm not sure we need any solution to null-terminated arrays of things other than u8. These cases exist, but I don't think we need special support for them.

@thejoshwolfe thejoshwolfe reopened this Oct 2, 2017
@andrewrk andrewrk removed the rejected label Oct 2, 2017
@andrewrk
Copy link
Member Author

andrewrk commented Oct 2, 2017

I think this counter proposal has 2 competing ideas:

  • Introduce a type safety feature that makes it more ergonomic to have safer code.
  • Make the feature intentionally ugly so that using it is unattractive

These don't go well together. I think the original proposal is better, where the type is named.

There's also no reason to limit it to u8. A null terminated array is not inherently an evil C concept that is intruding in the Zig language. It's a general data storage technique that is valid for some memory constrained use cases.

I can probably find a Windows API that uses NULL as a sentinel for an array of pointers. The argv in libc main would be represented with &null ?&null u8. (And not just libc - this is how it is represented in the official x86_64 ABI specification).

There is no way to create a @typeOf(c"") directly from a pointer. You have to slice the pointer, explicitly giving it a length, and then convert it. This way the safety check is always bounded.

So the generated code would first have to do essentially a strlen to find the length, use that to convert to a slice, then cast that to the null-terminated type, and then the debug safety check would do a redundant strlen? That doesn't seem right.

If we had the[]null u8 type as originally proposed, this would be more straightforward. I think a better way this can go is:

  • Set the null terminator in memory.
  • Cast the &u8 to []null u8. This cast does linear search for the terminator and sets len appropriately.
  • Now @typeOf(slice.ptr) == &null u8

@andrewrk andrewrk added the accepted This proposal is planned. label Oct 14, 2017
@andrewrk
Copy link
Member Author

From llvm/include/llvm/Support/MemoryBuffer.h

/// This interface provides simple read-only access to a block of memory, and
/// provides simple methods for reading files and standard input into a memory
/// buffer.  In addition to basic access to the characters in the file, this
/// interface guarantees you can read one character past the end of the file,
/// and that this character will read as '\0'.
///
/// The '\0' guarantee is needed to support an optimization -- it's intended to
/// be more efficient for clients which are reading all the data to stop
/// reading when they encounter a '\0' than to continually check the file
/// position to see if it has reached the end of the file.

@jido
Copy link

jido commented Mar 27, 2018

That may be a better place to put my comment -
Does the null/0 have to be at index x.len? C strings are stored in fixed length array but the string length can vary, it is not necessarily equal to the array size. The same applies to null-terminated C arrays.

@Hejsil
Copy link
Contributor

Hejsil commented Mar 27, 2018

@jido In C, nothing prevents you from removing all null terminators from a char array, and then passing it to strlen. This is the motivation for the null terminated type. The compiler will insert a null terminator at x.len to prevent this bug. You, as a user of a null terminated type can still insert null terminators between 0 and x.len and strlen will stop at your null terminator instead of the one at x.len. The one at x.len is there for safety, and trying to override it is undefined behavior (aka a runtime crash in debug mode).

@andrewrk
Copy link
Member Author

andrewrk commented Jun 6, 2018

Moving some work from Pointer Reform (#770) to here:

  • add syntax for [N]null T, []null T, and [*]null T
  • add implicit casting
    • []null T to []T
    • [*]null T to [*]T
    • [N]null T to [N]T
    • *[N]null T to *[N]T
    • []null T to [*]null T
    • [N]null T to []null const T
    • [N]null T to [*]null const T
  • make string literals [N]null u8
  • remove C string literals and C multiline string literals

@andrewrk
Copy link
Member Author

I would definitely like to see that happen whichever way null-terminated pointers are implemented.

This is definitely an interesting proposal which I would invite you to file separately from this null terminated pointers issue (and indeed if it was accepted, it could potentially reverse the decision on this one).

I think you've made a pretty strong case for userland null-terminated pointer. It's enough to make me go back and consider all the use cases that I have for it. I don't know if I'm ready to reverse the "accepted" label here yet, but I'll admit that I'm going back and questioning this decision.

@matthew-mcallister
Copy link
Contributor

matthew-mcallister commented Feb 16, 2019

This is definitely an interesting proposal which I would invite you to file separately from this null terminated pointers issue (and indeed if it was accepted, it could potentially reverse the decision on this one).

That sounds good! I'd be happy to write up a proposal and work on implementing it as well.

Also, I just tried this out in Compiler Explorer (updated) and it seems like a one-member extern struct already gets passed the same way as a plain int on x86_64, so the suggestion here could be tentatively evaluated outside stdlib without waiting for that idea to be implemented.

@matthew-mcallister
Copy link
Contributor

matthew-mcallister commented Feb 16, 2019

In principle, you can always make the type template function more flexible and add typedefs for common cases.

For the record, what I was thinking by this was that the template would take a struct specifying pointer attributes and you'd write one typedef per set of parameters. Without working out details, something like

const PtrAttrs = struct {
    is_const: bool,
    is_volatile: bool,
    // etc.
}
const c_str_const = TermPtr(u8, '\0', PtrAttrs { is_const: true, is_volatile: false });
const w_str_mut = TermPtr(u16, '\0', PtrAttrs { is_const: false, is_volatile: false });
// etc.

This should generalize to various custom pointer types.

@Rocknest
Copy link
Contributor

Rocknest commented Feb 16, 2019

Ability to define types such as 0 terminated pointers would a powerful addition to the language. Also it would probably solve #1595.

How it could possibly look like:

pub fn TermPtr(comptime T: type, comptime termValue: T) type {
    return @TransparentStruct(struct { // builtin ensures that there is only one field
        ptr: [*]T // could be any type, for example 'c_int' to pass type safe values to some c api
        fn from(comptime slice: []T) @This() {
            return @This() { .ptr = (slice ++ termValue).ptr };
        }
    });
}

Not sure about modifiers such as const, align. Maybe something like this:

pub fn TermPtr(comptime T: type, comptime termValue: T) type {
    return @TransitiveStruct(struct { // allows to forward type modifiers from the outside
        ptr: [*]T 
        // methods
    });
}
///////
const c_str = TermPtr(u8, 0x00);
extern fn someApi(path: const c_str); // the type here is '[*]const T', but it is type safe

It could even allow to use operators on such type, however it should be optional. For example [index] operator on c_str is ok, but + on type safe c_int wrapper for c api is not ok.

@andrewrk
Copy link
Member Author

@matthew-mcallister It seems important to me that C types that can be represented in C, can be represented in Zig language without having to import any code. This precedent is set for C integer types, float types, and C pointer types. There must be a C string literal, or Zig string literals must work for C APIs.

@daurnimator
Copy link
Contributor

daurnimator commented Feb 19, 2019

Something I hadn't considered yet:

* Would the null terminated slice/array types assert that a null/0 byte does _not_ occur in any of the elements before the `len` index?

If so, this would guarantee the property that after casting a []null T to [*]null T, finding the length based on the null termination would give the same len value as before. However this would mean that casting []null T to [*]null T should have a runtime safety check, which probably means it shouldn't be an implicit cast. Hmm.

Or the type could have a weaker guarantee, which is only that there shall be a null/0 byte at the len index, and makes no guarantees about items otherwise. However, then the "length" may change when implicitly casting from []null T to [*]null T.

Continuing from IRC:

To echo the common usage of null terminated strings, I think the length should always be computed at runtime (at least before the optimizer kicks in). I propose:

null slices

  • []null T (null slice) should be struct { ptr: [*]T } where .len performs a strlen-like operation.
  • []null T implicitly casts to [*]T
  • []null T can be 'cast' to [*]T by simply doing nullslice.ptr
  • [*]T to []null T needs an explicit cast (via @ptrCast)
  • []null T can be 'cast' to []T via nullslice[0..nullslice.len]. The .len here is invoking a strlen-like operation.
  • For safety, you could make indexing a null slice do a length check: null_u8_slice[5] could emit code that does: assert(strnlen(null_u8_slice.ptr, 5) == 5) before the access.

null arrays

  • [N]null T is an array of max size N where the first null should be considered the length.
  • It is similar to []null T except uses a strnlen-like instead of a strlen-like.
  • It is valid to read [N]null T at N. You are guaranteed to get null.
  • [N]null T uses @sizeOf(T)*(N+1) space (or possibly less depending on array packing if T is e.g. u1?)
  • [N]null T implicitly casts to []null T
  • [N]null T implicitly casts to [*]T
  • It is a compile error to write to index N.
  • It is valid to read or write to a [N]null T at any index in the range [0..N)

null literals

  • c"aoeu" is of type [4]null const u8
  • (c"aoeu").len == 4
  • @sizeOf(c"aoeu") == 5
  • (c"ao\0eu").len == 2
  • @sizeOf(c"ao\0eu") == 6

misc notes

  • [*]null T doesn't exist. the length of a null array always "known": it's before the first null!

@matthew-mcallister
Copy link
Contributor

matthew-mcallister commented Feb 20, 2019

@daurnimator Why not keep [*]null T/[*c]null T and dispense with the others? If the intended purpose of this feature is to decorate pointers in C FFI definitions, then that would be the minimal solution, plus it would encourage use of "real" slices in Zig code and discourage overuse of strlen.

I feel like null-terminated literals can still conceivably be handled well by a builtin function. Say c"hello" produces a raw [*]null u8 pointer. Then a hypothetical @cstrToSlice function can safely check for the null and make a Zig slice at compile time. Or all string literals can have an implicit terminating null and @sliceToCstr will check that its argument has the null (and no interior nulls?) and return a [*]null u8.

@andrewrk
Copy link
Member Author

why would we need @cstrToSlice? std.mem.len already works fine, it'll just have an improved prototype that uses [*]null T instead of [*]const T. @sliceToCstr as you described it could also easily be a userland function.

@matthew-mcallister
Copy link
Contributor

I meant those names as placeholders; I only added the @ to satisfy the requirement that you wouldn't need to import from std. As far as my suggestion goes, it's agnostic to how they're implemented.

@andrewrk andrewrk modified the milestones: 0.4.0, 0.5.0 Mar 1, 2019
@Androthi
Copy link

Moving some work from Pointer Reform (#770) to here:

  • add syntax for [N]null T, []null T, and [*]null T

  • add implicit casting

    • []null T to []T
    • [*]null T to [*]T
    • [N]null T to [N]T
    • *[N]null T to *[N]T
    • []null T to [*]null T
    • [N]null T to []null const T
    • [N]null T to [*]null const T
  • make string literals [N]null u8

  • remove C string literals and C multiline string literals

coming in late to this, but wouldn't it be easier to create a hybrid string type?
the hla language for example uses a hybrid string type that is null terminated with a header containing the max length, the current length and a pointer to the start of the string. this really simplified interactions between C libraries and internal libraries.

@marler8997
Copy link
Contributor

marler8997 commented Jun 18, 2019

Thought it might be helpful to share my experience with this problem. In D I implemented a module to support null-terminated strings.

https://github.com/dragon-lang/mar/blob/master/src/mar/sentinel.d

The general term for arrays that end in a particular value that I've found is a "sentinel array" and or "sentinel ptr". In D I just implemented them as a wrapper struct arround pointers and arrays.

The 2 main benefits I see from having sentinel pointers as a part of the type system are:

  1. functions that take sentinel pointers can declare this requirement, meaning that if a client passes a non-sentinel pointer then they will get a compile error rather than a runtime bug
  2. it allows the application to control when and how sentinel arrays are allocated, rather than having to convert normal zig arrays to sentinel arrays every time a C function is called

My library solution for this in D would be equivalent to something like this in Zig:

pub fn SentinelPtr(comptime T: type) type {
    return struct {
        ptr: [*]T,
        // create a sentinel pointer from `ptr`, assume it ends in a sentinel value
        pub fn init(ptr: [*]T) {
            return @This() {
                .ptr = ptr,
            }
        }
    };
}
pub fn SentinelSlice(comptime T: type) type {
    return struct {
        array: []T,
        // create a sentinel slice from `slice`
        pub fn init(slice: []T) @This() {
            std.debug.assert(slice.ptr[slice.len] == 0);
            return assume(slice);
        }
        // create a sentinel slice from `slice`, assume it ends in a sentinel value
        pub fn assume(slice: []T) @This() {
            return @This() {
                .array = array,
            };
        }
    };
}

It just boils down to wrapping the pointers/slices inside structs and creating helper functions to create/unwrap them.

This is one way to implement it, though, if you do it in a library like this then it will be a bit more verbose than a language solution, and you'll probably want to find a way to allow the types to perform automatic const conversion, i.e.

var chars = [2]u8;
chars[0] = 'a';
chars[1] = '\0';
var x = SentinelPtr(u8).init(chars.*);
var y : SentinelPtr(const u8) = x; // is there a way to support this in zig?

@hryx
Copy link
Contributor

hryx commented Jul 3, 2019

Just to extend and visualize @Rocknest 's syntactical proposal:

[]const u8
[*]const u8
[5]const u8
[_]const u8
[*c]const u8
[null]const u8
[*null]const u8
[5 null]const u8
[_ null]const u8

I appreciate having the null inside the brackets because it describes a property of the array/slice type itself (like 5/_/*), as opposed to the const which qualifies the element type.
(Minor thing)> 🐤

@daurnimator
Copy link
Contributor

I appreciate having the null inside the brackets because it describes a property of the array/slice type itself (like 5/_/*), as opposed to the const which qualifies the element type.

But e.g. align is on the outside of the []

@hryx
Copy link
Contributor

hryx commented Jul 3, 2019

But e.g. align is on the outside of the []

Ah, true

@andrewrk
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. enhancement Solving this issue will likely involve adding new logic or components to the codebase. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests