What to do about the zero terminator of a string in a Span? #273

ericsink · 2019-06-24T18:11:57Z

The use of Span has raised some interesting questions.

In the SQLite API, when a parameter is a string, it is usually just a pointer, with the (documented) expectation that the string ends with a zero terminator byte.

So, at a higher layer, if we use a ReadOnlySpan<byte> for that string, the Length or the span is ignored. By the time it gets to SQLite, it's just a pointer. Is that a signal that we shouldn't be using span?

Suppose the string is "hello". That's 5 bytes of actual string plus one byte for the zero terminator. We could pass a span of length 5 or a span of length 6. It doesn't matter to SQLite, because it's just going to look for that zero byte anyway.

Heck, we could pass a span of length 1, and it still wouldn't matter. But I mention that just to illustrate the problem. Clearly, if we're going to use a span, we should have the length be correct.

But which is more correct? Should the Length include the zero terminator or not?

Including the zero byte in the length seems wrong because the zero is not part of the string data. If we called strlen() on the pointer, it would return a length which did not include the zero. If we passed this pointer to System.Text methods to convert it from UTF-8 to System.String, the length we provide should not include the zero byte. The length of the string "hello" is 5.

However, it also seems wrong to not include the zero byte in the length. The zero byte may not be part of the string, but it is a documented and required part of the block of bytes the function expects. And if we don't include the zero byte, then we are passing a span to a function which is going to walk beyond the end of that span's length, which seems wrong. The length of the zero-terminated block of memory that represents "hello" is 6.

At the moment, the solution I have chosen is to include the zero byte in the span length. And the provider code checks to make sure the last byte of the span is a zero.

But this is only half the story. We have a similar but slightly different set of problems for cases where SQLite is returning a string or passing one in a callback.

In this case, SQLite itself is responsible for putting that zero byte in place, and it does. But when we construct a ReadOnlySpan<byte> for that piece of memory, again, do we include the zero byte in the length or not?

In this case, a big difference is that the code that receives that span is likely to use the Length and ignore the presence of any zero terminator For example, using System.Text functions to convert from UTF-8 to a string, the span Length must be correct and must not include the zero terminator.

If we did include the zero terminator in this upward-moving set of strings, then most cases that use the spans would need to subtract one from the Length before using it. And should it check to make sure the last byte actually is a zero before doing so?

So for the moment, the solution I have chosen is to NOT include the zero terminator when constructing a span for a string that is coming from SQLite.

Neither of these decisions seems correct. Each one merely seems to be the least incorrect option.

And then there is the inconsistency. I have two very similar questions, but I came up with two different answers. That too seems wrong. For example, as things stand right now, if you receive a string from SQLite as a span and then pass that span directly back to another SQLite function, it won't work.

The text was updated successfully, but these errors were encountered:

ericsink · 2019-06-24T18:13:32Z

@bricelam Have you folks at Microsoft developed any recommended guidelines for using Span with unmanaged libraries that deal in zero-terminated strings?

bricelam · 2019-06-24T19:19:55Z

@divega might know who to ask...

divega · 2019-06-24T19:31:46Z

I would start with @jkotas.

jkotas · 2019-06-24T19:54:31Z

For public .NET Core APIs that take or return Span<byte> or Span<char> :

Span is not assumed to be zero terminated
Length does not include the zero terminator (because of there may not even be one)

The internal .NET Core implementation has a few places that create and operate on Spans that are zero terminated. This is local internal implementation detail that does not leak out through public APIs.

ericsink · 2019-06-24T19:59:50Z

@jkotas That makes sense. For a .NET-only API.

But what if you were passing a Span through P/Invoke to a C function that expects the zero terminator? Would you include that zero byte in the span length?

jkotas · 2019-06-24T21:27:48Z

P/Invoke methods cannot take Spans directly today.

I think you are really asking whether you should include zero byte in the Span length in your internal implementation details that prepare string passed to PInvoke. It is up to you and what makes sense for the type of code that you are writing.

For public APIs, I would recommend to be on the same plan as public .NET Core APIs.

ericsink · 2019-06-24T21:38:57Z

Okay. I admit this issue is arcane and my explanation of it above is tedious.

Anyway, at some point I will eventually cross paths with somebody else who is weighing the pros and cons of this.

Thanks for the replies.

ericsink · 2019-06-24T22:58:26Z

For others who might visit this issue, I want to add a bit more explanation for why "be on the same plan as public .NET Core APIs" doesn't seem helpful in this context.

The underlying C function requires a zero terminated string. That's just the way it is.

So let's say I expose a C# wrapper for that function which takes a Span, and to be consistent with .NET practices, the caller is not expected to include a zero terminator.

What that means is that I have to make a copy of the data in order to append the zero terminator before passing it down to the unmanaged function.

And this would defeat the whole point of using Span.

tmds · 2019-06-25T11:30:10Z

Maybe you can write a ref struct that encapsulates Span<byte> and represents a zero-terminated string?

ericsink · 2019-06-25T14:44:31Z

@tmds I've been wondering about that possibility myself.

ericsink · 2019-06-25T14:47:01Z

A tidbit of good news noticed by sgjennings on Twitter:

This writeup on the upcoming Utf8String type:

dotnet/corefxlab#2368

contains this bit:

"ensuring a null terminator (important for p/invoke scenarios)"

which indicates an awareness of this issue.

I was aware of Utf8String, but had not yet seen any indication that dealing with zero terminator cases was (or was not) being considered as part of the design.

…simply encapsulates (and hides) a ReadOnlySpan of byte, with the additional property of knowing for certain that it includes a zero terminator. it also supports the fixed pattern. and it has a constructor that takes a string. change ISQLite3Provider.sqlite3_open() to accept one of these instead of ROS byte. #273

ericsink · 2019-06-26T03:02:38Z

@tmds So yeah, as a first attempt, that idea was easier than I thought. See 17785ea

So sz (which needs a better name) is a ref struct type that simply encapsulates (and hides) a ReadOnlySpan<byte>, with the additional property of knowing for certain that it includes a zero terminator, because its only constructor ensures that. It also supports the fixed pattern. In fact, right now that's its whole API, a constructor and GetPinnableReference().

The notion here is that sz would be used in cases where the utf8 string is required to have a zero terminator, leaving ReadOnlySpan<byte> to be used in places where the full expectations associated with span are appropriate.

As mentioned in the commit message, this is an experimental idea. I'm just throwing some code around to see what works.

…d of ROS byte. requires sz to have new ways of constructing, while continuing to ensure the zero terminator. requires sz.ToString(), called by extension method utf8_to_string. #273

…delegate to use sz instead of ROS byte. #273

ericsink · 2019-06-26T05:01:13Z

After continuing the sz experiment for a return value and a callback parameter, it continues to work well.

I'm starting to believe the best practice here is to NOT publicly use ReadOnlySpan<byte> to describe a pointer that needs a zero terminator. Using ReadOnlySpan<byte> sets certain expectations, such as that slicing works, and there may not be a zero terminator, and "if there is a zero terminator it is certainly not included in the Length", and so on.

As an alternative, the sz concept (or something like it) gives the benefits of a span (it's still a ref struct) but gives type safety for the use cases where the underlying C-ish API requires a zero terminator. It is basically a constrained span with very limited functionality.

This may in fact be what @jkotas meant when he said "in your internal implementation details that prepare string passed to PInvoke", but I initially didn't understand it because I had not yet implemented my own ref struct type.

jkotas · 2019-06-26T05:33:12Z

I agree that the type safe wrapper makes in your case since you are passing the value around. I have not thought about it, but you have figure it out. It looks pretty neat.

tmds · 2019-06-26T07:30:42Z

as a first attempt, that idea was easier than I thought.

cool!

This writeup on the upcoming Utf8String type:

Compared to the Span approach, afaik, Utf8String won't allow you to wrap native memory. Nor can you stackalloc the string.
The advantage of Utf8String over string is you no longer convert between between UTF-8 (SQLLite) and UTF-16 (.NET string).
With the Span approach, you can also avoid memory allocations in some cases.
It makes sense to use the Span approach where those allocations can be avoided, and Utf8String otherwise.
Last I heard, Utf8String isn't happening for 3.0.

After continuing the sz experiment for a return value and a callback parameter

In a callback the sz will be valid during the callback. But with a return value, there is an issue of scope when the sz is wrapping native memory.

ericsink · 2019-06-26T13:46:38Z

Utf8String won't allow you to wrap native memory.
The advantage of Utf8String over string is you no longer convert between between UTF-8 (SQLLite) and UTF-16 (.NET string).

Yeah, Utf8String (someday) will be complementary, but occupies a different place in the world. I'm trying to make it possible to use SQLitePCLRaw in a situation where everything is UTF-8, with minimal conversion and copying. For now, if the enclosing app or library is built around System.String, it won't be able to take full advantage, but I'll be more ready for the day when Utf8String is available.

In a callback the sz will be valid during the callback. But with a return value, there is an issue of scope when the sz is wrapping native memory.

Yeah, the method I experimentally converted over was already returning ReadOnlySpan<byte>, so that same limitation applied. The caller is expected to copy it if needed. For example, sqlite3_column_text() returns char *. If C# could overload on return type, I would provide one overload returning sz and one returning string to do the conversion. But since I can only have one method with that name, I choose the lower-level approach and let the caller do the extra work.

Thanks. Cheers.

…hich uses the length) or sz (which does not), since the underlying sqlite3 APIs allow that. #273

…vel, since the sqlite3 API allows either approach #273

…ion method instead of sz.FromString() #273

ericsink mentioned this issue Jun 24, 2019

Incorrect tail from sqlite3_prepare_v2 #274

Closed

ericsink added a commit that referenced this issue Jun 26, 2019

continuing the experiment to try an sz in a callback. chg trace_hook …

9aadafb

…delegate to use sz instead of ROS byte. #273

ericsink added a commit that referenced this issue Jun 26, 2019

make sz more robust in handling edge cases, null pointers #273

a9aebc0

ericsink added a commit that referenced this issue Jun 26, 2019

chg the parameters of sqlite3_open_v2() from ROS byte to sz. #273

f75c0c0

ericsink added a commit that referenced this issue Jun 26, 2019

add sz.FromPtrLen #273

cf0ca36

ericsink added a commit that referenced this issue Jun 26, 2019

chg sqlite3_column_text() to return sz #273

cf0fe11

ericsink added a commit that referenced this issue Jun 26, 2019

chg the string params for delegate_update from ROS byte to sz #273

c579ca6

ericsink added a commit that referenced this issue Jun 27, 2019

in the ISQLiteProvider interface, chg a bunch more ROS byte to sz #273

f88e608

ericsink added a commit that referenced this issue Jun 27, 2019

chg bind_text, result_text, and result_error to accept either span (w…

0d74c6b

…hich uses the length) or sz (which does not), since the underlying sqlite3 APIs allow that. #273

ericsink added a commit that referenced this issue Jun 27, 2019

for prepare_v2/3, add overloads to allow ROS or sz at the provider le…

1689e7f

…vel, since the sqlite3 API allows either approach #273

ericsink added a commit that referenced this issue Jun 27, 2019

rm some dead code in the provider file as part of the sz transition #273

d1670cf

ericsink added a commit that referenced this issue Jun 27, 2019

rm more dead code #273

d6eee00

ericsink added a commit that referenced this issue Jun 27, 2019

rm utf8_to_string() for sz, since it's just .ToString() now. #273

bacf673

ericsink added a commit that referenced this issue Jun 27, 2019

for consistency, chg sqlite3_open() in raw to use the .to_sz() extens…

d76aa16

…ion method instead of sz.FromString() #273

ericsink added a commit that referenced this issue Jun 28, 2019

rename sz to utf8z #273

b526ad2

ericsink closed this as completed Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do about the zero terminator of a string in a Span? #273

What to do about the zero terminator of a string in a Span? #273

ericsink commented Jun 24, 2019 •

edited

Loading

ericsink commented Jun 24, 2019

bricelam commented Jun 24, 2019

divega commented Jun 24, 2019

jkotas commented Jun 24, 2019 •

edited

Loading

ericsink commented Jun 24, 2019

jkotas commented Jun 24, 2019

ericsink commented Jun 24, 2019

ericsink commented Jun 24, 2019

tmds commented Jun 25, 2019

ericsink commented Jun 25, 2019

ericsink commented Jun 25, 2019 •

edited

Loading

ericsink commented Jun 26, 2019

ericsink commented Jun 26, 2019

jkotas commented Jun 26, 2019

tmds commented Jun 26, 2019

ericsink commented Jun 26, 2019

What to do about the zero terminator of a string in a Span? #273

What to do about the zero terminator of a string in a Span? #273

Comments

ericsink commented Jun 24, 2019 • edited Loading

ericsink commented Jun 24, 2019

bricelam commented Jun 24, 2019

divega commented Jun 24, 2019

jkotas commented Jun 24, 2019 • edited Loading

ericsink commented Jun 24, 2019

jkotas commented Jun 24, 2019

ericsink commented Jun 24, 2019

ericsink commented Jun 24, 2019

tmds commented Jun 25, 2019

ericsink commented Jun 25, 2019

ericsink commented Jun 25, 2019 • edited Loading

ericsink commented Jun 26, 2019

ericsink commented Jun 26, 2019

jkotas commented Jun 26, 2019

tmds commented Jun 26, 2019

ericsink commented Jun 26, 2019

ericsink commented Jun 24, 2019 •

edited

Loading

jkotas commented Jun 24, 2019 •

edited

Loading

ericsink commented Jun 25, 2019 •

edited

Loading