Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about the zero terminator of a string in a Span? #273

Closed
ericsink opened this issue Jun 24, 2019 · 16 comments
Closed

What to do about the zero terminator of a string in a Span? #273

ericsink opened this issue Jun 24, 2019 · 16 comments

Comments

@ericsink
Copy link
Owner

ericsink commented Jun 24, 2019

The use of Span has raised some interesting questions.

In the SQLite API, when a parameter is a string, it is usually just a pointer, with the (documented) expectation that the string ends with a zero terminator byte.

So, at a higher layer, if we use a ReadOnlySpan<byte> for that string, the Length or the span is ignored. By the time it gets to SQLite, it's just a pointer. Is that a signal that we shouldn't be using span?

Suppose the string is "hello". That's 5 bytes of actual string plus one byte for the zero terminator. We could pass a span of length 5 or a span of length 6. It doesn't matter to SQLite, because it's just going to look for that zero byte anyway.

Heck, we could pass a span of length 1, and it still wouldn't matter. But I mention that just to illustrate the problem. Clearly, if we're going to use a span, we should have the length be correct.

But which is more correct? Should the Length include the zero terminator or not?

Including the zero byte in the length seems wrong because the zero is not part of the string data. If we called strlen() on the pointer, it would return a length which did not include the zero. If we passed this pointer to System.Text methods to convert it from UTF-8 to System.String, the length we provide should not include the zero byte. The length of the string "hello" is 5.

However, it also seems wrong to not include the zero byte in the length. The zero byte may not be part of the string, but it is a documented and required part of the block of bytes the function expects. And if we don't include the zero byte, then we are passing a span to a function which is going to walk beyond the end of that span's length, which seems wrong. The length of the zero-terminated block of memory that represents "hello" is 6.

At the moment, the solution I have chosen is to include the zero byte in the span length. And the provider code checks to make sure the last byte of the span is a zero.

But this is only half the story. We have a similar but slightly different set of problems for cases where SQLite is returning a string or passing one in a callback.

In this case, SQLite itself is responsible for putting that zero byte in place, and it does. But when we construct a ReadOnlySpan<byte> for that piece of memory, again, do we include the zero byte in the length or not?

In this case, a big difference is that the code that receives that span is likely to use the Length and ignore the presence of any zero terminator For example, using System.Text functions to convert from UTF-8 to a string, the span Length must be correct and must not include the zero terminator.

If we did include the zero terminator in this upward-moving set of strings, then most cases that use the spans would need to subtract one from the Length before using it. And should it check to make sure the last byte actually is a zero before doing so?

So for the moment, the solution I have chosen is to NOT include the zero terminator when constructing a span for a string that is coming from SQLite.

Neither of these decisions seems correct. Each one merely seems to be the least incorrect option.

And then there is the inconsistency. I have two very similar questions, but I came up with two different answers. That too seems wrong. For example, as things stand right now, if you receive a string from SQLite as a span and then pass that span directly back to another SQLite function, it won't work.

@ericsink
Copy link
Owner Author

@bricelam Have you folks at Microsoft developed any recommended guidelines for using Span with unmanaged libraries that deal in zero-terminated strings?

@bricelam
Copy link
Contributor

@divega might know who to ask...

@divega
Copy link

divega commented Jun 24, 2019

I would start with @jkotas.

@jkotas
Copy link

jkotas commented Jun 24, 2019

For public .NET Core APIs that take or return Span<byte> or Span<char> :

  • Span is not assumed to be zero terminated
  • Length does not include the zero terminator (because of there may not even be one)

The internal .NET Core implementation has a few places that create and operate on Spans that are zero terminated. This is local internal implementation detail that does not leak out through public APIs.

@ericsink
Copy link
Owner Author

@jkotas That makes sense. For a .NET-only API.

But what if you were passing a Span through P/Invoke to a C function that expects the zero terminator? Would you include that zero byte in the span length?

@jkotas
Copy link

jkotas commented Jun 24, 2019

P/Invoke methods cannot take Spans directly today.

I think you are really asking whether you should include zero byte in the Span length in your internal implementation details that prepare string passed to PInvoke. It is up to you and what makes sense for the type of code that you are writing.

For public APIs, I would recommend to be on the same plan as public .NET Core APIs.

@ericsink
Copy link
Owner Author

Okay. I admit this issue is arcane and my explanation of it above is tedious.

Anyway, at some point I will eventually cross paths with somebody else who is weighing the pros and cons of this.

Thanks for the replies.

@ericsink
Copy link
Owner Author

For others who might visit this issue, I want to add a bit more explanation for why "be on the same plan as public .NET Core APIs" doesn't seem helpful in this context.

The underlying C function requires a zero terminated string. That's just the way it is.

So let's say I expose a C# wrapper for that function which takes a Span, and to be consistent with .NET practices, the caller is not expected to include a zero terminator.

What that means is that I have to make a copy of the data in order to append the zero terminator before passing it down to the unmanaged function.

And this would defeat the whole point of using Span.

@tmds
Copy link

tmds commented Jun 25, 2019

Maybe you can write a ref struct that encapsulates Span<byte> and represents a zero-terminated string?

@ericsink
Copy link
Owner Author

@tmds I've been wondering about that possibility myself.

@ericsink
Copy link
Owner Author

ericsink commented Jun 25, 2019

A tidbit of good news noticed by sgjennings on Twitter:

This writeup on the upcoming Utf8String type:

dotnet/corefxlab#2368

contains this bit:

"ensuring a null terminator (important for p/invoke scenarios)"

which indicates an awareness of this issue.

I was aware of Utf8String, but had not yet seen any indication that dealing with zero terminator cases was (or was not) being considered as part of the design.

ericsink added a commit that referenced this issue Jun 26, 2019
…simply encapsulates (and hides) a ReadOnlySpan of byte, with the additional property of knowing for certain that it includes a zero terminator. it also supports the fixed pattern. and it has a constructor that takes a string. change ISQLite3Provider.sqlite3_open() to accept one of these instead of ROS byte. #273
@ericsink
Copy link
Owner Author

@tmds So yeah, as a first attempt, that idea was easier than I thought. See 17785ea

So sz (which needs a better name) is a ref struct type that simply encapsulates (and hides) a ReadOnlySpan<byte>, with the additional property of knowing for certain that it includes a zero terminator, because its only constructor ensures that. It also supports the fixed pattern. In fact, right now that's its whole API, a constructor and GetPinnableReference().

The notion here is that sz would be used in cases where the utf8 string is required to have a zero terminator, leaving ReadOnlySpan<byte> to be used in places where the full expectations associated with span are appropriate.

As mentioned in the commit message, this is an experimental idea. I'm just throwing some code around to see what works.

ericsink added a commit that referenced this issue Jun 26, 2019
…d of ROS byte. requires sz to have new ways of constructing, while continuing to ensure the zero terminator. requires sz.ToString(), called by extension method utf8_to_string. #273
ericsink added a commit that referenced this issue Jun 26, 2019
@ericsink
Copy link
Owner Author

After continuing the sz experiment for a return value and a callback parameter, it continues to work well.

I'm starting to believe the best practice here is to NOT publicly use ReadOnlySpan<byte> to describe a pointer that needs a zero terminator. Using ReadOnlySpan<byte> sets certain expectations, such as that slicing works, and there may not be a zero terminator, and "if there is a zero terminator it is certainly not included in the Length", and so on.

As an alternative, the sz concept (or something like it) gives the benefits of a span (it's still a ref struct) but gives type safety for the use cases where the underlying C-ish API requires a zero terminator. It is basically a constrained span with very limited functionality.

This may in fact be what @jkotas meant when he said "in your internal implementation details that prepare string passed to PInvoke", but I initially didn't understand it because I had not yet implemented my own ref struct type.

@jkotas
Copy link

jkotas commented Jun 26, 2019

I agree that the type safe wrapper makes in your case since you are passing the value around. I have not thought about it, but you have figure it out. It looks pretty neat.

@tmds
Copy link

tmds commented Jun 26, 2019

as a first attempt, that idea was easier than I thought.

cool!

This writeup on the upcoming Utf8String type:

Compared to the Span approach, afaik, Utf8String won't allow you to wrap native memory. Nor can you stackalloc the string.
The advantage of Utf8String over string is you no longer convert between between UTF-8 (SQLLite) and UTF-16 (.NET string).
With the Span approach, you can also avoid memory allocations in some cases.
It makes sense to use the Span approach where those allocations can be avoided, and Utf8String otherwise.
Last I heard, Utf8String isn't happening for 3.0.

After continuing the sz experiment for a return value and a callback parameter

In a callback the sz will be valid during the callback. But with a return value, there is an issue of scope when the sz is wrapping native memory.

@ericsink
Copy link
Owner Author

Utf8String won't allow you to wrap native memory.
The advantage of Utf8String over string is you no longer convert between between UTF-8 (SQLLite) and UTF-16 (.NET string).

Yeah, Utf8String (someday) will be complementary, but occupies a different place in the world. I'm trying to make it possible to use SQLitePCLRaw in a situation where everything is UTF-8, with minimal conversion and copying. For now, if the enclosing app or library is built around System.String, it won't be able to take full advantage, but I'll be more ready for the day when Utf8String is available.

In a callback the sz will be valid during the callback. But with a return value, there is an issue of scope when the sz is wrapping native memory.

Yeah, the method I experimentally converted over was already returning ReadOnlySpan<byte>, so that same limitation applied. The caller is expected to copy it if needed. For example, sqlite3_column_text() returns char *. If C# could overload on return type, I would provide one overload returning sz and one returning string to do the conversion. But since I can only have one method with that name, I choose the lower-level approach and let the caller do the extra work.

Thanks. Cheers.

ericsink added a commit that referenced this issue Jun 26, 2019
ericsink added a commit that referenced this issue Jun 27, 2019
…hich uses the length) or sz (which does not), since the underlying sqlite3 APIs allow that. #273
ericsink added a commit that referenced this issue Jun 27, 2019
…vel, since the sqlite3 API allows either approach #273
ericsink added a commit that referenced this issue Jun 27, 2019
ericsink added a commit that referenced this issue Jun 27, 2019
ericsink added a commit that referenced this issue Jun 28, 2019
@ericsink ericsink closed this as completed Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants