Base-16 string literals for concise definition of arbitrary data blobs #7077

hrumhurum · 2023-03-27T00:03:35Z

hrumhurum
Mar 27, 2023

Summary

The existing UTF-8 Strings Literals proposal adds the ability to write UTF-8 string literals in C# and have them automatically encoded into their UTF-8 byte representation.

However, the language does not provide a concise way to create byte blobs of arbitrary data. If we develop the idea of string literals further then it can become the basis for defining the contents of arbitrary data blobs.

Background and Motivation

The existing way of defining a byte blob in C# is to use an array creation syntax:

var blob = new byte[] { 0xb0, 0x3f, 0x5f, 0x7f, 0x11, 0xd5, 0x0a, 0x3a }

While the array syntax works, it is not always convenient in practice. Oftentimes a more concise approach is desirable when well-known byte sequences are being encoded in the source code.

Proposed Feature

Base-16 encoding is a widespread convention used since 1970s or even earlier. It also widely known as HEX encoding. The basic idea behind Base-16 encoding is that a sequence of bytes is represented by a series of HEX-encoded [00..FF] numbers, like so:

b03f5f7f11d50a3a

The encoding is case-insensitive, so the following string equally works:

B03F5F7F11D50A3A

Both string representations carry the same data as the following C# array:

new byte[] { 0xb0, 0x3f, 0x5f, 0x7f, 0x11, 0xd5, 0x0a, 0x3a }

The proposed Base-16 string literals can use the approach pioneered in already-implemented UTF-8 Strings Literals feature. For example, to get the byte representation of a Base-16 encoded string, the following C# syntax can be used:

var blob = "b03f5f7f11d50a3a"b16;

Correspondingly, the rules applied to UTF-8 string literals also apply to Base-16 string literals:

byte[] blob1 = "010203"b16;             // Error - Cannot implicitly convert type 'System.ReadOnlySpan<byte>' to 'byte[]'.
var blob2 = "010203"b16;                // Okay and type is ReadOnlySpan<byte>.
ReadOnlySpan<byte> blob3 = "010203"b16; // Okay.
byte[] blob4 = "010203"b16.ToArray();   // Okay.
Span<byte> blob5 = "010203"b16;         // Error - Cannot implicitly convert type 'System.ReadOnlySpan<byte>' to 'System.Span<byte>'.

Separators in Encoded Literals

Base-16 encoding generally allows the use of whitespace separators between the symbols, like so:

B0 3F 5F 7F 11 D5 0A 3A

This makes the encoded data more recognizable and comprehendible by a human eye. This means that the following widespread syntax and its variations are allowed as well:

var alt1 = "b0 3f 5f 7f 11 d5 0a 3a"b16; // Okay.
var alt2 = "b03f 5f7f 11d5 0a3a"b16;     // Okay.
var alt3 = " b 03 f5f  7f11 d50a3 a"b16; // Okay - Base-16 completely ignores the insignificant symbols (whitespaces), so they can be anywhere.

var empty1 = ""b16;          // Okay.
var empty2 = "        "b16;  // Okay.
var empty3 = "  \t\r\n "b16; // Okay.

An interesting consequence of that is that the classical "memory dump" notation becomes readily available:

// The ECMA replacement key for the Microsoft implementation of the CLR.
var blob = @"
00 24 00 00 04 80 00 00 94 00 00 00 06 02 00 00
00 24 00 00 52 53 41 31 00 04 00 00 01 00 01 00
07 d1 fa 57 c4 ae d9 f0 a3 2e 84 aa 0f ae fd 0d
e9 e8 fd 6a ec 8f 87 fb 03 76 6c 83 4c 99 92 1e
b2 3b e7 9a d9 d5 dc c1 dd 9a d2 36 13 21 02 90
0b 72 3c f9 80 95 7f c4 e1 77 10 8f c6 07 77 4f
29 e8 32 0e 92 ea 05 ec e4 e8 21 c0 a5 ef e8 f1
64 5c 4c 0c 93 c1 ab 99 28 5d 62 2c aa 65 2c 1d
fa d6 3d 74 5d 6f 2d e5 f1 7e 5e af 0f c4 96 3d
26 1c 8a 12 43 65 18 20 6d c0 93 34 4d 5a d2 93
"b16;

Or even more tidy using the raw string syntax:

// The ECMA replacement key for the Microsoft implementation of the CLR.
var blob = """
           00 24 00 00 04 80 00 00 94 00 00 00 06 02 00 00
           00 24 00 00 52 53 41 31 00 04 00 00 01 00 01 00
           07 d1 fa 57 c4 ae d9 f0 a3 2e 84 aa 0f ae fd 0d
           e9 e8 fd 6a ec 8f 87 fb 03 76 6c 83 4c 99 92 1e
           b2 3b e7 9a d9 d5 dc c1 dd 9a d2 36 13 21 02 90
           0b 72 3c f9 80 95 7f c4 e1 77 10 8f c6 07 77 4f
           29 e8 32 0e 92 ea 05 ec e4 e8 21 c0 a5 ef e8 f1
           64 5c 4c 0c 93 c1 ab 99 28 5d 62 2c aa 65 2c 1d
           fa d6 3d 74 5d 6f 2d e5 f1 7e 5e af 0f c4 96 3d
           26 1c 8a 12 43 65 18 20 6d c0 93 34 4d 5a d2 93
           """b16;

Compare that with an existing array creation syntax (which is somewhat clunky to use for the given purpose):

// The ECMA replacement key for the Microsoft implementation of the CLR.
var blob = new byte[]
{
    0x00, 0x24, 0x00, 0x00, 0x04, 0x80, 0x00, 0x00, 0x94, 0x00, 0x00, 0x00, 0x06, 0x02, 0x00, 0x00,
    0x00, 0x24, 0x00, 0x00, 0x52, 0x53, 0x41, 0x31, 0x00, 0x04, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00,
    0x07, 0xd1, 0xfa, 0x57, 0xc4, 0xae, 0xd9, 0xf0, 0xa3, 0x2e, 0x84, 0xaa, 0x0f, 0xae, 0xfd, 0x0d,
    0xe9, 0xe8, 0xfd, 0x6a, 0xec, 0x8f, 0x87, 0xfb, 0x03, 0x76, 0x6c, 0x83, 0x4c, 0x99, 0x92, 0x1e,
    0xb2, 0x3b, 0xe7, 0x9a, 0xd9, 0xd5, 0xdc, 0xc1, 0xdd, 0x9a, 0xd2, 0x36, 0x13, 0x21, 0x02, 0x90,
    0x0b, 0x72, 0x3c, 0xf9, 0x80, 0x95, 0x7f, 0xc4, 0xe1, 0x77, 0x10, 0x8f, 0xc6, 0x07, 0x77, 0x4f,
    0x29, 0xe8, 0x32, 0x0e, 0x92, 0xea, 0x05, 0xec, 0xe4, 0xe8, 0x21, 0xc0, 0xa5, 0xef, 0xe8, 0xf1,
    0x64, 0x5c, 0x4c, 0x0c, 0x93, 0xc1, 0xab, 0x99, 0x28, 0x5d, 0x62, 0x2c, 0xaa, 0x65, 0x2c, 0x1d,
    0xfa, 0xd6, 0x3d, 0x74, 0x5d, 0x6f, 0x2d, 0xe5, 0xf1, 0x7e, 0x5e, 0xaf, 0x0f, 0xc4, 0x96, 0x3d,
    0x26, 0x1c, 0x8a, 0x12, 0x43, 0x65, 0x18, 0x20, 0x6d, 0xc0, 0x93, 0x34, 0x4d, 0x5a, 0xd2, 0x93
};

Another benefit of using Base-16 string literals is that they can be directly copied from/to specs, RFCs, listings, etc., making them less prone to accidental errors. Base-16 is already an established lingua franca that needs no translation.

Invalid Literals

Some situations are prohibited by Base-16:

var err1 = "z"b16;          // Error - Invalid symbol 'z' in Base-16 string literal.
var err2 = "="b16;          // Error - Invalid symbol '=' in Base-16 string literal.
var err3 = "01 02 03 4"b16; // Error - Base-16 string literal must contain an even number of significant symbols.

Endianness

Base-16 encoding is endianness-agnostic and always produces the consistent results regardless of a host platform. In contrast, System.BitConverter produces different results depending on a host CPU architecture as illustrated by the following code:

var literal = "b03f5f7f11d50a3a"b16;
var number = BitConverter.GetBytes(0xb03f5f7f11d50a3a);
if (literal.SequenceEqual(number))
    Console.WriteLine("The host CPU is big-endian.");
else
    Console.WriteLine("The host CPU is not big-endian.");

That's why the current proposal specifically revolves around Base-16 encoding and not hexadecimal numbers. Those are different things and it's a good practice to keep them separate by being specific. Endianness concerns are the main reason why b16 literal suffix is preferred to a more user-friendly hex - it helps to avoid the confusion.

Other Encodings

The presented approach opens the way for potential support of other encodings that have widespread use, e.g. Base-64.

TahirAhmadov · 2023-03-27T00:47:15Z

TahirAhmadov
Mar 27, 2023

2 replies

hrumhurum Mar 27, 2023
Author

The proposal is about basic syntax, not metaprogramming.

TahirAhmadov Mar 27, 2023

I misunderstood your proposal and deleted my comment to avoid confusion.

ufcpp · 2023-03-27T03:13:13Z

ufcpp
Mar 27, 2023

I have created a source generator for this. The souce generator was originally made for UTF-8, but now also supports Base-16 and Base-64. (However, it has not been tested much and is unstable.)

0 replies

Tinister · 2023-03-27T22:12:55Z

Tinister
Mar 27, 2023

I feel there's an endianness issue that this proposal hasn't said anything about.

Like "b03f5f7f11d50a3a"b16 and BitConverter.GetBytes(0xb03f5f7f11d50a3a) would provide completely different arrays on little endian machines. That's enough of an issue that our team would ban use of this feature (the former is too similar to logging a hex number).

1 reply

hrumhurum Mar 27, 2023
Author

Base-16 does not expose endianness issues. It's totally agnostic to a CPU architecture.

UPDATE: I've added a new chapter about endianness to the proposal based on your feedback. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base-16 string literals for concise definition of arbitrary data blobs #7077

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Base-16 string literals for concise definition of arbitrary data blobs #7077

hrumhurum Mar 27, 2023

Summary

Background and Motivation

Proposed Feature

Separators in Encoded Literals

Invalid Literals

Endianness

Other Encodings

Replies: 3 comments · 3 replies

TahirAhmadov Mar 27, 2023

hrumhurum Mar 27, 2023 Author

TahirAhmadov Mar 27, 2023

ufcpp Mar 27, 2023

Tinister Mar 27, 2023

hrumhurum Mar 27, 2023 Author

hrumhurum
Mar 27, 2023

Replies: 3 comments 3 replies

TahirAhmadov
Mar 27, 2023

hrumhurum Mar 27, 2023
Author

ufcpp
Mar 27, 2023

Tinister
Mar 27, 2023

hrumhurum Mar 27, 2023
Author