Base-16 string literals for concise definition of arbitrary data blobs #7077
Unanswered
hrumhurum
asked this question in
Language Ideas
Replies: 3 comments 3 replies
-
I have created a source generator for this. The souce generator was originally made for UTF-8, but now also supports Base-16 and Base-64. (However, it has not been tested much and is unstable.) |
Beta Was this translation helpful? Give feedback.
0 replies
-
I feel there's an endianness issue that this proposal hasn't said anything about. Like |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
The existing UTF-8 Strings Literals proposal adds the ability to write UTF-8 string literals in C# and have them automatically encoded into their UTF-8
byte
representation.However, the language does not provide a concise way to create
byte
blobs of arbitrary data. If we develop the idea of string literals further then it can become the basis for defining the contents of arbitrary data blobs.Background and Motivation
The existing way of defining a
byte
blob in C# is to use an array creation syntax:While the array syntax works, it is not always convenient in practice. Oftentimes a more concise approach is desirable when well-known byte sequences are being encoded in the source code.
Proposed Feature
Base-16 encoding is a widespread convention used since 1970s or even earlier. It also widely known as HEX encoding. The basic idea behind Base-16 encoding is that a sequence of bytes is represented by a series of HEX-encoded [00..FF] numbers, like so:
The encoding is case-insensitive, so the following string equally works:
Both string representations carry the same data as the following C# array:
The proposed Base-16 string literals can use the approach pioneered in already-implemented UTF-8 Strings Literals feature. For example, to get the
byte
representation of a Base-16 encoded string, the following C# syntax can be used:Correspondingly, the rules applied to UTF-8 string literals also apply to Base-16 string literals:
Separators in Encoded Literals
Base-16 encoding generally allows the use of whitespace separators between the symbols, like so:
This makes the encoded data more recognizable and comprehendible by a human eye. This means that the following widespread syntax and its variations are allowed as well:
An interesting consequence of that is that the classical "memory dump" notation becomes readily available:
Or even more tidy using the raw string syntax:
Compare that with an existing array creation syntax (which is somewhat clunky to use for the given purpose):
Another benefit of using Base-16 string literals is that they can be directly copied from/to specs, RFCs, listings, etc., making them less prone to accidental errors. Base-16 is already an established lingua franca that needs no translation.
Invalid Literals
Some situations are prohibited by Base-16:
Endianness
Base-16 encoding is endianness-agnostic and always produces the consistent results regardless of a host platform. In contrast,
System.BitConverter
produces different results depending on a host CPU architecture as illustrated by the following code:That's why the current proposal specifically revolves around Base-16 encoding and not hexadecimal numbers. Those are different things and it's a good practice to keep them separate by being specific. Endianness concerns are the main reason why
b16
literal suffix is preferred to a more user-friendlyhex
- it helps to avoid the confusion.Other Encodings
The presented approach opens the way for potential support of other encodings that have widespread use, e.g. Base-64.
Beta Was this translation helpful? Give feedback.
All reactions