-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add doc with considerations on the language and its future (#3)
- Loading branch information
Showing
1 changed file
with
191 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
# Considerations on the Protocol Buffers Language | ||
|
||
There are several parts of the [specification](./LanguageSpec.md) that are complicated and | ||
inconsistent. These parts exist as unintentional side effects of implementation | ||
details in Google's reference compiler, `protoc`. | ||
|
||
This demonstrates why having a clear language spec is valuable: if the spec were | ||
written first, then the resulting compiler implementation would be more clear. | ||
These peculiar behaviors in the compiler could be clearly categorized as bugs | ||
and then fixed. However, because there was no language spec at the onset of | ||
developing Protobuf, these implementation details have become the de facto spec. | ||
These quirks cannot be clearly categorized as bugs because there may be users | ||
and source code that _rely_ on this behavior. (This is a sort of corollary to | ||
[Hyrum's Law](https://www.hyrumslaw.com/)). | ||
|
||
The following sections describe potential _changes_ to the specification, that | ||
would make it much simpler, more straightforward, and more consistent. These | ||
changes are not backwards compatible, so introducing them would require a new | ||
syntax (e.g. "proto3-strict" or "proto4"). | ||
|
||
## Resolving Relative References | ||
The most complicated and least consistent part of the specification is the section | ||
on [resolving references](./LanguageSpec.md#reference-resolution). | ||
|
||
There are four inconsistencies in particular that ideally would be corrected: | ||
1. The way an option name is resolved vs. a field type name inside a message is | ||
inconsistent. A field type name may refer to a sibling nested message or | ||
enum without any qualifiers. However, an option name may not refer to a | ||
sibling nested extension: it must qualify the extension name with the name | ||
of the enclosing message. | ||
2. An unqualified name for the extendee of an "extend" block or for a field | ||
type name will not fail if there is a sibling extension with the same name, | ||
as long as a type with the correct name exists in an ancestor scope. But | ||
this is not true for other type references: the input and output type names | ||
of a method. For methods, if a nearer scope has an eponymous extension (for | ||
example), the reference is considered invalid. | ||
3. Similar to above, resolving custom option names will fail if there is a | ||
nearer scope with an eponymous element that is not an extension. But the | ||
logic for skipping non-type elements when resolving a field type reference | ||
could be generalized: resolving a custom option name could skip elements | ||
that are not extensions. | ||
4. Unqualified names are handled differently from qualified names. When the | ||
name is qualified, the behavior described in the previous bullet does not | ||
apply (where a matching symbol in a nearby scope will be ignored if it is | ||
the wrong type, in favor of a matching symbol in a farther scope). The | ||
behavior in the qualified case has more conditions (like matching the | ||
first name component only to a composite element), which makes it more | ||
error-prone in practice. An alternative is to always treat them as if | ||
they are unqualified. Combined with the change in bullet 3 above, this | ||
would improve the ergonomics of all relative references as there would be | ||
far fewer cases where the resolution incorrectly finds an element of the | ||
wrong kind. | ||
|
||
Another aspect that is unergonomic and a candidate for change is the existence | ||
of the service scope. None of the elements in this scope (services and methods) | ||
can actually be referenced by other declarations. So resolution when inside a | ||
service (such as service or method options and methods' input and output types) | ||
could behave as if they were in the enclosing package scope. This way, some other | ||
method would never even be considered when resolving a method's input or output | ||
type name. Admittedly, if a feature were ever added to the language that allowed | ||
for referring to a service or method, then such a change would be | ||
counter-productive. | ||
|
||
## Lack of Coherence in Option Values | ||
It is an unfortunate inconsistency that one cannot use array literal notation | ||
(i.e. a sequence of values enclosed in brackets, `[` and `]`) to define the | ||
value for a custom option field that is repeated. This notation is only allowed | ||
inside a message literal. | ||
|
||
Inside a message literal, the syntax changes to relying on the Protobuf text | ||
format. Instead, the Protobuf language could use a subset of the text format (or | ||
an alternate syntax that is similar) that is more streamlined and more | ||
consistent. The following aspects of the text format are particularly inconsistent | ||
with the rest of the language: | ||
|
||
1. The Protobuf IDL uses curly braces (`{` and `}`) for block elements. But | ||
the protobuf text format also allows for the use of `<` and `>`. | ||
2. The text format allows for eliding the colon between a field name and its | ||
value if the value is a message literal (or a list of message literals). | ||
While this may be convenient, the inconsistency is confusing (especially | ||
since only lists of messages are supported, not lists of scalar values). | ||
Two alternatives for increasing the consistency follow: | ||
1. Always require the colon. This makes the message literal more | ||
closely resemble formats like JSON and YAML. | ||
2. Allow the colon to be elided for any list value with the observation | ||
that the `[` preceding the value is effective as the separator from | ||
the field name (just as `{` or `<` is for message literal values). | ||
3. The text format encloses extension names in brackets (`[` and `]`). But | ||
other aspects of the IDL, such as option naming, uses parentheses to | ||
enclose extension names (`(` and `)`). Allowing message literals to use | ||
parentheses (and perhaps omit support for brackets) would make the syntax | ||
and punctuation in the language more internally consistent. | ||
4. The text format allows ',' or ';' as a separator, but they are not | ||
required. The IDL syntax in other places is not as flexible: you must use | ||
a comma in compact options and in range lists, and you must use a | ||
semicolon for separating most other elements. So requiring a comma (and | ||
not allowing a semicolon) would make the syntax and punctuation in the | ||
language more internally consistent. | ||
5. Though the text format has no context or scope (since it can be used as | ||
a data exchange format), message literals in the Protobuf IDL **do** have | ||
such a context. So it would be convenient if extension names in message | ||
literals supported the same kinds of references as option names: relative | ||
references allowed and a leading dot (`.`) allowed to indicate the name | ||
is fully-qualified. | ||
|
||
Addressing these issues would go a long way towards making this part of the | ||
language syntax more coherent. | ||
|
||
## Overloading Keywords | ||
In the Protobuf IDL, keywords are allowed as identifiers for user-defined | ||
elements, such as messages, fields, enums, etc. There are a small handful of | ||
places where they may not be used, such as the first component of a type name | ||
for a field declaration that omits the cardinality. But this is only to | ||
prevent ambiguity for the parser (and is mostly due to an implementation detail | ||
of the hand-written recursive descent parser in `protoc`). | ||
|
||
A stronger stance on keywords would be to prevent their use in user-defined | ||
identifiers. This is how many languages are specified, and it can lead to | ||
simpler parsing, as well as making the source easier to read since language | ||
keywords aren't overloaded. | ||
|
||
If keywords were strictly disallowed in identifiers, a new category of | ||
"predeclared identifiers" could be created, which would be a subset of the | ||
current keywords. The reason for this distinction is in case there are some | ||
words in the language that _should_ be usable as user-defined identifiers. | ||
Keywords cannot be used this way; predeclared identifiers can be. | ||
|
||
## Extensions and Any | ||
Both extensions (in proto2) and the `google.protobuf.Any` type attempt to solve | ||
similar problems: the ability to extend the content of a message to include | ||
other user-defined types, but without the message needing a priori (compile-time) | ||
knowledge of those user-defined types. This is most useful when writing generic | ||
container and envelope types. | ||
|
||
Extensions effectively reverse the dependency: instead of the base message | ||
definition needing to import the user-defined field type, the user-defined field | ||
type needs to import the base message in order to extend it. | ||
|
||
* The "good": | ||
* Extensions are easily serialized just like any other field of the base | ||
message would be. | ||
* Extensions have semantic names, which aids readability. | ||
* Extensions are treated just like any other unrecognized field when the | ||
consumer of a message is not aware of all extensions. | ||
* Extensions are declared in the IDL, so the association of the user-defined | ||
type with the base message is part of the schema. | ||
* The "bad": | ||
* Extensions require some form of coordination amongst all extenders to avoid | ||
a tag conflict. | ||
* If an extension is present in the JSON format but the consumer of the data | ||
does not recognize it, it is discarded. (Though same is true of normal | ||
fields, too.) | ||
* Transcoding from binary to JSON and vice versa requires knowledge of | ||
the extensions. There is no way to translate these formats without | ||
a descriptor registry of some sort. | ||
|
||
`Any` messages completely decouple the base message and any user-defined types: | ||
they are completely unrelated in the Protobuf IDL. They are, for the most part, | ||
implemented completely in the runtime and not necessarily part of the language. | ||
The only place they appear in the language is for their custom text format, for | ||
use with declaring option values whose type is `google.protobuf.Any`. | ||
|
||
* The "good": | ||
* Implemented almost entirely in the runtime; very little concern needed in | ||
the language specification itself. | ||
* When using the binary format, unrecognized message types can conveniently | ||
be ignored, similar to unrecognized fields. | ||
* Fields of type `Any` can be easily serialized to the proto binary format. | ||
* The "bad": | ||
* The data is identified by the fully-qualified name of the user-defined | ||
message type. This means there is no semantic name, unless the message | ||
type itself has a semantic name. For scalar values, a custom wrapper | ||
type must be created with a semantic name. Use of generic message types, | ||
such as the well-known types, is bad practice due to lack of a semantic | ||
message name. | ||
* Since they are not part of the language, there is no way to define the | ||
"schema": such as what kinds of messages are allowed in any given field | ||
of type `google.protobuf.Any`. So a field of type `Any` is a total | ||
"free for all". | ||
* Transcoding from binary to JSON and vice versa requires knowledge of | ||
the concrete types inside. There is no way to translate these formats | ||
without a descriptor registry of some sort. | ||
* Unlike with extensions, where unrecognized extensions are ignored | ||
when serializing to JSON, trying to serialize an `Any` that contains | ||
an unrecognized type to JSON results in runtime errors. One must | ||
manually strip unrecognized message types (if the desired outcome | ||
is to ignore the unrecognized data). | ||
|
||
Given the downsides to each of these, there may be room for a different | ||
feature that could suffice as a replacement for extensions in a "proto4" | ||
syntax. (No specific ideas yet.) |