-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raw content was changed by jq '.' #1741
Comments
This is a known behavior: see #1652 and https://github.com/stedolan/jq/wiki/FAQ#numbers Internally, we represent json numbers as IEEE754 doubles, which aren't precise at very large or very small magnitudes. Unfortunately, changing this behavior will impose a heavy cost on performance (bignum/bigmath libraries are slow), so we're taking a careful/slow approach to the problem, and haven't really addressed it yet. |
Thanks for your fast reply. |
Hey, just a suggestion here. |
There's actually a test branch we have lying around that does just that. My
main concern is that it could lead to a similar confusion. Also it makes
numbers a bit more complex internally. It's something we're looking at,
though.
…On Fri, Oct 19, 2018, 06:49 Leonid S. Usov ***@***.***> wrote:
Hey, just a suggestion here.
Do you think it would be a good idea to only convert numbers to doubles
when some math is about to take place, and otherwise preserve their string
representation?
I'm subscribed to the repository and in most cases I can recall the
problem of large integers was about some IDs being truncated. The "no math
no double" approach would make it work in all of those, and in case the
lossy conversion is required it may spit out a warning and proceed as it
does now.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1741 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADQ4V71ocbGXoNfn7jhUQXr0vemI9r4kks5uma5FgaJpZM4XjoLa>
.
|
oops, i've already pushed my version of the same thing. But it was fun! |
Wait, what's that branch? Can't detect it by name...
|
It's here: #1327 |
Nice... i like the Nevertheless, I'd say that the pull requests are quite different. Your approach keeps I understand that integer IDs > 64 bits seem quite unlikely, but on the other hand it's hard for me to tell if using int64 is giving a dramatic increase in the speed compared to the full string literal implementation. The trick is that the only slowdown is full comparison of two literals (I don't know the stats, but assume it's a rare case) and potentially slower Also, what I would definitely improve on my version, and what is already part of your code, is the initial decision to whether it's even needed to store the literal. That could basically select between the best fitting subkind, including the integer, double and raw string literal, and eventually even maybe a bignumber. |
True, and there's a reason nico opted for that solution. We didn't want to introduce an additional Relatedly, I was going to point this out on your PR. Your linux-based tests are failing for memory leak issues (The macOS tests don't fail for memory issues because valgrind doesn't work on macOS). If you want to try and find all the places where we didn't
I might consider this a requirement to merge, though I'd have to think on it a bit.
I'm not too worried about this, especially with the logic described above, though I'd be interested in some potential benchmarks. |
I'll be happy to hear some more specific comments on the PR itself. re. valgrind - that kind of explains why I haven't noticed any speed time increase when it was enabled in the config. I have actually thought about the memory issues, and have changed the places I could spot by statically analyzing the code, but I didn't have much time to go over all other cases, just few hours I have dedicated to it yesterday. It seemed like going through the few functions in the Memory Management sections should do the job, namely But what I couldn't possibly check over other parts of the code, like the execution, and just assumed, that the above functions are called on all objects, including the JV_KIND_NUMBER. |
I was at work yesterday, so didn't have much time to look at the PR in detail. I'll try to get to that today.
Yep! Those tests are slooooow in valgrind. You'll know when it's running! I recommend making a cup of tea or something.
So the issue here is specifically that in builtins/execution/compilation/etc, if we know that a else if (jv_get_kind(a) == JV_KIND_NUMBER && jv_get_kind(b) == JV_KIND_NUMBER) {
return jv_number(jv_number_value(a) +
jv_number_value(b)); If |
@leonid-s-usov - Thanks for getting the ball rolling again on this long-simmering and important topic. Perhaps I should have chimed in earlier, but I am very uncomfortable with the
Oddly enough, though, in your implementation (jq-1.6rc1-23-g00c00a3-dirty):
So I am a bit confused about your intentions, not only with respect to 1 and 1.0 In any case, I would like to argue that jq should NOT take JSON literal numbers literally In fact, I'd like to propose two principles regarding "numbers" that I believe
where it is to be understood that:
Of course the JSON specifications of "number" are primarily syntactic, In any case, for jq, the two principles are uncontroversial for Formal PresentationUsing Sem(_) to denote the "semantic value" function, backticks to denote
Note that the second principle (2) only applies to JSON texts. Once Thus (2) requires that the jq expression The two principles can be stated more formally as follows, it being understood that
Since Sem(_) is defined in terms of mathematical real numbers, some
Extension to jq numeric valuesFor jq, we will want to define equality (
Once again, I would prefer that the semantics be based on the natural mapping to real numbers. What about jq numbers for which "isinfinite" is true?
Currently, this evaluates to true because both the left and and right are 'isinfinite', but Sem( |
@pkoppstein thanks for the response! It's definitely not too late. Actually on the PR itself I have mentioned this non-backward compatible change and also mentioned that it's not hard at all to consider the equality operator like any other arithmetic operator and actually compare the double representation of the number. Just a side note about
The formal presentation above looks promising, but I feel there are some gaps still. I will get back to this in the evening and hope to better understand what you propose. My main point about "literal" number comparison was that if one has a number value in their JSON which would overflow a double, and we go for preserving it given that there's no math involved, we should also support matching that number to a literally provided constant. |
Just a thought... what if we simply introduce a new binary operator to compare numbers semantically? Something like
I'm also thinking that maybe we can provide (3) by the means of number to string conversion preserving the number literal where available, to save on expanding the syntax with a new binary operator. |
Sure, but I would prefer to continue discussing the general requirements here (at #1741). Even the title of PR#1743 ("Save literal value of the parsed number to preserve it for the output") presupposes the pre-eminence of "literal values", and in any case, the discussion at PRs is usually about implementation details.
If there is a need to preserve some kind of "literal" representation, then it would be far, far better simply to support a filter for converting it to a string. If
These approaches can of course be combined in various ways, but assuming some combination of them, I see very little value in adding support for preserving each literal JSON string that is presented, whereas I do see large downsides, especially given that each real number has infinitely many valid JSON representations. Since jq advertises itself as a kind of
|
@pkoppstein
This was taken from my comment on the PR about a very specific issue, namely the strange behaviour of the equivalence operator I wasn't expecting, and I have described my finding around that issue there. Pretty much PR-scoped topic.
If you read a little bit lower in the same post of mine I'm actually questioning if a "tostring" conversion will be sufficient there. Let me just repeat the portion of that message here
So, in essence, what I eventually suggested was to fully maintain today's number behaviour, adding just two things:
To me this looks quite aligned with your guidelines. If you feel that way as well then let's try to move on.
I'm not sure if there are any requirement of a
WDYT? UPDATE I have just realised that due to how |
@leonid-s-usov - Let me emphasize at the outset that I am drawing a distinction between "JSON number literals" in general, and "the literal specification of a JSON number literal" in particular. To avoid tedium, I will use "JSON number" as a synonym for "JSON number literal". Any references to jq numeric values will be to the IEEE 64-bit float that jq currently uses. m == nPart (2) of my proposal requires that if m and n are JSON numbers, then m==n should evaluate to true iff m and n represent the same real number.
That's the easy part. The hardest part is the handling of m == y m == yIn order to formulate my proposal regarding So my basic proposal can be extended as follows:
Incidentally, using
Literal preservation of JSON numbersAs I've said, I think the PROs of having jq store JSON number |
Let me address your comment in three parts:
|
@leonid-s-usov - It looks like we're making good progress, but I believe we still have some way to go.
That's a tricky question, because at present one can only use a) a "raw value" corresponding to a JSON numeric literal In my previous posts, I have generally proposed that the "raw value" would be based on "real number" semantics, but now may be a good time to mention that I am not opposed to introducing a distinction between integer and non-integer numeric values. With this variation:
However whether or not this variation is supported, I would still want real-number semantics for Regarding:
Yes, of course that should have read literal/1 vs toliteral/0
jq already has an overwhelming bias towards having filters acting on steams of values, as illustrated most glaringly by |
@leonid-s-usov wrote:
There is a slight ambiguity in your question, so let me emphasize that a) the canonicalization algorithm I have in mind will preserve the mathematical real-number value of ALL JSON number literals. b) The canonical value will (at least logically) always be used UNTIL an arithmetic operation is performed. (Of course, as an optimization, if the IEEE 64-bit representation can be used as the canonical value, then so be it.) In a nutshell, my proposal and hope is that:
but: 1.0E1000 == (0+1.0E10000) #=> false # because the RHS becomes an IEEE 64-bit float As I also mentioned, I have no objection to the expression |
@leonid-s-usov - Unfortunately, github somehow completely lost my most recent post. Since I don't have time to try to reproduce it just now, let me just highlight some salient points that I hope are still relevant: integersAlthough I have proposed and hope for a "real-number semantics" for
literal/1 vs toliteral/0jq is very strongly oriented, both in theory and in practice, to filters, especially for transformations such as Of course there are cases when there is no choice but to provide the stream of inputs as an argument, e.g.
Even here, though, it could be argued that a better version would take the input as the initial value:
The deciding factor, perhaps, is that by encapsulating the to-literal functionality as an arity-0 function, it follows necessarily that each input will be processed independently of the others, whereas that is not the case for functions that have arguments that can be streams. Thus |
@leonid-s-usov - github won't let me post at PR#1743 so let me try here. Here's a test case which highlights a discrepancy between the current implementation of jq (jq-1.6rc1-25-g56f6124-dirty) and the principles I'm advocating:
By the principle:
the result should of course be false. However, if your goal is to have a number-related PR accepted for jq 1.5.x, then the "real-number semantics" for == that I have advocated will probably be judged unacceptable. Indeed the same might be true of all the changes (as opposed to additions) that we're proposing, so it would be good to hear from @wtlangford and @nicowilliams on this point. For their benefit, here in a nutshell, are the main questions:
|
@leonid-s-usov - github is currently malfunctioning rather badly. Some of my posts have simply disappeared, or rather, they disappear, reappear, and then disappear again. For now, I'd just like to add that if your goal is to have a number-related PR accepted for jq 1.5.x then the "real-number semantics" for
|
@leonid-s-usov wrote:
In general, jq avoids raising errors if at all possible, and in this case, I think it would be advisable to avoid raising an error, though of course there are some interesting edge cases. Since I believe that ultimately the "to-literal" function will probably be named something like First, in the ordinary case, I think it would be rather odd if What then about Consider:
Since jq effectively regards |
OK, picked up the Thanks for the Practically, This explains why |
I'll chime back in here with a few notes:
|
1. Hooray, 1.6 !
2. What a pity that this can’t be included.. Maybe you could review this point after checking the branch?
Please make sure to review the latest version of the code (note that GitHub is glitching, maybe fetching it from my repo directly would be more reliable)
[email protected]:leonid-s-usov/jq.git branch `literal_number`
The last commit there was a large refactor so you may want to exclude it to see the actual changes.
I believe that the PR is exactly what you are looking for in the first point of your list.
Re. exposing the internals… Well if there was a more generic way of dumping jv to string I’d use it and this way encapsulate the implementation details about literals. But currently the only way to control the output is to expose the needed data to the jv_print.c, so I don’t think there’s much choice here. It’s pretty much in line with jv_string_value and jv_invalid_msg
… On 22 Oct 2018, at 16:21, William Langford ***@***.***> wrote:
I'll chime back in here with a few notes:
I'm currently only interested in a solution to the ". changed my number!" problem, and that if you do anything on/with the number (even ==), it falls into a IEEE754 double. I'm willing to punt the remainder of that problem to a later improvement (bignums?).
I'm not keen on exposing the internals of the literal. I think this is an implementation detail only. I see this sort of change as an extension underneath the jv type, where regardless if you got a jvp_number_literal or a double as the underlying storage, it still acts just the same! The only difference is that the jvp_number_literal doesn't get approximated if you don't do anything to it.
I'm going to be releasing 1.6 this week. I've frozen the feature set at this point and am just working on a few final administrative tasks. This change won't be in it. The goal from here is to move to a more regular/typical release cycle. Since this isn't about to be part of a proper, tagged release, I'm a bit more willing to do some... smaller, incremental changes that we can all live with and play with.
I'll have some comments to share on the PR shortly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1741 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC_851rV0nUiUS1l6oedNJEQO7qgnfpnks5uncZPgaJpZM4XjoLa>.
|
It's primarily a desire to let something like this live on master for a while before committing to it in a release. It's not really a comment on the branch itself. Again, the goal is to not go another 3 years before 1.7.
Ah, that comment was specifically about exposing the details of this to someone's jq program. The thought being: "if we later come up with a solution that makes this unnecessary and we remove the functionality, I don't want to have to maintain a |
It's a pity.. What is the licence you would accept? The more I think about it the more I feel that using a decimal instead of string for the literal representation is the best thing we can and maybe even will have to do. |
There is also this distribution of the library, They are having an ICU licence. Would that work? https://github.com/unicode-org/icu/blob/master/icu4c/LICENSE, see paragraph 1. |
Well, there's some logic behind that. First of all, given proper decimal representation we can really compare literals, like for real! No reason we'd not want to do that, cause the Now, imagine that I have a long 128 bit identifier as an integer. Using jq i'd want to compare it to another id or even find a distance between them. Or match another id with an offset from this id. On the other hand, when multiplying two high precision numbers I should understand that my precision should become twice as big for the result. This is something i'd hardly expect if I knew that any other math calculations would give me IEEE double precision, - and that is a really nice precision for an everyday math. Well, we could also consider making other non-lossy operations on two literals, and fall back to doubles when functions are called or one of the operands is already a double. That could also do, but then we're in the danger of making it a long computation if one decides to reduce an array of literal numbers.... |
@wtlangford Please comment on whether the ICU license is acceptable
|
Quickly scanning through this issue, I don't see that anyone has explicitly mentioned the safe integer limit of a IEEE 754 double-precision binary floating-point. The significand is 53 bits, so the maximum safely representable integer is 2^53 = 9007199254740992. You can try this experiment in GCC.
A reasonable thing to do in the short term is to error and exit (perhaps optionally with a command line option) when reading an integer that exceeds 9,007,199,254,740,992. See https://en.wikipedia.org/wiki/Double-precision_floating-point_format |
@sjackman That might not actually be a reasonable thing to do though because there almost certainly are users who do understand the limitations of IEEE754 and are using real numbers and large integers in JSON and jq. There's a pending PR that adds a bigreal implementation, but even that might need to be an option (perhaps the default) for performance reasons. |
Using real numbers is okay, and reading a real number that exceeds 9e15 is also okay. Reading an integer that exceeds 2^53 however is going to result in subtly corrupting the data. In that case, it'd be better to produce an error. That could be optional behaviour, but I'd definitely prefer an error rather than corrupting the data. |
I don't agree. jq must be able to read numbers too big to represent exactly because it does today and because users can reasonably expect it to continue to. Many JSON parsers use IEEE754 and so users are fairly used to this danger. |
I'd suggest an option (say |
That would work. First we should get the bigreal contribution integrated. |
Speaking of which, what's the plan here? There was some progress when @wtlangford pushed some thread local additions to the PR on January 5. |
@leonid-s-usov, @wtlangford told me it'd be ready very soon. I'll talk to him this week. |
Another example: https://jqplay.org/s/9ZW4XcEwqX This:
Gets changed to
|
@peterlynch one may easily overflow this thread with such examples. They are literally infinite. |
It's quite surprising to get something like this, what happened to my minor version number! $ echo '{"version": 0.0}' | jq .version
0 |
|
I'd like to further comment on this- While the changes on the master branch should have the side effect of preserving your precision, I don't consider that to be the primary intent of those changes- those changes are intended to avoid, when possible, floating-point approximations on numbers that aren't precisely representable by IEEE754 doubles (our internal representation for numbers in jq). |
@wtlangford - A small point, but although RFC7159 does mention that "This specification allows implementations to set limits on the range and precision of numbers accepted", I don't see anything that requires that 0 and 0.0 be mapped to the same value or be regarded as equivalent or equal. Am I missing something? |
The way I read the above suggests to me that
|
Put Discord Channel ID bettwen 2 ", or it will be changed by jq. More detail: github.com/jqlang/jq/issues/1741 Add line to detect if Discord Channel ID is default
Is there any suggestions about convert json stream with uinit now? |
jq 1.7 released with the fix. closing |
Bad case as attachment
bad_case.json.txt
The text was updated successfully, but these errors were encountered: