Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort handles Incomparable types by default #5742

Closed
Akirathan opened this issue Feb 22, 2023 · 33 comments · Fixed by #5998
Closed

sort handles Incomparable types by default #5742

Akirathan opened this issue Feb 22, 2023 · 33 comments · Fixed by #5998
Assignees
Labels
--data corruption or loss Important: data corruption or loss -libs Libraries: New libraries to be implemented p-medium Should be completed in the next few sprints

Comments

@Akirathan
Copy link
Member

Akirathan commented Feb 22, 2023

Problem Definition

The following snippet should not fail with Incomparable_Values error:

[Nothing, Number.nan, 3, 2, 1].sort

The problem is, that Nothing and Number.nan are incomparable, i.e., they return Nothing from their compare method, whereas integers and decimals are comparable.

User Wanted Behavior

  • User can sort vectors with Nothing: Nothing ends up together at one end. No warning.
  • User can sort vectors with NaN: NaN ends up next to floats above Nothing. Attach a warning .
  • If incomparable types (with different Comparator.from) found: Parcel into groups, sort independently and attach a warning. Bigger parcel first, smaller then.

Warn, Fail or Nothing

Introduce new parameter to sort: (on_incomparable = Warn) : Fail | Warn | Accept.

Re. "...ends up at one end..." - (optionally in @JaroslavTulach opinion) introduce a new argument in sort method that specifies how these incomparable values should be handled - whether they should be sorted at the very beginning of the vector, or at the very end.

Technical Specification

Comment below describes the desired behavior. Few notes:

  • Nothing has a special treatment - and ends up at one end
  • Proper topological sort is a future work and will be tracked as Topological sort for partially comparable values #5834 issue.
  • We'll have compare: Any -> Any -> (Ordering | Nothing)
  • NaN compares to Nothing with everything - it'd be great to not treat it as a special case, but as a regular one - if possible
  • Without topological sort - as soon as there is something that compares to Nothing - leave that at its position sort only the other elements(!?).

Related

Comparators were introduced in #4067, see Ordering.enso for docs.

Related:

@Akirathan Akirathan added -libs Libraries: New libraries to be implemented --data corruption or loss Important: data corruption or loss labels Feb 22, 2023
@Akirathan Akirathan self-assigned this Feb 22, 2023
@Akirathan Akirathan moved this from ❓New to 📤 Backlog in Issues Board Feb 22, 2023
@jdunkerley jdunkerley added this to the Design Partners milestone Feb 24, 2023
@jdunkerley jdunkerley added the p-medium Should be completed in the next few sprints label Feb 24, 2023
@jdunkerley
Copy link
Member

Wanted behavior:

  • User can sort vectors with Nothing: Nothing ends up together at one end. No warning.
  • User can sort vectors with NaN: NaN ends up next to floats above Nothing. No warning.
  • Other Incomparable types should end up at one end with a warning attached.

@radeusgd
Copy link
Member

Do we specify what happens for Vectors like v = ["c", 2, "a", "b", 3, 1, Nothing]?

Implicitly, I think the idea is to group by type? So v.sort == ["a", "b", "c", 1, 2, 3, Nothing]? I think it would be good to clarify if we do that or something else.

If that's the solution, how do we determine the relative order between types?

@JaroslavTulach
Copy link
Member

JaroslavTulach commented Feb 24, 2023

I'd suggest a new parameter to sort:

(on_incomparable = Warn) : Fail | Warn | Accept

The behavior is clear when Fail. We may have a special treatment for fixed set of special types like Nothing or even Nan, but in general we have following issue: There are two groups of objects which aren't comparable. Which one shall be first and which one shall be second?

The only idea I have is to put the bigger one as first and the smaller one second. The size of a group can either be size of set or it can be size with duplicates. Each may have some nice parameters.

how do we determine the relative order between types

I don't think we can do it in general - Unless you want to do it by comparing their fully qualified names. We can treat few fixed types especially and hardcode their order.

@Akirathan
Copy link
Member Author

Other Incomparable types should end up at one end with a warning attached.

You mean the warning should be attached to the vector, not to the incomparable value, right? Attaching it to the incomparable value seems not useful, as it would be hard for the user to spot that there is some warning attached, whereas with the warning on the vector it would/should be visible straight away.

how do we determine the relative order between types

I don't think we can do it in general - Unless you want to do it by comparing their fully qualified names. We can treat few fixed types especially and hardcode their order.

I think sorting of a vector with mixed types makes sense as long as the types are "primitive", i.e., numbers, text, nothing - and for them, we can determine that, e.g., numbers come before text or vice versa. If there is a mixture of other types, trying to sort it should IMHO fail anyway. Users should provide their own comparator functions for such a non-standard collection of values.

I'd suggest a new parameter to sort:
(on_incomparable = Warn) : Fail | Warn | Accept
The behavior is clear when Fail. We may have a special treatment for fixed set of special types like Nothing or even Nan, but in general we have following issue: There are two groups of objects which aren't comparable. Which one shall be first and which one shall be second?
The only idea I have is to put the bigger one as first and the smaller one second. The size of a group can either be size of set or it can be size with duplicates. Each may have some nice parameters.

According to my previous note, the only incomparable types that we would sort by default would be just NaN and Nothing. Mixing other incomparable types, including the ones from stdlib, should result in failure anyway. What do you think? In general: Do we want to sort as much stuff as possible, or rather be explicit about it?

@JaroslavTulach
Copy link
Member

Another interesting topic from a discussion with @radeusgd:

what if I define a type Set where ordering is by inclusion
then I don't really have good answers for which items are to be deemed comparable which not
some pairs will compare and some not, every item is comparable to emptyset

Clearly such sets can only have "partial order". If you have type Set with a comparator based on set's elements, then all objects of type Set will be grouped together (as they share the same comparator) and they shall end up in some topological order. Of course that would require compare : T -> T -> Ordering | Nothing as originally proposed in

@radeusgd
Copy link
Member

With the above in mind, we also need to be aware that the user may define an ordering which may not necessarily be symmetric, reflexive or transitive.

In these cases, we cannot really do much - probably should keep the order of elements as-is, maybe report some warning. But ideally we shouldn't for example enter an infinite loop if we get an example where a < b < c < a. So I think we should have tests for such adversarial cases. Not needing to do miracles, but ensuring that they do not blow up in our face.

@JaroslavTulach
Copy link
Member

JaroslavTulach commented Feb 25, 2023

If there is a mixture of other types, trying to sort it should IMHO fail anyway. Users should provide their own comparator

@Akirathan, such attitude leads towards the "Rust approach":

  • according to @Frizi one cannot sort array of double values without a custom comparator
  • because doubles can also contain NaN and that means doubles aren't "sortable"
  • that's proper rationalistic approach - seeking truth, beauty, correctness - I like it and it fits Rust mindset of computer science degree programmers well
  • however Enso targets completely different set of people and as such a more pragmatic approach is necessary

The goal of Enso sort is to always sort and do it well without any users providing custom comparator to sort as an argument. Of course, we need to find a balance to do it right. However everytime we force an Enso user to loose Cluelessness by writing a comparator to overcome a sort failing to do anything - we did a suboptimal job!

even more than Cluelessness, I'd like to strive for Correctness - ensure we don't return a "sorted" value to the user which is not really sorted the way they intended because they mixed the types for example - in that case I think warn is ideal, fail is OK

You are right @radeusgd, a warning must be attached whenever the sort encounters partial ordering issues. However completely bailing out signals our own failure, so I wouldn't do it (unless user explicitly requests such a strict mode) - I am convinced a meaningful sort order is always possible.

With the above in mind, we also need to be aware that the user may define an ordering which may not necessarily be symmetric, reflexive or transitive.

Right, we shall know what to do when the ordering constraints decay. Here is a classification of various cases and proposal how to handle such situation.

Linear Ordering

If all values are subject to linear order we just sort them. They all must have the same Comparator.from obviously. All pairs return real value from Ordering.compare. We use the Ordering values and run classical sort. No warnings attached.

Default Comparator

Many of the builtin Enso types have the same "default comparator". That is beneficial for the purposes of performance and allows the engine to perform various optimizations. However, from a logical perspective - different kinds of builtin are supposed to be treated as having "different comparator".

It is up to the engine sort method to treat Text and Number as having (logically) "different comparator" and treat them according to the Differerent Comparators section.

Different Comparators

Values that don't have the same Comparator.from aren't comparable by definition. Different types are usually not comparable (unless they share the same Comparator.from like Integer and Decimal types). We parcel the data into groups with the same comparator. Each group is sorted independently and then we somehow (discussed here for example) place the sorted groups one after another.

Partial Order

The current design of comparators tries hard to avoid "partial ordering" (inside of a single Comparator.from) - that's proper rationalistic approach - however it has unwanted consequences:

I don't like Nan type - it feels so artificial. I believe there are no real benefits of avoiding partial ordering for the sort operation, so let me assume we have compare : T -> T -> (Ordering | Nothing) comparators and analyze the consequences.

Topological Sort

Objects with partial ordering can still be sorted by topological sort. We will optimistically speculate on linear order of all values, but as soon as we find out it is not there, we resort into topological sorting and attach warnings to notify the user.

Topological sort may produce different results, but every result it produces respects all the specified Ordering conditions. As a result the typical Enso user shall be satisfied with the result. Any two values in the result will either be in the right order or the user will not care about their order.

Violated Transitivity or Anti-Symmetry

Custom Comparator.from may violate various partial ordering constraints, so a regular topological sort may not be possible. Usually that happens if there is a cycle (e.g. a < b < c < a ) in the ordering.

Even such situation has a reasonable solution (and I implemented it once in the past) - just detect each cycle and collapse the cycle into a single element. Perform a topological sort on the simplified graph (succeeds as all cycles have been eliminated) and then expand the collapsed elements back.

The elements that violated anti-symmetry are in random order, but all the other elements are properly sorted. Typical Enso users shall welcome such result especially with a handful warning describing which elements couldn't be sorted because they formed a cycle.

Violating Reflexivity

It is not possible in Enso for x == x not being True. If Meta.is_same_object x x then Any.== returns True immediately.

Violating Stability

What if two subsequent calls to compare a b yield different result? That shall never happen - especially not in a functional language like Enso - should that happen it is a bug in the Comparator.from implementation and needs to be fixed there.

Benefits of Trying to Sort Hard

There are reasonable data structures that cannot have total order, just partial one. @radeusgd mentioned:

what if I define a type Set where ordering is by inclusion
then I don't really have good answers for which items are to be deemed comparable which not
some pairs will compare and some not, every item is comparable to emptyset

Having a flexible sort that can deal even with partially ordered structures like a Set expands the number of use-cases users may easily use Enso for. Supporting decayed situations (like topological sort with cycles use-case) while optimistically speculating on having linear order leads to a flexible sort support without any compromises of fast sorting speed in the most common case.

@radeusgd
Copy link
Member

radeusgd commented Mar 1, 2023

...
Right, we shall know what to do when the ordering constraints decay. Here is a classification of various cases and proposal how to handle such situation.
...

Thanks @JaroslavTulach for this amazing writeup! The new sorting design looks really really great IMO

@enso-bot
Copy link

enso-bot bot commented Mar 14, 2023

Pavel Marek reports a new STANDUP for today (2023-03-14):

Progress: Analyzing the impact of changes. Thinking about how to introduce an optimized, default version of Vector.sort / Array.sort without changing the signature of Vector.sort. Thinking about generalization of "sorting incomparable values". It should be finished by 2023-03-21.

@enso-bot
Copy link

enso-bot bot commented Mar 15, 2023

Pavel Marek reports a new STANDUP for today (2023-03-15):

Progress: Developing a new, optimized sort node for the most common case when all the elements of a vector have the default comparator, and on and by parameters are default. It should be finished by 2023-03-21.

@JaroslavTulach
Copy link
Member

optimized sort node for the most common case when all the elements of a vector have the default comparator

After discussion with Pavel I've modified the specification to include:

Default Comparator

Many of the builtin Enso types have the same "default comparator". That is beneficial for the purposes of performance and allows the engine to perform various optimizations. However, from a logical perspective - different kinds of builtin are supposed to be treated as having "different comparator".

It is up to the engine sort method to treat Text and Number as having (logically) "different comparator" and treat them according to the Differerent Comparators section.

@Akirathan Akirathan linked a pull request Mar 17, 2023 that will close this issue
5 tasks
@enso-bot
Copy link

enso-bot bot commented Mar 17, 2023

Pavel Marek reports a new STANDUP for yesterday (2023-03-16):

Progress: Implemented the optimized Vector.sort_builtin for the most common case - when the vector contains only primitive values with the default comparator. It should be finished by 2023-03-21.

@enso-bot
Copy link

enso-bot bot commented Mar 17, 2023

Pavel Marek reports a new STANDUP for today (2023-03-17):

Progress: Published a PR, basic tests for the primitive values work, even for vectors with elements with different comparators. It should be finished by 2023-03-21.

@enso-bot
Copy link

enso-bot bot commented Mar 20, 2023

Pavel Marek reports a new STANDUP for today (2023-03-20):

Progress: Dealing with sorting of values with different comparators. The groups of sorted values with different comparators will be merged together by the order of FQN of comparators. Writing more tests. It should be finished by 2023-03-21.

@enso-bot
Copy link

enso-bot bot commented Mar 21, 2023

Pavel Marek reports a new 🔴 DELAY for today (2023-03-21):

Summary: There is 3 days delay in implementation of the sort handles Incomparable types by default (#5742) task.
It will cause 3 days delay for the delivery of this weekly plan.

Delay Cause: Blowout of complex Enso code handling, trying to merge all the sort functionality into a single builtin node. Need more time for that.

@enso-bot
Copy link

enso-bot bot commented Mar 21, 2023

Pavel Marek reports a new STANDUP for today (2023-03-21):

Progress: Bumped into a complex failure case when given a vector of custom types with custom comparators, that either return Nothing or fail with Incomparable_Values - difficult to handle that in Enso - blowout of Enso code. Trying to merge the sort functionality into a single builtin. For that, it is theoretically needed to only precompute some values, and everything should be handled by the builtin. Also the performance should be better. It should be finished by 2023-03-24.

@enso-bot
Copy link

enso-bot bot commented Mar 22, 2023

Pavel Marek reports a new STANDUP for today (2023-03-22):

Progress: Consolidating all the sorting functionality into a single builtin node. So far, have not bumped into major issues. It should be finished by 2023-03-24.

@enso-bot
Copy link

enso-bot bot commented Mar 24, 2023

Pavel Marek reports a new STANDUP for yesterday (2023-03-23):

Progress: Discussed other performance issue with Dmitry. Bumped into an issue with warnings not being stripped from self argument in the builtin method generated code. Discussing comparators implementation with James. It should be finished by 2023-03-24.

@enso-bot
Copy link

enso-bot bot commented Mar 24, 2023

Pavel Marek reports a new STANDUP for today (2023-03-24):

Progress: Gave up the issue with proper warnings handling, and created a separate issue for that - #6070. Got Ordering_Spec test working. Have not tried to run all the other tests yet. It should be finished by 2023-03-24.

@enso-bot
Copy link

enso-bot bot commented Mar 24, 2023

Pavel Marek reports a new 🔴 DELAY for today (2023-03-24):

Summary: There is 7 days delay in implementation of the sort handles Incomparable types by default (#5742) task.
It will cause 7 days delay for the delivery of this weekly plan.

Got slowed down a bit by the issue with warnings on the vector not being stripped away, but rather duplicated. Moreover, the sorting issue turned out to be more complex than anticipated. Right now, I at least have all my created tests passing, but I still need to deal with other tests, and probably also run benchmarks.

Delay Cause: I expect that I will need roughly another week to finish this task.

@enso-bot
Copy link

enso-bot bot commented Mar 27, 2023

Pavel Marek reports a new STANDUP for today (2023-03-27):

Progress: Continuing with fixing some corner cases. Handle incorrect by parameter. Handle custom by methods - they either take self argument or not. It should be finished by 2023-03-31.

@enso-bot
Copy link

enso-bot bot commented Mar 28, 2023

Pavel Marek reports a new STANDUP for today (2023-03-28):

Progress: Struggling with the BC 7, will probably finish it quite late and postpone it for yet another week. Implementing direct call of on function from the SortVectorNode builtin. Fails for on=(.field) input, will have to look into that. It should be finished by 2023-03-31.

@enso-bot
Copy link

enso-bot bot commented Apr 4, 2023

Pavel Marek reports a new 🔴 DELAY for yesterday (2023-04-03):

Summary: There is 7 days delay in implementation of the sort handles Incomparable types by default (#5742) task.
It will cause 7 days delay for the delivery of this weekly plan.

Delay Cause: I had longer vacation than I expected, catching up...

@enso-bot
Copy link

enso-bot bot commented Apr 4, 2023

Pavel Marek reports a new STANDUP for yesterday (2023-04-03):

Progress: Fixing some tests, catching up after vacation. It should be finished by 2023-04-07.

@enso-bot
Copy link

enso-bot bot commented Apr 5, 2023

Pavel Marek reports a new STANDUP for today (2023-04-05):

Progress: Fixing more tests, fixing how by function parameter is called - UnresolvedSymbol vs. function with default arguments. Add yet another jUnit theories test for vector sorting. It should be finished by 2023-04-07.

@enso-bot
Copy link

enso-bot bot commented Apr 6, 2023

Pavel Marek reports a new STANDUP for today (2023-04-06):

Progress: Removing Array.sort_builtin - it is now replaced with Vector.sort_builtin. Resolving some import issues after removing all deprecated number comparison operators. Fixing more tests, there is OutOfMemory test now, investigating... It should be finished by 2023-04-07.

@enso-bot
Copy link

enso-bot bot commented Apr 7, 2023

Pavel Marek reports a new STANDUP for today (2023-04-07):

Progress: Found out that the OutOfMemoryError is caused by too frequent reassignment of a single warning attached to the sorted vector. Every InvokeMethodNode creates a new Warning object with a new reassignment location. This is a potential bug in how we process warnings. Next week, I will try to reproduce this and create a separate issue for that. It should be finished by 2023-04-07.

@radeusgd
Copy link
Member

radeusgd commented Apr 7, 2023

Pavel Marek reports a new STANDUP for today (2023-04-07):

Progress: Found out that the OutOfMemoryError is caused by too frequent reassignment of a single warning attached to the sorted vector. Every InvokeMethodNode creates a new Warning object with a new reassignment location. This is a potential bug in how we process warnings. Next week, I will try to reproduce this and create a separate issue for that. It should be finished by 2023-04-07.

What could be related to this issue is that our warnings get duplicated easily. I was advocating to do something about it a while ago, but I think after all no issue was created, maybe it is time.

Here is a related discussion: https://discord.com/channels/401396655599124480/951879392743796757

Discord
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

@jdunkerley jdunkerley moved this from 🔧 Implementation to 👁️ Code review in Issues Board Apr 11, 2023
@enso-bot
Copy link

enso-bot bot commented Apr 13, 2023

Pavel Marek reports a new 🔴 DELAY for yesterday (2023-04-12):

Summary: There is 10 days delay in implementation of the sort handles Incomparable types by default (#5742) task.
It will cause 10 days delay for the delivery of this weekly plan.

All tests seem to be passing now, except for two that can be fixed immediately. Need some time for review, and possibly benchmark comparison. I expect this to be the very last delay for this issue. Need to merge it ASAP.

Delay Cause: Problems with OOM, warnings, need to introduce new parameter, some problems after merge with develop - builtin methods cannot be called as static methods, only instance methods.

@enso-bot
Copy link

enso-bot bot commented Apr 13, 2023

Pavel Marek reports a new STANDUP for yesterday (2023-04-12):

Progress: Dealing with new on_incomparable option. Configuring my local env for filtering of some truffle logs - I cannot now find the bugs in sorting functionality with standard java debugger, need to see some logs. It should be finished by 2023-04-17.

@enso-bot
Copy link

enso-bot bot commented Apr 13, 2023

Pavel Marek reports a new STANDUP for today (2023-04-13):

Progress: Almost all test seems to be passing now. Bumped into issue with builtin methods not callable as static methods. Some last cleanups before final reviews. It should be finished by 2023-04-17.

@enso-bot
Copy link

enso-bot bot commented Apr 14, 2023

Pavel Marek reports a new STANDUP for today (2023-04-14):

Progress: Integrated all review suggestions. Fixed all the tests. Scheduled benchmarks. Run locally some benchmarks that suggest that Vector.sort got a speedup, with one exception of Table.order_by objects. Reviewed some other issues. It should be finished by 2023-04-17.

@enso-bot
Copy link

enso-bot bot commented Apr 15, 2023

Pavel Marek reports a new STANDUP for today (2023-04-15):

Progress: Compared benchmarks, seems very decent, but had to revert one commit that caused some performance issues. Integrated all the other reviews. Going to merge ASAP, hopefully until tomorrow. It should be finished by 2023-04-17.

@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--data corruption or loss Important: data corruption or loss -libs Libraries: New libraries to be implemented p-medium Should be completed in the next few sprints
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants