-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Real Time Mutable FST #8016
Conversation
...in/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/utils/MutableFSTUtils.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RealTimeRegexpMatcher.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RealTimeRegexpMatcher.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RealTimeRegexpMatcher.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/MutableFSTImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/MutableFSTImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/MutableFSTImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the neatness of the algorithm. There are simple paths forward towards optimising it which could be made now if you want, or later if it proves necessary.
I have two general points about the layout of the code (I don't like focussing on this sort of thing, but feel it's necessary here):
- It needs to be formatted properly: https://github.com/apache/pinot/blob/master/config/codestyle-intellij.xml
- There are too many blank lines. In my humble opinion, these do not aid the reader at all. On many occasions I look at 15 lines of code and notice they compress to 5. It really obscures the logic by occupying screen space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
Can you provide some info about heap usage ? Thanks
Codecov Report
@@ Coverage Diff @@
## master #8016 +/- ##
=============================================
- Coverage 71.24% 30.73% -40.51%
=============================================
Files 1607 1603 -4
Lines 83409 83269 -140
Branches 12458 12457 -1
=============================================
- Hits 59426 25595 -33831
- Misses 19941 55419 +35478
+ Partials 4042 2255 -1787
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkMutableFST.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/MutableFSTConcurrentTest.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RealTimeRegexpMatcher.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/utils/MutableFSTUtils.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/utils/MutableFSTUtils.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RealTimeRegexpMatcher.java
Outdated
Show resolved
Hide resolved
* 3) This process is bound to complete since we are making progression on the FST (which is a DAG) towards final | ||
* nodes. | ||
*/ | ||
public void regexMatchOnFST() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove 9 blank lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the blank lines, may I suggest that while these are a bit subjective, there are some practical implications of these:
- Readability of code on whether it looks like a giant wall of text, vs too sparse such that a method won't fit in a page.
- Next person touching the code may alter these, leading to unnecessary diffs.
My $0.02 would be to enforce this via formatter, or guidelines (if formatter rules cannot be customized).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this scenario, we are at an impasse because we neither have formatter rules nor guidelines. I will concede and remove all blank lines, per Richard's command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayankshriv the formatter leaves it up to the author for a few reasons:
- it's difficult to make the tools enforce a guideline like this only within method bodies - the only tool available is RegexpMultiline which is a blunt instrument
- it's expensive (in terms of build time) to apply the numerous RegexpMultiline rules necessary
- A lot of the time, a line here and there makes code more readable (though it usually indicates that the enclosing scope is too large and needs to be broken down), but this PR consistently takes this grey area to the extreme. For instance, consider the constructor below: what do the blank lines mean? Are they intentional? Why is initialising
_pathState
different to_state
,_node
, or_fstArc
? As a peer reviewing this change set, I think it's inconsistent and a waste of screen space.
public Path(State state, MutableState node, MutableArc fstArc, List<Character> pathState) {
_state = state;
_node = node;
_fstArc = fstArc;
_pathState = pathState;
_pathState.add(node.getLabel());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardstartin First off, I really appreciate your in-depth look into the PR. As usual, your comments are invaluable.
I agree with you that the spacing on the PR was a bit too much -- and I have fixed it per your comments. However, I would also ask yourself to be a bit considerate for expression of a personal style. The example you made above about inconsistent spacing makes perfect sense, and I am happy to fix that (and would have fixed it in a blink, had it been the primary comment that you made about spacing). However, you have commented on nearly every blank line that I added in this PR, including all the places where I was consistent. For eg., I added a blank space after variable declaration and before usage, and this is a consistent pattern. You did not agree with the same and asked me to revert that in mostly every file that this PR consists of. I accepted your comments and changed the same. All I am requesting you (humbly) is to allow me a bit of room (within reasonable limits, of course). I tend to find adding a blank line soothing to my eyes, and while I understand that this PR went overboard with the same, I have tried my best to fix it now. I request you to kindly not allege that I am being intentional by adding too many lines and trying to obfuscate the project (quoting your latest comment). I sincerely apologise if I have crossed any boundaries or have been detrimental to the project in any form -- that is not my intention.
And, as you must have noticed, I have followed your advise to not add spaces in my text anymore. This comment is a testament to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I would also ask yourself to be a bit considerate for expression of a personal style.
...
All I am requesting you (humbly) is to allow me a bit of room (within reasonable limits, of course)
I don't believe there is room for personal expression in a shared codebase - it's engineering, not poetry. Ideally it shouldn't be possible to tell who wrote the code without looking at git blame. More time will be spent reading this code than was spent writing it, so it all needs to look the same for the sake of the person who may need to make changes in the future.
I request you to kindly not allege that I am being intentional by adding too many lines and trying to obfuscate the project (quoting your latest comment)
I don't know what you are referring to - no allegations have been made about intent to obfuscate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for jumping in the discussion late. I myself have the habit to add many line separators for readability as well so glad to see some discussions about this. In general I found it hard to enforce a checkstyle on readability as I agree it is rather subjective. (other than those obvious ones such as https://google.github.io/styleguide/javaguide.html).
my own opinion here is to follow as close to the OSS codebase standard from as possible - this gives a consistency sense to users as well as first-time contributors. for example i don't see other code separate variable definition from its first usage immediately after; However this should not incur too much overhead, I don't think we want to discourage contribution because I need to read how others format their code first.
but i saw richard here provided many constructive feedback as well beside the formatting; in my experience ppl who reviewed extensively on such a large PR will also be the ones who maintain it and collaborate together going forward. it would be great to reach a reasonable readability middle ground. I don't think we want to discourage reviews either because of readability.
This is just my personal opinion, I am also pretty new to Pinot so please disregard if it doesn't seem to make sense. But definitely will use this as a standard for my contribution going forward if we can reach a general guideline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to resurrect this thread but none of the underlying issues (profuse blank lines) have been addressed in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to best of my knowledge
...rc/main/java/org/apache/pinot/segment/local/utils/nativefst/utils/RealTimeRegexpMatcher.java
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/MutableFSTImpl.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/utils/nativefst/mutablefst/MutableState.java
Outdated
Show resolved
Hide resolved
I would certainly prefer to automate style checking because I want code to be readable automatically without needing to review it for readability. If a simple method doesn't fit on a screen without scrolling, it's unreadable. I've proposed using checkstyle to ban blank lines within functions to @Jackie-Jiang to help specifically with this case, but he wasn't too keen on the idea and reasoned that the occasional blank line can help readability. Regarding M1s, I think @dongxiaoman uses M1 and managed to build Pinot, maybe he can help? |
Thanks, I will check with @dongxiaoman. I would argue that the spacing around blanks is a subjective thing, unless somebody is putting 3 blanks consecutively (in which case, our check style does catch it). Beyond that, the definition of readability is YMMV. As long as it is not obviously and immediately evident that something is broken, we should IMO not let subjectivity take control |
This is not a subjective matter but a question of how much code fits on screen. Objectively, given finite screen space, large numbers of blank lines will prevent methods from fitting on screen at some point. One method in this PR contains 15 blank lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this powerful feature. It would be great to see a proposal on high this ties in end-to-end (as in would the interfaces still hold).
* does it require the input to be sorted. Single word additions work well with | ||
* mutable FST. | ||
* | ||
* The reason as to why normal FST and mutable FST have different interfaces is because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there will have to be separate code to use mutable vs immutable FST? An abstraction that avoids that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this powerful feature. It would be great to see a proposal on high this ties in end-to-end (as in would the interfaces still hold).
Added in the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there will have to be separate code to use mutable vs immutable FST? An abstraction that avoids that would be great.
That's a great point. However, here is the catch. Immutable and Mutable FSTs, while providing the same functionality, follow different data structures. It is pretty hard to get them to agree on one interface without significantly compromising ImmutableFST's serialised state optimisations or Mutable FST's speed of addition of new paths.
However, any layer above pinot-segment will not be aware of this difference. Referring to the proposal document above, the integration path for mutable FST will include a native FST reader which will propagate a query to both mutable and immutable FSTs(within a segment) at the same time.
Hi @atris, I will review this PR tomorrow |
This PR will be merged in 24 hours from now, barring objections |
I saw some unsolved comments. Please resolve them and wait for all tests pass before merging. |
There was one unresolved comment around class import -- have fixed that and merged master. Waiting for tests to complete |
@Warmup(iterations = 3, time = 30) | ||
@Measurement(iterations = 5, time = 30) | ||
@Fork(1) | ||
@State(Scope.Benchmark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class still has several unnecessarily fully qualified class names, one of them has been fixed but I expected you might generalise from the comment made on this class and fix the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining ones are explicitly left so since the method names are identical between Lucene FST and mutable FST. In order for the reader to be able to clearly disambiguate between the two, I felt it was necessary to use fully qualified names when creating the references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about line 93?
return org.apache.pinot.segment.local.utils.fst.RegexpMatcher.regexMatch(_regex, _fst);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason -- RegexpMatcher and RegexpMatcher.regexMatch are present for all three -- Lucene, Native and Mutable FST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the case - are you looking at the same file as I am?
This benchmark measures org.apache.pinot.segment.local.utils.nativefst.utils.RealTimeRegexpMatcher
and org.apache.pinot.segment.local.utils.fst.RegexpMatcher
. There is no name clash.
org.apache.lucene.util.fst.FST
does not clash with any other referenced type in this class, so does not need to be fully qualified either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this, let's avoid full class name if there is no name clash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@richardstartin Would you have any further comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you for the review! |
This PR implements real time mutable FST, allowing Pinot to perform real time full text searches. Mutable FST can be written to, one word at a time without needing to have the entire set available beforehand, nor does it require the input to be sorted.
Mutable FST supports concurrent read and write i.e a single thread can write to the FST and other thread can read from the same.
Mutable FST supports real time updates to searches, reflecting data as it is added. It does not require a flush operation.
Please see the design document at:
https://docs.google.com/document/d/1O2ttsplFAflkM1Q-8-7yRNrD9EgCCWYm-W63NdC7ghE/edit?usp=sharing
Follow up to convert mutable FST to immutable FST and integration with FST index.