-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exhaustive MySQL Parser #157
Merged
Merged
Changes from 1 commit
Commits
Show all changes
169 commits
Select commit
Hold shift + click to select a range
ccc341b
MySQL AST Parser
adamziel 78fdf69
Fix parser overriding parts of the parse tree as it constructs them.
adamziel c8652d5
Output ParseTree using a class, not an array for much simpler processing
adamziel 137d6ca
Manually factor left recursion into right recursion in the grammar fi…
adamziel 0a2440c
Explore support for SQL_CALC_FOUND_ROWS
adamziel 0406d71
Support VALUES() call
adamziel 87573f2
Extract queries from MySQL test suite and test the parser against them
JanJakes 9629702
Implement handling for manually added lexer symbols
JanJakes d63bc6e
Fix passing nulls to "ctype_" functions
JanJakes 8e7e2e8
Add support for hex format x'ab12', X'ab12', and bin format x'01' and…
JanJakes ebcc17e
Fix wrong MySQL version conditions (AI hallucinations)
JanJakes 1551b0e
Implement the checkCharset() placeholder function
JanJakes cdd84b4
Document manual grammar factoring
JanJakes f50b515
Fix "alterOrderList" that has a wrong definition in the original grammar
JanJakes e267f67
Fix "createUser" that was incorrectly converted from ANTLR to EBNF
JanJakes cd543af
Fix "castType" that was incomplete in the original grammar
JanJakes 135f29f
Fix "SELECT ... WHERE ... INTO @var" using a negative lookahead
JanJakes 27524dd
Fix "EXPLAIN FORMAT=..." by reordering grammar rules
JanJakes 069342f
Fix special "WINDOW" and "OVER" cases by adjusting grammar rules
JanJakes 9bfc977
Fix "GRANT" and "REVOKE" by adjusting grammar rules to solve conflicts
JanJakes ca4de77
Use ebnfutils to dump grammar conflicts
JanJakes cd3504d
Implement the determineFunction() placeholder function, unify SQL modes
JanJakes 81bbde0
Fix processing NOW() synonyms in lexer
JanJakes 71292fb
Match mysqltest commands case-insensitively
JanJakes 1ab3723
Add a script to test lexer on all the testing queries
JanJakes 42ffc1b
Replace lexer switch/case and function calls with lookup tables
JanJakes 01241b8
Fix unicode handling when extracting test queries
JanJakes f5266f1
Fix identifier matching, improve lexer performance by ~25%
JanJakes a2ac60b
Unify charset matching with identifier matching, remove non-existent …
JanJakes 90e2af6
Fix quoted text/identifier matching, improve lexer performance by ~10%
JanJakes dfcad3e
Remove dependency on ctype, improve lexer performance by ~4%
JanJakes 0f23dd3
Determine token names lazily, reduce LOC to < 3000, improve lexer per…
JanJakes 6bb8ff9
Finish manual token pass, use MySQL Workbench token IDs, add comments
JanJakes ec55c10
Fix wrong token type
JanJakes 21f3f16
Inline some simple single-use methods, remove unused methods
JanJakes b4f0e08
Move token iteration loop outside nextToken() method
JanJakes f658751
Fix and simplify lookahead logic, improve lexer performance by ~6%
JanJakes 4eee991
Fix date and time literals by reordering grammar rules
JanJakes 64e2068
Implement handling of unquoted user variable suffixes
JanJakes f1c56cc
Fix number vs identifier matching, add some tests
JanJakes 883c08a
Fix handling of identifiers preceded by a “.”
JanJakes f56f036
Implement lexeer support for NCHAR literals
JanJakes 972f7e0
Fix administration statements to support both "TABLE" and "TABLES" ke…
JanJakes df65873
Use MySQL 8.0.38 test suite, the same version as the grammar
JanJakes f25b4fa
Use the grammar MySQL version 8.0.38 as a default also in the lexer
JanJakes 1b53df5
Add a missing charset from MySQL 5, improve comments
JanJakes 905a7d1
Sort version-specific keyword list alphabetically
JanJakes 7d1cdd7
Fix conflict between “<index> USING” and “<index> TYPE” by adjusting …
JanJakes 4e5e499
Fix “LIKE …” and “LIKE (…)” by reordering grammar rules
JanJakes 4eb9dd5
Fix "sumExpr" (AVG, SUM, COUNT, …) by reordering grammar rules
JanJakes 4f968cd
Fix "REVOKE ALL ON ... FROM ..." by adding a missing rule segment
JanJakes 6188d90
Add full MySQLParser.g4 grammar
JanJakes b371daf
Add manual fixes to MySQLParser.g4 grammar
JanJakes 1765e4f
Add a script to parse grammar directly from MySQLParser.g4
JanJakes 50ef591
Handle empty branches during grammr parsing
JanJakes 28e44f5
Place and document grammar fixes below the original rules
JanJakes cd3c335
Skip perl, append_file, and write_file commands in query extraction
JanJakes 09bd0c7
Fix “histogram” rule by adding missing “USING DATA” clause
JanJakes d06be19
Fix conflict in "explainStatement" by reordering grammar rules
JanJakes cebfdd7
Handle escaped quotes when parsing test queries
JanJakes 94b7bd2
Fix "alterCommandList" to solve conflicts between "alterCommandsModif…
JanJakes ba43fc7
Add missing column visibility settings to the grammar
JanJakes 03fbb98
Fix "fieldDefinition" to solve conflict between "columnAttribute" and…
JanJakes ed15ffa
Fix "replicationStatement" to correctly support the "RESET PERSIST" s…
JanJakes 0b16f44
Implement missing "EXCEPT" and "INTERSECT" operators in the grammar
JanJakes 8df7294
Skip if and while blocks when parsing tests, skip mysqltest.test
JanJakes 0a35989
Fix processing “--“ commands and “--delimiter”
JanJakes 442a682
Fix "alterTableActions" to solve conflicts between "alterCommandsModi…
JanJakes 207078a
Fix REVOKE statement grammar
JanJakes 56c7b8d
Add missing branches to “accountLockPasswordExpireOptions”
JanJakes 9bdd088
Fix ALTER USER statement grammar
JanJakes 8f54a3f
Fix "userIdentifierOrText" to support omitting sequence after "@"
JanJakes 724e1c3
Fix conflict within "queryExpressionParens"
JanJakes 6e985b2
Fix CHARACTER VARYING support by reordering “dataType” subrules
JanJakes e2e7501
Fix ALL/ANY support by reordering subrules
JanJakes 61d2bcc
Implement MySQL-specific and version comments: /*!… */
JanJakes 3118f11
Fix SHOW GRANTS … USING by reordering grammar rules
JanJakes cb44ea7
Fix query expressions followed by “INTO …”
JanJakes 0812288
Fix “SELECT … FOR UPDATE … INTO …” by adding a missing rule part
JanJakes d5e1abb
Fix “DO … AS …” by reordering rules
JanJakes e3555d2
Fix PARTITION being defined as a non-reserved keyword
JanJakes abdbe47
Fix handling —error in MySQL test files
JanJakes 5faef36
Fix ignoring MySQL test commands without params
JanJakes 53cb169
Add missing MySQL test commands
JanJakes ea6d067
Fix handling prefixed quotes in MySQL tests
JanJakes 22ac89b
Add support for missing "INSTALL COMPONENT" statement "SET ..." suffix
JanJakes ce35d1d
Fix "indexHintList" to use only whitespace as a separator (not commas)
JanJakes a33723d
Fix and simplify line comment handling in lexer
JanJakes eb1e756
Add support for "CHANGE REPLICATION SOURCE ..." statement
JanJakes 96accd0
Mark ATTRIBUTE keyword as non-reserved
JanJakes f0c97bd
Fix START TRANSACTION statement with multiple characteristics
JanJakes d861ba7
Fix handling column attributes by reordering rules
JanJakes c74dbca
Fix GRANT “role with @“ by reordering grammar rules
JanJakes 143c421
Fix GRANT CREATE/DROP ROLE by reordering rules
JanJakes a48810c
Add support for COMMENT and ATTRIBUTE to CREATE USER
JanJakes a80dbf1
Fix FLUSH TABLES with qualified identifiers
JanJakes f6ebd7a
Fix temporal literals to support double quotes
JanJakes 09cdac7
Add support for CAST(... AT TIME ZONE ... AS DATETIME ...)
JanJakes 05d02ec
Fix transaction characteristics separator
JanJakes 8022fc5
Add support for IF NOT EXISTS for function, procedure, and trigger
JanJakes e98caad
Fix SET PASSWORD statement conflicts
JanJakes 47b7ec4
Fix handling nchar, national character, etc.
JanJakes 522867d
Ignore “query_attributes” MySQL client command
JanJakes 0d80a16
Fix "leadLagInfo" to support identifiers and variables as well
JanJakes d12d01e
Fix WEIGHT_STRING(… AS BINARY(…)) by reordering grammar rules
JanJakes ad62b8a
Add support for "ALTER INSTANCE ..." statement
JanJakes 560064a
Fix CAST(… AS FLOAT(N)) by fixing a wrong grammar rule
JanJakes d004dad
Add missing CREATE TABLE options
JanJakes 52d9ed1
Add support for ON DELETE SET DEFAULT
JanJakes fb1e5cf
Add support for ENGINE_ATTRIBUTE to tablespaces
JanJakes e5f59fe
Add support for "JSON_VALUE(..., '...' RETURNING <type>)
JanJakes 106a6e5
Skip disabled MySQL tests
JanJakes e5d3a03
Detect & convert charset when extracting tests
JanJakes 88e3ec2
Fix lexing numbers and identifiers
JanJakes c934f37
Fix CREATE/ALTER TABLE … UNION = ()
JanJakes dfee145
Improve RESET PERSIST fix
JanJakes 1c7c18c
Fix ALTER DABATASE to support optional schema name
JanJakes ae0b31a
Fix support for “:=“ assignment in “updateElement”
JanJakes 839d807
Add “phpize-grammar” logic to grammar conversion script
JanJakes 495e227
Delete all no longer necessary grammar tools
JanJakes d9827ac
Use WP coding styles for all new files
JanJakes a19d439
Ignore coding styles for compressed grammar file
JanJakes 8d810fe
Use WP coding styles for filenames, use file per class, improve direc…
JanJakes d869785
Exclude new parser tests for now
JanJakes f3f0c0f
Remove no longer needed rule name processing
JanJakes d993041
Omit grammar tools and WIP work from plugin builds
JanJakes 8c2d227
Omit all test files from plugin builds
JanJakes 2cf39d4
Omit new lexer & parser from plugin builds for now
JanJakes 09dda8a
Improve class naming and directory structure
JanJakes e42bf52
Move tokenize_query to WP_MySQL_Lexer::tokenize()
JanJakes c42ee15
Move tests under mysql directory
JanJakes 7b84cfc
Move manual lexer tests to PHPUnit
JanJakes be7257f
Move test tools to tests/tools and ignore them in PHPUnit
JanJakes 9c34178
Polish and document test downloading and extraction scripts
JanJakes c2cfd45
Skip queries testing parser stack overflow
JanJakes 03cc66a
Run the full MySQL server test suite in unit tests
JanJakes 5b2f488
Repurpose testing scripts for benchmarking, add single-parse testing …
JanJakes 1e3e254
Use strcspn to handle all types of quoted text
JanJakes 7d5869b
Use strspn to handle unquoted user-defined variables
JanJakes 04bf6e5
Replace preg_match with strspn and UTF-8 decoder when handling identi…
JanJakes 43357d0
Simplify lexer state, use and advance only byte position
JanJakes 6e3b31e
Replace all other byte scanning loops with strspn
JanJakes 3b789fe
Inline bin and hex numbers to number method
JanJakes ad4cf36
Simplify version comment processing
JanJakes 3730d80
Simplify matching EOF
JanJakes 833ca68
Remove channel, token, and token instance properties
JanJakes e0dfccd
Replace type property with local variables and function returns
JanJakes dff4649
Improve property and method naming and documentation
JanJakes 17641f7
Check for the U+0080-U+FFFF range manually, add test coverage
JanJakes ee9d59d
Improve lexer docs and naming
JanJakes b32a9dd
Unify identifier matching with other "read_" methods
JanJakes 432bf1f
Reorder methods to group "read_" methods together
JanJakes 73c6c0a
Implement integer type detection
JanJakes 4464cea
Move empty rule (ε) constant to WP_Parser_Grammar
JanJakes 79b2622
Cleanup, add docs and TODOs
JanJakes 43cec5d
Rename WP_Parser_Tree to WP_Parser_Node and document it
JanJakes 634635e
Explain the "ε" rule in a comment
JanJakes 96dd467
Rename $rule to $branches for better clarity
JanJakes dec0e3f
Use `false` rather than `null` when a parser subtree doesn't match
JanJakes afc70bb
Declare mbstring in dev dependencies
JanJakes 33d55c4
Use more descriptive file name for MySQL test queries
JanJakes dd31359
Read quote from the SQL payload instead of passing it as parameter
JanJakes ae42217
Improve code comments
JanJakes 5942fd6
Simplify next_token() loop condition
JanJakes c567a3f
Inline is-digit and is-whitespace logic
JanJakes 60d995d
Fix 0b and 0x indentifiers
JanJakes ccebf37
Fix unclosed b'...' and x'...' numbers
JanJakes 7155b7b
Rename parser testing script to dump-ast.php
JanJakes 9426969
Implement "next_token()" & "get_next_token" API
JanJakes File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Simplify next_token() loop condition
- Loading branch information
commit 5942fd6298fac505b1d900c024c3679c4bd45ef2
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is called multiple times after the tokenizing is finished, would it return an EOF token each time? What do you think about returning
false
when there's nothing else to consume?I also had a brief thought about returning
true
on success as that would make it consistent with the wayWP_HTML_Tag_Processor
works, but I don't have strong opinions here. It seems that would only require an additionalget_token()
method without further benefits.In case of HTML processor, we also have setters and we can also inspect a bunch of metadata about the current token directly on the processor instance. There's no token class in there because the parser only moves forward. In here, the parser backtracks a lot so it makes sense to reuse the tokenization results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel Yeah, I agree it's strange to return
EOF
forever.What about returning
null
when there are no more tokens?Hm, but what about invalid input? Now it uses the
INVALID_INPUT
token, but we could also stop as soon as an error occurs, and reflect that in the return value. Such an input wouldn't be then passed on to the parser at all. Or, are there use cases, where it would be useful to keep anINVALID_INPUT
token and try to continue the lexing process?Then we could consider some of these:
WP_MySQL_Token|null
, throw exception on invalid input.WP_MySQL_Token|null|false
(null
for no more tokens,false
for an error).next_token()
&get_token()
scenario).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking forward, there's also a "not enough input" scenario. Perhaps we could take a page from the
WP_XML_Processor
book here and returnfalse
whenever there's no available token, and then have specific methods to check what's the failure mode, as inget_last_error()
,is_paused_on_incomplete_input()
,is_finished_successfully()
etcThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel
So, do you think something like
WP_MySQL_Token|null|false
for now?What would "not enough input" mean in this case, and what would be the use case? The lexer just reads the input and creates tokens. When it's an invalid token, it creates an
INVALID_TOKEN
type, and when there is no more input, it'sEOF
. I'm not sure if the incompleteness is deterministic here, like in the case of an unclosed HTML tag. Isn't that more of a parser-level feature? I think thatWP_XML_Processor
is different in the way that it's also a parser at the same time.As for the
INVALID_TOKEN
, I think it makes sense to never output anything like that and stop at the first error. TheINVALID_TOKEN
approach would decodeb'01xyz'
asINVALID_INPUT
,IDENTIFIER
,INVALID_INPUT
,EOF
, which doesn't really make much sense and is rather random.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary use-case would be importing SQL.
Imagine this SQL file:
We can't read all the data at once. However, if we only read a chunk we'll likely get invalid SQL, e.g. just
insert into wp_posts
. So the future lexer needs a way to differentiate between an actual syntax error and not having the entire query yet:Again, it doesn't need to happen in this iteration. However, we can already use this requirement to inform the design. There's at least three ways to not return a token: a syntax error, incomplete input, and parsing finished. This is why I'm not a fan of using
false | null
– it can only express two states and it's also highly implicity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all! At first we may only have
SELECT 1
. If we tokenize that without waiting for more input, we'll get everything mixed up by the time we learn it's actuallySELECT 123, post_content FROM wp_posts
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel Ah, right, thank you for the detailed explanation! It didn't occur to me that if we need to chunk a big file, we have no idea how to chunk it before we actually tokenize it.
Looking at the code example, I have one question. Is there an advantage to iterating the chunks from outside the lexer, rather than passing a stream into it? In other words, do we need to expose
is_paused...
and other methods? E.g., is it better because it's a more generic API, and a stream would be just one possible use case?In this case, I agree that we should probably go with the
next_token()
andget_token()
API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried using PHP streams as the primary API for this kind of operations in AsyncHttp\Client and I failed miserably. I ended up removing that in WordPress/blueprints-library#113.
Tl;dr – every stream wrapper (https:// vs file:// vs memory://) handles nuances differently. Some won't work with stream filters or buckets under specific circumstances or in specific PHP versions and there's almost zero documentation on these very deep use-cases. Unfortunately it's not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also explored a generic stream interface in adamziel/wxr-normalize#1 and a use-case where that saves more complexity than it creates is yet to emerge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel Oh, wow, very interesting! I had no idea PHP streams are such a pain. A generic interface sounds great, but looking at the explorations, it also looks pretty complex, so I guess at this stage, we need to work with the
WP_XML_Processor
-like APIs in mind.I pushed a commit for the
next_token()
andget_token()
combo. Now, theremaining_tokens
method makes a little bit less sense, I guess, sided with the next-token/get-token API, but for the moment I kept it in. Let me know what do you think.