Skip to content

Commit

Permalink
regex: avoid using literal optimizations when whitespace is detected
Browse files Browse the repository at this point in the history
If a literal is entirely whitespace, then it's quite likely that it is
very common. So when that case occurs, just don't do (inner) literal
optimizations at all.

The regex engine may still make sub-optimal decisions here, but that's a
problem for another day.

Fixes #1087
  • Loading branch information
BurntSushi committed Mar 15, 2020
1 parent 9dd4bf8 commit e772a95
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 2 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ Deprecations:

Performance improvements:

* [PERF #1087](https://github.com/BurntSushi/ripgrep/pull/1087):
ripgrep is smarter when detected literals are whitespace.
* [PERF #1381](https://github.com/BurntSushi/ripgrep/pull/1381):
Directory traversal is sped up with speculative ignore-file existence checks.
* [PERF cd8ec38a](https://github.com/BurntSushi/ripgrep/commit/cd8ec38a):
Expand Down
1 change: 1 addition & 0 deletions crates/regex/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ An implementation of `grep-matcher`'s `Matcher` trait for Rust's regex engine.
#![deny(missing_docs)]

extern crate aho_corasick;
extern crate bstr;
extern crate grep_matcher;
#[macro_use]
extern crate log;
Expand Down
29 changes: 27 additions & 2 deletions crates/regex/src/literal.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ the regex engine doesn't look for inner literals. Since we're doing line based
searching, we can use them, so we need to do it ourselves.
*/

use bstr::ByteSlice;
use regex_syntax::hir::literal::{Literal, Literals};
use regex_syntax::hir::{self, Hir, HirKind};

Expand Down Expand Up @@ -99,7 +100,12 @@ impl LiteralSets {
// helps with case insensitive matching, which can generate lots of
// inner required literals.
let any_empty = req_lits.iter().any(|lit| lit.is_empty());
if req.len() > lit.len() && req_lits.len() > 1 && !any_empty {
let any_white = has_only_whitespace(&req_lits);
if req.len() > lit.len()
&& req_lits.len() > 1
&& !any_empty
&& !any_white
{
debug!("required literals found: {:?}", req_lits);
let alts: Vec<String> = req_lits
.into_iter()
Expand All @@ -121,7 +127,7 @@ impl LiteralSets {
// with the highest minimum length and use that to build our "fast"
// regex.
//
// This manifest in fairly common scenarios. e.g.,
// This manifests in fairly common scenarios. e.g.,
//
// rg -w 'foo|bar|baz|quux'
//
Expand Down Expand Up @@ -159,12 +165,20 @@ impl LiteralSets {
};

debug!("prefix/suffix literals found: {:?}", lits);
if has_only_whitespace(lits) {
debug!("dropping literals because one was whitespace");
return None;
}
let alts: Vec<String> =
lits.into_iter().map(|x| util::bytes_to_regex(x)).collect();
// We're matching raw bytes, so disable Unicode mode.
Some(format!("(?-u:{})", alts.join("|")))
} else {
debug!("required literal found: {:?}", util::show_bytes(lit));
if lit.chars().all(|c| c.is_whitespace()) {
debug!("dropping literal because one was whitespace");
return None;
}
Some(format!("(?-u:{})", util::bytes_to_regex(&lit)))
}
}
Expand Down Expand Up @@ -328,6 +342,17 @@ fn count_byte_class(cls: &hir::ClassBytes) -> u32 {
cls.iter().map(|r| 1 + (r.end() as u32 - r.start() as u32)).sum()
}

/// Returns true if and only if any of the literals in the given set is
/// entirely whitespace.
fn has_only_whitespace(lits: &[Literal]) -> bool {
for lit in lits {
if lit.chars().all(|c| c.is_whitespace()) {
return true;
}
}
false
}

#[cfg(test)]
mod tests {
use super::LiteralSets;
Expand Down

0 comments on commit e772a95

Please sign in to comment.