Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More extensible syntax mapping mechanism #2755

Merged
merged 49 commits into from
Jan 21, 2024

Conversation

cyqsimon
Copy link
Contributor

@cyqsimon cyqsimon commented Oct 31, 2023

Will close #2747.

Checklist

  • Ratify specification
  • Implementation
  • Add tests

@cyqsimon
Copy link
Contributor Author

@keith-hall would you please take a look at README.md and see if there are any obvious problems with this architecture? Thanks.

@keith-hall
Copy link
Collaborator

I don't see any problems with this approach 👍

@cyqsimon cyqsimon mentioned this pull request Oct 31, 2023
@cyqsimon
Copy link
Contributor Author

I'll leave it here for a week or two for people to take a look. If there are no major objections I will go ahead with the implementation.

@Enselic
Copy link
Collaborator

Enselic commented Nov 2, 2023

I like the idea in general. My main concern is startup time, but in the linked issues you are talking about doing clever stuff at compile time, which has the potential to even improve startup time.

I don't think we can at this point "we promise to include your solution in bat", but I am very interested in looking at a prototype and evaluating that, then we can decide on the next steps.

I hope am not sounding discouraging! Would be great if we could solve syntax mapping more elegantly.

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

I was investigating the possibility of doing this with proc macros, because I suspected that would make the code somewhat more elegant. In conclusion I decided that using a build script is probably still the way to go (at least for the moment). This comment is just to document my research; I will add more info if they are worthy of documenting.

  1. I initially thought I would be using phf_codegen, but I soon realised that a hashmap would be poorly suited for this job because glob patterns are not appropriate keys. Therefore probably the best way forward is to generate a static array, something like this: static STATIC_RULES: [(&'static str, &'static str); N].
  2. Seems like someone has tried to do similar (unsurprisingly) and has written an article about it. It is a bit old (2020-09) but most things in it are still valid.
  3. I tried to look for a crate that does similar to phf_codegen but for arrays. Unfortunately I was not able to find any, but honestly I suspect it's because it's just too simple.
  4. Proc macros are (still) not appropriate for this in 2023, because it lacks the build script's capability to watch/track a specific file/directory. See Tracking Issue for proc_macro::{tracked_env, tracked_path} rust-lang/rust#99515.

@cyqsimon cyqsimon force-pushed the syntax-mapping-refactor branch from 5c863c1 to f8349a8 Compare November 2, 2023 08:23
@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

Rebased onto #2756.

@cyqsimon cyqsimon force-pushed the syntax-mapping-refactor branch 2 times, most recently from 34567fe to e856d7f Compare November 2, 2023 08:29
@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

Okay, I have an initial prototype of the codegen part working. Right now it makes something like this:

// I formatted it manually to make it easier for the eye
static STATIC_RULES: [(&str, MappingTarget); 6] = [
    (
        r#"${XDG_CONFIG_HOME}/git/config"#,
        MappingTarget::MapTo(r#"Git Config"#),
    ),
    (
        r#"${HOME}/.config/git/config"#,
        MappingTarget::MapTo(r#"Git Config"#),
    ),
    (
        r#"${XDG_CONFIG_HOME}/git/ignore"#,
        MappingTarget::MapTo(r#"Git Ignore"#),
    ),
    (
        r#"${HOME}/.config/git/ignore"#,
        MappingTarget::MapTo(r#"Git Ignore"#),
    ),
    (
        r#"${XDG_CONFIG_HOME}/git/attributes"#,
        MappingTarget::MapTo(r#"Git Attributes"#),
    ),
    (
        r#"${HOME}/.config/git/attributes"#,
        MappingTarget::MapTo(r#"Git Attributes"#),
    ),
];

I'm thinking, I can actually take this one step further by moving the string searches to compile time. So instead of storing a string as the matcher, we store something like this:

enum CompileTimeMatcher {
    Static(&'static str),
    Dynamic(Vec<MatcherSegment>),
}

// ... where
enum MatcherSegment {
    Text(&'static str),
    Env(&'static str),
}

I assume this is faster, but the benefit might be minimal.

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

Okay, I think I have a decent implementation of the aforementioned idea. This is what the generated code currently looks like:

// formatted
static STATIC_RULES: [(BuiltinMatcher, MappingTarget); 7] = [
    (
        BuiltinMatcher::Dynamic(Lazy::new(|| {
            join_segments(&[
                MatcherSegment::Env(r#"XDG_CONFIG_HOME"#),
                MatcherSegment::Text(r#"/git/config"#),
            ])
        })),
        MappingTarget::MapTo(r#"Git Config"#),
    ),
    (
        BuiltinMatcher::Dynamic(Lazy::new(|| {
            join_segments(&[
                MatcherSegment::Env(r#"HOME"#),
                MatcherSegment::Text(r#"/.config/git/config"#),
            ])
        })),
        MappingTarget::MapTo(r#"Git Config"#),
    ),
    (
        BuiltinMatcher::Dynamic(Lazy::new(|| {
            join_segments(&[
                MatcherSegment::Env(r#"XDG_CONFIG_HOME"#),
                MatcherSegment::Text(r#"/git/ignore"#),
            ])
        })),
        MappingTarget::MapTo(r#"Git Ignore"#),
    ),
    (
        BuiltinMatcher::Dynamic(Lazy::new(|| {
            join_segments(&[
                MatcherSegment::Env(r#"HOME"#),
                MatcherSegment::Text(r#"/.config/git/ignore"#),
            ])
        })),
        MappingTarget::MapTo(r#"Git Ignore"#),
    ),
    (
        BuiltinMatcher::Dynamic(Lazy::new(|| {
            join_segments(&[
                MatcherSegment::Env(r#"XDG_CONFIG_HOME"#),
                MatcherSegment::Text(r#"/git/attributes"#),
            ])
        })),
        MappingTarget::MapTo(r#"Git Attributes"#),
    ),
    (
        BuiltinMatcher::Dynamic(Lazy::new(|| {
            join_segments(&[
                MatcherSegment::Env(r#"HOME"#),
                MatcherSegment::Text(r#"/.config/git/attributes"#),
            ])
        })),
        MappingTarget::MapTo(r#"Git Attributes"#),
    ),
    (
        BuiltinMatcher::Fixed(r#"*.conf"#),
        MappingTarget::MapExtensionToUnknown,
    ),
];

The downside is that this uses a couple of new types and functions, which doesn't feel quite right wherever I put them:

  • They are only used by the generated code, so I put them in src/syntax_mapping.rs, future maintainers will probably find it odd/confusing.
  • They are static code blocks, so they don't really belong in the generated code.

Ended up going with option 1. I guess it's just a minor nuisance at the end of the day. It's worth it if there are tangible performance improvements.

@cyqsimon cyqsimon closed this Nov 2, 2023
@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

So I fat-fingered tab in the middle of typing, right before I hit space... And that closed the PR 🤦. What a UI design disaster.

@cyqsimon cyqsimon reopened this Nov 2, 2023
@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

I previously said I'll leave it for a week or two, but I had time today so I thought I'd just do it.

Maintainers, please don't hesitate to ask for big changes if you think they are warranted.

@cyqsimon cyqsimon force-pushed the syntax-mapping-refactor branch from 3f0a772 to cbe755f Compare November 2, 2023 17:44
@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 2, 2023

Rebased onto master.

@Enselic
Copy link
Collaborator

Enselic commented Nov 4, 2023

I skimmed over the code and it looks like you know what you are doing. Nothing jumped out at me as a red flag. I didn't look very close at the code though.

I don't want to risk wasting your time, but how much work would it be to port all of the existing syntax mappings to the new system? So that we can see how it impacts performance. Or maybe you can partly port it? So that half of syntax mappings use new and half use old. Maybe that is enough to git a sense for how this impacts performance.

To check performance, do this (approximate instructions):

cargo install --locked hyperfine
cargo build --release
tests/benchmarks/run-benchmarks.sh --release

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 4, 2023

Thanks for the compliment.

I'm not ready to start porting just yet - currently the codegen is working, but the include!-ed mappings are not yet being used. I need to add that bit of code first; shouldn't be too difficult though.

As of the porting, there's not actually that much to port. Maybe 20 files (which can be done in like 30 minutes). I'll run the benchmark after it's all done.

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 4, 2023

Impl is all done. Tests are failing because mappings haven't been ported. Will do said porting tomorrow.

@cyqsimon cyqsimon force-pushed the syntax-mapping-refactor branch from 7127f0e to fc8eb15 Compare November 4, 2023 18:49
src/syntax_mapping.rs Outdated Show resolved Hide resolved
@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 5, 2023

I ran the benchmark like so (using patch in #2768):

cd "$(git rev-parse --show-toplevel)/tests/benchmarks"
rm -rf results
mkdir -p results/{pre,post}

git checkout master
cargo build -r
RUN_COUNT=50 ./run-benchmarks.sh --release
cp benchmark-results/report.md results/pre/

git checkout syntax-mapping-refactor
cargo build -r
RUN_COUNT=50 ./run-benchmarks.sh --release
cp benchmark-results/report.md results/post/

bat results/*/*.md

Current master

Command Mean [ms] Min [ms] Max [ms] Relative
bat 3.9 ± 0.9 2.4 7.8 1.00
bat … small-CpuInfo-file.cpuinfo 20.4 ± 1.7 18.6 28.4 1.00
bat … small-Markdown-file.md 25.4 ± 1.8 23.5 34.1 1.00
bat … --language=txt numpy_test_multiarray.py 5.4 ± 2.4 3.7 43.3 1.00
bat … grep-output-ansi-sequences.txt 38.6 ± 1.8 36.3 44.0 1.00
bat … jquery.js 520.4 ± 5.8 514.5 533.6 1.00
bat … miniz.c 50.1 ± 1.7 48.1 56.0 1.00
bat … numpy_test_multiarray.py 688.9 ± 5.0 683.3 697.1 1.00
bat … grep-output-ansi-sequences.txt 36.1 ± 2.3 33.3 45.2 1.00
bat … jquery.js 515.6 ± 5.0 511.1 527.1 1.00
bat … miniz.c 49.6 ± 2.3 47.6 59.6 1.00
bat … numpy_test_multiarray.py 688.0 ± 6.1 680.7 697.3 1.00
bat … --language=txt *.txt 4.6 ± 1.0 3.3 8.8 1.00

syntax-mapping-refactor

Command Mean [ms] Min [ms] Max [ms] Relative
bat 2.2 ± 0.7 1.1 5.6 1.00
bat … small-CpuInfo-file.cpuinfo 20.3 ± 1.6 18.6 28.2 1.00
bat … small-Markdown-file.md 25.9 ± 1.6 24.3 31.7 1.00
bat … --language=txt numpy_test_multiarray.py 3.6 ± 1.1 2.2 8.3 1.00
bat … grep-output-ansi-sequences.txt 38.3 ± 1.4 36.6 45.6 1.00
bat … jquery.js 522.3 ± 4.7 517.0 532.5 1.00
bat … miniz.c 48.9 ± 2.5 46.4 59.8 1.00
bat … numpy_test_multiarray.py 690.9 ± 4.2 685.4 700.7 1.00
bat … grep-output-ansi-sequences.txt 36.4 ± 1.8 34.1 42.3 1.00
bat … jquery.js 517.1 ± 5.6 513.0 532.4 1.00
bat … miniz.c 48.8 ± 1.8 46.1 53.6 1.00
bat … numpy_test_multiarray.py 689.0 ± 3.7 683.8 694.9 1.00
bat … --language=txt *.txt 3.1 ± 1.1 1.7 7.8 1.00

Unsurprisingly, the simpler the task, the bigger the speedup. For heavier tasks there is barely a difference.

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 5, 2023

One thing I was somewhat concerned about was a possible increase of the binary size due to the pervasive use of once_cell::sync::Lazy. Fortunately it seems like the increase is not too alarming.

cd "$(git rev-parse --show-toplevel)"
rm -rf res
mkdir -p res/{pre,post}

git checkout master
cargo build -r
cp target/release/bat res/pre/

git checkout syntax-mapping-refactor
cargo build -r
cp target/release/bat res/post/

du --summarize --bytes res/*/bat
5379928	res/post/bat
5359128	res/pre/bat

Eh. 20KB difference seems fine.

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 5, 2023

This is what the codegen looks like right now (Linux). I think this is pretty much what we will end up with:

/// Generated by build script from /src/syntax_mapping/builtins/.
pub(crate) static BUILTIN_MAPPINGS: [(Lazy<Option<GlobMatcher>>, MappingTarget); 56] = [
(Lazy::new(|| Some(build_matcher_fixed(r#"httpd.conf"#))), MappingTarget::MapTo(r#"Apache Conf"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/apache2/**/*.conf"#))), MappingTarget::MapTo(r#"Apache Conf"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/apache2/sites-*/**/*"#))), MappingTarget::MapTo(r#"Apache Conf"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"**/bat/config"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"Containerfile"#))), MappingTarget::MapTo(r#"Dockerfile"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.h"#))), MappingTarget::MapTo(r#"C++"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#".clang-format"#))), MappingTarget::MapTo(r#"YAML"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.fs"#))), MappingTarget::MapTo(r#"F#"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"fish_history"#))), MappingTarget::MapTo(r#"YAML"#)),
(Lazy::new(|| build_matcher_dynamic(&[MatcherSegment::Env(r#"XDG_CONFIG_HOME"#), MatcherSegment::Text(r#"/git/config"#)])), MappingTarget::MapTo(r#"Git Config"#)),
(Lazy::new(|| build_matcher_dynamic(&[MatcherSegment::Env(r#"HOME"#), MatcherSegment::Text(r#"/.config/git/config"#)])), MappingTarget::MapTo(r#"Git Config"#)),
(Lazy::new(|| build_matcher_dynamic(&[MatcherSegment::Env(r#"XDG_CONFIG_HOME"#), MatcherSegment::Text(r#"/git/ignore"#)])), MappingTarget::MapTo(r#"Git Ignore"#)),
(Lazy::new(|| build_matcher_dynamic(&[MatcherSegment::Env(r#"HOME"#), MatcherSegment::Text(r#"/.config/git/ignore"#)])), MappingTarget::MapTo(r#"Git Ignore"#)),
(Lazy::new(|| build_matcher_dynamic(&[MatcherSegment::Env(r#"XDG_CONFIG_HOME"#), MatcherSegment::Text(r#"/git/attributes"#)])), MappingTarget::MapTo(r#"Git Attributes"#)),
(Lazy::new(|| build_matcher_dynamic(&[MatcherSegment::Env(r#"HOME"#), MatcherSegment::Text(r#"/.config/git/attributes"#)])), MappingTarget::MapTo(r#"Git Attributes"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.jsonl"#))), MappingTarget::MapTo(r#"JSON"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.ksh"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/var/spool/mail/*"#))), MappingTarget::MapTo(r#"Email"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/var/mail/*"#))), MappingTarget::MapTo(r#"Email"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"nginx.conf"#))), MappingTarget::MapTo(r#"nginx"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"mime.types"#))), MappingTarget::MapTo(r#"nginx"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/nginx/**/*.conf"#))), MappingTarget::MapTo(r#"nginx"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/nginx/sites-*/**/*"#))), MappingTarget::MapTo(r#"nginx"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.nse"#))), MappingTarget::MapTo(r#"Lua"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/os-release"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/usr/lib/os-release"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/initrd-release"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/usr/lib/extension-release.d/extension-release.*"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/usr/share/libalpm/hooks/*.hook"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/pacman.d/hooks/*.hook"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.pac"#))), MappingTarget::MapTo(r#"JavaScript (Babel)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.ron"#))), MappingTarget::MapTo(r#"Rust"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.sarif"#))), MappingTarget::MapTo(r#"JSON"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"/etc/profile"#))), MappingTarget::MapTo(r#"Bourne Again Shell (bash)"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"**/.ssh/config"#))), MappingTarget::MapTo(r#"SSH Config"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"**/systemd/**/*.conf"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"**/systemd/**/*.example"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.automount"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.device"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.dnssd"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.link"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.mount"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.netdev"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.network"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.nspawn"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.path"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.service"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.scope"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.slice"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.socket"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.swap"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.target"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.timer"#))), MappingTarget::MapTo(r#"INI"#)),
(Lazy::new(|| Some(build_matcher_fixed(r#"*.conf"#))), MappingTarget::MapExtensionToUnknown),
(Lazy::new(|| Some(build_matcher_fixed(r#"build"#))), MappingTarget::MapToUnknown),
(Lazy::new(|| Some(build_matcher_fixed(r#"rails"#))), MappingTarget::MapToUnknown)
];

@cyqsimon
Copy link
Contributor Author

cyqsimon commented Nov 5, 2023

Maintainers, can you please take a look at my porting? I've classified the rules according to their applicable platforms; please help check if all mappings are in their correct subdirs. Thanks.

@cyqsimon cyqsimon force-pushed the syntax-mapping-refactor branch from d90cdb2 to 04ce25c Compare November 5, 2023 15:32
This was referenced Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need a more sustainable way to manage known syntax mappings
4 participants