Improve lexer performance by 5-10% overall, improve string lexer performance 15% #149689

fereidani · 2025-12-05T19:17:04Z

Hi, this PR improves lexer performance by ~5-10% when lexing the entire standard library. It specifically targets the string lexer, comment lexer, and frontmatter lexer.

For strings and comments, it replaces the previous logic with a new eat_past2 function that leverages memchr2.
For frontmatter, I eliminated the heap allocation from format! and rewrote the lexer using memchr-based scanning, which is roughly 4× faster.

I also applied a few minor optimizations in other areas.

I’ll send the benchmark repo in the next message. Here are the results on my x86_64 laptop (AMD 6650U):

Benchmarking tokenize_real_world/stdlib_all_files: Collecting 100 samples in esttokenize_real_world/stdlib_all_files
                        time:   [74.193 ms 74.224 ms 74.256 ms]
                        thrpt:  [423.74 MiB/s 423.92 MiB/s 424.10 MiB/s]
                 change:
                        time:   [−5.4046% −5.3465% −5.2907%] (p = 0.00 < 0.05)
                        thrpt:  [+5.5862% +5.6484% +5.7134%]
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  2 (2.00%) high mild
  19 (19.00%) high severe

Benchmarking strip_shebang/valid_shebang: Collecting 100 samples in estimated 5.strip_shebang/valid_shebang
                        time:   [11.391 ns 11.401 ns 11.412 ns]
                        thrpt:  [1.7954 GiB/s 1.7971 GiB/s 1.7987 GiB/s]
                 change:
                        time:   [−8.1076% −7.8921% −7.6485%] (p = 0.00 < 0.05)
                        thrpt:  [+8.2820% +8.5683% +8.8229%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking strip_shebang/no_shebang: Collecting 100 samples in estimated 5.000strip_shebang/no_shebang
                        time:   [4.8656 ns 4.8680 ns 4.8711 ns]
                        thrpt:  [4.2062 GiB/s 4.2089 GiB/s 4.2110 GiB/s]
                 change:
                        time:   [−0.1156% −0.0139% +0.0821%] (p = 0.78 > 0.05)
                        thrpt:  [−0.0821% +0.0139% +0.1157%]
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) high mild
  19 (19.00%) high severe

Benchmarking tokenize/simple_function: Collecting 100 samples in estimated 5.001tokenize/simple_function
                        time:   [288.86 ns 293.20 ns 297.41 ns]
                        thrpt:  [173.16 MiB/s 175.64 MiB/s 178.28 MiB/s]
                 change:
                        time:   [−2.2198% −0.8716% +0.3321%] (p = 0.20 > 0.05)
                        thrpt:  [−0.3310% +0.8793% +2.2702%]
                        No change in performance detected.
Benchmarking tokenize/strings: Collecting 100 samples in estimated 5.0032 s (4.6tokenize/strings        time:   [1.1175 µs 1.1379 µs 1.1573 µs]
                        thrpt:  [44.497 MiB/s 45.258 MiB/s 46.083 MiB/s]
                 change:
                        time:   [−14.860% −13.620% −12.359%] (p = 0.00 < 0.05)
                        thrpt:  [+14.101% +15.767% +17.454%]
                        Performance has improved.
Benchmarking tokenize/single_line_comments: Collecting 100 samples in estimated tokenize/single_line_comments
                        time:   [159.67 ns 161.52 ns 163.29 ns]
                        thrpt:  [315.39 MiB/s 318.84 MiB/s 322.53 MiB/s]
                 change:
                        time:   [+0.4110% +1.4523% +2.4709%] (p = 0.01 < 0.05)
                        thrpt:  [−2.4113% −1.4315% −0.4093%]
                        Change within noise threshold.
Benchmarking tokenize/multi_line_comments: Collecting 100 samples in estimated 5tokenize/multi_line_comments
                        time:   [220.54 ns 223.33 ns 225.99 ns]
                        thrpt:  [227.88 MiB/s 230.60 MiB/s 233.51 MiB/s]
                 change:
                        time:   [−7.7271% −6.7443% −5.7976%] (p = 0.00 < 0.05)
                        thrpt:  [+6.1544% +7.2320% +8.3742%]
                        Performance has improved.
Benchmarking tokenize/literals: Collecting 100 samples in estimated 5.0008 s (13tokenize/literals       time:   [399.63 ns 405.42 ns 410.94 ns]
                        thrpt:  [125.32 MiB/s 127.02 MiB/s 128.86 MiB/s]
                 change:
                        time:   [−1.4649% −0.3653% +0.7608%] (p = 0.54 > 0.05)
                        thrpt:  [−0.7550% +0.3666% +1.4867%]
                        No change in performance detected.

Benchmarking frontmatter/frontmatter_allowed: Collecting 100 samples in estimatefrontmatter/frontmatter_allowed
                        time:   [188.37 ns 189.51 ns 190.85 ns]
                        thrpt:  [264.85 MiB/s 266.71 MiB/s 268.33 MiB/s]
                 change:
                        time:   [−26.032% −25.300% −24.590%] (p = 0.00 < 0.05)
                        thrpt:  [+32.609% +33.869% +35.194%]
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe

Benchmarking cursor_first/first: Collecting 100 samples in estimated 5.0000 s (5cursor_first/first      time:   [886.05 ps 886.23 ps 886.43 ps]
                        thrpt:  [42.026 GiB/s 42.035 GiB/s 42.044 GiB/s]
                 change:
                        time:   [−1.7088% −1.6398% −1.5732%] (p = 0.00 < 0.05)
                        thrpt:  [+1.5984% +1.6671% +1.7385%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Benchmarking cursor_iteration/bump_all: Collecting 100 samples in estimated 5.00cursor_iteration/bump_all
                        time:   [891.48 ns 892.06 ns 892.78 ns]
                        thrpt:  [4.1727 GiB/s 4.1760 GiB/s 4.1788 GiB/s]
                 change:
                        time:   [−50.335% −50.211% −50.037%] (p = 0.00 < 0.05)
                        thrpt:  [+100.15% +100.85% +101.35%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) high mild
  12 (12.00%) high severe

Benchmarking cursor_eat_while/eat_while_alpha: Collecting 100 samples in estimatcursor_eat_while/eat_while_alpha
                        time:   [34.992 ns 34.999 ns 35.007 ns]
                        thrpt:  [1.7292 GiB/s 1.7297 GiB/s 1.7300 GiB/s]
                 change:
                        time:   [−1.0098% −0.8721% −0.7699%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7759% +0.8798% +1.0201%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking cursor_eat_until/eat_until_newline: Collecting 100 samples in estimcursor_eat_until/eat_until_newline
                        time:   [3.1314 ns 3.1323 ns 3.1332 ns]
                        thrpt:  [15.754 GiB/s 15.759 GiB/s 15.763 GiB/s]
                 change:
                        time:   [−0.4774% −0.3069% −0.1459%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1461% +0.3078% +0.4797%]
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  14 (14.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

rustbot · 2025-12-05T19:17:09Z

r? @nnethercote

rustbot has assigned @nnethercote.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

fereidani · 2025-12-05T19:32:23Z

this is the benchmark library to track performance changes:
https://github.com/fereidani/rustc_lexer_benchmark

matthiaskrgr · 2025-12-05T19:45:40Z

@bors try @rust-timer queue

Improve lexer performance by 5-10% overall, improve string lexer performance 15%

rust-bors · 2025-12-05T22:04:48Z

☀️ Try build successful (CI)
Build commit: e0cf684 (e0cf684abe69de9dd471c12c65d8cf3e198875e5, parent: 66428d92bec337ed4785d695d0127276a482278c)

rust-timer · 2025-12-05T23:26:05Z

Finished benchmarking commit (e0cf684): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @rustbot label: +perf-regression-triaged. If not, please fix the regressions and do another perf run. If its results are neutral or positive, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.7%	[0.0%, 1.7%]	18
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.1%	[-0.2%, -0.1%]	2
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (secondary 2.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	4.0%	[1.5%, 6.9%]	9
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.3%	[-2.3%, -0.8%]	5
All ❌✅ (primary)	-	-	0

Cycles

Results (primary 3.1%, secondary 1.2%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	3.1%	[2.3%, 4.9%]	4
Regressions ❌ (secondary)	3.6%	[2.0%, 6.4%]	12
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-3.7%	[-6.2%, -1.8%]	6
All ❌✅ (primary)	3.1%	[2.3%, 4.9%]	4

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 470.249s -> 469.703s (-0.12%)
Artifact size: 386.85 MiB -> 388.89 MiB (0.53%)

nnethercote

Thanks for looking into this. I like the attempt and the careful measurements. Unfortunately it doesn't seem to help full-compiler performance, as measured by rust-timer, and slightly regresses a few benchmarks. Lexing is a really small component of overall compilation time, and it's already been micro-optimized quite a bit, so it's hard for it have much effect on performance. I suspect adding #[inline] to all those functions had the biggest effect and caused the regressions.

View changes since this review

nnethercote · 2025-12-07T07:31:44Z

compiler/rustc_lexer/src/cursor.rs

+
+    /// Bumps the cursor if the next character is either of the two expected characters.
+    #[inline]
+    pub(crate) fn bump_if2(&mut self, expected1: char, expected2: char) -> bool {


I would call this bump_if_either. bump_if2 makes me think that expected1 must be followed by expected2.

sure I can rename it.

I think it would be logical to rename eat_past2 to eat_past_either too, and use byte1 and byte2 to match already existing eat_until style, what do you suggest?

nnethercote · 2025-12-07T07:32:44Z

compiler/rustc_lexer/src/cursor.rs

+    #[inline]
    pub(crate) fn bump_bytes(&mut self, n: usize) {
-        self.chars = self.as_str()[n..].chars();
+        self.chars = self.as_str().get(n..).unwrap_or("").chars();


What's the thinking behind this change?

it removes the panic handling code generation and branching which in my experiments it is always faster even when panic doesn't happen. if it can be proven by llvm that it will never panic unwrap_or will be optimized away like the panic handling.

it seems that I am wrong and it does not apply everywhere as even in my benchmark suite it reduces performance in cursor_eat_until/eat_until_newline, I'm going to remove this.

It seems that only works if function is #[inline].

nnethercote · 2025-12-07T07:35:23Z

compiler/rustc_lexer/src/lib.rs

-        let nl_fence_pattern = format!("\n{:-<1$}", "", length_opening as usize);
-        if let Some(closing) = self.as_str().find(&nl_fence_pattern) {
+        #[inline]
+        fn find_closing_fence(s: &str, dash_count: usize) -> Option<usize> {


Micro-optimizing frontmatter lexing doesn't seem worthwhile. It's just a tiny fraction of general lexing.

I agree that frontmatter lexing doesn’t need to be heavily optimized. That said, the memchr version eliminates a possible heap allocation (rare in practice, but still nice to avoid) and gives a ~4× speedup on that path. It improves the overall tone and consistency of the lexer code, and there’s no real harm in keeping it. Totally your call, though if you’d rather drop it for simplicity, I can remove it without issue.

I removed it in my last commit, but let me know if you are interested in having it back.

fereidani · 2025-12-07T12:26:33Z

Thank you for reviewing this PR. I’m exploring the rustc codebase in my spare time, and the lexer was the first part I dove into. I’m just trying to contribute what I can to help improve the compiler’s performance.

I’m happy to drop the #[inline] hints and apply the other suggestions you mentioned. Let me know what you’d like to do with the PR from here.

…per function names for better readability

nnethercote · 2025-12-09T04:08:30Z

Let's do another perf run just for completeness:

@bors try @rust-timer queue

What to do will depend on the result there. Overall this does add more code without particularly improving readability, IMO, so if there's no perf improvement the impetus to merge is low.

It is cool that you are looking at performance, though. If you have a Linux machine then I would recommend trying out rustc-perf and using that as a starting point for investigating compiler performance.

Improve lexer performance by 5-10% overall, improve string lexer performance 15%

rust-bors · 2025-12-09T06:25:16Z

☀️ Try build successful (CI)
Build commit: 1566a60 (1566a6038fa84a4dce380350a79f1b260724cbf4, parent: 0b96731cd10757f695e99ba675ac26840ff85a79)

rust-timer · 2025-12-09T07:06:08Z

Finished benchmarking commit (1566a60): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @rustbot label: +perf-regression-triaged. If not, please fix the regressions and do another perf run. If its results are neutral or positive, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.9%	[0.5%, 1.6%]	15
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (secondary 3.2%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	3.2%	[3.2%, 3.2%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Cycles

Results (primary -2.7%, secondary -0.4%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.0%	[2.0%, 2.0%]	1
Improvements ✅ (primary)	-2.7%	[-2.7%, -2.7%]	1
Improvements ✅ (secondary)	-2.7%	[-2.7%, -2.7%]	1
All ❌✅ (primary)	-2.7%	[-2.7%, -2.7%]	1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 471.709s -> 470.591s (-0.24%)
Artifact size: 389.01 MiB -> 389.02 MiB (0.00%)

fereidani · 2025-12-09T12:38:45Z

Thanks for taking the time to review this PR!
I completely understand that you're the maintainer of this code and that consistency with your preferred style is important.
That said, I believe the version I proposed improves readability:

    fn double_quoted_string(&mut self) -> bool {
        debug_assert!(self.prev() == '"');
        while let Some(c) = self.eat_past_either(b'"', b'\\') {
            match c {
                b'"' => {
                    return true;
                }
                b'\\' => _ = self.bump_if_either('\\', '"'),
                _ => unreachable!(),
            }
        }
        false
    }

vs

    fn double_quoted_string(&mut self) -> bool {
        debug_assert!(self.prev() == '"');
        while let Some(c) = self.bump() {
            match c {
                '"' => {
                    return true;
                }
                '\\' if self.first() == '\\' || self.first() == '"' => {
                    // Bump again to skip escaped character.
                    self.bump();
                }
                _ => (),
            }
        }
        // End of file reached.
        false
    }

nnethercote · 2025-12-09T20:21:17Z

For the quoted example, the new code is slightly shorter but it requires the separate function, and it's clunkier in a way because it does double comparison of the char -- first in eat_past_either, and then again in the match. Also, there's still the slight performance regression, possibly because of the extra comparisons, which makes it unsuitable to merge right now.

If you want to morph this PR into a readability-oriented one instead of a performance-oriented one, that's fine, but at the moment it feels a bit like an attempt to do both and it's not quite working on either front.

… methods

…ic and improve readability in lexer

fereidani · 2025-12-09T23:37:14Z

Thank you for your response.
Since eat_past_either uses memchr::memchr2 and is an inline function, the build file sizes will be slightly larger. However, string parsing will be 200-300% faster for longer strings, as it is SIMD-optimized compared to the old byte-by-byte approach.

I've removed the unreachable! parts and simplified the branches, which should hopefully result in better performance and a smaller binary in case compiler was failing to detect unreachable! is actually unreachable.

As I said before, the decision is entirely yours and I fully respect it. I’ve invested a lot of time into this PR, and naturally I’d love to see it merged, but my only real motivation is to help make Rust faster. If you feel it doesn’t belong here, it’s better not to merge it at all.

I'm currently reading the compiler source code and will follow your guidance by using rustc-perf and also measureme to identify bottlenecks.

P.S. Now I think it is actually much cleaner:

    fn double_quoted_string(&mut self) -> bool {
        debug_assert!(self.prev() == '"');
        while let Some(c) = self.eat_past_either(b'"', b'\\') {
            if c == b'"' {
                return true;
            }
            // Current is '\\', bump again if next is an escaped character.
            self.bump_if_either('\\', '"');
        }
        // End of file reached.
        false
    }

rustbot assigned nnethercote Dec 5, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 5, 2025