Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc tests action takes too long #118891

Closed
Privat33r-dev opened this issue May 10, 2024 · 26 comments
Closed

Doc tests action takes too long #118891

Privat33r-dev opened this issue May 10, 2024 · 26 comments
Labels
docs Documentation in the Doc dir

Comments

@Privat33r-dev
Copy link
Contributor

Privat33r-dev commented May 10, 2024

Issue

Usually doctest's action's "Run documentation doctest"1 job takes around 12 minutes2 to execute. Most of this time is taken by the internal framework doctest.py3. The cause of the issue is GC regression in Python 3.13.

Potential Solution

Fix #124567

Additional Improvements

Improve doctest performance

Since the tests are not interconnected (between different doc files), they can be easily parallelized. I suggest using ProcessPoolExecutor that will be suitable to "bypass" GIL (assuming that tasks are CPU-bound). Github runners have 4 cores available4, so, potentially, it can speed up execution by up to 4 times.

@AlexWaygood @erlend-aasland

Footnotes

  1. https://github.com/python/cpython/blob/7ac933e2609b2ef9b08ccf9c815b682b0e1ede2a/.github/workflows/reusable-docs.yml#L108

  2. https://github.com/python/cpython/actions/runs/8922903843/job/24506021604

  3. https://github.com/python/cpython/blob/main/Lib/doctest.py

  4. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

@Privat33r-dev Privat33r-dev added the docs Documentation in the Doc dir label May 10, 2024
@erlend-aasland
Copy link
Contributor

Sounds good; do you have a PR ready? :)

@Privat33r-dev
Copy link
Contributor Author

I was waiting for a feedback because I might've missed something :) My schedule is a bit tight lately, but I will try to find a time to implement it (hopefully in a week or two).

@Privat33r-dev
Copy link
Contributor Author

Sorry, it's going to be delayed. Currently I am loaded with work, but hopefully soon I can start work on the PR.

@hugovk
Copy link
Member

hugovk commented May 31, 2024

No rush, thanks for the update :)

@willingc
Copy link
Contributor

@Privat33r-dev Do you still wish to work on a PR for this? If you are busy, would you mind someone picking up the work to do the PR?

@Privat33r-dev
Copy link
Contributor Author

Privat33r-dev commented Jul 26, 2024 via email

@willingc
Copy link
Contributor

No worries on a delay @Privat33r-dev. Seems like it would be a nice addition since the doctests are slow 😉

@Privat33r-dev
Copy link
Contributor Author

Privat33r-dev commented Jul 28, 2024

After further investigation I understood the following.
The command to run doctests that takes around 13-16 minutes is:
xvfb-run make -C Doc/ PYTHON=../python SPHINXERRORHANDLING="-W --keep-going" doctest
xvfb-run is needed to simulate graphic environment (my guess it is needed to test turtle and some other modules)
make -C Doc/ doctest is launching Doc/Makefile with doctest builder. Other flags are less relevant, but in short they point out to the Python installation and how the sphinx should handle errors.

Inside of the Makefile, the sphinx-build is being executed with the doctest builder.

So, basically, the sphinx wrapper for the doctests is used. Source

Likely the parallelization should be applied there, I will continue and I might update this comment/add a new comment with further findings.


However, what's actually interesting is that speed of the doctest depends on Python version.
Python 3.13 tests are running 3 times slower than 3.12.
Example #1 (Python 3.13) 14m 57s test that ran yesterday
Example #2 (Python 3.12) 5m 18s test that also ran yesterday

Search request:
https://github.com/python/cpython/pulls?q=is%3Apr+label%3Adocs+%5B3.12%5D+


Also caching does not work efficiently and it might contribute to the problem, but it's a different topic.
image

@Privat33r-dev
Copy link
Contributor Author

Privat33r-dev commented Aug 5, 2024

I would be glad if someone can join in resolving this issue since currently I am, unfortunately, still occupied by other business. At this weekend I will try to continue.

General conclusion from earlier findings:

  1. The multiprocessing change is needed to be done in the Sphinx, rather than in the Python's core library. However, due to the speed degradation comparing to earlier Python version, I think that it would be highly beneficial to investigate it first, instead of just masking the issue by applying multiprocessing.
  2. There is a major speed degeneration in Python 3.13 doctest on GH Runner comparing to Python 3.12. I have a few theories (all require further testing/findings to be confirmed/rejected):
    1. Due to some changes, 3.13 is slower on some operations. I just hope that it's not PEP 703-related. How to check: doctest.py performance tests with profiling on both versions.
    2. There were some changes in the doctest library that cause speed loss. How to check: diff doctest.py between both versions.

I can't think of another reason quickly, but it might be as well caused by some Python API changes, building flags or other thing. However, if we can fix and return to near-3.12 speed, it would be a major win already. Applying multiprocessing to it would just skyrocket the speed.

The Plan

  1. Investigate the performance issue
  2. https://github.com/sphinx-doc/sphinx/blob/master/sphinx/ext/doctest.py#L363C1-L368C1 -> multiprocessing

@hugovk
Copy link
Member

hugovk commented Sep 24, 2024

@AA-Turner Do you have any ideas?

The doctest on CPython CI is much slower on 3.13 and 3.14 (main) than with 3.12. It's not specifically doctest, but the main Sphinx build is slow before starting the doctest.

For example:

With the example builds, when the full 3.12 Sphinx build + doctest had finished, 3.13 was only at reading sources... [ 73%] library/stdtypes .. library/trace.

@AA-Turner
Copy link
Member

Didn't we previously reason that this is as pre-releases don't use an optimised build of Python? What happens if we run CI with the released rc2?

A

@hugovk
Copy link
Member

hugovk commented Sep 24, 2024

For both of these, we use non-optimised Python built from the branch -- because we're testing the branch's Python.

https://github.com/python/cpython/blob/f4997bb3ac961d6aaf07ce650cd074e28ce6ccd0/.github/workflows/reusable-docs.yml#L95_L118

@willingc
Copy link
Contributor

Hmm...I'm wondering if we use pytest to run these doctests instead of sphinx build plus doctest. We could then parallelize the doctests if needed. @hugovk @AA-Turner

@AA-Turner
Copy link
Member

For both of these, we use non-optimised Python built from the branch -- because we're testing the branch's Python.

Right, I was just wondering if the slowdown was soley due to this (which we have no real choice over) or if other factors in Python 3.13+ have caused Sphinx to slow down (which we might be able to resolve).

I'm wondering if we use pytest to run these doctests instead of sphinx build plus doctest.

I don't think pytest supports test{setup,cleanup,code} directives which we make heavy use of.

A

@Privat33r-dev
Copy link
Contributor Author

@AA-Turner

Maybe we can do tests parallelization in the Sphinx itself? If you think that it will work, I can find a time this week to implement it.

https://github.com/sphinx-doc/sphinx/blob/master/sphinx/ext/doctest.py#L363C1-L368C1 -> multiprocessing

@AlexWaygood
Copy link
Member

I wrote the following script, with which I can reproduce locally a significant performance regression over the course of Python 3.13 development when using Sphinx to build a single documentation file:

Script
import contextlib
import shutil
import subprocess
import time
import venv
from pathlib import Path

def run(args):
    try:
        subprocess.run(args, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(e.stdout)
        print(e.stderr)
        raise


with contextlib.chdir("Doc"):
    try:
        for path in Path(".").iterdir():
            if path.is_dir() and not str(path).startswith("."):
                for doc_path in path.rglob("*.rst"):
                    if doc_path != Path("library/typing.rst"):
                        doc_path.write_text("foo")

        venv.create(".venv", with_pip=True)

        run([
            ".venv/bin/python",
            "-m",
            "pip",
            "install",
            "-r",
            "requirements.txt",
            "--no-binary=':all:'",
        ])

        start = time.perf_counter()

        run([
            ".venv/bin/python",
            "-m",
            "sphinx",
            "-b",
            "html",
            ".",
            "build/html",
            "library/typing.rst",
        ])

        print(time.perf_counter() - start)
        shutil.rmtree(".venv")
        shutil.rmtree("build")
    finally:
        subprocess.run(["git", "restore", "."], check=True, capture_output=True)

If I save the script as bisect_sphinx.py in the root of my CPython clone, and then run ./configure --with-pydebug && make -j10 && ./python.exe bisect_sphinx.py, the script reports that building library/typing.rst takes roughly eight seconds on main, but only around 3s with the v3.13.0a1 tag checked out.

I'll now try to use this script to see if I can bisect the performance regression to a specific commit.

@AA-Turner
Copy link
Member

Thanks Alex. You prompted a memory of sphinx-doc/sphinx#12174, which might be relevant.

A

@AlexWaygood
Copy link
Member

Thanks Alex. You prompted a memory of sphinx-doc/sphinx#12174, which might be relevant.

Indeed! The results of the bisection are in:

15309329b65a285cb7b3071f0f08ac964b61411b is the first bad commit
commit 15309329b65a285cb7b3071f0f08ac964b61411b
Author: Mark Shannon <[email protected]>
Date:   Wed Mar 20 08:54:42 2024 +0000

    GH-108362: Incremental Cycle GC (GH-116206)

 Doc/whatsnew/3.13.rst                              |  30 +
 Include/internal/pycore_gc.h                       |  41 +-
 Include/internal/pycore_object.h                   |  18 +-
 Include/internal/pycore_runtime_init.h             |   8 +-
 Lib/test/test_gc.py                                |  72 +-
 .../2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst |  12 +
 Modules/gcmodule.c                                 |  25 +-
 Objects/object.c                                   |  21 +
 Objects/structseq.c                                |   5 +-
 Python/gc.c                                        | 806 +++++++++++++--------
 Python/gc_free_threading.c                         |  23 +-
 Python/import.c                                    |   2 +-
 Python/optimizer.c                                 |   2 +-
 Tools/gdb/libpython.py                             |   7 +-
 14 files changed, 684 insertions(+), 388 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst
  • On the commit prior to 1530932, the script reported that building library/typing.rst took 4.47s
  • On 1530932, the script reported that it took 17.61s
  • On 909c6f7, it reports that it takes 8.38s

I did another bisect to find out what caused the performance improvement between 1530932 and 909c6f7, and the result was e28477f (git bisect reports it as the "first bad commit", but in this case it's actually the first good commit):

e28477f214276db941e715eebc8cdfb96c1207d9 is the first bad commit
commit e28477f214276db941e715eebc8cdfb96c1207d9
Author: Mark Shannon <[email protected]>
Date:   Fri Mar 22 18:43:25 2024 +0000

    GH-117108: Change the size of the GC increment to about 1% of the total heap size. (GH-117120)

 Include/internal/pycore_gc.h                       |  3 +-
 Lib/test/test_gc.py                                | 35 +++++++++++++++-------
 .../2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst |  3 ++
 Modules/gcmodule.c                                 |  2 +-
 Python/gc.c                                        | 30 +++++++++----------
 Python/gc_free_threading.c                         |  2 +-
 6 files changed, 47 insertions(+), 28 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst
  • On e2e0b4b, the script reports 16.46s
  • On e28477f, the script reports 6.78s -- still much worse than prior to 1530932, but a substantial improvement over e2e0b4b

So, we now know which CPython commits are the root cause of our Doctest CI job taking so much longer on the 3.13 branch than on the 3.12 branch! I'm now going to see if the same performance regression can be reproduced with a non-debug build. (So far, I've only been testing with a debug build, as this is what we use for our Doctest CI job, so this is what was relevant for understanding why our Doctest CI job might be taking so much longer on the 3.13 branch.

@nascheme
Copy link
Member

Here is what linux-perf shows. 47% of time spent in cyclic GC. I don't see any other obvious hot spots to optimize.

image

I did a second run with the GC completely disabled and without --with-pydebug. Again no obvious hot spots.

image

@AlexWaygood
Copy link
Member

The good news is that the bisection script is hugely faster on a PGO-optimized, non-debug build. The bad news is that it's still a fair amount slower than on 3.13:

  • With v3.13.0a1 checked out, the script consistently takes between 1.27s and 1.29s on a PGO-optimized, non-debug build
  • On ede1504, it consistently takes between 1.76 and 1.82s

That's around a 48% performance regression on my bisection script with a PGO-optimized build, which is still quite bad!

@nascheme
Copy link
Member

I tried running the tests with Python 3.12 and with the gc.DEBUG_STATS flag turned on. The output suggests that this workload is creating a lot of cyclic garbage and the gen 0 (youngest) collection of the cyclic GC is dealing with it. The "unreachable" number is counting cyclic garbage objects found the GC.

I don't know enough about the "incremental" collector to comment on how it might be better optimized for this workload. I do suspect that there are many more similar workloads in the world and the GC that comes with the 3.13 release will hopefully not be much slower for them.

gc: collecting generation 0...
gc: objects in each generation: 1175 3804 1663292
gc: objects in permanent generation: 375
gc: done, 520 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 976 4408 1663292
gc: objects in permanent generation: 375
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 982 5336 1663289
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 959 5866 1663289
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 948 6397 1663288
gc: objects in permanent generation: 375
gc: done, 116 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 813 7178 1663288
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 1...
gc: objects in each generation: 951 7537 1663273
gc: objects in permanent generation: 375
gc: done, 5219 unreachable, 0 uncollectable, 0.0010s elapsed

@AlexWaygood
Copy link
Member

AlexWaygood commented Sep 25, 2024

For colour, I created a variant of my script that measures the execution time of parsing and building all of CPython's docs with a PGO-optimized build. (The script I used for bisecting the perf regression only parsed and built library/typing.rst.) Here are the results on v3.13.0a1 and ede1504:

  • v3.13.0a1: 45.5s
  • main: 62.7s

That's a 38% regression; pretty similar to what we saw with the smaller sample (and again, very bad!).

Variant of my script that parses and builds all of CPython's docs
import contextlib
import shutil
import subprocess
import time
import venv
from pathlib import Path

def run(args):
    subprocess.run(args, check=True, text=True)


with contextlib.chdir("Doc"):
    venv.create(".venv", with_pip=True)

    run([
        ".venv/bin/python",
        "-m",
        "pip",
        "install",
        "-r",
        "requirements.txt",
        "--no-binary=':all:'",
    ])

    start = time.perf_counter()

    run([
        ".venv/bin/python",
        "-m",
        "sphinx",
        "-b",
        "html",
        ".",
        "build/html",
    ])

    print(time.perf_counter() - start)
    shutil.rmtree(".venv")
    shutil.rmtree("build")

@AlexWaygood
Copy link
Member

I created a new issue summarising our findings here. It seems clear that the new incremental GC causes significant performance regressions for workloads like Sphinx that create lots of reference cycles; that seems like something that's worth investigating more:

@Privat33r-dev
Copy link
Contributor Author

@AlexWaygood amazing findings, thanks for your investigation :) I felt that the issue is with likely with Python itself (because GH Actions are the same & many other things are the same as well), however, I couldn't make a proper investigation due to time constraints.

Thanks for everyone involved, I am glad that you found the root cause.

@Privat33r-dev
Copy link
Contributor Author

Privat33r-dev commented Oct 1, 2024

The issue was mostly resolved by #124770.

However, there still is a room for improvement, but mostly on the Sphinx's side. I don't see any obvious improvements that can be done in this repository, so the issue is resolved.

I am glad that performance regression was prevented in 3.13 as a result of a teamwork. Thanks for all the participants.

@AlexWaygood
Copy link
Member

AlexWaygood commented Oct 1, 2024

@Privat33r-dev, thanks so much for opening the issue and for spotting how much slower the job was on 3.13 than 3.12. You helped prevent a new version of Python being released with a serious performance regression that could have affected lots of workloads (not just Sphinx!) — it's great that we identified it before the release went out! I'm only sorry it took us a while to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

7 participants