Compare & plot¶

Two (or more) saved runs in, one comparison out — as a table (benchmem compare) or an interactive view (benchmem plot). Both work over --metric time or any memory metric, and both group by the dims your tests carry.

Setup¶

A scratch dir, and a baseline run to diff against. plotly renders inline from the CDN.

In [1]:

Copied!





import os
import sys
import tempfile
from pathlib import Path

import plotly.io as pio

os.environ["FORCE_COLOR"] = "1"
os.environ["PATH"] = f"{Path(sys.executable).parent}{os.pathsep}{os.environ['PATH']}"
pio.renderers.default = "notebook_connected"
_tmp = Path(tempfile.mkdtemp(prefix="pytest-benchmem-"))

suite = _tmp / "test_sortbench.py"
suite.write_text("""
import pytest

@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_sort(benchmark_memory, n):
    benchmark_memory(sorted, list(range(n, 0, -1)))
""")
baseline, candidate = _tmp / "baseline.json", _tmp / "candidate.json"
!pytest {suite} --benchmark-only --benchmark-json={baseline}  --benchmark-columns=min,median -q -p no:cacheprovider
import os
import sys
import tempfile
from pathlib import Path

import plotly.io as pio

os.environ["FORCE_COLOR"] = "1"
os.environ["PATH"] = f"{Path(sys.executable).parent}{os.pathsep}{os.environ['PATH']}"
pio.renderers.default = "notebook_connected"
_tmp = Path(tempfile.mkdtemp(prefix="pytest-benchmem-"))

suite = _tmp / "test_sortbench.py"
suite.write_text("""
import pytest

@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_sort(benchmark_memory, n):
    benchmark_memory(sorted, list(range(n, 0, -1)))
""")
baseline, candidate = _tmp / "baseline.json", _tmp / "candidate.json"
!pytest {suite} --benchmark-only --benchmark-json={baseline}  --benchmark-columns=min,median -q -p no:cacheprovider

.                                                                     [100%]

Wrote benchmark data in: <_io.BufferedWriter name='/tmp/pytest-benchmem-znu40z1c/baseline.json'>

benchmark: 4 tests                                                                                         
                                                                                                           
  Name (time in us)                  Min               Median   │   peak (MiB)   allocated (MiB)   allocs  
 ───────────────────────────────────────────────────────────────────────────────────────────────────────── 
  test_sort[10000]         50.2910 (1.0)        51.1720 (1.0)   │         0.08              0.08        1  
  test_sort[50000]       257.3480 (5.12)      262.5930 (5.13)   │         0.38              0.38        1  
  test_sort[200000]   1,030.7030 (20.49)   1,043.9730 (20.40)   │         1.53              1.53        1  
  test_sort[500000]   2,656.7050 (52.83)   2,758.1240 (53.90)   │         3.81              3.81        1  
                                                                                                           
memory (right of │): a separate, untimed pass — single shot, not the timed rounds                          
4 passed in 4.69s

On a real change you'd run the suite on main, then on your branch. Here we just run it twice — same code, so the deltas below are measurement noise; on a real change they'd move.

In [2]:

Copied!

!pytest {suite} --benchmark-only --benchmark-json={candidate} --benchmark-columns=min,median -q -p no:cacheprovider
!pytest {suite} --benchmark-only --benchmark-json={candidate} --benchmark-columns=min,median -q -p no:cacheprovider

.                                                                     [100%]

Wrote benchmark data in: <_io.BufferedWriter name='/tmp/pytest-benchmem-znu40z1c/candidate.json'>

benchmark: 4 tests                                                                                         
                                                                                                           
  Name (time in us)                  Min               Median   │   peak (MiB)   allocated (MiB)   allocs  
 ───────────────────────────────────────────────────────────────────────────────────────────────────────── 
  test_sort[10000]         49.4710 (1.0)        55.3320 (1.0)   │         0.08              0.08        1  
  test_sort[50000]       254.3580 (5.14)      265.9380 (4.81)   │         0.38              0.38        1  
  test_sort[200000]   1,030.6030 (20.83)   1,044.4030 (18.88)   │         1.53              1.53        1  
  test_sort[500000]   2,768.1390 (55.95)   3,407.1190 (61.58)   │         3.81              3.81        1  
                                                                                                           
memory (right of │): a separate, untimed pass — single shot, not the timed rounds                          
4 passed in 4.46s

`benchmem compare` — the delta table¶

A per-id delta table with percent change, for whichever --metric you ask for. Ids in only one run show —.

In [3]:

Copied!

!benchmem compare {baseline} {candidate} --metric peak
!benchmem compare {baseline} {candidate} --metric peak

peak (MiB)                                           
                                                     
  id                  baseline   candidate   change  
 ─────────────────────────────────────────────────── 
  test_sort[10000]        0.08        0.08    +0.0%  
  test_sort[200000]       1.53        1.53    +0.0%  
  test_sort[500000]       3.81        3.81    +0.0%  
  test_sort[50000]        0.38        0.38    +0.0%

In [4]:

Copied!

!benchmem compare {baseline} {candidate} --metric time
!benchmem compare {baseline} {candidate} --metric time

time (s)                                              
                                                      
  id                   baseline   candidate   change  
 ──────────────────────────────────────────────────── 
  test_sort[10000]    5.029e-05   4.947e-05    -1.6%  
  test_sort[200000]    0.001031    0.001031    -0.0%  
  test_sort[500000]    0.002657    0.002768    +4.2%  
  test_sort[50000]    0.0002573   0.0002544    -1.2%

For timing comparisons you can also use pytest-benchmark's own tooling directly — pytest-benchmark compare, --benchmark-histogram. pytest-benchmem doesn't reimplement those; it adds the memory-aware, dims-aware views.

Order the rows with --sort (name | value — largest in the last run first — | change), and write the raw numbers for another tool with --csv out.csv:

benchmem compare {baseline} {candidate} --metric peak --sort value --csv peak.csv

Gate on a regression with --fail-on — it exits non-zero past a threshold. Here baseline and candidate are the same code, so nothing trips it (exit 0); on a real regression the offending ids print and the command exits 1:

In [5]:

Copied!

!benchmem compare {baseline} {candidate} --metric peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"
!benchmem compare {baseline} {candidate} --metric peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"

peak (MiB)                                           
                                                     
  id                  baseline   candidate   change  
 ─────────────────────────────────────────────────── 
  test_sort[10000]        0.08        0.08    +0.0%  
  test_sort[200000]       1.53        1.53    +0.0%  
  test_sort[500000]       3.81        3.81    +0.0%  
  test_sort[50000]        0.38        0.38    +0.0%  
                                                     
no regressions over thresholds

exit: 0

Thresholds are percent (peak:10%) or absolute (peak:5MiB), on peak, allocated, or allocations. The next section wires this into CI; the reference has the full grammar.

Gate CI on regressions¶

Two ways to fail a PR when memory regresses.

A) Two saved JSON files — save a baseline (e.g. on main), then compare the PR run against it with benchmem compare --fail-on. The baseline file is just the --benchmark-json from an earlier run, restored from cache or a base-branch build:

# on the PR branch:
pytest --benchmark-only --benchmark-json=pr.json
benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%

B) Inline, via pytest-benchmark storage — no separate files. Save a baseline into storage once with pytest-benchmark's own --benchmark-save (or --benchmark-autosave every run), then gate the next run against it. --benchmark-memory-compare-fail implies --benchmark-memory-compare, so the PR run compares against the latest saved run automatically:

# on main — record the baseline into .benchmarks/ storage:
pytest --benchmark-only --benchmark-memory --benchmark-save=main

# on the PR branch — fail if peak grows >10% vs that baseline:
pytest --benchmark-only --benchmark-memory --benchmark-memory-compare-fail=peak:10%

Without a prior saved run, the inline gate is a no-op — it prints "no prior run with memory to compare against" and passes. Save a baseline first.

A minimal GitHub Actions job using approach A, caching the baseline across runs:

- uses: actions/cache@v4
  with:
    path: main.json
    key: benchmem-baseline-${{ github.base_ref }}
- run: pytest --benchmark-only --benchmark-json=pr.json
- run: benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%

allocations is the steadiest tripwire — near-deterministic, so a move there is almost always a real behaviour change rather than measurement noise.

`benchmem plot` — the interactive views¶

benchmem plot writes an interactive plotly view to standalone HTML. It picks the view by run count — but each view answers a different question, so override with --view when you want a specific one:

Runs	Default view	Answers
1	`scaling`	how does cost grow with input size?
2	`scatter`	which ids moved, and were they already big?
2	`compare` (`--view compare`)	ranked — what moved most, in native units?
3+	`sweep`	fold-change across versions, one cell per (id, run)

In [6]:

Copied!

!benchmem plot --metric peak {baseline} {candidate} -o {_tmp / "scatter.html"}
!benchmem plot --metric peak {baseline} {candidate} -o {_tmp / "scatter.html"}

scatter (peak): 4 ids → /tmp/pytest-benchmem-znu40z1c/scatter.html

Every view is a plot_* function over the same load_long_df seam — call it directly to render the same figure inline, no HTML round-trip. Each takes a metric, returns (figure, n_ids), and shares three options: facet (small-multiple by a dim), labels (name the series, defaulting to file stems), and clip (clamp the colour scale so one outlier doesn't wash the rest out).

Scaling — a single run, cost vs. size. plot_scaling auto-infers the x-axis from the numeric n dim (override with x=), and auto-picks log/linear (force with log=). The baseline alone draws sorted's peak-memory curve:

In [7]:

Copied!

from pytest_benchmem import plotting

plotting.plot_scaling([baseline], metric="peak")[0]
from pytest_benchmem import plotting

plotting.plot_scaling([baseline], metric="peak")[0]

Scatter — two runs. x = baseline cost (log), y = candidate/baseline ratio, colour = absolute Δ. The top-right is the "big and got bigger" corner — where a regression actually costs you. Here on memory:

In [8]:

Copied!

plotting.plot_scatter([baseline, candidate], metric="peak")[0]
plotting.plot_scatter([baseline, candidate], metric="peak")[0]

Compare — two runs, ranked. A bar per id sorted by absolute delta, diverging colour around zero — the "did anything regress, biggest first" view. Pass sort="relative" to rank by percent instead. On timing this time:

In [9]:

Copied!

plotting.plot_compare([baseline, candidate], metric="time")[0]
plotting.plot_compare([baseline, candidate], metric="time")[0]

Sweep — three or more runs. A heatmap of log₂ fold-change vs the first run, one column per run, one row per id — the natural picture for a version sweep. A third run to make one:

In [10]:

Copied!

third = _tmp / "third.json"
!pytest {suite} --benchmark-only --benchmark-json={third} --benchmark-columns=min,median -q -p no:cacheprovider
plotting.plot_sweep([baseline, candidate, third], metric="peak")[0]
third = _tmp / "third.json"
!pytest {suite} --benchmark-only --benchmark-json={third} --benchmark-columns=min,median -q -p no:cacheprovider
plotting.plot_sweep([baseline, candidate, third], metric="peak")[0]

.                                                                     [100%]

Wrote benchmark data in: <_io.BufferedWriter name='/tmp/pytest-benchmem-znu40z1c/third.json'>

benchmark: 4 tests                                                                                         
                                                                                                           
  Name (time in us)                  Min               Median   │   peak (MiB)   allocated (MiB)   allocs  
 ───────────────────────────────────────────────────────────────────────────────────────────────────────── 
  test_sort[10000]         50.5820 (1.0)        51.9920 (1.0)   │         0.08              0.08        1  
  test_sort[50000]       255.2780 (5.05)      260.2580 (5.01)   │         0.38              0.38        1  
  test_sort[200000]   1,034.5030 (20.45)   1,049.2440 (20.18)   │         1.53              1.53        1  
  test_sort[500000]   2,683.0960 (53.04)   2,967.5860 (57.08)   │         3.81              3.81        1  
                                                                                                           
memory (right of │): a separate, untimed pass — single shot, not the timed rounds                          
4 passed in 4.29s

Naming the series¶

By default each run is labelled by its file stem (baseline, candidate, …). Pass labels= to name them yourself — the API behind plot's -l/--label — which is what you want when the filenames are version numbers or commit shas:

In [11]:

Copied!

plotting.plot_sweep([baseline, candidate, third], metric="peak",
                    labels=["v0.6", "v0.7", "v0.8"])[0]
plotting.plot_sweep([baseline, candidate, third], metric="peak",
                    labels=["v0.6", "v0.7", "v0.8"])[0]

Faceting by a dim¶

facet= (CLI --facet) splits any view into small multiples by a dim — including the node.* structural dims. With one test function per operation, for instance, --facet node.func gives one panel per operation:

benchmem plot run.json --metric peak --facet node.func