Load Testing Tool Benchmarks: Methodology
This page documents how we benchmark HTTP load testing tools — the target service, hardware, scenarios, metrics, and how raw data is published. Benchmark numbers without methodology are not benchmarks; they are claims. The goal of publishing the methodology separately is to make the runs reproducible by anyone with a Hetzner account and a few hours.
results section to this page and link the raw CSVs. We will not edit methodology after a result is published; if it changes, that is a new benchmark.
What this benchmark is and is not
The benchmark answers one question: given identical conditions, how many requests per second can each tool sustain against a known target before its own load generator becomes the bottleneck, and what does its latency reporting look like under that load?
It is not:
- A claim about which tool is best — that depends on workflow, scripting, reporting, and team fit. See the Best Load Testing Tools guide for that conversation.
- A test of how realistic each tool's traffic patterns are. We use a single endpoint with a fixed payload because that isolates load-generator behavior. Multi-step scenario fidelity is a separate evaluation.
- A test of cost or scalability across distributed worker fleets. We benchmark single-node generators only.
Target service
The target is a minimal HTTP/1.1 server written in Go that does the cheapest thing possible: returns 200 OK with a fixed 256-byte JSON body, no logging, no allocation per request beyond what the standard library forces. Source is published in the benchmark repo.
Why a synthetic target instead of a real application? Because the question we are answering is about the load-generator, not the system under test. A real application introduces variance from the database, the runtime, the OS scheduler, and external dependencies. Those are interesting questions, but they belong to a different benchmark.
The target runs on a separate machine from each tool, on a 1 Gbps private network, and is warmed for 60 seconds before each run. Server CPU is monitored; any run where target CPU exceeds 70% is discarded as a target-bound run.
Target reproducibility
git clone https://github.com/cloud-native/loadtester-benchmarks cd loadtester-benchmarks/target go build -o target . ./target -addr :8080 -body 256
Hardware
Both the target and the load generator run on Hetzner Cloud CCX33 instances (8 dedicated vCPU, 32 GB RAM, Ubuntu 24.04 LTS) in the same datacenter (HEL1). Network limit is documented as 10 Gbps shared but realistic sustained throughput is closer to 1–2 Gbps; we cap test rates well below this.
| Role | Instance | vCPU | RAM | OS |
|---|---|---|---|---|
| Target | CCX33 | 8 | 32 GB | Ubuntu 24.04 |
| Generator | CCX33 | 8 | 32 GB | Ubuntu 24.04 |
Kernel and TCP tunables (applied identically to both):
net.core.somaxconn = 65535 net.ipv4.ip_local_port_range = 1024 65535 net.ipv4.tcp_tw_reuse = 1 fs.file-max = 1048576 ulimit -n 1048576
Scenarios
Each tool runs the same three scenarios. All use HTTP/1.1, keep-alive enabled, GET requests against /api/health.
| Scenario | Target RPS | Duration | Connections | Purpose |
|---|---|---|---|---|
| S1 — Steady 1k | 1,000 | 120s | 50 | Baseline: every modern tool should handle this without strain |
| S2 — Steady 10k | 10,000 | 120s | 500 | Stress: separates lightweight CLIs from production-grade generators |
| S3 — Burst | 0 → 20,000 → 0 ramp | 180s | 1,000 | Tests how each tool handles ramp behavior and reporting under varying load |
Each scenario runs 5 times per tool, with a 30-second cooldown between runs. We report median, p5, and p95 across the 5 runs (not within a single run — that is a separate metric we also publish).
Tools and versions
| Tool | Version | Invocation style |
|---|---|---|
| LoadTester | TBD | API-triggered run, single-worker config |
| k6 | TBD | k6 run script.js |
| JMeter | TBD | Non-GUI jmeter -n -t plan.jmx |
| Vegeta | TBD | vegeta attack -rate=N -duration=120s |
| hey | TBD | hey -z 120s -c 50 -q N |
| wrk | TBD | wrk -t8 -c500 -d120s |
| ApacheBench | TBD | ab -n N -c 50 |
Exact invocations for each tool are committed in the benchmark repo under scenarios/<tool>/ alongside any required config.
What we measure
From the load generator (per tool):
- Achieved RPS (mean across run)
- Latency: min, p50, p90, p95, p99, p99.9, max
- Error rate (non-2xx responses, connection errors, timeouts — counted separately)
- Wall-clock duration
- Generator CPU and memory (sampled at 1 Hz via
pidstat)
From the target (validation):
- Requests received per second (target's view)
- Bytes in/out
- Target CPU / memory (we discard runs where target was the bottleneck)
Cross-validation: we compare the generator's "achieved RPS" with the target's "received RPS." Discrepancies above 2% are flagged in the published results — they usually mean the generator is over- or under-counting.
tcp_info sampling on the server side) where possible. Published latency tables include a "measurement model" column so the comparison is honest.
How results are published
For every published benchmark, we commit:
- Raw CSVs for every run (per-request timing where available, per-second buckets where not).
- The exact tool invocation, scenario script, and config files used.
- Generator and target syslog excerpts for the run window.
- Hetzner instance ID and snapshot label so the run is auditable.
- Discarded-run log with the reason for each discard (target-bound, network blip, etc.).
All of this lives in cloud-native/loadtester-benchmarks. Issues and pull requests are open. If you find a methodology flaw or believe we have under-tuned a specific tool, file an issue with the proposed change and a re-run will follow.
Conflict of interest disclosure
This benchmark is run by LoadTester, which is one of the tools being measured. That is the unavoidable bias. The mitigations are: (1) the methodology is fixed before any run; (2) all raw data is published; (3) the per-tool invocations are reviewable and tunable by the community; (4) when LoadTester does worse than another tool on a given metric, the result section says so without softening. We treat the benchmark as a credibility instrument, not a marketing instrument — a benchmark that can only ever favor the publisher is worse than no benchmark at all.
Results
Pending first run. When published, results will appear below as a dated subsection (e.g., 2026-06 — initial run) with the tables, charts, and raw-data links. We do not back-fill or quietly update old results; corrections are appended as new dated subsections with a note explaining what changed.
Methodology changelog
- 2026-05-06 — methodology published, no runs yet.