Published 2026-05-06 · By Kristian Razum · Editorial policy

Load Testing Tool Benchmarks: Methodology

This page documents how we benchmark HTTP load testing tools — the target service, hardware, scenarios, metrics, and how raw data is published. Benchmark numbers without methodology are not benchmarks; they are claims. The goal of publishing the methodology separately is to make the runs reproducible by anyone with a Hetzner account and a few hours.

Status. DRAFT — first run pending. Methodology is fixed before any tool is run, then locked. The first published run will append a results section to this page and link the raw CSVs. We will not edit methodology after a result is published; if it changes, that is a new benchmark.

What this benchmark is and is not

The benchmark answers one question: given identical conditions, how many requests per second can each tool sustain against a known target before its own load generator becomes the bottleneck, and what does its latency reporting look like under that load?

It is not:

A claim about which tool is best — that depends on workflow, scripting, reporting, and team fit. See the Best Load Testing Tools guide for that conversation.
A test of how realistic each tool's traffic patterns are. We use a single endpoint with a fixed payload because that isolates load-generator behavior. Multi-step scenario fidelity is a separate evaluation.
A test of cost or scalability across distributed worker fleets. We benchmark single-node generators only.

Target service

The target is a minimal HTTP/1.1 server written in Go that does the cheapest thing possible: returns 200 OK with a fixed 256-byte JSON body, no logging, no allocation per request beyond what the standard library forces. Source is published in the benchmark repo.

Why a synthetic target instead of a real application? Because the question we are answering is about the load-generator, not the system under test. A real application introduces variance from the database, the runtime, the OS scheduler, and external dependencies. Those are interesting questions, but they belong to a different benchmark.

The target runs on a separate machine from each tool, on a 1 Gbps private network, and is warmed for 60 seconds before each run. Server CPU is monitored; any run where target CPU exceeds 70% is discarded as a target-bound run.

Target reproducibility

git clone https://github.com/cloud-native/loadtester-benchmarks
cd loadtester-benchmarks/target
go build -o target .
./target -addr :8080 -body 256

Hardware

Both the target and the load generator run on Hetzner Cloud CCX33 instances (8 dedicated vCPU, 32 GB RAM, Ubuntu 24.04 LTS) in the same datacenter (HEL1). Network limit is documented as 10 Gbps shared but realistic sustained throughput is closer to 1–2 Gbps; we cap test rates well below this.

Role	Instance	vCPU	RAM	OS
Target	CCX33	8	32 GB	Ubuntu 24.04
Generator	CCX33	8	32 GB	Ubuntu 24.04

Kernel and TCP tunables (applied identically to both):

net.core.somaxconn = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
fs.file-max = 1048576
ulimit -n 1048576

Scenarios

Each tool runs the same three scenarios. All use HTTP/1.1, keep-alive enabled, GET requests against /api/health.

Scenario	Target RPS	Duration	Connections	Purpose
S1 — Steady 1k	1,000	120s	50	Baseline: every modern tool should handle this without strain
S2 — Steady 10k	10,000	120s	500	Stress: separates lightweight CLIs from production-grade generators
S3 — Burst	0 → 20,000 → 0 ramp	180s	1,000	Tests how each tool handles ramp behavior and reporting under varying load

Each scenario runs 5 times per tool, with a 30-second cooldown between runs. We report median, p5, and p95 across the 5 runs (not within a single run — that is a separate metric we also publish).

Tools and versions

Tool	Version	Invocation style
LoadTester	TBD	API-triggered run, single-worker config
k6	TBD	`k6 run script.js`
JMeter	TBD	Non-GUI `jmeter -n -t plan.jmx`
Vegeta	TBD	`vegeta attack -rate=N -duration=120s`
hey	TBD	`hey -z 120s -c 50 -q N`
wrk	TBD	`wrk -t8 -c500 -d120s`
ApacheBench	TBD	`ab -n N -c 50`

Exact invocations for each tool are committed in the benchmark repo under scenarios/<tool>/ alongside any required config.

What we measure

From the load generator (per tool):

Achieved RPS (mean across run)
Latency: min, p50, p90, p95, p99, p99.9, max
Error rate (non-2xx responses, connection errors, timeouts — counted separately)
Wall-clock duration
Generator CPU and memory (sampled at 1 Hz via pidstat)

From the target (validation):

Requests received per second (target's view)
Bytes in/out
Target CPU / memory (we discard runs where target was the bottleneck)

Cross-validation: we compare the generator's "achieved RPS" with the target's "received RPS." Discrepancies above 2% are flagged in the published results — they usually mean the generator is over- or under-counting.

Why latency numbers from different tools are not directly comparable. Tools differ in how they measure latency. Some measure from send to last byte received; some include connection setup; some sample only a subset of requests. We document each tool's measurement model in the per-tool notes, and we cross-check against the target's own histogram (kernel-level tcp_info sampling on the server side) where possible. Published latency tables include a "measurement model" column so the comparison is honest.

How results are published

For every published benchmark, we commit:

Raw CSVs for every run (per-request timing where available, per-second buckets where not).
The exact tool invocation, scenario script, and config files used.
Generator and target syslog excerpts for the run window.
Hetzner instance ID and snapshot label so the run is auditable.
Discarded-run log with the reason for each discard (target-bound, network blip, etc.).

All of this lives in cloud-native/loadtester-benchmarks. Issues and pull requests are open. If you find a methodology flaw or believe we have under-tuned a specific tool, file an issue with the proposed change and a re-run will follow.

Conflict of interest disclosure

This benchmark is run by LoadTester, which is one of the tools being measured. That is the unavoidable bias. The mitigations are: (1) the methodology is fixed before any run; (2) all raw data is published; (3) the per-tool invocations are reviewable and tunable by the community; (4) when LoadTester does worse than another tool on a given metric, the result section says so without softening. We treat the benchmark as a credibility instrument, not a marketing instrument — a benchmark that can only ever favor the publisher is worse than no benchmark at all.

Results

Pending first run. When published, results will appear below as a dated subsection (e.g., 2026-06 — initial run) with the tables, charts, and raw-data links. We do not back-fill or quietly update old results; corrections are appended as new dated subsections with a note explaining what changed.

Methodology changelog

2026-05-06 — methodology published, no runs yet.