27 Jan 2004

phk's benchmarking hints

It’s important to have a scientific view when doing any benchmarks. Today, phk and Robert Watson has posted several hints to the -current list.

phk’s:

A number of people have started to benchmark things seriously now, and run into the usual problem of noisy data preventing any conclusions. Rather than repeat myself many times, I decided to send this email.

I experiemented with micro-benchmarking some years back, here are some bullet points with a lot of the stuff I found out. You will not be able to use them all every single time, but the more you use, the better your ability to test small differences will be.

* Disable APM and any other kind of clock fiddling (ACPI ?)

* Run in single user mode. Cron(8) and and other daemons
only add noise.

* If syslog events are generated, run syslogd with an empty
syslogd.conf, otherwise, do not run it.

* Minimize disk-I/O, avoid it entirely if you can.

* Don’t mount filesystems you do not need.

* Mount / and /usr and any other filesystem possible as read-only.
This removes atime updates to disk (etc.) from your I/O picture.

* Newfs your R/W test filesystem and populate it from a tar or
dump file before every run. Unmount and mount it before starting
the test. This results in a consistent filesystem layout. For
a worldstone test this would apply to /usr/obj (just newfs and
mount). If you want 100% reproducibility, populate your filesystem
from a dd(1) file (ie: dd if=myimage of=/dev/ad0s1h bs=1m)

* Use malloc backed or preloaded MD(4) partitions.

* Reboot between individual iterations of your test, this gives
a more consistent state.

* Remove all non-essential device drivers from the kernel. For
instance If you don’t need USB for the test, don’t put USB in
the kernel. Drivers which attach often have timeouts ticking
away.

* Unconfigure hardware you don’t use. Detach disk with atacontrol
and camcontrol if you do not use them for the test.

* Do not configure the network unless you are testing it (or after
your test to ship the results off to another computer.)

* Do not run NTPD.

* Put each filesystem on its own disk. This minimizes jitter from
head-seek optimizations.

* Minimize output to serial or VGA consoles. Running output into
files gives less jitter. (Serial consoles easily become a
bottleneck). Do not touch keyboard while test is running,
even

shows up in your numbers.

* Make sure your test is long enough, but not too long. If you
test is too short, timestamping is a problem. If it is too
long temperature changes and drift will affect the frequency of
the quartz crystals in your computer. Rule of thumb: more than
a minute, less than an hour.

* Try to keep the temperature as stable as possible around the
machine. This affects both quartz crystals and disk drive
algorithms. If you really want to get nasty, consider stabilized
clock injection. (get a OCXO + PLL, inject output into clock
circuits instead of motherboard xtal. Send me an email).

* Run at least 3 but better is >20 for both “before” and “after”
code. Try to interleave if possible (ie: do no run 20xbefore
then 20xafter), this makes it possible to spot environmental
effects. Do not interleave 1:1, but 3:3, this makes it possible
to spot interaction effects.

My preferred pattern: bababa{bbbaaa}* This gives hint after
the first 1+1 runs (so you can stop it if it goes entirely the
wrong way), a stddev after the first 3+3 (gives a good indication
if it is going to be worth a long run) and trending and interaction
numbers later on.

* Use usr/src/tools/tools/ministat to see if your numbers are
significant. Consider buying “Cartoon guide to statistics”
ISBN: 0062731025, highly recommended, if you’ve forgotton or
never learned about stddev and Student’s T.

Enjoy, and please share any other tricks you might develop!

Poul-Henning

Robert Watson replied:

On Mon, 26 Jan 2004, Poul-Henning Kamp wrote:

* Run in single user mode. Cron(8) and and other daemons
only add noise.

Of particular interest here is sshd – either disable its SSHv1 key regeneration, or kill the parent sshd (or don’t run sshd) during the tests.

A few obvious ones just worth remembering:

(1) Don’t use bgfsck unless you want to benchmark with bgfsck. Also,
disable background_fsck in rc.conf unless you plan not to start the
benchmark for at least 60+fudge seconds after the boot, as rc wakes up
and checks to see if anything needs fscking if it is enabled.
Likewise, make sure you don’t have snapshots lying around unless you
mean to teset with snapshots.

(2) If you must leave the system connected to a public network, watch out
for spikes of broadcast traffic. Even though you don’t notice it, it
will eat your CPU. Multicast has similar caveats.

(3) If you see unexpectedly bad performance, check for things like
high interrupt volume from an unexpected source. I’ve seen several
reports of ACPI “misbehaving” and generating excess interrupts under
some of the ACPI code drops. If you’re seeing odd results, take a few
snapshots of vmstat -i and look for anything unusual.

(4) Make sure to be careful about optimization parameters for kernel and
userspace, likewise debugging. It’s easy to let something slip
through and realize later you weren’t comparing the same thing.

(5) And finally, do not ever benchmark with WITNESS and INVARIANTS unless
you are really just interested in benchmarking those features. :-)
I’ve seen 400%+ drops in performance from running with WITNESS.
Likewise, userspace malloc paramters default differently in -CURRENT
from the way they’d ship in production.

Also, it would be very helpful if, while benchmarking, people would do the same benchmark with SCHED_ULE and SCHED_4BSD, and see what impact it has.
Scheduler changes are very influential to performance, and we want to make sure we catch edge cases and glitches that will now be being seen with more broad testing.

Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org Senior Research Scientist, McAfee Research

delphij's Chaos

phk's benchmarking hints