While RRDtool performs very well for small- and medium-sized
installations, the RRD update mechanism requires a surprising
number of I/O syscalls for each operation. These syscalls, in
turn result in cache-unfriendly (random-seeming) I/O access, which
defeats most OS and hardware caching algorithms.
The main focus of this page is to examine
how RRDtool performs in the large scale and propose some solutions.
This amounts to about 500,000 snmp variables being monitored. We are using MRTG to poll the data and stuff it
into rrd files. Since MRTG puts "in" and "out" variables into the same rrd file, we have about 250,000
RRD files. We also only monitor ports that are currently active or have been active in the last 7 days, so
it is closer to only 175,000 RRD files being active at any one time.
It turns out that MRTG, properly configured, is an amazingly robust snmp poller. We run 6,000 "targets"
per MRTG process, and each MRTG process has 4 forks for snmp polling. There are usually about 28-30
MRTG processes running (times 4 forks each, during polling). Our polling interval is 300 seconds (5 minutes).
Each MRTG's snmp polling cycle completes in about 30 seconds, or
sometimes a bit longer if a device is unreachable or slow at responding.
Graph of mrtg polling performance, 2006-12-19
So, here's the problem: writing data into RRD files for one polling interval was taking about 350 seconds on average, up to 450
seconds once an hour when the 1 hour consolidation RRA was being written, and up to 500 seconds
every other hour as we write a 2 hour consolidation RRA. To keep up, the write cycle needs to complete in under 200
seconds. Graph of mrtg write performance, 2006-12-19
The load average on the machine was about
20, the cpu's were all stuck in iowait, and the disks were all 100% busy.
We more or less started to tackle the problem starting with the easiest parts first, given that our time
was the scarcest resource. We upgraded the machine from RHEL 3 to RHEL 4 in order to try the different
I/O schedulers available in linux kernel 2.6. As our workload was still so overwhelming, this didn't make
a noticable difference. When we got the dedicated disk array (we were on shared disks before), that doubled
the amount of disk I/O we could do, and brought MRTG's write cycle times closer to our goals.
Graph of mrtg write performance, 2006-12-25 However it was still not enough,
particularly when writing the consolidation RRA's.
Furthermore, our sysadmin team informed us that our one host is presenting more transactions per second
to the storage system than the rest of the datacenter combined. This clearly means that the I/O workload
we are giving the disks is unreasonable.
Sure, opening and closing 175,000 files every 5 minutes isn't free, but some testing showed that this
was not a significant bottleneck. Another concern was filesystem directory performance. Since we split
the rrd files up into multiple directories and also use ext3's
dir_index support, some simple testing showed that this also
was not a bottleneck.
An strace of a running MRTG process showed the system hanging on reads:
At first glance, I found the number of reads suprising. I had always casually thought that rrdtool> sudo strace -r -p 6355 -c Password: Process 6355 attached - interrupt to quit Process 6355 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 95.68 15.891513 2027 7839 read 1.42 0.236465 34 6870 write 1.16 0.192803 197 980 stat64 0.99 0.164568 10 15680 _llseek 0.20 0.032697 33 980 munmap 0.16 0.026266 27 980 open 0.10 0.016665 17 980 close 0.10 0.015863 16 980 mmap2 0.08 0.012510 13 980 fcntl64 0.06 0.010652 11 980 time 0.05 0.008951 9 980 fstat64 ------ ----------- ----------- --------- --------- ---------------- 100.00 16.608953 38229 total
What is rrdtool doing?
For each file:
Open.
Read some of the headers.
For each RRA to be updated (typicaly 2):
If this is not the 1st RRA
Seek.
Read
Seek.
Write.
Seek back to beginning of file.
Read some of the headers.
Seek
Write the "live" headers.
Close.
open("file.rrd", O_RDWR) = 4
read(4, "RRD\0000001\0\0\0\0/%\300\307C+\37[\2\0\0\0\10\0\0\0,\1"..., 4096) = 4096
_llseek(4, 0, [4096], SEEK_CUR) = 0
_llseek(4, 4096, [4096], SEEK_SET) = 0
_llseek(4, 4096, [4096], SEEK_SET) = 0
_llseek(4, -1324, [2772], SEEK_CUR) = 0
write(4, "\2557Q$<\314\0@\303k\327.8\316\363?", 16) = 16
_llseek(4, 53248, [53248], SEEK_SET) = 0
read(4, "\0\0\370\377\0\0\0\0\0\0\370\377\0\0\0\0\0\0\370\377\0"..., 4096) = 4096
_llseek(4, -3372, [53972], SEEK_CUR) = 0
write(4, "\2557Q$<\314\0@\303k\327.8\316\363?", 16) = 16
_llseek(4, 0, [0], SEEK_SET) = 0
read(4, "RRD\0000001\0\0\0\0/%\300\307C+\37[\2\0\0\0\10\0\0\0,\1"..., 4096) = 4096
_llseek(4, -2880, [1216], SEEK_CUR) = 0
write(4, "t \370E832936\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1540) = 1540
close(4) = 0
If you are updating an RRA with a consoldating function, add in another seek and then more reading
Now do this 175,000 times every 5 minutes, and guess what this looks like to the system. From the
strace, the writes are going into cache on the host and/or the storage system. The seek and
read patterns from this many files probably looks like random I/O to the host and storage systems.
Any intelligent reordering or read-ahead to reduce the amount of seeking would mostly be luck.
With the rrd file headers being constantly updated (and those pages marked as dirty) the system
has a hard time caching this data (and doesn't know that it could hold on to it forever).
So, the easiest hack for now would be to hold onto 1 hour's worth of data, and write it all down at once.
RRDCache_d has two parts, a client library that you load in place of RRDs and a daemon which runs
in the background to cache data and feed the threads to write out data to disk. In your client application,
such as MRTG, put Require RRDCache in place of Require RRDs, and force all RRD update calls to
use RRDCache instead of RRDs. Furthermore, make sure that your poller does not do any threshold
checking or any other calling of RRDs::last or RRDs::fetch.
RRDCache_d is available here and requires this
patch for RRDtool 1.2.19 for
threaded perl bindings, called
RRDts which provides the RRDts::update_r function.
RRDCache_d showed that on our system with 6 threads we could write one hour's worth of data in about
10-15 minutes, which allowed each MRTG polling and write cycle to complete on time. However, it exposed
some problems related to threading. There is a deadlock which locks up the entire perl interpreter, and the
suspicion is that perl's threading implementation is not all that robust. Of course, I won't rule out a
programming error on my part.
This system has proven to work remarkably well, and I should have started with this simpler approach initially.
The log files could be written to a local disk, but you probably don't want to put these logs on the same
spindles as the rest of your rrd files or your polling may suffer during the commit cycle. We chose to write
the log files to a ram disk for extra speed, and because we are very interested in using the data for checking
thresholds while it is in RAM, as opposed to writing it out to disk and reading it back in. Keeping the data
on the ramdisk also makes flushing out specific rrd's cheaper as well. Of course, you have to be comfortable
about loosing data.
Download rrdcache which contains the RRDCache.pm library to be used in place of RRDs, and
RRDCachewriter.pl which commits data from the logs to disk.
Now MRTG completes the write cycle in about 6 seconds on average. We
are running RRDCachewriter
with 8 forks, and the average write cycle for 1 hour of data is under 9 minutes. With the load on the host
reduced, our MRTG system is completing the entire 5 minute poll and write cycle for 350,000 snmp
variables in about 20 seconds total, and graphing the data is nearly instantaneous. Since data can be more
than an hour old, RRDCache implements a "flush" for any RRD file called through RRDCache's fetch or
graph command.
The most robust caching implementation is JRobin, which is a java rewite of RRDtool. While it
natively supports threading, queueing of data, and other such java-ness, it's file format is not
compatible with RRDtool. Although RRDtool's file format "suffers" from not being machine
independent, JRobin's incompatability prevents it from being used with the multitudes of tools and
frontends to both work with and visualize data from RRDtool. Outside of OpenNMS, JRobin is too
hard to
integrate into any site currently using RRDtool.
Another unique project is RTG. It has the most robust poller
available, and an interesting backend. Instead
of using RRD files or even jsut appending to log files like RRDCache, it stuffs its time-series data into an
SQL database. There is no free lunch, however, and getting time-series data out of SQL is expensive. In
addition, the burden of any data consolidation or other database mantenance is placed on the admin.
Development on the project has stagnated, plus the archived mailing lists are now unavailable which is
disappointing.
RRD Accelerator as part of RRDTool 1.3
A queueing strategy for the RRD Interface from OpenNMS
JRobin NIO backend, JRobin is now part of OpenNMS
Notes from the rrd-developers mailing list:
[rrd-developers] Re: status of threaded RRDTool and Perl
[rrd-developers] How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Announce RRDtool Performance Tester