RRDtool Scalability

1) What is meant by RRDtool Scalability?

    RRDtool is more or less the defacto way to store time-series data in the information technology
world. Many sites are now monitoring more and more systems as a whole while increasing how much
data they are collecting per system. The ability to gather and store this much data is a testament to
rrdtool's ease of use.

    While RRDtool performs very well for small- and medium-sized installations, the RRD update mechanism requires a surprising number of I/O syscalls for each operation. These syscalls, in turn result in cache-unfriendly (random-seeming) I/O access, which defeats most OS and hardware caching algorithms. The main focus of this page is to examine
how RRDtool performs in the large scale and propose some solutions.

2) The Problem

    At the University of Wisconsin at Madison, the central IT department is responsible for the majority
of the campus network infrastructure which consists of approximatly 30 routers, 2,000 switches and
1,500 wireless access points. For each interface we use SNMP to monitor byte counters, unicast,
broadcast, and multicast packet counters, and interface errors. There are other SNMP variables
monitored, but the majority of our monitoring is looking at these variables on the edge of the network.

This amounts to about 500,000 snmp variables being monitored. We are using MRTG to poll the data and stuff it
into rrd files. Since MRTG puts "in" and "out" variables into the same rrd file, we have about 250,000
RRD files. We also only monitor ports that are currently active or have been active in the last 7 days, so
it is closer to only 175,000 RRD files being active at any one time.

It turns out that MRTG, properly configured, is an amazingly robust snmp poller. We run 6,000 "targets"
per MRTG process, and each MRTG process has 4 forks for snmp polling. There are usually about 28-30
MRTG processes running (times 4 forks each, during polling). Our polling interval is 300 seconds (5 minutes). Each MRTG's snmp polling cycle completes in about 30 seconds, or
sometimes a bit longer if a device is unreachable or slow at responding.
Graph of mrtg polling performance, 2006-12-19

So, here's the problem: writing data into RRD files for one polling interval was taking about 350 seconds on average, up to 450
seconds once an hour when the 1 hour consolidation RRA was being written, and up to 500 seconds
every other hour as we write a 2 hour consolidation RRA. To keep up, the write cycle needs to complete in under 200
seconds.   Graph of mrtg write performance, 2006-12-19   The load average on the machine was about
20, the cpu's were all stuck in iowait, and the disks were all 100% busy.

3) I/O performance problem characteristics

    The system is an IBM xSeries 445 with (8) 2.7 GHz Xeon Hypterthreaded CPUs and 16 GB of
RAM. Storage is attached with a QLogic QLA2342 dual port 2 Gbps fibre channel HBA to two parallel Brocade SANs
running to an EMC DMX-3 disk array. On the DMX-3, the host has its own front-end adapters (FAs) and
a dedicated array of (30) 15K RPM disks set up as RAID 10.

We more or less started to tackle the problem starting with the easiest parts first, given that our time
was the scarcest resource. We upgraded the machine from RHEL 3 to RHEL 4 in order to try the different
I/O schedulers available in linux kernel 2.6. As our workload was still so overwhelming, this didn't make
a noticable difference. When we got the dedicated disk array (we were on shared disks before), that doubled
the amount of disk I/O we could do, and brought MRTG's write cycle times closer to our goals. Graph of mrtg write performance, 2006-12-25 However it was still not enough, particularly when writing the consolidation RRA's.

Furthermore, our sysadmin team informed us that our one host is presenting more transactions per second
to the storage system than the rest of the datacenter combined. This clearly means that the I/O workload
we are giving the disks is unreasonable.

Sure, opening and closing 175,000 files every 5 minutes isn't free, but some testing showed that this
was not a significant bottleneck. Another concern was filesystem directory performance. Since we split the rrd files up into multiple directories and also use ext3's
dir_index support, some simple testing showed that this also was not a bottleneck.

An strace of a running MRTG process showed the system hanging on reads:

> sudo strace -r -p 6355 -c
Password:
Process 6355 attached - interrupt to quit
Process 6355 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.68   15.891513        2027      7839           read
  1.42    0.236465          34      6870           write
  1.16    0.192803         197       980           stat64
  0.99    0.164568          10     15680           _llseek
  0.20    0.032697          33       980           munmap
  0.16    0.026266          27       980           open
  0.10    0.016665          17       980           close
  0.10    0.015863          16       980           mmap2
  0.08    0.012510          13       980           fcntl64
  0.06    0.010652          11       980           time
  0.05    0.008951           9       980           fstat64
------ ----------- ----------- --------- --------- ----------------
100.00   16.608953                 38229           total

At first glance, I found the number of reads suprising. I had always casually thought that rrdtool
should be writing, not reading. (The failed stat64's were from mrtg looking for .log files to turn
into .rrd files, which I ended up commenting out of our version of mrtg.)

What is rrdtool doing?

For each file:
  Open.
  Read some of the headers.
  For each RRA to be updated (typicaly 2):
    If this is not the 1st RRA
      Seek.
      Read 
    Seek.
    Write.
  Seek back to beginning of file.
  Read some of the headers.
  Seek
  Write the "live" headers.
  Close.
open("file.rrd", O_RDWR) = 4
read(4, "RRD\0000001\0\0\0\0/%\300\307C+\37[\2\0\0\0\10\0\0\0,\1"..., 4096) = 4096
_llseek(4, 0, [4096], SEEK_CUR)         = 0
_llseek(4, 4096, [4096], SEEK_SET)      = 0
_llseek(4, 4096, [4096], SEEK_SET)      = 0
_llseek(4, -1324, [2772], SEEK_CUR)     = 0
write(4, "\2557Q$<\314\0@\303k\327.8\316\363?", 16) = 16
_llseek(4, 53248, [53248], SEEK_SET)    = 0
read(4, "\0\0\370\377\0\0\0\0\0\0\370\377\0\0\0\0\0\0\370\377\0"..., 4096) = 4096
_llseek(4, -3372, [53972], SEEK_CUR)    = 0
write(4, "\2557Q$<\314\0@\303k\327.8\316\363?", 16) = 16
_llseek(4, 0, [0], SEEK_SET)            = 0
read(4, "RRD\0000001\0\0\0\0/%\300\307C+\37[\2\0\0\0\10\0\0\0,\1"..., 4096) = 4096
_llseek(4, -2880, [1216], SEEK_CUR)     = 0
write(4, "t \370E832936\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1540) = 1540
close(4)                                = 0

If you are updating an RRA with a consoldating function, add in another seek and then more reading
of those data points before updating that particular RRA.

Now do this 175,000 times every 5 minutes, and guess what this looks like to the system. From the
strace, the writes are going into cache on the host and/or the storage system. The seek and
read patterns from this many files probably looks like random I/O to the host and storage systems.
Any intelligent reordering or read-ahead to reduce the amount of seeking would mostly be luck.
With the rrd file headers being constantly updated (and those pages marked as dirty) the system
has a hard time caching this data (and doesn't know that it could hold on to it forever).

4) What can be done to make RRDtool scale?

    Much of the overhead is dealing with the rrd headers, updating fields like last update time, etc. This
has to be done on every rrd update. If rrdtool were more like a "real database", this header information
would be updated and cached in the memory of a daemon. Writes could also be journaled and committed
to disk only as needed. Others have thought about this too, see the references section below.

So, the easiest hack for now would be to hold onto 1 hour's worth of data, and write it all down at once.

Attempt #1: RRDCache_d

Since we are using MRTG as our poller, it would be very easy to modify it to talk to a daemon in place of
calling RRDs::update. So, that is the first approach I took. Each MRTG daemon would make RPC calls to
a daemon which would update its internal data structures. In addition, this daemon would have 'n' disk
writer threads that would actually call RRD's update_r function. To do this, we need to bind RRDtool's
update_r into perl. Luckily, this has been partly worked on (see "Re: status of threaded RRDTool and Perl"
in References below).

RRDCache_d has two parts, a client library that you load in place of RRDs and a daemon which runs
in the background to cache data and feed the threads to write out data to disk. In your client application,
such as MRTG, put Require RRDCache in place of Require RRDs, and force all RRD update calls to
use RRDCache instead of RRDs. Furthermore, make sure that your poller does not do any threshold
checking or any other calling of RRDs::last or RRDs::fetch.

RRDCache_d is available here and requires this patch for RRDtool 1.2.19 for threaded perl bindings, called
RRDts which provides the RRDts::update_r function.

RRDCache_d showed that on our system with 6 threads we could write one hour's worth of data in about
10-15 minutes, which allowed each MRTG polling and write cycle to complete on time. However, it exposed
some problems related to threading. There is a deadlock which locks up the entire perl interpreter, and the
suspicion is that perl's threading implementation is not all that robust. Of course, I won't rule out a
programming error on my part.

Attempt #2: RRDCache

The second try was to create a simpler system. A client application would still use RRDCache in place of
RRDs, but there would not be a daemon and all of the RPC needed to talk to it. Instead, each client process
would simply write append rrd updates to its' individual log file. Then RRDCachewriter, currently run by cron,
reads in these log files, fires up 'n' forks, and commits the data to the rrd files by calling RRDs::update with
12 datapoints to insert at a time.

This system has proven to work remarkably well, and I should have started with this simpler approach initially.
The log files could be written to a local disk, but you probably don't want to put these logs on the same
spindles as the rest of your rrd files or your polling may suffer during the commit cycle. We chose to write
the log files to a ram disk for extra speed, and because we are very interested in using the data for checking
thresholds while it is in RAM, as opposed to writing it out to disk and reading it back in. Keeping the data
on the ramdisk also makes flushing out specific rrd's cheaper as well. Of course, you have to be comfortable
about loosing data.

Download rrdcache which contains the RRDCache.pm library to be used in place of RRDs, and
RRDCachewriter.pl which commits data from the logs to disk.

Now MRTG completes the write cycle in about 6 seconds on average. We are running RRDCachewriter
with 8 forks, and the average write cycle for 1 hour of data is under 9 minutes. With the load on the host
reduced, our MRTG system is completing the entire 5 minute poll and write cycle for 350,000 snmp
variables in about 20 seconds total, and graphing the data is nearly instantaneous. Since data can be more
than an hour old, RRDCache implements a "flush" for any RRD file called through RRDCache's fetch or
graph command.

What Others are doing

Some other folks on the rrd-developers mailing list have implemented caching systems for their sites,
(see References, below).

The most robust caching implementation is JRobin, which is a java rewite of RRDtool. While it
natively supports threading, queueing of data, and other such java-ness, it's file format is not
compatible with RRDtool. Although RRDtool's file format "suffers" from not being machine
independent, JRobin's incompatability prevents it from being used with the multitudes of tools and
frontends to both work with and visualize data from RRDtool. Outside of OpenNMS, JRobin is too
hard to integrate into any site currently using RRDtool.

Another unique project is RTG. It has the most robust poller available, and an interesting backend. Instead
of using RRD files or even jsut appending to log files like RRDCache, it stuffs its time-series data into an
SQL database. There is no free lunch, however, and getting time-series data out of SQL is expensive. In
addition, the burden of any data consolidation or other database mantenance is placed on the admin.
Development on the project has stagnated, plus the archived mailing lists are now unavailable which is
disappointing.

What I think should be done






5) References:

RRD Accelerator as part of RRDTool 1.3
A queueing strategy for the RRD Interface from OpenNMS
JRobin NIO backend, JRobin is now part of OpenNMS

Notes from the rrd-developers mailing list:
[rrd-developers] Re: status of threaded RRDTool and Perl
[rrd-developers] How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Re: How to get the most performance when using lots of RRD files
[rrd-developers] Announce RRDtool Performance Tester

6) Thanks:

Dave Plonka, for convincing me that there was no inherent reason why this much data collection shouldn't work.
Bob Plankers, Kevin Kettner, and the rest of the UW DoIT sys admin and enterprise storage teams
Philippe Simonet of Swisscom, for submitting some example code to cache data (that I looked at, but it was too site-specific to use)
Ole Bjørn Hessen of Telenor Networks, for getting threads in perl to work with rrdtool
Tobias Oetiker, for RRDtool of course.