An Analysis of Napster and Other IP Flow Sizes


Dave Plonka - University of Wisconsin - Madison

Table of Contents

An Analysis of Napster and Other IP Flow Sizes
    Introduction
    Our Flow Definition
    A Look at Average Flow Sizes
            Figure 1. Average Flow Sizes, Fall 1999 through Fall 2000
    About the Flow Samples
            Table 1.
            Figure 2. Daily Average Campus IP Traffic Rate, Fall 1999 through Fall 2000
    About the Flow Size Distribution Graphs
    Napster Flows
    IP Flow Size Distributions
            Figure 3. Cumulative % Distribution of Flow Sizes
            Figure 4. Content % Distribution of Flow Sizes
            Figure 5. Fall 1999, Spring 2000, Fall 2000: Content % vs. Flow Size
    Potential Problems
    Summary
    Future Directions
    Acknowledgements
    Analysis Tools
    References

Introduction

This is a report on an informal investigation of IP flow sizes that was initiated in response to questions posed in July, 2000, paraphrased as follows:
  1. Is there a peak in the distribution of flow sizes which approximates the size of a typical MP3 file?
  2. Do flows of Napster traffic exhibit a characterstic signature in terms of the sizes of its IP flows?
  3. While the use of "sharing" applications such as Napster are increasing bandwidth used, is it also changing the typical size of IP flows?
We also wondered if flow size analysis might provide a useful indication of general trends in Internet workload.

Our Flow Definition

The way in which IP flows are defined varies amongst investigators and vendors. This work utilized Cisco NetFlow export data collected by CAIDA's cflowd and post-processed by FlowScan [cflowd, FlowScan]. Therefore our IP flow is defined by the NetFlow V5 flow export PDU [NetFlow]. In short, a flow is a unidirectional series of IP packets of a given protocol, between a source and destination port, within a certain duration. The tunable NetFlow "timeout active" value was set to one minute. This means that active flows were expired and exported in as little as one minute after being instantiated.

A Look at Average Flow Sizes

During the 1999 and 2000 academic years, the average IP flow size appears to have increased slightly. This is evidenced by time-series plots of average flow size such as figure 1. Furthermore, the average size of Napster flows is significantly larger than that of IP flows in general. Note that because measurement of Napster traffic at the University of Wisconsin - Madison began in March of 2000, no earlier Napster flow data is shown in figure 1.

Although Napster flows are larger than average IP flows, other popular well-known applications such as ftp-data transfers have average flow sizes that far exceed those of Napster. While not included here, similar plots show the average ftp-data flow usually contains more than 200 kilobytes.

Figure 1. Average Flow Sizes, Fall 1999 through Fall 2000

plot of average flow sizes, Fall 1999 through Fall 2000

About the Flow Samples

This analysis uses raw flow data collected at the border of the campus network at the University of Wisconsin - Madison. This data was collected in three 24 hour periods, each sample roughly representing the traffic during one day of the Fall 1999, Spring 2000, and Fall 2000 semesters of the academic calendar. Table 1 summarizes the average rate of inbound and outbound IP traffic for the campus as a whole during each 24 hour sample. The rate of Napster traffic is unknown for the Fall 1999 sample because it predates the implementation of Napster flow identification in FlowScan.

Table 1.

SemesterSample "Day"TotalNapster
FlowsInboundOutboundFlowsInboundOutbound
Fall 1999September 15-1656,625,94226 Mb/s45 Mb/sunknownunknownunknown
Spring 2000May 12-1375,315,76845 Mb/s73 Mb/s7,509,9387 Mb/s21 Mb/s
Fall 2000November 1698,366,89160 Mb/s110 Mb/sunknown13 Mb/s31 Mb/s

There was no rhyme or reason to our selection of these particular sample "days". The first two samples were simply the only contiguous 24 hour periods for which data was available since we did not systematically retain detailed logs of campus traffic from so long ago. We retrieved these samples from the backup tapes of infrequent manual backups of our analysis machine. These backups were performed only for disaster recovery. As such, it was just by chance that they might coincide with interesting points in time regarding the use of "file sharing" applications. For this investigation we assume that traffic during these days is somewhat representative of traffic during each semester as a whole.

Figure 2 is a time-series graphs of daily average outbound and inbound IP traffic rates throughout the entire range in which the samples were taken. This figure is included to illustrate the overall growth trend of the campus traffic throughout the time in which the samples were collected. Note that the sample days are marked as red, green, and blue vertical rules in the graph. Throughout the figures here-in, the colors red, green, and blue are used for each of the three samples: Fall 1999, Spring 2000, and Fall 2000, respectively.

Figure 2. Daily Average Campus IP Traffic Rate, Fall 1999 through Fall 2000

plot of daily average IP traffic rate, Fall 1999 through Fall 2000

Consider figure 2. At the time of the Fall 1999 sample, the amount of Napster traffic is not known, however it is generally believed to be negligible by comparison to the later samples. At the time of the Spring 2000 sample, Napster traffic represented a significant portion of the campus traffic as a whole. Specifically, Napster is thought to have represented 29% of our campus outbound traffic, and 15% of the inbound traffic. Also in the Spring 2000 sample, Napster flows represented 13% of the outbound flows and 7% of the inbound flows. By the time of the Fall 2000 sample, Napster usage was responsible for even more inbound and outbound traffic.

In table 1 we see that the Fall 1999, Spring 2000, and Fall 2000 24-hour samples contained 56, 75, and 96 million flows respectively. This increase in the number of flows is indicative of the increased Internet usage that our campus observed throughout the 1999/2000 school year. This continous increase in data traffic has been recently investigated and reported by others [CoffmanO].

About the Flow Size Distribution Graphs

In figures 3 through 5 we use two methods to visualize flows size distributions.

The first method, used in figure 3, employs a line graph to plot the distribution of the cumulative percentage of total flows across 32 flow size intervals in units of packets and bytes. Subsequent size intervals are incremented by consecutive powers-of-two. That is, the first interval along the horizontal axis represents flows of sizes 1 and 2 (2^1), the second represents 2 through 4 (2^2), then 4 through 8 (2^3), 8 through 16 (2^4), and so on, up to the maximum size representable (2^31 through 2^32): approximately those between 2 billion and 4 billion. This plot is similar to that used in past investigations which examined the distribution of packet sizes [MCI], except that our horizontal axis compresses 4 billion discrete sizes into a managable set of only 32 intervals.

A second method is used in figures 4 and 5. Like the previous, figure 4 employs a line graph joining points plotted across the 32 size intervals. However, the percentage of total content delivered is plotted rather than cumulative percentage of total flows. Figure 5 contains histograms which plot the percentage of total bytes delivered for each size interval.

Napster Flows

Because this investigation looked for evidence of Napster's influence on IP flow sizes overall, it is useful to understand what sort of flows are produced by Napster. IP flows produced by the Napster application and work-alike clones are of, at least, these types:
  1. TCP initial connections from client user to "redirect" server
  2. TCP responses from "redirect" server to client user (specifying address of an "index" server)
  3. TCP commands/requests from client user to "index" server
  4. TCP responses from "index" server to client user
  5. ICMP ECHO from client user to candidate "server" user (28 byte packets)
  6. ICMP ECHOREPLY from candidate "server" user to client user (28 byte packets)
  7. TCP request from client user to "server" user (request and subsequent ACKs)
  8. TCP responses from "server" user to client user (possibly containing MP3 content)

The term "NapUser" is used below to label traffic believed to be generated by an application using the Napster protocol. Unless otherwise specified in the following dicussion, NapUser flows comprise all of those Napster flow types. These NapUser flows were identified by a method implemented in FlowScan [Plonka].

Throughout the figures, Napster values are plotted in purple and magenta. Figure 3 contains plots of the sizes of Napster application flows in terms of packets and bytes, both ICMP/TCP combined, and TCP alone. The "NapUser TCP bytes" and "NapUser TCP packets" plots represent just the Napster TCP flows, and therefore emphasize the flows representing the content-carrying flows: those representing the interaction with Napster index servers and representing the bidirectional TCP data streams which carries the MP3 data.

Figure 4 shows that Napster flow sizes, when measured in packets, peak in the 512-1000 packet interval and again in the 4000-8000 packet interval. When measured in bytes, they peak in the 512KB to 1MB range, and again in the 4MB to 8MB range. These byte measurements are roughly the product of those peak packet counts and 1500 byte MTU size commonly used for ethernet. As such it is likely that those peaks are caused by the type of Napster TCP flow which carries most of the MP3 content.

From examination of the raw Napster flows themselves we know that most Napster-related flows are actually the small ICMP flows from Napster clients to candidate servers. As such, well over half the flows produced by the application carry a trivial amount of content as measured in bytes or packets. However, the average Napster-produced TCP flow is larger than the average flow amongst all types and therefore Napster does appear to have increased the size of the average Internet IP flow.

IP Flow Size Distributions

Considering figure 3 and the size distribution amongst IP flows of all types, there is similarity among the percentages of flows of particular sizes amongst all samples (red, green, and blue). For instance, in each sample, about half of the flows are of sizes less than 512 bytes. The specific numeric results and the finding that the distributions are stable over a long period of time (about a year) are similar to those reported following recent investigations of flow "lifetimes" by [Brownlee].

One curiousity visible when considering the cumulative percentage of flows vs. flow size is that, in Spring 2000, 5.7% of the flows were less than 32 bytes in contrast with 0.1% from the previous fall. The Fall 2000 statistic remained similar to that of Spring 2000. This is evident in Figure 3 where the solid lines, representing flow size in bytes, leave the zero value on the vertical axis. Remembering that a full 27% of the Spring traffic during the Spring sample was Napster traffic, it is likely that those small flows represent the 28-byte "ping" packets generated by Napster.

Figure 3. Cumulative % Distribution of Flow Sizes

plot of flow sizes

Figure 4. Content % Distribution of Flow Sizes

plot of flow sizes

Considering Figure 4, the distribution of flow sizes based upon the percentage of total IP content delivered, both the packet and byte plots shift slightly toward larger flow sizes as we progress from Fall 1999 (red), to Spring 2000 (green), and to Fall 2000 (blue). Also, there was an increase in the percentage of byte content delivered in flows between 4MB and 16MB in size. Specifically, the Spring 2000 sample shows a spike in those two intervals, somewhat mimicking the pure NapUser flow size distribution.

Figure 5. Fall 1999, Spring 2000, Fall 2000: Content % vs. Flow Size

plot of Fall 1999, Spring 2000, and Fall 2000 flow sizes

Figure 5 shows the distribution of flow sizes in bytes, according to the percentage of the total content in bytes that were delivered by flows of that size. The median intervals have been identified and labeled. This figure seems to show that flow sizes are increasing in that a number of larger intervals (such as those between 4 and 32 megabytes) are responsible for having delivered more of the content. Since earlier figures showed that the percentage of flows of those sizes has not noticeably increased, it appears that the flows within those intervals are increasing in size, and are sometimes promoted up to the larger size intervals. This increase has resulted in the median shifting from the 2MB to 4MB interval, up into the 4MB to 8MB interval.

Potential Problems

In part, Cisco NetFlow defines its flows based upon timeouts. This user configurable timeout has been configured in such a way that a TCP data stream will timeout after approximately one minute regardless of whether or not the TCP stream is still active. So, a single TCP stream may be, and often is, represented by more than one flow in each direction. For this reason it is somewhat difficult to correlate TCP streams with our flow-based measurements.

Additionally, if the offered load of IP traffic does not change but the usable bandwidth to the users increases, one would expect to see an increase in flow size simply because bulk data transfer applications should be able to transfer more content before the flow expires. We have not yet looked for trends in this flow rate, so we can't necessarily enumerate all the factors contributing to the increases in flow size.

Summary

Napster flows are clearly larger than average, even when its numerous but often overlooked "ping" flows participate in the calculation. Furthermore, Napster flows appear to follow a characterstic pattern - namely that most Napster content is delivered in flows of sizes between 4 megabytes and 16 megabytes. This is perhaps not surprising since it is a reasonable estimate of the range of the MP3 files typically exchanged.

In response to the questions posed at the beginning of this investigation, we can respond thusly:

  1. For IP flows of approximately 1 minute and lesser duration, there is a peak in the distribution of general IP flow sizes that matches a peak in Napster flow sizes. This peak shows that over 30% of the Internet traffic (in bytes) is transferred by flows of sizes between 4MB and 16MB.
  2. The cumulative Napster flow size distribution exhibits a signature quite distinct from the cumulative flow size distribution for general IP flows.
  3. During the time that the bandwidth utilized by Napster and other "sharing" applications has increased dramatically, the size of flows that carry the largest percentage of Internet traffic has subtly increased as well.

Considering these results, it seems likely that use of the popular file-sharing applications such as Napster will not only continue to increase bandwidth usage solely by virtue of their popularity, but will also likely shift the distribution of flow sizes higher. Operational implications of such a shift in Internet workload and usage characteristics warrants further study.

Future Directions

It is possible that the increasing amount of Internet content being transferred by large flows is due to an increase in the effective throughput observed by applications. If Internet bulk data transfers are enjoying an increasing bit transfer rate, the flow "active timeout" may cause those flows to expire prior to termination of the underlying application session, thus causing a larger number of flows to accumulate within the size interval containing the product of that active timeout value and bit transfer rate. An investigation of flow durations and flow rates needs to be performed to determine if flows rates are increasing.

The time-series measurement and visualization of flow rates may be a useful feature to add to existing flow-based passive measurement systems. If the observed flow rates are found to correlate with the user's perceived application performance, we would have a new tool by which to passively measure quality of service in near real time.

Acknowledgements

I thank these folks for their answers and questions:

k claffy <kc@caida.org>
Michael Hare <mhare@doit.wisc.edu>

Analysis Tools

The following tools were used during this analysis:

References


$Id: index.wml,v 1.1 2001/01/03 22:21:24 plonka Exp plonka $