Scott A. Crosby
Dan S. Wallach
scrosby@cs.rice.edu
dwallach@cs.rice.edu
Department of Computer Science, Rice
University
When analyzing the running time of algorithms, a common technique is to differentiate best-case, common-case, and worst-cast performance. For example, an unbalanced binary tree will be expected to consume time to insert elements, but if the elements happen to be sorted beforehand, then the tree would degenerate to a linked list, and it would take time to insert all elements. Similarly, a hash table would be expected to consume time to insert elements. However, if each element hashes to the same bucket, the hash table will also degenerate to a linked list, and it will take time to insert elements.
While balanced tree algorithms, such as red-black trees [11], AVL trees [1], and treaps [17] can avoid predictable input which causes worst-case behavior, and universal hash functions [5] can be used to make hash functions that are not predictable by an attacker, many common applications use simpler algorithms. If an attacker can control and predict the inputs being used by these algorithms, then the attacker may be able to induce the worst-case execution time, effectively causing a denial-of-service (DoS) attack.
Such algorithmic DoS attacks have much in common with other low-bandwidth DoS attacks, such as stack smashing [2] or the ping-of-death 1, wherein a relatively short message causes an Internet server to crash or misbehave. While a variety of techniques can be used to address these DoS attacks, common industrial practice still allows bugs like these to appear in commercial products. However, unlike stack smashing, attacks that target poorly chosen algorithms can function even against code written in safe languages. One early example was discovered by Garfinkel [10], who described nested HTML tables that induced the browser to perform super-linear work to derive the table's on-screen layout. More recently, Stubblefield and Dean [8] described attacks against SSL servers, where a malicious web client can coerce a web server into performing expensive RSA decryption operations. They suggested the use of crypto puzzles [9] to force clients to perform more work before the server does its work. Provably requiring the client to consume CPU time may make sense for fundamentally expensive operations like RSA decryption, but it seems out of place when the expensive operation (e.g., HTML table layout) is only expensive because a poor algorithm was used in the system. Another recent paper [16] is a toolkit that allows programmers to inject sensors and actuators into a program. When a resource abuse is detected an appropriate action is taken.
This paper focuses on DoS attacks that may be mounted from across a network, targeting servers with the data that they might observe and store in a hash table as part of their normal operation. Section 2 details how hash tables work and how they can be vulnerable to malicious attacks. Section 3 describes vulnerabilities in the Squid web cache, the DJB DNS server, and Perl's built-in hash tables. Section 4 describes vulnerabilities in the Bro intrusion detection system. Section 5 presents some possible solutions to our attack. Finally, Section 6 gives our conclusions and discusses future work.
Hash tables are widely used throughout computer systems. They are used internally in compilers to track symbol tables. They are used internally in operating systems for everything from IP fragment reassembly to filesystem directory lookup. Hash tables are so common that programming languages like Perl provide syntactic sugar to represent hash tables as ``associative arrays,'' making them easy for programmers to use. Programmers clearly prefer hash tables for their constant-time expected behavior, despite their worst-case per-operation running time. After all, what are the odds that a hash table will degenerate to its worst case behavior?
In typical usage, objects to be inserted into a hashtable are first reduced to a 32-bit hash value. Strings might be hashed using a checksum operator like CRC32 or MD5, but are usually hashed by much simpler algorithms. More complex objects might have custom-written hash-value operators. The hash table then takes this hash value, modulo the bucket count, the size of the array of pointers to data being stored in the hash table, determining the bucket that will hold the reference to the object being inserted. When two inputs map the the same bucket, a collision has occurred. To deal with this case, each hash bucket holds a linked list of all inserted objects whose hash value, modulo the bucket count, maps to that particular bucket (see Figure 1). These linked lists are referred to as hash chains. When the total number of objects in the hash table grows too large, resulting in long average chain length, the size of the array of hash buckets is typically increased, perhaps multiplied by a constant factor, and the entries in the table are reinserted, taking their hash values modulo the new bucket count.
There are other methods of implementing hash tables, including open addressing, where collisions are not resolved using hash chains. Instead, the system follows a deterministic strategy to probe for an empty hash bucket, where the object is then inserted. Although this paper focuses on hash chaining, the attacks described here will be at least as effective on open addressing hash tables.
The worse case (see Figure 2) can occur for two reasons: either the 32-bit hash values are identical, or the hash values modulo the bucket count becomes identical. Of course, for randomly chosen input, the odds of every object hashing to the same bucket is vanishingly small -- for buckets and objects. For maliciously chosen input, however, it becomes entirely feasible. If the hash table is checking for duplicates when a new object is inserted, perhaps to guarantee that it acts as a mapping from object keys to values, then it will need to scan every entry in the hash bucket. This will induce the worst-case behavior for each insert.
There are only a few requirements in order to engage in such an attack. First, the hash function being used must be deterministic and known to the attacker. Second, the attacker needs the ability to predict or supply all of the input being used by the hash function. Third, the attacker needs to ensure that a sufficient volume of attack input gets to the victim such that they experience a performance degradation.
The attacker must understand how raw data, initially read by the application from the network, is processed before it is inserted into the hash table. Knowing this, the attacker must compute objects that will eventually collide, either in the 32-bit hash-value space, or only in the eventual hash buckets. Section 2.1 will describe how these collisions can be efficiently computed for some hash functions. At worst, computing hash collisions requires an attacker to exhaustively search within the space of possible inputs. While expensive, the attacker can do this work ahead of time. Ultimately, the question is whether the victim will accept enough attack-input for the worst-case behavior to manifest itself. Furthermore, some victims may enforce various limits on the growth of their hash tables, making them robust against this class of attack. We describe such limits in Section 2.2.
The first step in analyzing a program's vulnerabilities to this attack is to determine where hash tables are being used and identifying whether external, untrusted input can be fed directly into the table. This can be time consuming. As an example, the Bind DNS server places some four different abstraction layers between the network and the ultimate hash table storing DNS bindings. Tracing this can be tedious work for an attacker unfamiliar with the source code.
An attacker may not know the bucket count exactly; many implementations change the bucket count based on the number of objects stored in the hash table. However, given the application's source code, an attacker may be able to guess possible values for the bucket count. This leads to two avenues of attack: those where you don't care about the bucket count and those where you know or guess the bucket count.
If collisions can be computed in the full 32-bit hash-value space, then the bucket count is irrelevant; the hash table will exhibit worst-case behavior regardless of how many buckets it has. More formally, we wish to derive inputs such that . We refer to these as hash collisions. If the inputs have different hash values, but still collide into the same bucket (e.g., after a modulo operation has taken place), we refer to these as bucket collisions. Formally, a bucket collision is when we derive inputs such that where is the the function mapping from inputs to buckets. In many cases, , with being the number of buckets.
While hash collisions would seem straightforward, they do not always result in a feasible attack. For example, consider an attacker who wishes to attack an intrusion detection system (IDS) scanning TCP/IP SYN packets to detect SYN flooding activity. If the IDS is remembering packets based purely on the source and destination IP addresses and port numbers, this would give the attacker a 96-bit search space. However, the destination address must be close enough to the IDS for the IDS to observe the traffic. Likewise, the attacker's service provider may do egress filtering that prevents forged source IP addresses. This could reduce the attacker to as little as 48-bits of freedom in selecting packets. If the hash function reduces these packets to 32-bit hash values, then there will be, on average, packets that an attacker can possibly send which will collide in the 32-bit hash-value space. values stored in the same hash bucket may or may not be enough to noticeably degrade the IDS's performance.
Conversely, suppose the attacker wishes to compute bucket collisions rather than hash collisions. Because the bucket count is much smaller than the size of the hash-value space, it will be easier to find bucket collisions. Thus, if the attacker can predict the precise bucket count, then many more possible collisions can be computed. This flexiblity may allow effective attacks on applications hashing inputs as short as 32-bits. However, if there are several possible bucket counts, then the attacker has several options:
Computing collisions that work for multiple bucket counts is not practical; the search space grows proportionally to the least common multiple of the candidate bucket counts. This can easily exceed the 32-bit space of hash values, making hash collisions more attractive to compute than bucket collisions. However, if the number of candidate bucket counts () is small enough, then the attacker can compute separate attack streams focused on each potential bucket count. If the attacker sends objects of attack data, then most of the attack data will be distributed throughout the hash table, with an expected insert per object. The remaining objects, however, will cause an expected total running time. Furthermore, if the hash table happens to be resized and one of the attack streams corresponds to the new bucket count, then the resulting hash table will still exhibit quadratic performance problems.
For simplicity, the remainder of this paper focuses on computing hash collisions. Later, when we describe attacks against an actual IDS (see Section 4), we will show that collisions in one bucket are more than sufficient to mount an effective attack.
The hash functions used by typical programs for their hash tables are generally not cryptographically strong functions like MD5 or SHA-1. Instead, they tend to be functions with 32 bits of internal state, designed primarily for speed. Because this state is limited, we need only find inputs such that the internal state after hashing is the same as the initial state.
Consider a hash function with the initial state of . Imagine we can find generators, or inputs such that . Then the concatenation of any number of these generators in any combination and any order also hashes to . So, also hashes to , as will or . Thus, by finding three inputs , , via exhaustive search and concatenating them combinatorially, we can generate a large number of collisions without requiring any additional searching. The number of possible collisions is bounded only by the maximum length to which we are willing to allow concatenated generators to grow. This process can be generalized by finding a set of generators closed over a small number of hash states (i.e., searching for generators that take hash states less than a small integer to other hash states less than the same small integer).
In simple tests, attacking the Perl 5.6.1 hash functions on a 450MHz Pentium-2 processor, 30 minutes of CPU time enumerating and hashing all 8 character alphabetic strings was sufficient to find 46 generators that hash to zero. By concatenating three of them combinatorially, we derive (97k) alphabetic inputs, 24 characters long, that will all hash to the same 32-bit hash value.
Hash collisions can be efficiently computed for a number of other common applications. The Linux protocol stack and the Bro intrusion detection system simply XOR their input together, 32 bits at a time. Thus, collisions may be directly computed from the algebraic structure of the hash function.
Many applications are sensitive about their overall memory usage, and thus have limits designed to control how large their hash tables might grow. If a hash table can never have enough elements in it for the worst-case behavior to dominate, then our attack will fail.
Some applications have explicit limits on their hash tables. We first consider the Linux IP fragment reassembly code. In response to earlier attacks, Linux currently allows at most 256 kbytes of storage toward reassembling incomplete packets. If we wish to attack the hash table being used to store these packet fragments, the longest hash chain we can induce will still be under 256 kbytes in total. We can still force Linux to repeatedly scan this chain, increasing the CPU load on the kernel, but we are unsure whether we can cause enough slowdown to be interesting.
(Florian Weimer reports that he found an exploitable hashing vulnerability in the Linux route cache, allowing 400 packets per second from an attacker to overload a quad-processor Pentium Xeon server, despite the size limits present in the route cache's hash table [20].)
The Apache web server collects fields from HTTP request headers into a vector (auto-sizing array). If there are multiple header fields with the same type, Apache concatenates them with an operation. This was a target for attack [19], however Apache now imposes a limit on the number of fields that can appear in an HTTP request (100 fields, by default). Even with 100 entries naming the same field, a worst-case running time will still be small, because is too small for the quadratic performance to become noticeable.
There are many other places where there are limits on the attacker's ability to influence a hash table. For instance, as discussed in Section 2.1.1, the freedom of an attacker to construct arbitrary inputs may be limited. In the case of network packets intended to attack a network sniffer, the attacker is limited both by the packet fields being watched by the sniffer, and by the packet headers necessary to route the packet toward the targeted machine. More generally, many applications operate on restricted data types, or otherwise place limits on an attacker's ability to generate arbitrary input for the targeted hash table. In some sense, these applications are lucky, but they could be vulnerable to attack in the future if their environment changes (e.g., moving from IPv4 to IPv6 will increase the size of IP addresses, giving more freedom to attack tables that hash IP addresses).
We did a short analysis of three programs to analyze how vulnerable they are to attack. We analyzed and attacked the hash tables used by two versions of the the Perl interpreter. We also analyzed and attacked the Squid web proxy cache. We investigated the DJB DNS cache and found it less vulnerable to these attacks.
The Squid Internet object cache [14] is intended to reduce network bandwidth by caching frequently used objects [7]. We analyzed the hash tables used within version 2.5STABLE1.
While we have not performed an exhaustive audit of Squid, we did discover a hash table used to track objects cached in memory. The hash table is keyed with an integer counter, the HTTP request method (i.e., GET, HEAD, etc.), and the URL in question. When Squid is operating as part of a caching cluster, it omits the integer counter and only hashes the HTTP request method and URL. (For reasons that escape us, Squid calls this ``private key'' vs. ``public key'' mode; this seems to have nothing to do with the traditional cryptographic senses of those terms.) An MD5 cryptographic checksum is performed over these values, and the resulting 128-bit value is truncated to 13 bits, identifying the hash bucket.
As an attacker, we cannot necessarily predict the value of the counter, making it difficult to compute hash collisions. However, Squid can be tricked into believing that it is part of a cluster by sending it a single UDP packet, an Internet Caching Protocol (ICP) MISS_NO_FETCH message [21]. This packet is accepted by the default configuration, and it's unclear whether this packet could be easily filtered, even using Squid's access control features. Regardless, any Squid cluster would already be forgoing the use of the ``private key'' mode, and thus would be vulnerable to attack.
A full benchmarking environment for Squid would require multiple web servers and clients to simulate the load experienced by the Squid web cache. To simplify things, we ran Squid on a stand-alone machine, where the URL requests were parsed from a local file and were satisfied with constant-sized web page results, served by a local proxy server. This environment is undeniably not suitable for making general remarks about Squid's general-purpose throughput, but it allows us to place pressure on this particular hash table and observe the effects.
We measured the wall-clock time necessary for Squid, in our restrictive configuration, to load approximately 143k URLs. We compared the performance of loading randomly chosen URLs with URLs carefully chosen to collide with Squid's hash function. Squid took 14.57 minutes to process the attack URLs versus 10.55 minutes to process the randomly chosen URLs. Thus, our attack added, on average, approximately 1.7ms of latency to each request serviced by the Squid cache.
This attack does not represent a ``smoking gun'' for algorithmic complexity attacks, but it does illustrate how common network services may be sensitive to these attacks. Furthermore, this attack demonstrates how seemingly innocuous features (e.g., Squid's ``private key'' mechanism, whatever it actually does) may have an effect on an application's resistance to these attacks.
Dan Bernstein's DNS server is designed to have several independent programs serving different duties. His DNS cache is one program in this collection. If we can pollute the cache with requests for domains under our control (e.g., ``x1.attacker.org'', ``x2.attacker.org'', etc.), we may be able to mount an algorithmic complexity attack against the DNS cache's hash table.
Upon code inspection, DJBDNS uses a deterministic hash function in its implementation of a DNS cache. Interestingly, the lookup code has an explicit check for being subject to ``hash flooding;'' after following a chain for 100 entries, it gives up and treats the request as a cache miss. We presume this design is intended to prevent the DNS cache from burning an excessive amount of CPU on any given request. Bernstein essentially anticipated a version of our attack, although, as we discuss in Section 5, his fix could be improved.
Perl is a widely-used programming language with built-in support for hash tables (called ``associative arrays''). While attacking a large number of Perl scripts is behind the scope of this paper, we expect that many deployed Perl scripts take untrusted input and store it directly in associative arrays. We demonstrate attacks against the associative arrays in Perl, versions 5.6.1 and 5.8.0; the hash function was changed between these two versions.
The hash functions in both versions of Perl form state machines. The internal state is the 32 bit accumulated hash value. The input being hashed is mixed in, one byte at a time, using a combination of addition, multiplication, and shift operations. The structure of the hash functions in both Perl 5.6.1 and 5.8.0 allow us to efficiently compute generators (see Section 2.1.2). Spending around one CPU hour attacking both hash functions, we were able to find 46 generators for Perl 5.6.1 and 48 generators for Perl 5.8.0, yielding 97k-110k colliding inputs of 24 characters in length. We then loaded these strings directly into associative arrays in both interpreters. The results are presented in Table 1. When an interpreter is fed the input designed to collide with its hash function, the running time was three orders of magnitude worse (2 seconds vs. almost two hours) than when fed the data designed to attack the other Perl version. This represents how devastating an algorithmic complexity attack can be. One hour of pre-computed CPU work, on the client, can cause almost two hours of online work for a server. Doubling the number of inputs by either finding new generators or using longer inputs would quadruple the victim's work. The exponent in the victim's worst-case behavior is clearly dominant.
|
Bro [15] is a general-purpose network intrusion detection system (IDS) that can be configured to scan for a wide variety of possible attacks. Bro is open-source and is used in production at a number of commercial and academic sites. This makes it an attractive target, particularly because we can directly study its source code. Also, given that Bro's job is to scan and record network packets, correlating events in real time to detect attacks, we imagine it has numerous large hash tables with which it tracks these events. If we could peg Bro's CPU usage, we potentially knock the IDS off the air, clearing the way for other attacks to go undetected.
In order to keep up with traffic, Bro uses packet filters [13] to select and capture desired packets, as a function of its configuration. Following this, Bro implements an event-based architecture. New packets raise events to be processed. Synthetic events can also be timed to occur in the future, for example, to track the various time-outs that occur in the TCP/IP protocol. A number of Bro modules exist to process specific protocols, such as FTP, DNS, SMTP, Finger, HTTP, and NTP.
Bro contains approximately 67,000 lines of C++ code that implement low-level mechanisms to observe network traffic and generate events. Bro also provides a wide selection of scripts, comprising approximately 9000 lines of code in its own interpreted language that use the low-level mechanisms to observe network behavior and react appropriately. While we have not exhaustively studied the source code to Bro, we did observe that Bro uses a simple hash table whose hash function simply XORs together its inputs. This makes collisions exceptionally straightforward to derive. The remaining issue for an attack any is to determine how and when incoming network packets are manipulated before hash table entries are generated.
We decided to focus our efforts on Bro's port scanning detector, primarily due to its simplicity. For each source IP address, Bro needs to track how many distinct destination ports have been contacted. It uses a hash table to track, for each tuple of source IP address, destination port, whether any internal machine has been probed on a given port from that particular source address. To attack this hash table, we observe that the attacker has 48-bits of freedom: a 32-bit source IP address and a 16-bit destination port number. (We're now assuming the attacker has the freedom to forge arbitrary source IP addresses.) If our goal is to compute 32-bit hash collisions (i.e., before the modulo operation to determine the hash bucket), then for any good hash function, we would expect there to be approximately possible collisions we might be able to find for any given 32-bit hash value. In a hypothetical IPv6 implementation of Bro, there would be significantly more possible collisions, given the larger space of source IP addresses.
Deriving these collisions with Bro's XOR-based hash function requires understanding the precise way that Bro implements its hash function. In this case, the hash function is the source IP address, in network byte order, XORed with the destination port number, in host order. This means that on a little-endian computer, such as an x86 architecture CPU, the high-order 16 bits of the hash value are taken straight from the last two octets of the IP address, while the low-order 16 bits of the hash value result from the first two octets of the IP address and the port number. Hash collisions can be derived by flipping bits in the first two octets of the IP address in concert with the matching bits of the port number. This allows us, for every 32-bit target hash value, to derive precisely input packets that will hash to the same value.
We could also have attempted to derive bucket collisions directly, which would allow us to derive more than collisions in a single hash bucket. While we could guess the bucket count, or even generate parallel streams designed to collide in a number of different bucket counts as discussed in Section 2.1.1, this would require sending a significant amount of additional traffic to the Bro server. If the hash collisions are sufficient to cause an noticable quadratic explosion inside Bro, then this would be the preferable attack.
We have designed attack traffic that can make Bro saturate the CPU and begin to drop traffic within 30 seconds during a 160kb/s, 500 packets/second flood, and within 7 minutes with a 16kb/s flood.
Our experiments were run over an idle Ethernet, with a laptop computer transmitting the packets to a Bro server, version 0.8a20, running on a dual-CPU Pentium-2 machine, running at 450MHz, with 768MB of RAM, and running the Linux 2.4.18 kernel. Bro only uses a single thread, allowing other processes to use the second CPU. For our experiments, we configured Bro exclusively to track port scanning activity. In a production Bro server, where it might be tracking many different forms of network misbehavior, the memory and CPU consumption would be strictly higher than we observed in our experiments.
We first present the performance of Bro, operating in an off-line mode, consuming packets only as fast as it can process them. We then present the latency and drop-rate of Bro, operating online, digesting packets at a variety of different bandwidths.
|
Normally, on this hardware, Bro can digest about 1200 SYN packets per second. We note that this is only 400kb/s, so Bro would already be vulnerable to a simple flood of arbitrary SYN packets. We also note that Bro appears to use about 500 bytes of memory per packet when subject to random SYN packets. At a rate of 400kb/s, our Bro system, even if it had 4GB of RAM, would run out of memory within two hours.
We have measured the offline running time for Bro to consume 64k randomly chosen SYN packets. We then measured the time for Bro to consume the same 64k randomly chosen packets, to warm up the hash table, followed by 64k attack packets. This minimizes rehashing activity during the attack packets and more closely simulates the load that Bro might observe had it been running for a long time and experienced a sudden burst of attack packets. The CPU times given in Table 2 present the results of benchmarking Bro under this attack. The results show that the attack packets introduce two orders of magnitude of overhead to Bro, overall, and three orders of magnitude of overhead specifically in Bro's hash table code. Under this attack, Bro can only process 24 packets per second instead of its normal rate of 1200 packets per second.
In the event that Bro was used to process an extended amount of data, perhaps captured for later offline analysis, then an hour of very low bandwidth attack traffic (16kb/s, 144k packets, 5.8Mbytes of traffic) would take Bro 1.6 hours to analyze instead of 3 minutes. An hour of T1-level traffic (1.5Mb/s) would take a week instead of 5 hours, assuming that Bro didn't first run out of memory.
|
As described above, our attack packets cannot be processed by Bro in real-time, even with very modest transmission rates. For offline analysis, this simply means that Bro will take a while to execute. For online analysis, it means that Bro will fall behind. The kernel's packet queues will fill because Bro isn't reading the data, and eventually the kernel will start dropping the packets. To measure this, we constructed several different attack scenarios. In all cases, we warmed up Bro's hash table with approximately 130k random SYN packets. We then transmitted the attack packets at any one of three different bandwidths (16kb/s, 64kb/s, and 160kb/s). We constructed attacks that transmitted all attack packets sequentially, multiple times. We also constructed a ``clever'' attack scenario, where we first sent of our attack packets and then repeated the remaining of the packets. The clever attack forces more of the chain to be scanned before the hash table discovered the new value is already present in the hash chain.
Table 3 shows the approximate drop rates for four attack scenarios. We observe that an attacker with even a fraction of a modem's bandwidth, transmitting for less than an hour, can cause Bro to drop, on average, 71% of its incoming traffic. This would make an excellent precursor to another network attack that the perpetrator did not wish to be detected.
Bro's drop rate is not constant. In fact, Bro manifests interesting oscillations in its drop rate, which are visible in Figures 3 through 6. These graphs present Bro's packet processing latency and cumulative packet drop rate for attack packets being transmitted at 16 kb/sec and 64 kb/sec.
At time , the latency (time between packet arrival and packet processing) starts increasing as total processing cost per packet begins to exceed the packet inter-arrival time.
At time , Bro is sufficiently back-logged that the kernel has begun to drop packets. As a result, Bro starts catching up on its backlogged packets. During this phase, the Bro server is dropping virtually all of its incoming traffic.
At time , Bro has caught up on its backlog, and the kernel is no longer dropping packets. The cycle can now start again. However, the hash chain under attack is now larger then it was at time . This will cause subsequent latencies to rise even higher than they were at time .
This cyclic behavior occurs because Bro only adds entries to this hash table after it has determined there will be no response to the SYN packet. Bro normally uses a five minute timeout. We reduced this to 30 seconds to reduce our testing time and make it easier to illustrate our attacks. We anticipate that, if we were to run with the default 5-minute timeout, the latency swings would have a longer period and a greater amplitude, do to the ten times larger queues of unprocessed events which would be accumulated.
Our attack on Bro has focused on its port scanning detector. Bro and other IDS systems almost certainly have other hash tables which may grow large enough to be vulnerable to algorithmic complexity attacks. For example, Bro has a module to detect network scans, determining how many destination hosts have been sent packets by a given source host. This module gives bits of freedom, where is the number of host bits in the destination network monitored by the IDS. is unlikely to be greater than 16 except for a handful of sites. However, in an IPv6 network, the sky is the limit. For that matter, IPv6 gives the attacker a huge amount of freedom for any attack where IP addresses are part of the values being hashed.
Part of any hash table design is the need to expand the bucket count when the table occupancy exceeds some threshold. When the hash table has a large number of objects which hash to the same bucket after the rehashing operation, then the rehashing operation could be as bad as , if the hash table were using its normal insertion operation that checks for duplicates. As it turns out, Bro does exactly this. In our regular experimental runs, we ``warmed up'' the hash tables to prevent any rehashing during the experiment. Before we changed our experimental setup to do this, we saw large spikes in our latency measurements that indicated rehashing was occurring. When rehashing, Bro takes 4 minutes to process a table with 32k attack entries. Bro takes 20 minutes to process a table with 64k attack entries. Without IPv6 or using bucket collisions, we cannot create more collisions than this, although making the IDS server unresponsive for 20 minutes is certainly an effective attack.
Although rehashing attacks are extremely potent, they are not necessarily easy to use; attackers cannot exploit this window of opportunity unless they know exactly when it is occurring. Furthermore, Bro's hash table will rehash itself at most 12 times as it grows from 32k entries to 64M entries.
When analyzing algorithmic complexity attacks, we must assume the attacker has access to the source code of the application, so security through obscurity is not acceptable. Instead, either the application must use algorithms that do not have predictable worst-case inputs, or the application must be able to detect when it is experiencing worst-case behavior and take corrective action.
A complete survey of algorithms used in common systems is beyond the scope of this paper. We focus our attention on binary trees and on hash tables.
While binary trees are trivial for an attacker to generate worst-case input, many many other data structures like red-black trees [11] and splay trees [18] have runtime bounds that are guaranteed, regardless of their input. A weaker but sufficient condition is to use an algorithm that does not have predictable worst-case inputs. For example, treaps [17] are trees where all nodes are assigned a randomly chosen number upon creation. The tree nodes are rotated such that a tree property is preserved on the input data, as would be expected of any tree, but a heap property is also maintained on the random numbers, yielding a tree that is probabilistically balanced. So long as the program is designed to prevent the attacker from predicting the random numbers (i.e., the pseudo-random number generator is ``secure'' and is properly initialized), the attacker cannot determine what inputs would cause the treap to exhibit worst-case behavior.
When attacking hash tables, an attacker's goal is to efficiently compute second pre-images to the hash function, i.e., if hashes to and , it should be infeasible for the attacker to derive such that . Cryptographically strong hash functions like MD5 and SHA-1 are resistant, in general, to such attacks. However, when used in hash tables, the 128 or 160 bits of output from MD5 or SHA-1 must eventually be reduced to the bucket count, making it feasible for an attacker to mount a brute force search on the hash function to find bucket collisions. Some simple benchmarking on a 450MHz Pentium-2 allowed us to compute approximately five such collisions per second in a hash table with 512k buckets. This weakness can be addressed by using keyed versions of MD5 or SHA-1 (e.g., HMAC [12]). The key, chosen randomly when the program is initialized, will not be predictable by an attacker; as a result, the attacker will not be able to predict the hash values used when the program is actually running. When keyed, MD5 and SHA-1 become pseudo-random functions, which, like treaps, become unpredictable for the attacker. When unkeyed, MD5 and SHA-1 are deterministic functions and subject to bucket collisions.
Replacing deterministic hash functions with pseudo-random functions gives probabilistic guarantees of security. However, a stronger solution, which can also execute more efficiently, is available. Universal hash functions were introduced in 1979 [5] and are cited by common algorithm textbooks (e.g., Cormen, Leiserson, and Rivest [6]) as a solution suitable for adversarial environments. It has not been standard practice to follow this advice, but it should be.
Where MD5 and SHA-1 are designed to be resistant to the computation of second pre-images, universal hash functions are families of functions (with the specific function specified by a key) with the property that, for any two arbitrary messages and , the odds of are less than some small value . This property is sufficient for our needs, because an attacker who does not know the specific hash function has guaranteed low odds of computing hash collisions.
Carter and Wegman's original construction of a universal hash function computes the sum of a fixed chosen constant with the dot product of a fixed chosen vector with the input, modulo a large prime number. The fixed chosen constant and vectors are chosen, randomly, at the beginning, typically pre-computed using a keyed pseudo-random function, and reused for every string being hashed. The only performance issue is that this vector must either be pre-computed up to the maximum expected input length, or it must be recomputed when it is used, causing a noticeable performance penalty. More recent constructions, including UMAC [4] and hash127 [3] use a fixed space despite supporting arbitrary-length arguments. UMAC, in particular, is carefully engineered to run fast on modern processors, using adds, multiplies, and SIMD multimedia instructions for increased performance.
Some software designers are unwilling to use universal hashing, afraid that it will introduce unacceptable performance overheads in critical regions of their code. Other software designers simply need a fast, easy-to-integrate library to solve their hashing needs. Borrowing code from UMAC and adding variants hand-optimized for small, fixed-length inputs, we have implemented a portable C library suitable for virtually any program's needs.
Our library includes two different universal hash functions: the UHASH function, submitted as part of the (currently expired) UMAC Internet draft standard [4], and the Carter-Wegman dot-product construction. We also include a hand-tuned variant of the Carter-Wegman construction, optimized to support fixed-length, short inputs, as well as an additionally tuned version that only yields a 20 bit result, rather than the usual 32 bits. This may be appropriate for smaller hash tables, such as used in Squid (see Section 3.1).
Our Carter-Wegman construction processes the value to be hashed one byte at a time. These bytes are multiplied by 32 bits from the fixed vector, yielding 40 bit intermediate values that are accumulated in a 64 bit counter. One 64-by-32 bit modulo operation is used at the end to yield the 32 bit hash value. This construction supports inputs of length up to bytes. (A maximum length is declared by the programmer when a hashing context is allocated, causing the fixed vector to be initialized by AES in counter mode, keyed from /dev/random. Hash inputs longer than this are rejected.)
Our initial tests showed that UHASH significantly outperformed the Carter-Wegman construction for long inputs, but Carter-Wegman worked well for short inputs. Since many software applications of hash functions know, apriori, that their inputs are small and of fixed length (e.g., in-kernel network stacks that hash portions of IP headers), we wished to provide carefully tuned functions to make such hashing faster. By fixing the length of the input, we could fully unroll the internal loop and avoid any function calls. GCC inlines our hand-tuned function. Furthermore, the Carter-Wegman construction can be implemented with a smaller accumulator. Without changing the mathematics of Carter-Wegman, we can multiply 8 bit values with 20 bit entries in the fixed vector and use a 32 bit accumulator. For inputs less than -bytes, the accumulator will not overflow, and we get a 20 bit hash value as a result. The inputs are passed as separate formal arguments, rather than in an array. This gives the compiler ample opportunity for inlining and specializing the function.
We performed our microbenchmarking on a Pentium 2, 450MHz computer. Because hash tables tend to use a large amount of data, but only read it once, working set size and its resultant impact on cache miss rates cannot be ignored. Our microbenchmark is designed to let us measure the effects of hitting or missing in the cache. We pick an array size, then fill it with random data. We then hash a random sub-range of it. Depending on the array size chosen and whether it fits in the L1 cache, L2 cache, or not, the performance can vary significantly.
In our tests, the we microbenchmarked two conventional algorithms, four universal hashing algorithms, and one cryptographic hash algorithm:
Perl | Perl 5.8.0 hash function |
MD5 | cryptographic hash function |
UHASH | UMAC universal hash function |
CW | Carter-Wegman, one byte processing, variable-length input, 64 bit accumulator, 32 bit output |
CW12 | Carter-Wegman, two byte processing, 12-byte fixed input, 64 bit accumulator, 32 bit output |
CW12-20 | Carter-Wegman, one byte processing, 12-byte fixed input, 32 bit accumulator, 20 bit output |
XOR12 | four byte processing, 12-byte fixed input, 32 bit output |
In addition to Perl, MD5, UHASH, and three variants of the Carter-Wegman construction, we also include a specialized function that simply XORs its input together, four bytes at a time. This simulates the sort of hash function used by many performance-paranoid systems.
Figure 7 shows the effects of changing the working set size on hash performance. All the hash functions are shown hashing 12-byte inputs, chosen from an array whose sizes have been chosen to fit within the L1 cache, within the L2 cache, and to miss both caches. The largest size simulates the effect that will be seen when the data being hashed is freshly read from a network buffer or has otherwise not yet been processed. We believe this most accurately represents the caching throughput that will be observed in practice, as hash values are typically only computed once, and then written into an object's internal field, somewhere, for later comparison.
As one would expect, the simplistic XOR12 hash function radically outperforms its rivals, but the ratio shrinks as the working set size increases. With a 6MB working set, XOR12's throughput is 50 MB/sec, whereas CW12-20 is 33 MB/sec. This relatively modest difference says that universal hashing, with its strong security guarantees, can approach the performance of even the weakest hash functions. We can also see that universal hashing is competitive with Perl's hash function and radically outperforms MD5.
As the cache hit rate increases with a smaller working set, XOR12 radically outperforms its competition. We argue that this case is unlikely to occur in practice, as the data being hashed is likely to incur cache misses while it's being read from the network hardware into main memory and then the CPU. Secondly, we are microbenchmarking hash function performance, a hash function is only a percentage of overall hash table runtime which is only a percentage of application runtime. Of course, individual application designers will need to try universal hashing out to see how it impacts their own systems.
Some applications hash longer strings and require general-purpose hash functions. Figure 8 uses the 6MB working set and varies the length of the input to be hashed. We can no longer use the specialized 12-byte functions, but the other functions are shown. (We did not implement a generalization of our XOR12 hash function, although such a function would be expected to beat the other hash functions in the graph.) With short strings, we see the Perl hash function outperforms its peers. However, with strings longer than around 44-bytes, UHASH dominates all the other hash functions, due in no small part to its extensive performance tuning and hand-coded assembly routines.
We have some preliminary benchmarks with integrating universal hashing into Perl. We benchmarked the change with two perl scripts, both of which do little other than hash table operations. The first script is a histogramming program, the second just inserts text into a hash table. Our results indicate that the application performance difference between UHASH and Perl's default hash function is plus or minus 10%.
We conclude that our customized Carter-Wegman construction, for short fixed-length strings, and UHASH, for arbitrary strings, are sufficiently high performance that there is no excuse for them not to be used, or at least benchmarked, in production systems. Our code is available online with a BSD-style license2.
We have presented algorithmic complexity attacks, a new class of low-bandwidth denial of service attacks. Where traditional DoS attacks focus on small implementation bugs, such as buffer overflows, these attacks focus on inefficiencies in the worst-case performance of algorithms used in many mainstream applications. We have analyzed a variety of common applications, finding weaknesses which vary from increasing an applications workload by a small constant factor to causing the application to completely collapse under the load cause by its algorithm's unexpectedly bad behavior.
Algorithmic complexity attacks against hash table, in particular, count on the attacker having sufficient freedom over the space of possible inputs to find a sufficient number of hash collisions to induce worst-case behavior in an application's hash table. When the targeted system is processing network packets, the limited address space of IPv4 offers some limited protection against these attacks, but future IPv6 systems will greatly expand an attacker's ability to find collisions. As such, we strongly recommend that network packet processing code be audited for these vulnerabilities.
Common applications often choose algorithms based on their common-case behavior, expecting the worst-case to never occur in practice. This paper shows that such design choices introduce vulnerabilities that can and should be patched by using more robust algorithms. We showed how universal hashing demonstrates impressive performance, suitable for use in virtually any application.
While this paper has focused on a handful of software systems, an interesting area for future research will be to study the algorithms used by embedded systems, such as routers, firewalls, and other networking devices. For example, many ``stateful'' firewalls likely use simple data structures to store this state. Anyone who can learn the firewall's algorithms may be able to mount DoS attacks against those systems.
This work is supported by NSF Grant CCR-9985332 and Texas ATP grant #03604-0053-2001 and by gifts from Microsoft and Schlumberger.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)