The first portion of this writeup is based on [1], below. The second part is my own analysis.

Why did Slammer propagate so quickly?

Slammer's rate of infection (see Jetifi's writeup above) was unparalleled. The entire worm was 376 bytes. Infection was the simple matter of receiving a UDP packet containing the worm on the right UDP port of a vulnerable host. Upon infection, the worm would immediately start scanning IP addresses randomly (Almost; the PRNG was apparently coded incorrectly, leading to Slammer to scan only 1/4 of all IP addresses; this analysis also shows why this did little to limit the damage. Additionally, even with a correct PRNG, the seed was partially time-based the tremendous rate of infection would mean that many copies would use the same seed, leading to their using the same selection of IPs. This would act to lower the rate of infection during the peak spread of the worm.). The worm would transmit a UDP packet containing itself to each IP address, at the right port.

If a vulnerable host were present at the scanned IP address, it would immediately enter the same loop of scanning IPs randomly.

None of which is enough to explain the rate of infection: Slammer was (to quote [1]) "bandwidth-limited", not "latency-limited". That is, Slammer could (quite easily) saturate a 100Mbps ethernet link. The reason for this is the trivial nature of the UDP packets. TCP is a connection-oriented protocol: first a "connection" is set up, then data is streamed along it. Setting up a connection requires the TCP three-way handshake (aka SYN - ACK). Even if the remote host refuses a connection on the port, a round trip of packets is required. In other words, had Slammer utilized a TCP connection to spread, every attempt to infect an IP address would have to wait at least one round-trip time to execute. From where I'm sitting, pinging Google has a round-trip time of 1/4 second.

Slammer used UDP packets, without having to wait for a response from the host. These were literally "fire and forget". A 404-byte UDP packet was enough to infect -- and it immediately caused the infected hosts to fire 404-byte UDP packets, usually as fast as it could. Thus, Slammer wasn't limited by latency (the length of the pipe to the target), but by bandwidth (the width of the pipe, mostly at the source). On an unloaded machine (and Slammer hit during a weekend, of course), it could easily fill the entire 100 Mbps ethernet link with its traffic.

A TCP-based worm could also conceivably be bandwidth-limited, if it initiated numerous TCP connections in parallel. This would likely require rewriting portions of the TCP/IP stack, making the worm larger (threading isn't up to the task of handling so many simultaneous connection attempts; neither, for that matter, is the operating system). State would probably also have to be kept. Even if it worked (and someone were willing to code it all, the worm would be much larger -- saturating a 100Mbps link with a 400-byte worm means 25,000 IPs attacked every second; saturating it with a 40Kbyte worm means 250 IPs a second...

Why did Slammer spread at all?

Having explained above the extraordinary effectiveness of Slammer's infection, I'll now turn to the other side, and examine why it could spread at all.

Every machine infected by Slammer uses up a huge amount of bandwidth. Eventually, the machine will get stopped (even a reboot is enough to stop the current infection by Slammer -- though of course re-infection is a definite possibility). Either the sysadmin will switch it off, or it will be rebooted, or the ISP will disconnect it, or ... Let's assume an infected machine runs for 24 hours.

A simple probabilistic model for the early stages of infection is this: In early stages, we can ignore re-infection, as so few machines are already infected. An infected machine attacks IPs, infecting machines at some of them, until it is stopped. Assume that true random numbers are used (untrue, but reasonable), and also that IPs chosen by different machines are independent (untrue, and unreasonable; in truth, this factor lowers the infection rate even more). Finally, assume all machines in the network are alike -- connected by a 100Mbps link to the Internet (again, untrue -- but machines on dialups are probably not going to contribute significantly to infection, and machines with larger links are probably in heavy use, and won't manage to utilise much more than 100Mbps for worm traffic).

So the number of different machines infected by one machine is given by some distribution z on the natural numbers; this distribution is the same for all machines, and the infections of each machine are independent. The random process describing this scenario is the Galton-Watson process. Without going into much detail, the process dies out if the average number of "children" of an infected machine is below 1 (μ(z)≤1). If μ(z)>1, then there is some positive probability of infection continuing forever; in this case, the infection grows roughly exponentially, with rate μ(z).

OK, we're all set! An infected machine will send 25,000 packets (400 bytes each) a second, for a total of 100Mbps. In a day, it will scan around 2*109 IPs, hitting half the IPs on the Internet. Clearly, in this mode μ(z)>>1 if there are any vulnerable machines out there at all.

Suppose now we only allow 1 minute before an infected machine is shut down. An infected machine scans 1,500,000 IPs before being shut down. Recall that many IPs are unused (this is unfortunately true, regardless of the IP address shortage). If only 1 in 1000 IPs is used (i.e. ~2 million hosts on the Internet) and the probability of a host being vulnerable is 1/5000 (i.e., running a vulnerable SQL Server or similar product, not behind a firewall or behind a firewall not blocking the dangerous port), then μ(z)<1, and we're safe -- the infection will die off by itself. But, even with these optimistic numbers, if infected machines are allowed to run for 10 minutes, μ(z)=3.333... > 1, and we'll be hit badly (the number of infected machines will roughly triple every 10 minutes).

Slammer had it too easy. It could have been 10 or 100 times larger, maybe even utilizing some primitive TCP stack, and still have made an impact.

Bibliography

  1. The Spread of the Sapphire/Slammer Worm, David Moore (CAIDA & UCSD), Vern Paxson (ICIR & LBNL), Stefan Savage (UCSD CSE), Colleen Shannon (CAIDA), Stuart Staniford (Silicon Defense) and Nicholas Weaver (Silicon Defense & UC Berkeley EECS).