Are you a Linux user? Have you ever wondered what happens when your Linux box is out of memory? I'm not talking about "Hmmm, OpenOffice seems a bit slow" out of memory, or even "I have to restart the X server" out of memory, but the kind of "out of memory" that causes your computer to start throwing spastic fits. The kind of "out of memory" which generally indicates a Malicious Malloc in a disgruntled user's looping C program.
But I'm babbling.
What I'm trying to ask is, would you like to know what happens when there really is no more RAM and no more swap space left? If so, read on.
Unsurprisingly, it's the kernel (of which I'm referring to version 2.4.20) which really holds the reins when it comes to memory management, and as such it also handles extreme memory situations - one of which is being totally out of memory.
The thing which handles a no memory situation is known as the OOM killer. You can probably guess what the OOM stands for, and you'll see why it's a "killer" in a minute. The code which makes up the OOM killer is located in linux/mm/oom_kill.c. Go ahead and take a look; it's probably one of the cleanest and easiest to understand bits of code in the kernel.
But what does it actually do? Let's break it down into its functions and see what each of them do.
The OOM killer's functions
- unsigned int int_sqrt(unsigned int x)
- int badness(struct task_struct *p)
- struct task_struct * select_bad_process()
- void oom_kill_task(struct task_struct *p)
- void oom_kill()
- void out_of_memory()
And that's pretty much it. I've listed them in the order they appear in the source code, but that's not quite the order in which they're used. In fact, they're generally called in the reverse of the above order; out_of_memory goes first, checking to make sure that we've run out of all memory
First: are we really out of memory?
The OOM killer can take pretty drastic steps to stop the machine from keeling over, so we want to make sure that the system
is actually ready to collapse before we do anything big. So, we check for swap space. If there's any swap space, we're not out of memory. If we've had a long gap between temporary
out-of-memory warnings, then we're not out of memory. If we're not out-of-memory for long enough, we're not out of memory. The point is, we have to wait a surprisingly long while before breaking out the kernel equivalent of a defibrillator
The function that does this is out_of_memory
. If out_of_memory
decides that OOM has to do something, it then calls oom_kill
finds itself being run. First it has to find some way to free up memory, and that way is to kill a process. But first it has to pick a process. It outsources this task to...
This function runs through the list of all tasks running and gets a score on how much we have to gain by killing it. It ranks them all in descending order to find which one we will benefit from killing the most (the one with greatest score), and it returns the process which OOM decides needs to be sacrificed the most. However, select_bad_process doesn't do the number-crunching to find the points score of each process. That is the job of...
Now that badness has returned points scores for each process currently running, we order them to find out what process will be deaded. We pass back a structure containing this process' information to oom_kill. Note that by the time this information reaches oom_kill, we may have decided we can't kill anything with some degree of safety so it gets NULL.
The only argument badness takes is a pointer to a task structure; basically something which stores a process' identifying information. badness takes this information and uses it to work out what we have to gain from killing it. It assigns it an integer point score, based on the following things in order:
Having done all this, an integer score pops out which is then returned by badness to select_bad_process so it can compare processes.
- the memory used by the process (this forms the basis of the score)
- how long the process has been running: divide the current score by the square root of the CPU time used and the square root of the square root of the run time
- the priority of a process; if it has a high nice value (i.e. low priority) then double its score since it's probably unimportant
- if it's a superuser process, divide its score by four
- if the process is accessing hardware directly, divide its score by four
will now check to see whether it's got a process to kill, or a NULL
. If it's a NULL, we couldn't find anything decent to kill and as such are SOL
; as a result we have to do a kernel panic
to say that we can't kill anything and we're OOM. Otherwise, we call oom_kill_task
to kill our sacrificial process and its children.
We know what process we're going to kill, so we're on the home stretch of recovery. But first we throw out an error with printk to say what we're doing, then give the process a high priority and access to all the memory it wants so we can kill it really, really fast. If the process has hardware access, we send it a SIGTERM (signal 15) to be slightly more gentle, otherwise we kill -9 it.
And hopefully we should now be in the clear with the memory we stole from the dead process. If not, we wait a bit and then we can try killing something else.
This part of the Linux kernel
was written by Rik van Reil
in 1998 and 2000, who claims to have been goad
ed into writing it and was inspired by Claus Fischer
. Just a little heads up so you know who wrote this part of the kernel.
The contents of this writeup are in the public domain.