A context switch is a computer science/computer engineering term used to describe the actions undertaken to remove a process from a CPU and place another process onto said CPU. A context switch involves saving the state of the old process into some form of storage and un-archiving the other process and placing it onto the processor. It is an important concept in multiprogramming, and the ability to have context switches greatly increases the efficiency of a computer system in almost all circumstances
There are a few things that make context switches so costly, in terms of performance. One of the most commonly thought of (at least in my classes) but least important is that the registers in the processor must be saved to some manner of storage, along with the program state for the currently running process. This can take a considerable amount of time, given how fast DRAM is compared to cache is on a modern system, but overall, that's a predictable speed hit that is workable. Some architectures (Such as the PDP-11, I think) have an instruction to do this whole dump and archive as one atomic operation, which greatly reduces the overhead.
The real problem with context switching is cache state. When a process is first placed onto a processor, a large string of cache misses occur, which makes things very slow for that process until the cache can be filled up. This destroys concepts like locality and the translation look-aside buffer used for logical memory addressing. As a simple example, If a process is looping through a single structure, and gets taken off the processor, stalled, and then restarted, it will (likely) have to rebuild its cache state, which will make that loop considerably slower. Because the process that was brought in has been using the cache, there will be a lot of information in the cache that is useless to the original process when it is switched back in. Depending on the size of your various levels of cache, this can be avoided, because a large cache won't be completely thrashed. Some computers don't have any sort of level 3 cache, however, and have a small level 2 cache. These sorts of computers would exhibit the problem more readily.
This is even worse in a multiprocessor system, because processes may be switched from one processor to another quite often. While a single-processor system may still have some remnants of the old state floating around in cache, it is unlikely that any cache state would exist on an entirely different processor. If your scheduler (like a modern one would) takes into account that switching processors is bad, then the speed hit for this will obviously be avoided a lot of the time. Other tricks are also possible to reduce the cache thrashing, but the problem still remains.
The other small speed hits lie in things like interrupt processing that are necessary for multiprogramming to occur.
References: Computer architecture classes at my university.