What a RISC CPU actually does is execute each of its instructions in one cycle. Also, instructions are of a fixed length which is generally equivalent to the width of the data path; 32 bits for a 32 bit CPU like PowerPC G3, 64 bits for a 64 bit CPU like Alpha, etc. Hence, aside from copies to and from off-CPU memory which may take longer (unless your memory is synchronous and as fast as your CPU) no operation you can execute will take longer than one cycle. This lends itself much more to multiprocessing (See: superscalar) than a CISC architecture because you know exactly how long it's going to take every operation to complete and reassembling them in the proper order is no trouble. If you have a superscalar architecure, which means that your instruction decoder can handle talking to more than one unit in a cycle, and you have multiple blocks of computing logic in your CPU, then you can pass multiple instructions off at once. Multiprocessing CISC systems get into all kinds of crazy long pipelines which get complicated (and potentially inefficient) quickly.
CISC CPUs (like the intel x86 series) have fruity instructions like STOSW (store string word) which is basically a mnemonic for MOV WORD PTR ES:[DI],AX respectively, followed by INC DI twice. (Literally; move into the word-long address specified by ES (Extra Segment) and the offset of DI (Data Index, surrounded with brackets to get the offset) the contents of AX, the entire 16 bit long general purpose register, then increment DI (Your "Data Index" or offset) twice so that, in a perfect world, ES:DI now points at the next word. RISC chips tend to have simpler instructions, keeping it down to simple MOV commands. More complex functions like STOSW (which really gain you nothing except some simplicity) are removed in favor of the diy approach of explicitly specifying each instruction. This does not really present a problem as you can create procedures (Subroutines which are invoked on x86 with the 'CALL' instruction) or macros, which can be even simpler to use.
Generally speaking CPUs have a broad enough system bus (AKA the memory bus) to handle receiving an instruction every cycle. There have been some exceptions, but that is the general pattern. Some CPUs (like early x86 CPUs) don't have enough lines to do their addressing however, and this is why they have a segment and an offset. Your offset can only be up to 64k (or, when used as a signed integer, -32768 to 32767 due to how signed values are stored -- see twos complement) so in order to make a "near" jump which only takes one operation (for the offset) rather than two (for the segment and the offset) you must be jumping only that far. Therefore, a larger binary is potentially slower not only during load (more to load into memory) but also during execution (you have to make longer jumps.)
Most of what a computer does is based on addition, even subtraction. (Again, see twos complement.) A basic adder circuit is as simple as having an AND and an XOR (exclusive or) in the same circuit. Both the and and the xor are hooked up to the same inputs (Two of them). The output of the xor is the sum (what we put in this place's digit) and the output of the and is the carry (what we carry to the next digit.) This is actually called a half-adder because it only does half the job. It's suitable for the least significant digit because you will never carry into it, but won't work for any others. A full adder is actually made up of five gates, and I won't bother to go into it here.
Without a FPU (or floating point unit), multiplication and division are only repeated addition and subtraction whether it's done at the CPU level or the assembler level, and as such you derive no benefit in an integer-only CPU from having a MUL (multiply) instruction, so we could even remove that. The only important instructions in any instruction set are those which have a 1:1 relationship between a function and an instruction. As we saw above, STOSW is made up of three instructions, a MOV and two INCs. (INC twice is faster than ADDing 2, at least in the x86 CPUs.)
So when we pare things down we are left with a relatively small set of instructions. As many as 40-60% of your instructions in any given program will be compares (CMP on x86, along with the results of ADD, SUB, XOR, and so on, or TEST) and conditional jumps. For example, the x86 CPU has a JA (jump if above) and JB (jump if below), a JZ (jump if zero) and JNZ (jump if not zero), among many other conditional jumps, which use the status register to decide whether they should be jumping to another address or not. We will also have PUSH and POP instructions for stack manipulation, which also INC(rement) or DEC(rement) the SP (stack pointer) register. Then we have our basic mathematical and logical operations (math is logical, so I suppose that's redundant) such as ADD, SUB, MUL, and DIV, which work on bytes or words (16 bits) and then the various bitwise operators AND, OR, XOR. We of course have MOV which copies, not moves, pieces of memory around. This is everything a modern computer must do; You must have registers which store data you are about to do something with, you must have someplace to put data and get it back from so you can work with larger data sets than what will fit in your registers, and you must be able to carry out calculations. Finally, you must be able to carry out different instructions based on the results of those earlier instructions.
Just think, most of what a CPU does is based on adding, or on one of the three basic parts of logic: AND, OR, and NOT. Even an ADD is just made up of those basic things, and so really everything is based either on "If these two things are the same, then I will answer affirmatively", "If one of these things are affirmative then I will be too", and "I'm just feeling contrary."