Software which deliberately modifies the machine instructions comprising itself in memory. This technique obviously only works when writing in assembly language. It is of questionable value unless working in an environment where memory is so restricted that incredibly tight code is a necessity. It is a bitch to debug.

A simple example: say you have a program containing a variable foo which is only ever used by two instructions, one of which loads it into a register, the other of which increments it (we'll assume the architecture has an inc instruction which can be applied to a memory location):

start:	load r1, foo
	; other stuff...
	inc foo
	jmp start
Then instead of wasting memory by having foo stored somewhere outside the code, we can instead change things so that the load instruction contains a literal value (which we set to foo's initial value), and the inc instruction references the actual memory location of the load instruction's operand:
start:	load r1, 42
	; other stuff...
	inc (start+1)
	jmp start
Thus after execution of inc (start+1) the load instruction stored in memory will actually be load r1, 43--the code has self-modified.

This, of course, is a very weak example. Much more interesting things happen when you start modifying the actual opcodes, so that completely different instructions will be executed on each pass through the loop.

This technique is sometimes used in polymorphic viruses. It's fairly common to see one modify its own code to remain as small as possible, or change its signature. It could also be used to disguise the code as something harmless as a virus scanner passes by...

Self modifying code is a programming technique where the program modifies itself as it runs. This technique is generally frowned on except when used in extremely limited ways, and has been largely made impossible, undesirable, or useless by modern computer architectures. Self modifying code was most useful on architectures with a very limited number of registers and limited (less than 64k) ram.

  • ways to self modify code:
    store loop index in instruction
    save memory & registers
    modify instruction as a flag
    replace NOP's with instructions or vice versa to add or remove operations
  • Problems with self modifying code when used fully
    • Self modifying code can difficult to read. Sometimes this was done intentionally, as job security or as part of copy protection to make cracking the software harder.
    • self modifying code can be tricky to debug, since it may do different things each time you run it
    • self modifying code is tricky to reuse, since it is not reentrant; what one run does depends on what the last one did
  • current architecture obsticles
    cpu instruction cache
    Instructions that are modified in memory are not modified in cpu cache, and thus are ignored until the cache line expires. This could be exploited, of course, but then you have to totally understand how the instruction cache works.
    read only text segments
    Executable code in memory may be marked as read only by the operating system so it can be shared...
    shared text segments
    Exeuctable pages may be shared between separate processes, and thus modifying one page would affect other users' processes. This is generally not allowed in multiuser operating systems.
    compiled code vs. machine language
    The instructions generated by the compiler are not necessarily known when the code is written, making it difficult to modify code that isn't generated yet.
  • modern uses of self modifying code
    runtime linker
    The linker may patch unresolved jump statements in a jump table or in the code itself at or immediately before runtime; an unresolved symbol may be expressed as a jump to a routine that would backpatch the original jump to the correct address, thus allowing demand linking.
    patch kernel to match cpu features available (fpu, etc.)
    The Linux kernel does (or at one time did) include cpu instructions and features such as math instructions that were not available on all cpu's. When such an instruction is encountered the first time, a trap is genenerated and code is called to patch the instruction into a more efficent subroutine call to emulate the instruction next time instead of generating the trap.
    On the fly generation of temporary code which may load or switch banks to run another piece of code; this was especially popular in bank switched machines, where the addressable memory was smaller than the available memory, and in systems that used overlays.
    overflow exploits
    Many security holes are exploited by using potential buffer overruns in buggy code and modifying either the stack or the running code, sometimes even by putting a trampoline on the stack.
    polymorphic viruses or stealth viruses
    So called "polymorphic viruses" work by modifying their own code to attempt to prevent virus checkers from finding them.
    genetic algorithms
    Genetic algorithms are inherently self modifying; "code" fragments are mixed and matched and mutated using a search algorithm (random search is common) until an ideal combination is found
  • Structured languages have better methods that give the same advantages of self modifying code without actually modifying existing code:
    Many languages, especially interpreted languages, have eval, which will take a pregenerated string and run it as program code, thus generating new code rather than modifying existing code.
    function pointers
    Rather than modifying code in place, the code is generated using a function pointer (an indirect jump in assembly) which is given a value at runtime. This has the advantage that type checking can still be done, but may be less efficent on some architectures.
    dynamic linking and using DLLs or to add functions
    Some operating systems have support for linking in additional code at runtime, either via the use of function pointers to activate the code once linked in, or via unresolved symbols that cause the additional code to be automatically linked. (This uses the same mechanism as shared libraries.)
    Some object oriented languages allow functions to be overloaded (defined multiple times in different ways), and linking of overloaded functions may actually change at runtime depending on what modules are loaded or the current context.
    thunk or closure or lazy evaluation
    Some languages (java, lisp, perl, others) allow code to be stored in or with a variable; the key is that the thunk may be created and passed to another piece of code (carrying along with it some of its execution environment) where it is later executed, similar to a trampoline.

    This was brought to you by the Save Our Archaic Technical Terms Society.

Self-modifying code is not limited to assembly.

A couple of high-level languages offer the (usually evil) choice to change your code in run time.

The best example is most likely COBOL. The ALTER statement works in the following way :

.  (some thousands of lines snipped)


. (some more lines snipped)

        MOVE 0 TO MY-SUM.
        READ INFILE.

What it does is simple: Read records from an input file until the sum of values in a certain field of the input field is greater than 700.

The ALTER statement changes the statement


during run-time.

Most people who wrote COBOL compilers apart from IBM decided not to implement ALTER.

Some lesser-known languages like NODAL and POCAL also offered self-modifying code features, easier to implement because both were interpreted languages.

A sample in POCAL :

10.10 SET I=0
10.20 SET POCLIN(10.10)="SET I=" I+1
10.30 DO 20!30
10.40 TYPE "Program used " I " times" !
10.50 END


30.10 TYPE "Saving failed - tough luck !"
This program not only modifies itself, it also saves itself back to disk.

The conclusion might be that languages which allow for self-modifying code are also self-obfuscating.

One fairly interesting application of self-modifying code is in copy protection. It's possible with some knowledge of assembler and good debugger and/or disassembler to change the code of a crippled shareware (or a commercial) program to unlock all the features. Apparently1 self-modifying self-encrypting code is one of the techiniques against these brute force cracks. Essentially the code to the advanced features of the program is encrypted. The license or the serial number would provide the unencryption key which would make it possible to decrypt the code which contains the more advanced features. This would make cracking practically impossible.

1I must admit, this is little more than theory to me. This information was gleaned from a russian book called Theory and Psychology of Hacker Attacks. If I have messed up somewhere please let me know and I'll correct it.
Oddly, no-one has mentioned Lisp. Lisp is the original self-modifying programming language. Sometimes it's elegant, sometimes it's just dirty. Lisp code is just data, so you could store the text of a function in a variable, and modify it as needed, then redefine the definition of the function in the global context if you want, or keep the compiled version of the function as a variable, and call that (As ever in Common Lisp, other options probably exist). Alternatively, to access the text of a function foo, you could do:

(function-lambda-expression #'foo)

If you define a function:

(lambda (x) (foo x))

Then you change the definition of foo, then the above function will call the new version of foo. This still works if the function is compiled.

All behaviour tested under Corman Lisp version 1.42

The standard for Common Lisp allows function-lambda-expression to return nil whenever it wants.

Log in or register to write something here or to contact authors.