Ted Leung points out the interesting feature and proven results of the Squeak Smalltalk VM that are apparently not apparent in the state-of-the-industry JVM or CLR.
I learned a few things about Squeak. The Squeak VM runs on about 30 platforms and runs the same (bit for bit) image on all of them. Apparently the VM is auto-generated in some fashion. Given the effort required to port the Java VM to a different platform (I did some of the work to get Java running on the Newton at Apple), this is pretty impressive.Here's a description of how the Squeak VM is "auto-generated"...
The key to both practical performance and portability is to translate the Smalltalk description of the virtual machine into C. To be able to do this translation without having to emulate all of Smalltalk in the C runtime system, the virtual machine was written in a subset of Smalltalk that maps directly onto C constructs. This subset excludes blocks (except to describe a few control structures), message sending, and even objects! Methods of the interpreter classes are mapped to C functions and instance variables are mapped to global variables. For byte code and primitive dispatches, the special message dispatchOn:in: is mapped to a C switch statement. (When running in Smalltalk, this construct works by perform:-ing the message selector at the specified index in a case array; since a method invocation is much less efficient than a branch operation, this dispatch is one of the main reasons that the interpreter runs so much faster when translated to C).What's really interesting about this is the ability for anyone to extend Squeak using the same technique. Do you want to try making a stretch of your Smalltalk go faster? Write it using the subset language and generate the C, essentially extending the "VM" in an application-specific way.
The translator first translates Smalltalk into parse trees, then uses a simple table-lookup scheme to generate C code from these parse trees. There are only 42 transformation rules, as shown in Table 3. Four of these are for integer operations that more closely match those of the underlying hardware, such as unsigned shifts, and the last three are macros for operations so heavily used that they should always be inlined. All translated code accesses memory through six C macros that read and write individual bytes, 4-byte words, and 8-byte floats. In the early stages of development, every such reference was checked against the bounds of object memory.
Our first translator yielded a two orders of magnitude speedup relative to the Smalltalk simulation, producing a system that was immediately usable. However, one further refinement to the translator yielded a significant additional speedup: inlining. Inlining allows the source code of the virtual machine to be factored into many small, precisely defined methods, thus increasing code-sharing and simplifying debugging, without paying the penalty in extra procedure calls. Inlining is also used to move the byte code service routines into the interpreter byte code dispatch loop, which both reduces byte code dispatch overhead and allows the most critical VM state to be kept in fast, register-based local variables. All told, inlining increases VM performance by a factor of 3.4 while increasing the overall code size of the virtual machine by only 13%.