> I honestly don't know why erlang vm is faster than python vm.
One reason is that on platforms that support computed goto, when a BEAM file is loaded, the list of bytecodes for each function is translated into a list of machine code addresses that implement each bytecode (a.k.a. threaded code). Bytecode dispatch then skips one layer of indirection vs. CPython. CPython looks up each bytecode's machine code address in an array every time it dispatches a bytecode. (Or, if the platform doesn't support computed goto, CPython uses a regular switch statement, which hopefully gets compiled to a jump table.)
Interesting, so it's very implementation specific. Could CPython do something similar? Or is it the Erlang language semantics that allow for this optimization?
There's nothing stopping CPython from using a threaded interpreter, though doing it in C requires a compiler supporting GCC's nonstandard labels-as-values[0]/computed goto and undefined behavior (using goto to jump across functions).
Threaded code is a pretty standard implementation strategy for some languages (notably Forth), and my understanding it was even a pretty common compiler implementation strategy in the 1970s/ early 1980s. It's a pretty easy optimization, particularly for a stack-based VM or a VM with a small number of registers.
The main downside is memory usage, as Python bytecode can be memory-mapped directly from disk and therefore shared across processes and discarded by the OS under memory pressure rather than being written out to swap. There's obviously a bit of startup overhead at class load time.
Though, I'm a bit surprised that most stack-based interpreters don't initially load classes with functions implemented as a dead-simple threaded implementation consisting of a pointer to the start of a regular bytecode interpreter and a pointer into a memory-mapped buffer of the on-disk bytecode. Based on performance counters, they could JIT the hotspots into regular threaded code.
One reason is that on platforms that support computed goto, when a BEAM file is loaded, the list of bytecodes for each function is translated into a list of machine code addresses that implement each bytecode (a.k.a. threaded code). Bytecode dispatch then skips one layer of indirection vs. CPython. CPython looks up each bytecode's machine code address in an array every time it dispatches a bytecode. (Or, if the platform doesn't support computed goto, CPython uses a regular switch statement, which hopefully gets compiled to a jump table.)