There are many parts of the code that are functional in C and have side effects in the transpiled version.
Look ultimately what we have is a C version and cgo version that are roughly the same speed and a transpiled version that is 1/6th the speed - and the caller can be in any thread in any of those. Then there's a different API where the caller has to manage storage that's on par with the first two, but that's not the same thing. If you jump through these hoops you can be on par is a different claim from what the blog author made.
Now it's possible that the wrapper functions could be made fast by storing the TLS object in a Go thread local storage so the API is the same, but the author didn't do this.