I only dabbled in Taichi, but I find its magic has limitation. I took a provided example, just increased the length of the loop and bam! it crashed the Windows driver. Obviously it ran out of memory but I have no idea how how to adjust except experiment with different values. If it has information about the GPU and its memory, I thought it could automatically adjust the block size but apparently not. There is a config command to fine tune the for loop parallelizing but the docs says we normally do not need to use them.