If you really want an isometric look but need more complicated rendering than layered tiles, it seems like it would be much easier to do 3d rendering but lock the 'camera' into an orthographic projection mode.
Unity and Unreal Engine both have this as a built-in feature that basically just takes a checkbox, for examples off the top of my head.