This is very impressive, especially the performance per LUT! Did I overlook frequency spec on a given target or did you not specify?
Will the execute stage pipeline effectively to reach higher f_max? (Of course there will be a small logic penalty, and a larger FF penalty, but the core is small enough that it would probably be tolerable.) Or is the core's whole architecture predicated on a two stage design?
This core is targeted at "smaller-is-better" applications with few actual instruction-throughput requirements. If it reaches 200 MHz on a Xilinx KU060, I will be delighted. (That specific clock frequency on that specific part carries heavy hints about what this core is intended for.)
With that in mind: the single instruction-per-clock design is for simplicity's sake, not performance's sake. If the execution stage were pipelined, it'd be a different core. If performance is the goal, I'd start by ripping out some of the details that distinguish this core from other (excellent) RISC-V cores.
Will the execute stage pipeline effectively to reach higher f_max? (Of course there will be a small logic penalty, and a larger FF penalty, but the core is small enough that it would probably be tolerable.) Or is the core's whole architecture predicated on a two stage design?