As in many things, it's a combination of both. For example:

- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.

- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.

Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.