As in many things, it's a combination of both. For example:
- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.
- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.
Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.
- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.
- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.
Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.