Forgive my naive question - but what if the neural networks output range is -1 to +1, if the activation functions are ReLU doesn’t that mean a negative value cannot be produced?
A negative value cannot be produced but in the hidden layers the output of one neuron is multiplied by weights so the sign can be encoded in those weights.
The strategy for most of these things is that most of the network builds up a bunch of "shapes", and you have a final layer that projects those appropriately into the output space. The intermediate layers can use basically any activation they want that has desirable convergence properties, and at the end you might have a linear projection (or full MLP layer) followed by a sigmoid or other reshaping. The GPT family uses "softmax" -- an exponentially weighted norming function that scales all the outputs to sum to 1 (since they represent probabilities for each of the next tokens).
If your output range is bounded like that then you probably want a sigmoid or tanh activation (with some shifting and scaling maybe) on the output layer. But the hidden layers can still use ReLU without issue
A 'leaky relu' is used to retain some of the negative stuff and prevent neurons from dying young. I just googled around and now see "PReLU" which seems to also address negative value.