Well, depends on the size of the hash table and the particular memory access patterns. Lookups into many small hash tables, or workloads in which most threads in a threadgroup all fetch the same entry, can be very efficient on GPUs. Sparse virtual texturing is often implemented with a hash table and works well on GPUs because the hash table involved has both of these properties.
Yes, a very good point. I am assuming the tables are quite large due to the workload. If it's large enough to give a benefit to large pages in reduced DTLB misses, it's likely too large for warp-local memory :)
(I'm sure you know this, just wanted to clarify.)