It does seem a bit funny to do it in two stages but it's the standard way... it ...

It does seem a bit funny to do it in two stages but it's the standard way... it might seem more natural if you were to do the resampling stage first - resample by a factor of 2x and you have the same sound but an octave lower and twice as slow, then a 50% time-stretch gives you the original duration but maintains the new pitch.

The phase vocoder does use an FFT on each window internally, so that it can ensure the phases remain continuous when everything is merged back together. There are variants that let you monkey around with the FFT coefficients before the merge, so you can pitch-shift that way, but I believe when you stitch it back together you end with with the exact same artifacts as the two-step way. I think people have concentrated on perfecting time-stretching since pitch-shifting can be derived from it.

The problem with doing an FFT on the entire length of the sample is that shifting all frequencies would then simply speed it up as well as changing the pitch ;) Chopping it up into bits is key to separating the fundamental frequencies that we percieve as the general "pitch" and all the time-varying harmonics that we percieve as "timbre".

edit: what is maybe a bit hackish is the crude resampling here - when going to the trouble of building a phase vocoder at least some linear interpolation might be appropriate rather than just dropping/repeating samples ;)