That's not the dataset used for training. From the paper:
>We train our T2V model on a dataset containing 30M videos
along with their text caption. [...] We evaluate our
model on a collection of 113 text prompts describing diverse
objects and scenes. The prompt list consists of 18 prompts
assembled by us and 95 prompts used by prior works (Singer
et al., 2022; Ho et al., 2022a; Blattmann et al., 2023b) (see
App. B). Additionally, we employ a zero-shot evaluation
protocol on the UCF101 dataset
>
[0] https://paperswithcode.com/