Yes, but how different would those 7K frames would really be? Same lighting, same background, same surrounding objects, the exact same condition of the vehicle's interior and exterior, same quirks of the camera's color profile, etc, etc. It would be an interesting experiment to actually try this, but I have a feeling the results wouldn't be all that good. Point being, you probably wouldn't get most of the benefits of deep learning and you might as well use the same approach the author used.
No they won't. All of the pictures will have lighting from the time of day and weather conditions from the time and place the pictures were taken. The same problems will happen for the background. If I want my neural network to identify the make and model of cars, but every picture I have of a Mazda3 is taken at noon on a sunny day in suburbia then it is reasonably likely to train on the wrong features and either identify trucks on sunny days in suburbia as Mazda3's or not recognize a Mazda3 photographed on a rainy night.
A human might have difficulty recognizing a Mazda 3 on a rainy night as well. You can adjust color temperature and white balance in post-processing, or film a couple minutes at night too. Point is, generating 7k images is not insurmountable, especially in this case with the criteria that it only has to recognize a particular car.