It's not particularly efficient but that's not a design goal of the system. The tradeoff is robustness–can't lose or corrupt data as long as the master data set is safe–and flexibility–generate whatever views you like on the data whenever you choose. The design is Twitter's analytics system which was running this sort of thing over a 27TB raw dataset using Hadoop so apparently it scales if you throw more hardware at it.
There's a second layer using Storm (not written up in the book yet, so I don't know the details) that handles all data newer than the most recent batch run and you somehow merge that new data with the old data (also not written up yet). I don't need to have this sort of system implemented immediately so I'm content to sit around and wait for new book chapters rather than try to muddle through.
There's a second layer using Storm (not written up in the book yet, so I don't know the details) that handles all data newer than the most recent batch run and you somehow merge that new data with the old data (also not written up yet). I don't need to have this sort of system implemented immediately so I'm content to sit around and wait for new book chapters rather than try to muddle through.