Read the last paragraph. You still have humans, but their input is more akin to a movie reviewer than a movie director/writer/actor/etc. It still takes skill, but it takes a lot less time.
RLHF typically employees humans, and that can be time consuming in itself, but less time consuming than creating content. And their efforts can be amplified. If they are actually rating unpaid humans, that is, users, who are willing to provide feedback and are also prompting the system. Plenty of people are happy to do this for free, and some of it happens, just as a byproduct of them doing what they're already doing, creating content and choosing, which comes out good and which one doesn't. Every time I am working through a coding problem with chatGPT, and it makes mistakes and I tell her about those mistakes, it can be learning from that.
People can also come up with coding problems that can run and test itself on. As a simple example, I imagine it's trying to write a sorting algorithm. It can also write a testing function simply tests that it is correctly sorted. They can also time its results, they can count how many steps had to do in that sense it can work just like Alpha zero, where there is an objective goal, which is to do it with the least clock cycles, and there's a way to test whether and how well it is a achieving that goal. While that may be a limited number of programming problems that that works for, by practicing on that type of problem it will presumably get better at other types of problems, just like humans do.
This is exactly what large language models do, they find a way to objectively test their writing ability, which is by having them predict words and things that they've never seen before. In a sense it's different from actually writing new creative content, but it is practicing skills that you need to tap into when you are creating new content. Interestingly, a lot of people will dismiss them as simply being word predictors, but that's not really what they're doing. They're predicting words when they're training, but when they're actually generating new content, they're not "predicting" words (you can't predict your own decisions, that doesn't make sense), they are choosing words.
RLHF typically employees humans, and that can be time consuming in itself, but less time consuming than creating content. And their efforts can be amplified. If they are actually rating unpaid humans, that is, users, who are willing to provide feedback and are also prompting the system. Plenty of people are happy to do this for free, and some of it happens, just as a byproduct of them doing what they're already doing, creating content and choosing, which comes out good and which one doesn't. Every time I am working through a coding problem with chatGPT, and it makes mistakes and I tell her about those mistakes, it can be learning from that.
People can also come up with coding problems that can run and test itself on. As a simple example, I imagine it's trying to write a sorting algorithm. It can also write a testing function simply tests that it is correctly sorted. They can also time its results, they can count how many steps had to do in that sense it can work just like Alpha zero, where there is an objective goal, which is to do it with the least clock cycles, and there's a way to test whether and how well it is a achieving that goal. While that may be a limited number of programming problems that that works for, by practicing on that type of problem it will presumably get better at other types of problems, just like humans do.
This is exactly what large language models do, they find a way to objectively test their writing ability, which is by having them predict words and things that they've never seen before. In a sense it's different from actually writing new creative content, but it is practicing skills that you need to tap into when you are creating new content. Interestingly, a lot of people will dismiss them as simply being word predictors, but that's not really what they're doing. They're predicting words when they're training, but when they're actually generating new content, they're not "predicting" words (you can't predict your own decisions, that doesn't make sense), they are choosing words.