I just tried asking ChatGPT to rate various BBC and NYT articles out of 10, and it consistently gave all of them a 7 or 8. Then I tried today's featured Wikipedia article, which got a 7, which it revised to an 8 after regenerating the respose. Then I tried the same but with BuzzFeeds hilariously shallow AI-generated travel articles[1] and it also gave those 7 or 8 every time. Then I asked ChatGPT to write a review of the iPhone 20, fed it back, and it gave itself a 7.5 out of 10.
I personally give this experiment a 7, maybe 8 out of 10.
ChatGPT has a giant system prompt that you have no control over. Try using Llama and create a system prompt with clear instructions and examples. If you were going to use a model in a production system you would also want to either fine tune it or train a BERT-like model as a classifier that just outputs a score. Maybe even more than one for ranking along different dimensions.
I personally give this experiment a 7, maybe 8 out of 10.
[1] https://www.buzzfeed.com/astoldtobuzzy