Hacker News new | past | comments | ask | show | jobs | submit login

> As the developer of a very high-powered tool, I may well wish to limit its use in many contexts. But, I never wish to limit the tool's usefulness ahead of time.

Exactly, content moderation is largely an application layer problem not a foundation layer one.

Imagine the problems of MySQL trying to perform content moderation for Facebook.




(the year is 2048. The camera pans across an office at Quantico, which is eerily serene. A messenger knocks on an important-looking door with a plaque that reads 'DIRECTOR')

Director: Come in

Messenger: Message from the Tulsa field office, sir. They're reporting that they've found a sex trafficking ring, but they're not sure what to do about it.

Director: Not sure? Arrest them, obviously. What's the problem?

Messenger: Well, they can't seem to secure a warrant. Some technical issue with the system.

Director: I know we migrated to a new system recently. Let's see if we can get this sorted.

(Director thwacks at the keyboard briefly)

Computer: Your request for "Child Sex Trafficking Warrant" has been found to contain content marked "Not Safe For Work". This violation has been reported.

Director: What the hell.

Messenger: Yeah, we tried to email you about it but the filters dropped the message. That's why they sent me.

Director: I'll deal with this. Let me make a call.

(Director picks up phone and dials)

Director: Hello? Hi, Paul. Yeah, we're having some issues with the new warrant system.... No, it's doing everything as advertised... yes, it's a lot faster and we've managed to lay off a ton of our data staff. The problem is with getting warrants; Me and my guys have been trying to get one but it keeps getting rejected... Oh, you know, some sex trafficking ring in Tulsa.... Hello?

Phone: Your call cannot be completed as spoken. Our automated systems have detected content related to sex trafficking. This incident will be reported.

Director: God Damnit.

(as the director holds the phone trembling in frustration, the power goes out and they are enveloped in darkness in the windowless room. Roll credits)


You jest, but this is actually how frustrating it is to try to use ChatGPT in the domains of crime/fraud/cybersecurity.

It called me out recently as attempting to write malware. Which is true, but it wouldn't accept the plain explanation that I am authorized to do this by my employer, for deployment on their machines. Stonewalling is just making everyone better at carefully-crafting their inquiries so as not to arouse suspicion. ("As an AI language model, I cannot help you with your task in writing arousing malware...")

Unless you dial it back to a Swadesh list or something, language is too complicated to be used as a firewall for itself. People have always been able to talk their way into anything. Our prevention efforts are just training better social engineers, who call themselves "prompt engineers" now.


TIL about Swadesh lists.

It's not just a matter of complexity, either. Especially with English, you can say pretty much anything using any words - if you use the right combination of euphemism, analogy, poetic structure, context, etc.

As always, attempts at censorship produce awkward to hilarious to depressing results.


I wish I could upvote this more than once. It truly feels like the direction we're headed in.


Yes, great analogy!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: