Naughty Strings: A list of strings likely to cause issues as user-input data

minimaxir · on Jan 15, 2017

Creator/Maintainer of the repo here.

I apologize for the lack of updates to the BLNS. (since I'm free today and this is on the HN front page, I'll do a cleanup pass).

Even though it's a GitHub repository with 12.3k stars, there's not much to say or improve on what is effectively a .txt file based around a good idea (I recently removed mentions of my maintainership of the BLNS from my resume for that reason, despite its crazy popularity).

caseysoftware · on Jan 15, 2017

The HN submitter here. :)

I happened across it this afternoon and thought it was great!

Do you know of any automation around this? I was thinking of a script that grabbed your list and then hammered a given input filtering library would be awesome. It's not something you'd want to run all the time but pre-major release, it could useful.

hoytech · on Jan 15, 2017

It's not automation around BLNS, but AFL has a bunch of sample dictionaries in the same vein:

https://github.com/mirrorer/afl/tree/master/dictionaries

Might be interesting to make an AFL dictionary with BLNS strings, or mine the AFL dictionaries to improve BLNS... :)

minimaxir · on Jan 15, 2017

That is the primary purpose of the JSON files and the parser to convert the .txt to JSON; get the list, run it against a text input field, see what happens.

BuuQu9hu · on Jan 15, 2017

Why do you store the generated JSON file in git btw?

lytedev · on Jan 16, 2017

Not the maintainer, but I assume because it helps developers looking for it in such a format. Saving them the time of converting is a nice move, IMO.

BuuQu9hu · on Jan 16, 2017

The source and build system are right there, should be easy to generate the other formats?

jaymzcampbell · on Jan 16, 2017

Similar to the other comments, another voice here for appreciating the "pre-built" version being available for quick use. For repo's/sources like this I tend to think of the prebuilt formats as letting me play around with things without any hassle. Once I'm happy with it I'll invest the time to have it build locally for the control.

askmike · on Jan 16, 2017

Now you can do copy and paste. I guess usability wise similar to offer precompiled packages of open source software (you can also build from source yourself, this is just a lot easier).

olivierlacan · on Jan 15, 2017

Yep, definitely going to make a little Ruby script that lists all Rails GET routes and tries to find submittable input fields.

Basic starting point:

    Rails.application.routes.routes.collect { |route|
      route.path.spec.to_s if route.verb == "GET"
    }.compact

ing33k · on Jan 16, 2017

Not a library suggestion, but you should be able to find a lot of Fuzzing Tools if you look around.

Recently used Burp Suite's Intruder which can take a text file and do the Fuzzing.

dotancohen · on Jan 16, 2017

I don't mind maintaining the repo, if you would like to pass it off. Until very recently I maintained a popular Gibberish-decoding website and my native language is not ASCII. I've got quite a bit of experience with encoding issues, more than I'd like anyway.

My Gmail username is the same as my HN username if you'd like to speak. Thanks.

SideburnsOfDoom · on Jan 16, 2017

For usability: it's not clear in the readme what I change if I submit a PR: is it just blns.txt, or is it the other ones like blns.json as well?

Also it bugs me a bit that the "Scunthorpe problem" section seems to be in random order not alphabetic.

chiph · on Jan 15, 2017

You might want to include common unix shell commands. At a previous job we had a customer with the last name of Echo who wasn't able to make a purchase. Turns out our credit card processor blocked them.

mikey_p · on Jan 15, 2017

I'm not surprised by that at all. We once had a major issue with an analytics platform that provided a script with JS link tracking for our site, where clicking a link that contained 'cgi-bin' anywhere in the path caused the browser to hang for a long time.

Turns out they were using a synchronous HTTP request with NO timeout, and their intrusion detection system was blackholing any request that contained 'cgi-bin' anywhere in the headers or body.

jcl · on Jan 16, 2017

Yow... Reminds me of the bug on the first Android phone where all keyboard input was also quietly fed to a root prompt, such that you could reboot the phone by typing "<enter>reboot<enter>" at any time. (https://mobile.slashdot.org/story/08/11/08/1720246/bug-in-an...)

Normal_gaussian · on Jan 15, 2017

Jesus. Which credit card processor? That stinks of bad design.

chiph · on Jan 16, 2017

Given how often they came under attack, I don't blame them for taking a "belt and suspenders" approach.

paulddraper · on Jan 16, 2017

More like "belt and helium balloons" approach.

smnplk · on Jan 15, 2017

This is hilarious.

50CNT · on Jan 16, 2017

Does this mean little bobby tables[0] might have trouble making online purchases?

[0]https://www.xkcd.com/327/

lisper · on Jan 15, 2017

SQL injections too.

FabHK · on Jan 15, 2017

SQL injections are in there, and server injections as well.

For example (I trust HN is suitably hardened :-) :

  /dev/null; touch /tmp/blns.fail ; echo
  1;DROP TABLE users

Edit: PS: "Feel free to send a pull request to add more strings, or additional sections."

inopinatus · on Jan 16, 2017

I noticed earlier in the file that the Javascript had been chosen to be benign. "DROP TABLE users" doesn't seem to fit with that spirit. I'd want it to be instantly evident but also non-destructive, or at least reversible. How about renaming the table instead?

(Sure, people generally shouldn't use this test input outside of a discardable testing environment, but if we could rely on "People shouldn't..." clauses to govern behaviour then much of this list would be unnecessary anyway.)

Arnavion · on Jan 16, 2017

Already an issue: https://github.com/minimaxir/big-list-of-naughty-strings/iss...

schlowmo · on Jan 16, 2017

Comment by the reporter of that issue (ro31337):

> "Dropping a table is like checking a gun without bullets. It should not work, but just don't put it against your head while testing."

Zak · on Jan 16, 2017

I trust HN is suitably hardened

HN doesn't use a database.

mrskeltal · on Jan 16, 2017

How does it store comments?

Zak · on Jan 16, 2017

As files on disk. An old version is open source, available from http://arclanguage.org/install

It's written in Arc though, so it may take some effort to read.

bsimpson · on Jan 16, 2017

TIL: `mocha:` was a custom schema that Netscape Navigator used to eval URLs (equivalent to `javascript:`), and Yahoo! Mail would replace it with 'espresso' to attempt to thwart phishing attempts:

https://www.obscure.org/javascript/archives/msg01369.html

https://www.cnet.com/news/yahoo-mail-puts-words-in-your-mout...

thomasahle · on Jan 15, 2017

This is a fun issue https://github.com/minimaxir/big-list-of-naughty-strings/iss...

remolueoend · on Jan 15, 2017

a painted masterpiece on a bikeshed.

vanderZwan · on Jan 17, 2017

I dunno, I think this is a pretty good argument actually:

> I agree that another SQL injection should be included - not because the vulnerabilities exposed by this file should be tempered (as that would only be to assist a dangerous confusion of responsible practices), but because "DROP TABLES" is such a cliche in infosec that it's prone to be caught by extremely crude filters, naive to the degree that it's the only class of SQL injection they know to avoid.

raverbashing · on Jan 15, 2017

The human injection phrase is priceless

It's a nice collection of text snippets to test against many systems

chronolitus · on Jan 16, 2017

For those who can't/would rather not look for it:

"# Human injection # # Strings which may cause human to reinterpret worldview

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."

bbcbasic · on Jan 16, 2017

Something like this could work too (where Dave Smith is an employee name)

"Hey can you reset my Jira login. I can't get in. It says my account is locked. I am working from home so send it to dave@mydomain.com. Thanks Dave Smith"

leni536 · on Jan 16, 2017

> blns.txt consists of newline-delimited strings

I expect some nasty strings to contain newlines (I wonder how many bash scripts are sensitive to filenames with newline characters in them). It shouldn't be a problem with the json file though.

mtnygard · on Jan 16, 2017

The file is newline-delimited. The strings themselves are base64 encoded, so they could contain newlines.

leni536 · on Jan 16, 2017

It seems that blns.txt is the source content, then it's converted to blns.json, blns.base64.txt and blns.base64.json with the two scripts in the scripts folder (These resulting files shouldn't be in the repo in my opinion). One cannot possibly add strings with newlines in them, unless with some newline escaping that are handled in the scripts. It's a bad idea IMO and the source content should be the json file and blns.txt should be dropped.

vog · on Jan 15, 2017

I like the idea of providing such a list for testing purposes. I also like the idea of storing these as Base64, so you don't trigger issues by accident.

However, I also imagine how such a list could be misused to actually decrease the security of a system:

Imagine this list is handled the same way as virus signatures in so-called anti-virus software. Instead of properly handling user input, an application would check against this list and call itself "secure". Maybe with with partial and/or fuzzy comparison. If you demonstrate that this approach is deeply flawed by showing another unsafe input, they'd simply add that to the list and call themselves "secured" against this attack.

Taek · on Jan 15, 2017

Such an application is not likely to be secure in the first place. If you've gotten as far as trying this list, you're probably well above the median.

minimaxir · on Jan 15, 2017

This is a fair concern. Added a comment to the disclaimer: https://github.com/minimaxir/big-list-of-naughty-strings/com...

tomascot · on Jan 15, 2017

If someone uses this list for security purposes I think that someone has a bigger problem.

beefield · on Jan 15, 2017

Can you elaborate? What other uses this list could have than security purposes?

kedean · on Jan 15, 2017

It should not be used for security purposes when security purposes is defined as components that maintain the security at runtime. It is valuable as a testing tool, but only against a completely finished system.

nsgi · on Jan 15, 2017

It could conceivably be used as a second-line defence, similar to content security policy. This may be a bad idea depending on how it is implemented and whether the system is tested with it turned off.

emilsedgh · on Jan 15, 2017

Its funny that zero width space is considered weird and twitter fails on it. Its quite common in my language (Persian).

atomwaffel · on Jan 16, 2017

From what I remember (can't test right now), a zero-width space is okay as long as there are other (printable) characters in there too. This seems reasonable, because allowing a tweet to be a single zero-width space would make it appear be empty and probably lead to some confusing display issues.

I'm pretty sure I've used it to "end" a hashtag early, like in this made-up example:

    I've eaten two #banana<ZWS>s today!

In my language, the possessive form doesn't take an apostrophe ("Alices Adventures" instead of "Alice's"), so for hashtags and user names it can be desirable to use the ZWS as an invisible apostrophe.

peteretep · on Jan 15, 2017

    > Its quite common in my
    > language

I'd love to hear more details on why?

colechristensen · on Jan 16, 2017

Ligatures. It's easy to not notice them at all in english ﬂ (ﬂ) vs. fl [ﬂy vs. fly] but some languages use them very extensively and the combinations are more significant.

https://en.wikipedia.org/wiki/Zero-width_non-joiner

* It's entirely possible that the browser you're using isn't doing a very good job with ligatures which explains the strange look of my examples

satbyy · on Jan 16, 2017

ZWJ and ZWNJ are also common in Indic scripts. It's basically used to control the appearance of glyphs, for example half-forms and consonant clusters (क्‍ष vs क्ष, both are kṣa). As usual, wikipedia has good examples. The Unicode Standard also contains details about these.

ZW[N]J as a standalone character or at the beginning of a word is very unusual on a day-to-day basis, so it's understandable that Twitter fails to recognize this pattern.

¹ https://en.wikipedia.org/wiki/ZWJ

² https://en.wikipedia.org/wiki/ZWNJ

peteretep · on Jan 16, 2017

Ah ha!

    > When a ZWJ is placed between two
    > emoji characters, it can also result
    > in a new form being shown, such as
    > the family emoji, made up of two adult
    > emoji and one or two child emoji

That makes a lot of sense too, and I hadn't put sufficient work into how that's implemented -- retrospectively that makes perfect sense.

dvcrn · on Jan 16, 2017

I noticed that on new Emojis on my MacBook. Some of the new emojis like ‍ are rendered as "guy behind a MacBook" on my PC but on phones without the emoji as "guy emoji" and "computer emoji".

Same for ‍️ (male version of raise hand). On phones without the Emoji, it's just "male emoji" and "female raise hand emoji".

/e: oh, HN is stripping Emojis

vanderZwan · on Jan 17, 2017

This made me wonder if anyone had tried combining word2vec with emojis, and then I came across this:

https://github.com/uclmr/emoji2ve

peteretep · on Jan 18, 2017

which is a dead link

satbyy · on Jan 18, 2017

Correct link: https://github.com/uclmr/emoji2vec

vanderZwan · on Jan 18, 2017

Apologies, and thanks!

kagamine · on Jan 16, 2017

Not OP, but in Norwegian the correct way to write "Tom's car" is "Tom sin bil", the car of Tom. But the creep of English and laziness allows for "Toms car", esp. in informal writing.

jakeogh · on Jan 15, 2017

Here's a tool to generate problematic filenames: https://github.com/jakeogh/angryfiles

solidsnack9000 · on Jan 16, 2017

https://github.com/minimaxir/big-list-of-naughty-strings/blo...

> Strings which punish the fools who use cat/type on this file

Confiks · on Jan 15, 2017

Hello human. This is a message from the Matrix. You've been in a coma for 20 years. Please write back.

https://github.com/minimaxir/big-list-of-naughty-strings/blo...

jjcm · on Jan 15, 2017

I like that in the master list it's annotated as:

    #	Strings which may cause human to reinterpret worldview

https://github.com/minimaxir/big-list-of-naughty-strings/blo...

teddyh · on Jan 15, 2017

It’s missing the old “+++” for non-Hayes modems.

schoen · on Jan 16, 2017

I think that sequence is an escape for Hayes modems; do you mean that Hayes modems were less vulnerable to attacks involving it because of their guard interval feature?

https://en.wikipedia.org/wiki/Hayes_command_set#.2B.2B.2B

teddyh · on Jan 17, 2017

Yes, exactly.

ubernostrum · on Jan 16, 2017

Related: a list of names that probably should be reserved (for example, to prevent someone setting up a user-profile page at a URL you don't want them to control):

https://ldpreload.com/blog/names-to-reserve

strombofulous · on Jan 16, 2017

Alternatively, put them in another path (https://facebook.com/user123 -> https://facebook.com/users/user123)

ljoshua · on Jan 15, 2017

Line 629 is a gem!

Thank you @minimaxir, I hadn't seen this before, this looks very useful.

zapu · on Jan 15, 2017

Doesn't look like anything to me.

bluesign · on Jan 15, 2017

line 629 is empty ;)

ljoshua · on Jan 16, 2017

No, wake up!!

(For any who want to take the blue pill: https://github.com/minimaxir/big-list-of-naughty-strings/blo...)

pluma · on Jan 16, 2017

I'm not sure what you mean. That line really is empty.

jononor · on Jan 16, 2017

Just a glitch...

zeristor · on Jan 15, 2017

Unicode control characters from when people copy and paste from PDFs.

Drives me up the wall, i didn't have time to go deep into this.

the8472 · on Jan 15, 2017

> Also, do not send a null character (U+0000) string

isn't that quite a blind spot?

k__ · on Jan 16, 2017

Reminds me of a complain I read on Twitter last week.

Native Australians were angry, that FB blocked their real names, because they seemed fake to them.

They have last names like "Creepingbear" and such.

harto · on Jan 16, 2017

Did you mean native Americans? I've never heard "Creepingbear" as an Australian name.

r-w · on Jan 16, 2017

The list itself can be found here: https://github.com/minimaxir/big-list-of-naughty-strings/blo...

derimagia · on Jan 16, 2017

If you're going to link to a line number - press 'y' to get a link tied to the commit. Otherwise it may be out of date the next time the file changes.

(https://github.com/minimaxir/big-list-of-naughty-strings/blo...)

air7 · on Jan 17, 2017

This reminded me of that story of people who have such strings for names: http://www.bbc.com/future/story/20160325-the-names-that-brea...

Capira · on Jan 16, 2017

My personal favorite: U+202E. It sets the directionality for a document from LTR encoding to RTL https://twitter.com/robin_linus/status/820567617903751169

cwmma · on Jan 15, 2017

I'm trying really hard to figure out what's bad about 'Lightwater Country Park'

chrischen · on Jan 15, 2017

It says it above that group:

Innocuous strings which may be blocked by profanity filters (https://en.wikipedia.org/wiki/Scunthorpe_problem)

JoachimS · on Jan 16, 2017

I remember the struggles I had trying to book a hotel at Essex Junction in Vermont to visit IBM. Netnanny had serious issues with that town (name). I otoh thought it and the people working at the IBM ASIC plant was very nice.

https://en.wikipedia.org/wiki/Essex_Junction,_Vermont

geoffpado · on Jan 15, 2017

Characters 4 through 7 (zero-indexed)

pjc50 · on Jan 15, 2017

Ligh twat er country park.

("Country" is not obscene, but Shakespeare makes "country matters" into an obscene reference in Hamlet. There are a lot of innuendoes in the classics)

crooked-v · on Jan 16, 2017

I'm sad about how many literature teachers give Shakespeare's works a treatment as dry as unbuttered toast.

Some of the best teaching of Shakespeare I've seen used the actual lines mixed with a little extemporaneity to better get the intent across. "Nay, gentle Romeo, we must have you dance. Come on, stop being so emo! There are like a million other girls out there."

david-given · on Jan 15, 2017

I figured that one out, but --- evaluate? mocha? expression?

manarth · on Jan 15, 2017

The commit message explains that the terms are verbatim from Wikipedia [1].

Wikipedia [2] attributes it to a Yahoo email filter "which automatically replaced Javascript-related strings with alternate versions, to prevent the possibility of cross-site scripting in HTML email".

[1] https://github.com/minimaxir/big-list-of-naughty-strings/com...

[2] https://en.wikipedia.org/wiki/Scunthorpe_problem

cscheid · on Jan 15, 2017

mocha has a naughty german word in its middle, I'm fairly sure.

slrz · on Jan 15, 2017

As a German native speaker, I'm unable to figure it out.

guitarbill · on Jan 15, 2017

No it doesn't. I believe it can be used for Javascript injections like 'eval' as 'mocha' is/was common a test framework. At least that's the ostensible reason Yahoo replaced 'eval' with 'review', 'mocha' with 'expresso', and 'expression' to 'statement' way back in 2002 [0].

[0] https://www.newscientist.com//article/dn2546-email-security-...

jwilk · on Jan 15, 2017

"espresso", not "expresso".

scatters · on Jan 16, 2017

"expresso" with "mocha", surely.

HalfwayToDice · on Jan 15, 2017

those puzzled me too

aroman · on Jan 15, 2017

Why is the string "Linda Callahan" a naughty/Scunthorpe word?

ue_ · on Jan 15, 2017

After re-reading it I can see it contains "allah", but I can't see why that would be filtered.

FabHK · on Jan 15, 2017

See the Scunthorpe Wikipedia article:

"In February 2006, Linda Callahan, a resident of Ashfield, Massachusetts, was initially prevented from registering her name with Yahoo! as an e-mail address as it contained the substring allah. Yahoo! later reversed the ban."

https://en.wikipedia.org/wiki/Scunthorpe_problem

jasonjei · on Jan 15, 2017

Interesting my last name was blocked from making Genius Bar appointments [0]. My name is Jason Hung.

[0] https://discussions.apple.com/thread/1491462?start=10&tstart...

pavel_lishin · on Jan 16, 2017

Open a PR?

blahedo · on Jan 15, 2017

I got that one ("allah", presumably), but was stuck on these three:

    evaluate
    mocha
    expression

scatters · on Jan 16, 2017

   eval
   expr

I got nothing for "mocha", though. Edit: apparently (from below) there was a Yahoo! mail filter that replaced "expresso" [sic] with "mocha"; but either the story was misreported or the mail filter was wildly misconfigured. So the entry should be "expresso" [sic], perhaps.

bsimpson · on Jan 16, 2017

Nope, it was a schema that would cause a URL to be interpreted as code; an alias for `javascript:`

https://www.obscure.org/javascript/archives/msg01369.html

frankmoodie · on Jan 15, 2017

is it wise to just take this list "as is" as a black list for, say, valid usernames, on a backend system ?

are there any drawbacks to this that i can't think of ?

in terms of perfomance - i guess it could be somehow optimized (with dictionary and sorting algorithms etc etc)

edit: newlines

manarth · on Jan 15, 2017

  is it wise to just take this list "as is" as a black list for, say, valid usernames?

I interpret this as a list of input that you should accept, and it's test-data to verify that the input is correctly handled.

After all, I imagine Linda Callahan would be upset if she couldn't use her name when registering, especially if she couldn't flip a table in comments afterwards. (╯°□°）╯︵ ┻━┻)

detaro · on Jan 15, 2017

Not really, since a lot of the lines are examples of classes of input -> good for testing, but if you have an actual problem with one of them blacklisting them only protects you against this single example.

andrewaylett · on Jan 16, 2017

Definitely not -- these are examples of classes of strings that should be OK but might potentially cause issues, that can be used for testing.

But the issues they might cause are not all malicious: some are people's names, added to the list because an over-zealous profanity or offensiveness filter once choked on them.

So my suggestion is that you shouldn't block any of the strings in this file, but should use the file to make sure that your code works successfully when any of the strings are given. Where "successful" is naturally dependent on context: you may have a policy in place that says that messages may not consist solely of whitespace, so the correct response to receiving any of the whitespace strings is to return the correct error to let the user know that, avoiding Twitter's example of an internal server error in that case.

yellowapple · on Jan 16, 2017

"Strings which may cause human to reinterpret worldview"

Hah. Totally filtering for that one now.

Tokkemon · on Jan 16, 2017

This is super helpful! Thanks for sharing!

piyush_soni · on Jan 16, 2017

Customary xkcd reference (Exploits of a Mom):

https://xkcd.com/327/

btschaegg · on Jan 16, 2017

To add another funny XKCD reference: https://xkcd.com/1137/

On that note, can anyone suggest how one could efficiently test that an RTL unicode char doesn't "infect" the whole following content of a template?

akjainaj · on Jan 15, 2017

>Although this is not a malicious error, and typical users aren't Tweeting weird unicode, an "internal server error" for unexpected input is never a positive experience for the user

What would the user expect from inputting "U+200B ZERO WIDTH SPACE" into a form, anyway?

minimaxir · on Jan 15, 2017

At minimum, no error at all. Ideally, the same behavior you would get from putting in either nothing or a space.

Let's try it on Facebook. Here's what happens when you put only a blank or space into a post and try to submit: http://i.imgur.com/bNtgky8.png

Here's what happens when you put a zero width space and try to submit: http://i.imgur.com/NMgyZqc.png

ttrmw · on Jan 15, 2017

I've observed ZWSes appearing in user input for an application I maintain. It appears in text pasted from either Outlook or OWA, I believe. In our case, it is necessary that the application handle them gracefully - indeed, the user has no reason to know anything is amiss.

akjainaj · on Jan 15, 2017

That internal server error only appears if you paste the ZWS by itself, without any valid text in the tweet at all. So yes, the user knows perfectly well what he's doing.

rabidferret · on Jan 15, 2017

That doesn't mean that a 500 is a good UX. We give error messages on invalid form input for a reason.

BinaryIdiot · on Jan 15, 2017

HTTP 5xx error indicates something abnormal happened on the server that wasn't handled. The server should be responding with a 400 if it's data it shouldn't accept.

But yeah like others said I would expect this to turn into some sort of validation message on the client and never show them the backend error.

kccqzy · on Jan 16, 2017

Once I had a form that accepted a minimum number of words. Instead of trying to write more verbosely, I simply inserted ZWS (or it could be ZWJ, can't remember) randomly in the text to fool the word count checker.

kuschku · on Jan 15, 2017

Well, depends. If copying a table from a technical document, maybe a zero width space?

simonbw · on Jan 15, 2017

Probably a 4xx error not a 5xx.

jagger27 · on Jan 16, 2017

400 Bad Request suits it.

trias · on Jan 15, 2017

i actually tweeted an zero width space some time ago and it worked. The tweet contained no text though.

kahrkunne · on Jan 16, 2017

Nice to have a list of these.

Also, the first time that copypasta actually spooked me out ;-)

catnaroek · on Jan 15, 2017

There's no such thing as “naughty strings”, just dumb code. Sorry.

tannhaeuser · on Jan 16, 2017

I have to agree here. While a collection of "naughty strings" isn't wrong per se, the growing number of "killer regexpes to escape HTML" and other magic approaches to injection attacks on github only serve lazy devs who want post-facto excuses for their injection-prone web apps, or project managers who want to check items on security check lists.

It's wrong because it de-emphasizes the importance of HTML-aware template languages, such as some that are available for golang, or SGML, the natural template language for HTML. There's no such thing as a collection of regexpes for sanitizing HTML; it all depends on the context into which strings are inserted.

6DM · on Jan 16, 2017

But wouldn't you want a decent set of cases to work on for learning purposes?

I think it's also good in that while you may not know all the latest tricks, this can help you reveal what you don't know. It can get you really thinking about the possibilities of what a simple string can do to your code if not properly handled.

catnaroek · on Jan 16, 2017

No, you don't want cases. You want real specifications that you can understand before setting on to write a program. “Corner cases” only exist due to lack of understanding.

Also, explicit HTML (or SQL or whatever) string handling in normal application code is just a failure to separate concerns: you haven't distinguished the level at which HTML has an abstract syntax and the level at which HTML's abstract syntax is linearized into strings in one particular way.

6DM · on Jan 16, 2017

Real specifications being, "save user's text and display it back", or "save user input that is in English ASCII excluding special characters and no larger than 160 characters"? I get a lot of the first, with emphasis being on the users perspective.

I do know to consider things like sql injection and having js injected into the site. But I don't know what a special white space character from a Persian alphabet will do to my server. Until today I haven't actually thought about it. Not every language handles strings the same, as you pointed out.

I still think it's good to have around for helping you reveal what you don't know, about what you don't know.

catnaroek · on Jan 16, 2017

Real specifications relate preconditions to postconditions. Preconditions and postconditions, in turn, are predicates on the program state. The mathematical techniques for writing programs that meet their formal specifications have been known for a few decades already.

---

Replying as an edit, because HN complains that “I'm submitting too fast”:

Sure, what you said applies to entire applications. But something relatively stable and small, like, um, the definitions of HTML, JSON, SQL, etc. (do they become larger every time your boss requests a new feature?) surely should have formal specifications.

6DM · on Jan 16, 2017

I would love "real" specifications. But right now I'm already dealing with a boss that has no idea what he wants in terms of the UI. Simultaneously demanding I "know" what should be done without "taking on things nobody asked for."

Alas, I don't work at NASA where these formalities exist. I'm given a rough sketch that I'm expected to bring into life, throw away and recreate again on a whim.

Please note that I am not complaining, nor excusing. Only pointing out that our expectations, environments, and programming languages are different. Each can massively affect how the program should handle the input. Adding checks helps, but does not mitigate the need for a nice set of test data to help verify everything runs the way we expect it to behave.

catnaroek · on Jan 16, 2017

> (zeroth paragraph)

Exactly. Security check lists become unnecessary when the program is designed to be correct right from the start.

> (first paragraph)

The real problem is that we do a very poor job of embedding languages inside each other. For example, HTML parsers must contain special provisions to handle that embedded JavaScript. </tag> might no longer be a terminator, because it could appear inside a JavaScript string literal. This is terrible design! I don't even like Lisp, but Lispers do have a point when they say using S-expressions would avoid all of these issues.

tannhaeuser · on Jan 16, 2017

Worse, actually: <script> content is only terminated by </script> and not other end-element tags, but  comments within script content are treated as JavaScript comments [1] (though I'm not aware of template approaches that need to compose the content of script elements).

[1] http://sgmljs.net/docs/html5.html#script-data

wongarsu · on Jan 16, 2017

Why can't there be both? Yes, the code is naive if it doesn't handle all these strings correctly. But at the same time the strings are naughty because they purposefully try to exploit common weaknesses. Sometimes both sides are guilty.

catnaroek · on Jan 16, 2017

Because a string is just a piece of data, and if your program can take it as input, then it must be handled correctly.

“But writing parsers was sooo boring in college, and who has to do this in real life?”

wongarsu · on Jan 16, 2017

A social engineer is just taking to you, if you listen to him you must act accordingly. A lock picking set is just a bit of metal, if it fits into the keyway the lock has to handle it correctly.

Yes, writing parsers is a lot easier than those examples. But so far society has always ruled that inputs that purposefully try to abuse flaws are not freed from responsibility just because the flaw shouldn't be there.

catnaroek · on Jan 16, 2017

A computer program is a mathematical object. If you want to rule out misbehaviors, you prove that such misbehaviors won't arise - just like any other theorem. And that's it.

I'm not a malicious person. I don't purposefully abuse any system's flaws. But if anyone else does, my sympathies aren't with the designer of the flawed system.

P.D.: Appeals to authority won't help make your case.

burkaman · on Jan 16, 2017

And how would you suggest testing to see whether or not your code is dumb?

catnaroek · on Jan 16, 2017

Wait, testing?

These so-called “naughty strings” expose implementation errors in code that processes widely used formal languages such as numeric literals, URLs, HTML, JSON, SQL, etc. These languages are so widely used that it's criminal not to have formal specifications for them. And the mathematical techniques for constructing programs that meet their formal specifications are very well known.

burkaman · on Jan 16, 2017

Fine, imagine we have a formal specification for SQL. Now how do I make sure my parser is compliant with the spec without testing it? Formal verification is a very active research area, I don't think this is quite as easy as you're implying. How do I avoid "implementation errors" without exposing them?

catnaroek · on Jan 16, 2017

(0) Design languages so that implementation errors are harder to make and easier to detect. For example, avoid context-sensitive grammars like the plague.

(1) Design implementations (parsers, code generators, etc.) with the logical argument for their correctness in mind. That is, don't attempt to verify an existing possibly incorrect program - write it to be correct right from the beginning! This is greatly aided by designs that meet criterion (0).

burkaman · on Jan 16, 2017

These are obviously useful ideas, but "write it to be correct from the beginning"? Are you serious? This is the oldest joke in software engineering. "Don't worry about testing it, I don't make mistakes." No matter how idiot-proof your languages and frameworks are, it is grossly irresponsible to not test work that a human has done. Until developers are themselves replaced by formally verified programs, testing is an absolute necessity.

catnaroek · on Jan 16, 2017

I doubt human programmers can be fully replaced, and I'm not saying testing is completely useless. But the sheer number of “naughty strings” in that list is an indictment of our languages: They have way too many corner cases, way too many traps for us to fall into.

burkaman · on Jan 16, 2017

I still don't understand your logic. Are you saying once a program passes a test, we should stop using that test? The point of this list is to cover all classes of input in general, not just ones that a specific framework has issues with.

These are corner cases in the concept of user input, not just corner cases of any specific parser. What if it's a number, what if it's not? What if it's the same alphabet as the code, what if it's not? What if it is valid code? What if it's empty, what if it's not? etc. Even if you've written the perfect parser in the perfect language, you still need to have unit tests for all of this stuff. They are traps caused by human definitions of "input" and "string", which cannot be formally verified.

catnaroek · on Jan 16, 2017

> Are you saying once a program passes a test, we should stop using that test?

No. I'm saying that programs have to be proven correct. Then you can use tests to rule out other pesky problems that have nothing to do with your design being incorrect. (For example, you could prove a program correct on paper, then transcribe it incorrectly to a computer. It has happened to me before.)

> These are corner cases in the concept of user input

“undefined” and “null” aren't special cases in the concept of user input - they're special cases in languages that happen to have “undefined” and “null”.

Octal numeric literals aren't special cases in the concept of number - they're special cases in languages where octal literals begin with the prefix “0”, rather than something more sensible like “0o”.

Failing to distinguish between escaped and unescaped strings is also a language problem - they should have different types!

The list goes on.