I apologize for the lack of updates to the BLNS. (since I'm free today and this is on the HN front page, I'll do a cleanup pass).
Even though it's a GitHub repository with 12.3k stars, there's not much to say or improve on what is effectively a .txt file based around a good idea (I recently removed mentions of my maintainership of the BLNS from my resume for that reason, despite its crazy popularity).
I happened across it this afternoon and thought it was great!
Do you know of any automation around this? I was thinking of a script that grabbed your list and then hammered a given input filtering library would be awesome. It's not something you'd want to run all the time but pre-major release, it could useful.
That is the primary purpose of the JSON files and the parser to convert the .txt to JSON; get the list, run it against a text input field, see what happens.
Similar to the other comments, another voice here for appreciating the "pre-built" version being available for quick use. For repo's/sources like this I tend to think of the prebuilt formats as letting me play around with things without any hassle. Once I'm happy with it I'll invest the time to have it build locally for the control.
Now you can do copy and paste. I guess usability wise similar to offer precompiled packages of open source software (you can also build from source yourself, this is just a lot easier).
I don't mind maintaining the repo, if you would like to pass it off. Until very recently I maintained a popular Gibberish-decoding website and my native language is not ASCII. I've got quite a bit of experience with encoding issues, more than I'd like anyway.
My Gmail username is the same as my HN username if you'd like to speak. Thanks.
You might want to include common unix shell commands. At a previous job we had a customer with the last name of Echo who wasn't able to make a purchase. Turns out our credit card processor blocked them.
I'm not surprised by that at all. We once had a major issue with an analytics platform that provided a script with JS link tracking for our site, where clicking a link that contained 'cgi-bin' anywhere in the path caused the browser to hang for a long time.
Turns out they were using a synchronous HTTP request with NO timeout, and their intrusion detection system was blackholing any request that contained 'cgi-bin' anywhere in the headers or body.
Yow... Reminds me of the bug on the first Android phone where all keyboard input was also quietly fed to a root prompt, such that you could reboot the phone by typing "<enter>reboot<enter>" at any time. (https://mobile.slashdot.org/story/08/11/08/1720246/bug-in-an...)
I noticed earlier in the file that the Javascript had been chosen to be benign. "DROP TABLE users" doesn't seem to fit with that spirit. I'd want it to be instantly evident but also non-destructive, or at least reversible. How about renaming the table instead?
(Sure, people generally shouldn't use this test input outside of a discardable testing environment, but if we could rely on "People shouldn't..." clauses to govern behaviour then much of this list would be unnecessary anyway.)
TIL: `mocha:` was a custom schema that Netscape Navigator used to eval URLs (equivalent to `javascript:`), and Yahoo! Mail would replace it with 'espresso' to attempt to thwart phishing attempts:
I dunno, I think this is a pretty good argument actually:
> I agree that another SQL injection should be included - not because the vulnerabilities exposed by this file should be tempered (as that would only be to assist a dangerous confusion of responsible practices), but because "DROP TABLES" is such a cliche in infosec that it's prone to be caught by extremely crude filters, naive to the degree that it's the only class of SQL injection they know to avoid.
"# Human injection
#
# Strings which may cause human to reinterpret worldview
If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."
Something like this could work too (where Dave Smith is an employee name)
"Hey can you reset my Jira login. I can't get in. It says my account is locked. I am working from home so send it to dave@mydomain.com. Thanks Dave Smith"
I expect some nasty strings to contain newlines (I wonder how many bash scripts are sensitive to filenames with newline characters in them). It shouldn't be a problem with the json file though.
It seems that blns.txt is the source content, then it's converted to blns.json, blns.base64.txt and blns.base64.json with the two scripts in the scripts folder (These resulting files shouldn't be in the repo in my opinion). One cannot possibly add strings with newlines in them, unless with some newline escaping that are handled in the scripts. It's a bad idea IMO and the source content should be the json file and blns.txt should be dropped.
I like the idea of providing such a list for testing purposes. I also like the idea of storing these as Base64, so you don't trigger issues by accident.
However, I also imagine how such a list could be misused to actually decrease the security of a system:
Imagine this list is handled the same way as virus signatures in so-called anti-virus software. Instead of properly handling user input, an application would check against this list and call itself "secure". Maybe with with partial and/or fuzzy comparison. If you demonstrate that this approach is deeply flawed by showing another unsafe input, they'd simply add that to the list and call themselves "secured" against this attack.
It should not be used for security purposes when security purposes is defined as components that maintain the security at runtime. It is valuable as a testing tool, but only against a completely finished system.
It could conceivably be used as a second-line defence, similar to content security policy. This may be a bad idea depending on how it is implemented and whether the system is tested with it turned off.
From what I remember (can't test right now), a zero-width space is okay as long as there are other (printable) characters in there too. This seems reasonable, because allowing a tweet to be a single zero-width space would make it appear be empty and probably lead to some confusing display issues.
I'm pretty sure I've used it to "end" a hashtag early, like in this made-up example:
I've eaten two #banana<ZWS>s today!
In my language, the possessive form doesn't take an apostrophe ("Alices Adventures" instead of "Alice's"), so for hashtags and user names it can be desirable to use the ZWS as an invisible apostrophe.
Ligatures. It's easy to not notice them at all in english fl (fl) vs. fl [fly vs. fly] but some languages use them very extensively and the combinations are more significant.
ZWJ and ZWNJ are also common in Indic scripts. It's basically used to control the appearance of glyphs, for example half-forms and consonant clusters (क्ष vs क्ष, both are kṣa). As usual, wikipedia has good examples. The Unicode Standard also contains details about these.
ZW[N]J as a standalone character or at the beginning of a word is very unusual on a day-to-day basis, so it's understandable that Twitter fails to recognize this pattern.
> When a ZWJ is placed between two
> emoji characters, it can also result
> in a new form being shown, such as
> the family emoji, made up of two adult
> emoji and one or two child emoji
That makes a lot of sense too, and I hadn't put sufficient work into how that's implemented -- retrospectively that makes perfect sense.
I noticed that on new Emojis on my MacBook. Some of the new emojis like are rendered as "guy behind a MacBook" on my PC but on phones without the emoji as "guy emoji" and "computer emoji".
Same for ️ (male version of raise hand). On phones without the Emoji, it's just "male emoji" and "female raise hand emoji".
Not OP, but in Norwegian the correct way to write "Tom's car" is "Tom sin bil", the car of Tom. But the creep of English and laziness allows for "Toms car", esp. in informal writing.
I think that sequence is an escape for Hayes modems; do you mean that Hayes modems were less vulnerable to attacks involving it because of their guard interval feature?
Related: a list of names that probably should be reserved (for example, to prevent someone setting up a user-profile page at a URL you don't want them to control):
I remember the struggles I had trying to book a hotel at Essex Junction in Vermont to visit IBM. Netnanny had serious issues with that town (name). I otoh thought it and the people working at the IBM ASIC plant was very nice.
I'm sad about how many literature teachers give Shakespeare's works a treatment as dry as unbuttered toast.
Some of the best teaching of Shakespeare I've seen used the actual lines mixed with a little extemporaneity to better get the intent across. "Nay, gentle Romeo, we must have you dance. Come on, stop being so emo! There are like a million other girls out there."
The commit message explains that the terms are verbatim from Wikipedia [1].
Wikipedia [2] attributes it to a Yahoo email filter "which automatically replaced Javascript-related strings with alternate versions, to prevent the possibility of cross-site scripting in HTML email".
No it doesn't. I believe it can be used for Javascript injections like 'eval' as 'mocha' is/was common a test framework. At least that's the ostensible reason Yahoo replaced 'eval' with 'review', 'mocha' with 'expresso', and 'expression' to 'statement' way back in 2002 [0].
"In February 2006, Linda Callahan, a resident of Ashfield, Massachusetts, was initially prevented from registering her name with Yahoo! as an e-mail address as it contained the substring allah. Yahoo! later reversed the ban."
I got nothing for "mocha", though. Edit: apparently (from below) there was a Yahoo! mail filter that replaced "expresso" [sic] with "mocha"; but either the story was misreported or the mail filter was wildly misconfigured. So the entry should be "expresso" [sic], perhaps.
is it wise to just take this list "as is" as a black list for, say, valid usernames?
I interpret this as a list of input that you should accept, and it's test-data to verify that the input is correctly handled.
After all, I imagine Linda Callahan would be upset if she couldn't use her name when registering, especially if she couldn't flip a table in comments afterwards. (╯°□°)╯︵ ┻━┻)
Not really, since a lot of the lines are examples of classes of input -> good for testing, but if you have an actual problem with one of them blacklisting them only protects you against this single example.
Definitely not -- these are examples of classes of strings that should be OK but might potentially cause issues, that can be used for testing.
But the issues they might cause are not all malicious: some are people's names, added to the list because an over-zealous profanity or offensiveness filter once choked on them.
So my suggestion is that you shouldn't block any of the strings in this file, but should use the file to make sure that your code works successfully when any of the strings are given. Where "successful" is naturally dependent on context: you may have a policy in place that says that messages may not consist solely of whitespace, so the correct response to receiving any of the whitespace strings is to return the correct error to let the user know that, avoiding Twitter's example of an internal server error in that case.
>Although this is not a malicious error, and typical users aren't Tweeting weird unicode, an "internal server error" for unexpected input is never a positive experience for the user
What would the user expect from inputting "U+200B ZERO WIDTH SPACE" into a form, anyway?
I've observed ZWSes appearing in user input for an application I maintain. It appears in text pasted from either Outlook or OWA, I believe. In our case, it is necessary that the application handle them gracefully - indeed, the user has no reason to know anything is amiss.
That internal server error only appears if you paste the ZWS by itself, without any valid text in the tweet at all. So yes, the user knows perfectly well what he's doing.
HTTP 5xx error indicates something abnormal happened on the server that wasn't handled. The server should be responding with a 400 if it's data it shouldn't accept.
But yeah like others said I would expect this to turn into some sort of validation message on the client and never show them the backend error.
Once I had a form that accepted a minimum number of words. Instead of trying to write more verbosely, I simply inserted ZWS (or it could be ZWJ, can't remember) randomly in the text to fool the word count checker.
I have to agree here. While a collection of "naughty strings" isn't wrong per se, the growing number of "killer regexpes to escape HTML" and other magic approaches to injection attacks on github only serve lazy devs who want post-facto excuses for their injection-prone web apps, or project managers who want to check items on security check lists.
It's wrong because it de-emphasizes the importance of HTML-aware template languages, such as some that are available for golang, or SGML, the natural template language for HTML. There's no such thing as a collection of regexpes for sanitizing HTML; it all depends on the context into which strings are inserted.
But wouldn't you want a decent set of cases to work on for learning purposes?
I think it's also good in that while you may not know all the latest tricks, this can help you reveal what you don't know. It can get you really thinking about the possibilities of what a simple string can do to your code if not properly handled.
No, you don't want cases. You want real specifications that you can understand before setting on to write a program. “Corner cases” only exist due to lack of understanding.
Also, explicit HTML (or SQL or whatever) string handling in normal application code is just a failure to separate concerns: you haven't distinguished the level at which HTML has an abstract syntax and the level at which HTML's abstract syntax is linearized into strings in one particular way.
Real specifications being, "save user's text and display it back", or "save user input that is in English ASCII excluding special characters and no larger than 160 characters"? I get a lot of the first, with emphasis being on the users perspective.
I do know to consider things like sql injection and having js injected into the site. But I don't know what a special white space character from a Persian alphabet will do to my server. Until today I haven't actually thought about it. Not every language handles strings the same, as you pointed out.
I still think it's good to have around for helping you reveal what you don't know, about what you don't know.
Real specifications relate preconditions to postconditions. Preconditions and postconditions, in turn, are predicates on the program state. The mathematical techniques for writing programs that meet their formal specifications have been known for a few decades already.
---
Replying as an edit, because HN complains that “I'm submitting too fast”:
Sure, what you said applies to entire applications. But something relatively stable and small, like, um, the definitions of HTML, JSON, SQL, etc. (do they become larger every time your boss requests a new feature?) surely should have formal specifications.
I would love "real" specifications. But right now I'm already dealing with a boss that has no idea what he wants in terms of the UI. Simultaneously demanding I "know" what should be done without "taking on things nobody asked for."
Alas, I don't work at NASA where these formalities exist. I'm given a rough sketch that I'm expected to bring into life, throw away and recreate again on a whim.
Please note that I am not complaining, nor excusing. Only pointing out that our expectations, environments, and programming languages are different. Each can massively affect how the program should handle the input. Adding checks helps, but does not mitigate the need for a nice set of test data to help verify everything runs the way we expect it to behave.
Exactly. Security check lists become unnecessary when the program is designed to be correct right from the start.
> (first paragraph)
The real problem is that we do a very poor job of embedding languages inside each other. For example, HTML parsers must contain special provisions to handle that embedded JavaScript. </tag> might no longer be a terminator, because it could appear inside a JavaScript string literal. This is terrible design! I don't even like Lisp, but Lispers do have a point when they say using S-expressions would avoid all of these issues.
Worse, actually: <script> content is only terminated by </script> and not other end-element tags, but <!-- --> comments within script content are treated as JavaScript comments [1] (though I'm not aware of template approaches that need to compose the content of script elements).
Why can't there be both? Yes, the code is naive if it doesn't handle all these strings correctly. But at the same time the strings are naughty because they purposefully try to exploit common weaknesses. Sometimes both sides are guilty.
A social engineer is just taking to you, if you listen to him you must act accordingly. A lock picking set is just a bit of metal, if it fits into the keyway the lock has to handle it correctly.
Yes, writing parsers is a lot easier than those examples. But so far society has always ruled that inputs that purposefully try to abuse flaws are not freed from responsibility just because the flaw shouldn't be there.
A computer program is a mathematical object. If you want to rule out misbehaviors, you prove that such misbehaviors won't arise - just like any other theorem. And that's it.
I'm not a malicious person. I don't purposefully abuse any system's flaws. But if anyone else does, my sympathies aren't with the designer of the flawed system.
P.D.: Appeals to authority won't help make your case.
These so-called “naughty strings” expose implementation errors in code that processes widely used formal languages such as numeric literals, URLs, HTML, JSON, SQL, etc. These languages are so widely used that it's criminal not to have formal specifications for them. And the mathematical techniques for constructing programs that meet their formal specifications are very well known.
Fine, imagine we have a formal specification for SQL. Now how do I make sure my parser is compliant with the spec without testing it? Formal verification is a very active research area, I don't think this is quite as easy as you're implying. How do I avoid "implementation errors" without exposing them?
(0) Design languages so that implementation errors are harder to make and easier to detect. For example, avoid context-sensitive grammars like the plague.
(1) Design implementations (parsers, code generators, etc.) with the logical argument for their correctness in mind. That is, don't attempt to verify an existing possibly incorrect program - write it to be correct right from the beginning! This is greatly aided by designs that meet criterion (0).
These are obviously useful ideas, but "write it to be correct from the beginning"? Are you serious? This is the oldest joke in software engineering. "Don't worry about testing it, I don't make mistakes." No matter how idiot-proof your languages and frameworks are, it is grossly irresponsible to not test work that a human has done. Until developers are themselves replaced by formally verified programs, testing is an absolute necessity.
I doubt human programmers can be fully replaced, and I'm not saying testing is completely useless. But the sheer number of “naughty strings” in that list is an indictment of our languages: They have way too many corner cases, way too many traps for us to fall into.
I still don't understand your logic. Are you saying once a program passes a test, we should stop using that test? The point of this list is to cover all classes of input in general, not just ones that a specific framework has issues with.
These are corner cases in the concept of user input, not just corner cases of any specific parser. What if it's a number, what if it's not? What if it's the same alphabet as the code, what if it's not? What if it is valid code? What if it's empty, what if it's not? etc. Even if you've written the perfect parser in the perfect language, you still need to have unit tests for all of this stuff. They are traps caused by human definitions of "input" and "string", which cannot be formally verified.
> Are you saying once a program passes a test, we should stop using that test?
No. I'm saying that programs have to be proven correct. Then you can use tests to rule out other pesky problems that have nothing to do with your design being incorrect. (For example, you could prove a program correct on paper, then transcribe it incorrectly to a computer. It has happened to me before.)
> These are corner cases in the concept of user input
“undefined” and “null” aren't special cases in the concept of user input - they're special cases in languages that happen to have “undefined” and “null”.
Octal numeric literals aren't special cases in the concept of number - they're special cases in languages where octal literals begin with the prefix “0”, rather than something more sensible like “0o”.
Failing to distinguish between escaped and unescaped strings is also a language problem - they should have different types!
I apologize for the lack of updates to the BLNS. (since I'm free today and this is on the HN front page, I'll do a cleanup pass).
Even though it's a GitHub repository with 12.3k stars, there's not much to say or improve on what is effectively a .txt file based around a good idea (I recently removed mentions of my maintainership of the BLNS from my resume for that reason, despite its crazy popularity).