Hacker News new | past | comments | ask | show | jobs | submit login
Using Multi-Byte Characters to Nullify SQL Injection Sanitizing (hackallthethings.com)
56 points by wlrm on June 26, 2016 | hide | past | favorite | 36 comments



In order for this to take effect, you need: 1. To "sanitize" SQL injection by quoting parameters and building SQL strings manually 2. Your quoting function needs to be in a language that iterates over multibyte chars properly (i.e., you're not running naively on binary strings) 3. The output multibyte charset must be one of the afflicted charsets (mostly CJK charsets) 4. Your DBMS must be incapable of properly handling the relevant charset, and is instead checking it in a different charset.

Point #4 makes it really hard to envision this ever being a problem, since if you're using a CJK charset on your platform already, you're quite likely to notice very quickly that something is horribly, horribly wrong.


I was thinking the same thing, and does people really still build SQL-strings programmatically in the server application?

Seems like this bad practice should be dead by now, not only does it open up for injection attacks but it also prevents the database from optimizing the query by precompiling, data aggregation and building smarter execution plans.


It happens a lot in junk WordPress plugins. Even with built-in support for doing it better [1], far too many just use straight concatenation.

[1]: https://developer.wordpress.org/reference/classes/wpdb/prepa...


It should be dead by now, but I had to a move a website done by an agency to a new server. I don't really code PHP, but could see that it was full of concatenated SQL strings.


As always just use parameterized queries. This is the best and only defense you need against sql injection.


This is absolutely good advice, but people giving it rarely acknowledge that it isn't always possible.

A particular irritant is that most (all?) db APIs make it impossible to use parametrized queries with IN clauses. I find myself having to implement string escaping functions sooner or later in every project, because of stuff like that.


That's when you combine parameterized queries and concatenated SQL:

    sql = '... col_name IN ('
    foreach(var arg in args)
      if (!first)
        sql.append(', ')
      sql.append('?')
      params.add(arg)
    sql.append(')')


> A particular irritant is that most (all?) db APIs make it impossible to use parametrized queries with IN clauses

Psycopg2 has a generic adaptation of iterables to a form suitable for IN, but it's hand-rolled: https://github.com/psycopg/psycopg2/blob/732ea90a4ff85f6d0cc...

Real convenient though.


This is such an irritant. Everyone supports variable size IN clause in the form of a sub-select AFAIK. You should be able to abstract the parsing and have an api to address that. Yet for years its missing and you end up all sorts of garbage under-performing code.

Our current solution is to use of temporary table because other solutions (including dynamically adding parameter in query) makes them unique queries and crashes performance.


At least in Postgres, you can use ANY and pass an array of values. From what I've read, the generated query plan is equivalent.


> From what I've read, the generated query plan is equivalent.

I believe at one point IN (values) was just sugar for = ANY(ARRAY(values)), don't know if that's still the case.

Still requires that your DB adapter supports array parameters though.


Alternatively, you can use a temp table and a join. Not ideal, but still safe from injection and should have roughly the same overhead in the database.


This isn't the case with oracle, MySQL, or hsqldb, at least when using jdbc. I believe it's also not true for SQLite and postgres.


select * from Person where email = any(?)


This.

I really don't get why anyone in this day and age doesn't use parameterised queries.

It's such a big security win and comes with such little programming overhead that it boggles my mind to think people still use manual string escaping.


In Postgres until 9.2, the query plan for prepared statements is fixed at prepare time rather than time of execution, so you might end up with worse query plans. It's worth noting that 9.1 is still the version in Ubuntu 12.04, it's only 14.04 that has 9.3.


It should be noted that Postgres offers a repository with the latests versions even for Ubuntu 12.04, so you don't need to upgrade the distro or compile PG yourself to use them.


This have been an annoyance in many DBMS-engines, I have previously solved it with recompiles, either triggered or nightly/hourly.


> It's worth noting that 9.1 is still the version in Ubuntu 12.04, it's only 14.04 that has 9.3.

And the most recent LTS, 16.04, has 9.5.


In which case different values would end up have a different query plan? NULL values?


That's one of the main points of gathering statistics, so the query plan can depend on the values being queried. At least in Oracle, not sure about Postgres. For long running queries it can be a big win.


Here is an example [1] that just is a simple equality comparison on an indexed column falling back to a sequential table scan.

[1] http://blog.endpoint.com/2014/04/custom-plans-prepared-state...


Skewed distribution.


People may use frameworks which sometimes default to concatenation, even when the ORM makes it look like they're using parameterized queries.

People care more about convenience and fast iteration than security - concatenation is faster and easier, and insecure sites can still make money, so the incentives are on the side of fast, dirty code that works now rather than secure code that works later.

People (this is especially true in the PHP world) may not even know parameterized queries exist.


I too wish more people in the PHP world would use parameters, but alas, your second paragraph sums up the common mentality. If its still going to make money, why go to the effort? Personally I love sending a query with an object and just let the engine do the work.

I see this sort of thing on a daily basis, but it's not from PHP developers but our "lead" Delphi developer. I've tried to convince him of the benefits of parameter based queries, but believes that it would be a performance hit (performance for him maybe, he'd have to struggle to learn something new).


Not just security, but readability (in most languages) and performance gain also (in popular RDBMSes and drivers)


"Web applications sanitize the apostrophe (') character in strings coming from user input being passed to SQL statements using an escape (\) character."

Please, please don't say this. In the SQL standard, backslash is NOT the escape character for a string literal.

PostgreSQL, starting in v8.2, began transitioning from C-style escapes (using backslash) to SQL standard escapes (where a single quote is escaped with another single quote). Standard behavior is the default, but can be controlled with the variable standard_conforming_strings.

But you shouldn't have to know that anyway. Use out-of-band parameters that are passed in the wire protocol separately from the string. Web frameworks should already ensure this, and if they don't, they are likely broken.

If you are writing a web framework and you need to use escaping for some reason: first, make sure you can't use parameters on the wire instead; then read the product-specific literal parsing rules very carefully, considering things like multibyte characters.


In-band: bad. Out-of-band: good.

The same goes for JavaScript, escape sequences, SQL string concatenation and so on. It always seems like a great idea but it is just about impossible to get it airtight.

Better do your signaling, commands and meta-data through another channel.


This is a pretty well-known problem, but mostly irrelevant if you use UTF-8.


Or, you know, just doing the right thing(tm) and using parameterized queries


Yes. If you are building SQL by concatenating user inputs (escaped or not) you are doing it wrong.


IMO building SQL by concatenating anything feels wrong.

I still do it, and i haven't used an ORM yet that is actually useful, but it still feels wrong.


Section 3.1 of Unicode technical report 36 describes a couple of similar things specific to UTF-8: overlong sequences, and ill-formed subsequences.

Does your software conform to Unicode 3.0 and earlier, 3.1 through 5.1, or 5.2 and later? And do your server and client software agree? If you don't know the answer (and you depend on string sanitization), you may be at risk.


What is particularly fun is the history of GB2312/GBK/GB18030. I wonder how easy it would have been to change Win95 to use UTF-8 for example.


I wonder, does that 0x5c byte has to be the last one in the character or can it also be the first, or the second one, followed by the hex value of ;, the end of instruction in mysql?


String literals in SQL were a mistake.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: