June 23, 2016

Using Multi-byte Characters To Nullify SQL Injection Sanitizing

There are a number of hazards that using multiple character sets and multi-byte character sets can expose web applications to. This article will examine the normal method of sanitizing strings in SQL statements, research into multi-byte character sets, and the hazards they can introduce.

SQL Injection and Sanitizing

Web applications sanitize the apostrophe (') character in strings coming from user input being passed to SQL statements using an escape (\) character. The hex code for the escape character is 0x5c. When an attacker puts an apostrophe into a user input, the ' is turned into \' during the sanitizing process. The DBMS does not treat \' as a string delimiter and thusly the attacker (in normal circumstances) is prevented from terminating the string and injecting malicious SQL into the statement.

If a multi-byte character supported by the server ended in the hex code 0x5c, it is possible for an attacker to insert the prefix to this character before the apostrophe, so that the escape, in combination with this prefix, turns into a different character altogether and allows the single quote to escape the string input unscathed. While this idea isn't necessarily new, finding research online that includes an entire list of character sets and characters is cumbersome at best. This article attempts to put all of the research and tools in one place.

Researching Multi-byte Character Sets

A small python script was devised to determine which character set and characters within them contained multi-byte characters ending in 0x5c. The script iterates over all installed character sets and then inspects their hexadecimal values for each character. A list of character sets found to contain valid multi-byte character sets ending in 0x5c is provided in Figure A. Additionally, a video of running the script has been provided to show what the output should look like in Figure B.

Figure A:Character sets containing valid multi-byte characters ending in 0x5c
Used in Taiwan, Hong Kong, and Macau for "Traditional Chinese"
Hong Kong's Big5 Supplementary Character Set
Windows-31J (Japanese)
Microsoft's implementation of Big5
Chinese National Character Set
Simplified Chinese
Korean Legacy Encoding
Shift Japanese Industrial Standards
Figure B:Multi-byte Inspection Script Video


In conclusion, there are hundreds of multi-byte characters that could potentially allow attackers to perform SQL injection through sanitizing. It is interesting to note that these character sets are intended for use in a specific region of the world. Ways to fix this by forcing both the webserver and the SQL server to use the same character set exist, as this vulnerability only occurs when multiple (and different) character sets are in use. Those looking to do so may find this research interesting.


  1. Hey there. Nice article!

    Question: To your knowledge, are any default English database configurations affected? Or do the configs need to actively be modified by admins to support non-English character sets?

  2. To my knowledge default English databases are not affected, however I've seen certain situations in which for some reason the default locale or language wasn't specified for the entire operating system (like a missing file from /etc) -- when this happens there are times that MySQL will randomly support a large number of character sets and default to whichever one is indexed first, whether alphabetically or indexed on the filesystem.


Note: Please keep comments academic in nature.