Hacker News new | past | comments | ask | show | jobs | submit login
1213486160 has a friend: 1195725856 (rachelbythebay.com)
522 points by TimWolla on Dec 8, 2016 | hide | past | favorite | 100 comments



Hah, one of the side effects of doing embedded programming is spending a lot of time staring at hex dumps with ascii in them, hex on the left actual characters on the right. As a result you start recognizing a lot of ASCII characters when you see the hex codes for them.

I was debugging a library that was a 'native' library for a scripting language and the code seemed to have a much bigger running footprint than I expected. It kept allocating this odd sized buffer, a bit over 13,000 bytes in size. Walking it back to the scripting language interface to C code the buffer it wanted was '32' bytes long but the scripting language was passing it as a string so 0x3332 bytes long. oops! Reading hex and seeing ASCII is a very useful skill to develop.


The structure of ASCII itself is very useful for this, if you know the ordinal positions of letters in the alphabet.

Uppercase letters are 0x40 + position of the letter in the alphabet, so "E", being the 5th letter, is 45, "I", being the 9th letter, is 49, and so on.

Lowercase letters are 0x60 + position of the letter in the alphabet, so "e" is 65, "i" is 69, and so on.

That also means that you can swap case by flipping a single bit (XOR with 0x20).

Finally, digits are 0x30 + the digit's numeric value (including 0), so the digit "5" is 35.

(All of these properties were very intentional on the part of ASCII's creators.)


Relatedly, if you want to impress people with the ability to "read binary", and you know that something is plain ASCII text represented in binary, just look at the rightmost 5 bits of every byte. They will be the ordinal position of the letter.

"Hello" is

01001000 8 (h)

01100101 5 (e)

01101100 12 (l)

01101100 12 (l)

01101111 15 (o)

And when you see all zeroes, it's probably 00100000, the space character.


I did a talk at the DocklandsLJC on the history of Unicode, and covered the reason for the bit patterns in ASCII. The video was recorded and is available at

http://www.docklandsljc.co.uk/2016/06/unicode-cuddly-applica...

The specific slide regarding ASCII code points is here:

https://speakerdeck.com/alblue/a-brief-history-of-unicode?sl...


Aha, THAT explains ^H, ^C, ^D, ^[ and so forth. I can't believe this eluded me for so long


If 01001000 is 'h' what is 01101000?


I deliberately wrote the "h" in lowercase, even though the ASCII character is uppercase, because I was advising looking only at the five least-significant bits, which won't tell you the case. Sorry for the confusion.


Good clarification thank you :)


01001000 is H

01101000 is h

http://www.asciitohex.com/ try and play around with it, it's fun.


I think 01001000 was supposed to be 'H'


01001000 is 'H' and 01101000 is 'h'


Conveniently, if you started with a Dragon 32/TRS 80 Color keyboard (and presumably some others of that time?), the symbols above number N was ASCII symbol 0x2N: https://upload.wikimedia.org/wikipedia/commons/5/50/PIC_0119...

The standard US QWERTY keyboard does not quite follow this, though it is close (there's some insertions, substitutions ;)).


I see ![1] at 21, #[3] at 23, $[4] at 24, %[5] at 25, and that's all that literally match.

&[7] at 26, ([9] at 28, and )[0] at 29 are off by one in their current QWERTY keyboard positions. If we didn't have ^ and * where they are, then &, (, and ) would be in the right places to continue your pattern.

@, ^, and * don't fit the pattern at all.

I should also have mentioned this amazingly scholarly piece by Tom Jennings, that explains probably everything there is to know about where everything we've been talking about came from:

https://web.archive.org/web/20030201161943/http://www.wps.co...

Unfortunately it looks like someone else is now running wps.com so you can't get this directly at its original home anymore.


Yes, some others too. The BBC Micro, for instance. See https://en.wikipedia.org/wiki/Bit-paired_keyboard for some of the history.


Reminds me of doing development for old macs. Apple would combine four characters and cast them to an integer - https://en.wikipedia.org/wiki/FourCC

They used these four-char-codes for error codes, file formats, return values and other random things. After a while I could read and recognize the text from the HEX value of the integers. Even now I still see them pop up sometimes on iOS as error codes from random networking or filesystem errors.


It's also a side effect of playing The Talos Principle, which has a lot of secret messages in hex-encoded ASCII (and they have to be decoded character-by-character, because you can't copy and paste text from a 3D rendered image).


I used the Google Goggles app of my phone to transcribe these to text, then paste the result in a HEX to ASCII website and voila :)


I want to play this game even more now.


Assuming you like puzzles, it's well worth the time and money! Makes a nice complement to SOMA, which touches on very similar themes from a quite different perspective.


> Reading hex and seeing ASCII is a very useful skill to develop.

I found it to be also quite draining when I was reverse engineering an old proprietary software system to develop a tool to import/export data to it. Definitely put my brain in a weird place after spending hours pouring through network dumps and determining how it all worked. It did kind of feel like that scene from the matrix where what's-his-name is staring at the streaming code on the screen, though :P


This might be stating the obvious, but I think that probably has more to do with the quantity of ASCII codes, and less to do with the "translation" itself. I have significantly more difficulty reading hex dumps than I ever had reading G-code (that's the machine code that runs CNC mills, most 3d printers, etc). After a few years of working in a shop, I was fluent enough in G-code to be literally talking in it (on occasion, when appropriate) while debugging things. Hex dumps are far, far more intimidating for me to manage.


On the other hand, reading octal and seeing ASCII is really tricky. One thing I've realized from working with the Xerox Alto is that octal is really awful and hex is much better.

In octal a 16-bit word doesn't split up evenly into bytes, so even recognizing ASCII is difficult. For example, the string "AA" in hex is 0x4141, where each 41 is 'A' - pretty easy. But in octal, it's 040501; 0101 is 'A' but gets multiplied by 4 in the upper byte. (One thing in defense of octal: the 8080/Z-80 instruction set makes much more sense if you look at it in octal.)


As I recall some DEC utilities supported 'split octal' which was base 8 on 8 bit boundaries. With a 16 bit value of 0xffff was represented as 377377 rather than 177777. But their (DEC's) love of octal was why you got 18 bit and 36 bit machines.

It is a good example of how language effects design, than the use of hexadecimal vs octal and its impact on computer architecture.


x86 is also octal-structured, and I can mentally assemble and disassemble most of the basic instructions from a hexdump by converting between hex and octal:

https://news.ycombinator.com/item?id=13045558

In octal a 16-bit word doesn't split up evenly into bytes, so even recognizing ASCII is difficult.

That's true only if your hexdump is in 16-bit words; in bytes, it's just as straightforward: A-Z is 101 through 132, and a-z is 141 through 172. Incidentally, these are also where x86 puts the single-byte inc/dec/push/pop instructions.


I believe most linux distros come with 'od', which can dump octal by byte.

  $ echo 'ABC' | od -bc
  0000000 101 102 103 012
          A   B   C  \n
  0000004


If you debug math for a while, you get similar skills in recognizing the decimal versions of factors like 1/(2 pi) and 1/e. You get a little tingle: that looks familiar ...

Sometimes it can save a lot of time chasing down constant factors.


Same goes for me when doing CAD and having to constantly convert between standard & metric -- 254, 508, etc show up in numbers a lot and it gives you a nice hint that it's a "clean" value in the other system. Super useful for reading IC footprints off datasheets that only use one of the two systems


> Hah, one of the side effects of doing embedded programming

or reverse engineering hardware or software :)


Also obscure or undocumented file formats.


I do the opposite. I often have to identify file types in an environment where we're not allowed to install the tools to read them, so I've gotten good at identifying, say, Word 97 documents by opening them in Notepad.


that sounds strange, since you have Notepad then I guess you are running Windows, which usually seems to work with file extensions to identify the files.. can you not see the extension?

idle programmer mode enabled It seems that it would be possible to have special tools built and installed, which would claim the file types and have an icon with a 'no entry' symbol superimposed on the original type (ie Word icon with a red circle and line through it, for .doc etc..) .. the tool itself could be a simple program that just opened a notification saying '<tool> not installed'


> can you not see the extension

If you see a file that's called "something.doc", can you tell whether it's a flyer for someone's holiday party versus another sample of that malicious RTF that's been going around this week? All the extension does is let Windows put an icon on it and dispatch it to the right application, and if the application supports multiple file formats it does the actual identification by the file header.


"...there's way too much information to decode the Matrix. You get used to it, though. Your brain does the translating. I don't even see the code. All I see is blonde, brunette, redhead."


Passing the very first 4 bytes you receive straight to malloc with no sanity checking? I suspect that application is riddled with other vulnerabilities!


Oh, now that you mention that I first understand what the correlation between those numbers and malloc is. It wasn't obvious from the article:

On top of TCP often length prefixed protocols are used, which means some (e.g. 4) bytes of the length of a "packet" are sent first and the packet is sent afterwards. Thereby you can create the notion of messages and message boundaries on top of the stream oriented TCP.

In the receiver implementation you first have to read the length, then allocate a buffer for the packet and then can read the remaining message into the buffer. If someone will send a HTTP request to such a receiver implementation it will try to allocate the buffer with this number.

The most sensible way to avoid this is checking the first bytes against a maximum message size first. However I have to admit: In my very first implementation of such a protocol I also have not thought about this.


I suspect this kind of thing is common in "in-house" applications where the assumption is that the app will never be exposed to the open internet.


It is, but it creates the 'soft chewy center' of production infrastructure that hackers so enjoy.

People are supposed to do defense in depth, where there are multiple layers such that a compromise of a single one only leads to limited damage, but too often it's more like there's a hard outer shell surrounding a soft chewy center where they thoroughly and completely own you once the outer layer has been compromised.


Yep. But sometimes you don't get to choose where to hold your last stand. You were not PLANNING for the abandoned kindergarten to be your command post against the rebels but that is where you ended up. Same goes for corporate security sometimes. The chewy soft center was written in happier, more cheerful times. By undergrads. Who now are fully vested and long since left the building. Then the mercenaries are called in to defend it against the Internet. But I digress.


Yes, but the natural thing would to be not to have anything HTTP on its path

(Unless of course some IT department has transparent proxies that try to be too smart)

And that's why some sanity checks are important (magic numbers on the protocol, size limits, etc)


Another possibility is that someone in IT or corporate security bought a fancy vulnerability scanner and is blindly running it against everything on the internal network. They might even change any existing firewall rules to allow it to contact every host


Vulnerability scanners tend not to do anything with unknown custom services, running on ports they probably don't even bother to scan.


Major enterprise vulnerability scanner author here (15 years ago). Yes, we scanned the port, and yes, we ran a lot of fingerprinting rules against it until we figured out what it was (or ran out of rules), so that we'd know what vulnerabilities to test for.

This had some amusing side effects when we encountered some services we'd never seen before, like the port on HP printers that sends every byte straight to print... apparently expecting PCL or PostScript but if it didn't understand it, it just printed the ASCII. Came into the office one morning to find all printers out of paper and 500 sheets sitting in the output tray. Oops.


It worked, you found a vulnerability.


> the port on HP printers that sends every byte straight to print

That seems like a design flaw. :)


I remember my first epson dot matrix receiving data from my z80.

Print a graphic was done by sending exactly every dot to it, I mean, 1 to put a dot on the paper, 0 to blank. Encoded as a byte...


At a minimum, it seems like a way to DOS an HP printer...


Now that's "Resource Exhaustion".


And not just the paper. You might be able to DOS the ink as well.


If we are going there, toner


Still not an excuse to trust data you're getting across an I/O boundary.


...and the first thing that happens when it ever happens to be connected to the internet is that it gets broken and has to be rewritten. That's why it's good to care about the quality in the first place...


That's the best-realistic-case scenario. The most likely case is that even if a few problems are found and patched, some serious vulnerabilities will be left in place, unnoticed until someone from outside, without your interests at heart, goes looking for them.


Chances are that if you're connecting a formerly in-house application to the Internet, you want to force a rewrite. It likely has a lot of other security vulnerabilities that would let an attacker pwn your corporate network. You build software for the open Internet very differently from you build it for a trusted environment; you don't want to fool yourself into thinking the latter is good enough for the former.


What are the differences really? What's a "trusted environment" anyway?


Every program needs to make some assumptions about what operations it's going to trust and what it's going to verify. Most programs, for example, assume that (length + 1) [where 'length' is a simple 32-bit integer] will have no side effects, even though there's no absolute guarantee that the NSA hasn't backdoored your compiler or a Chinese supplier hasn't backdoored your CPU to open a rooted shell remotely. Most programs will also assume that (length + 1 > length), even though in many languages this isn't universally true because of integer overflow. If your program never deals with integers that large, it doesn't matter.

For most public programs, the trust boundary is generally assumed to be the process. Any I/O the process does is assumed to be untrusted; it could do anything. But anything inside the process is assumed to work as the language says it does, because the OS is assumed to provide memory protection that prevents other processes from tampering with it. (Some big companies go a step further and dictate that you're not to trust 3rd-party libraries unless the code has been specifically audited; this is generally a sensible practice security-wise, but a huge drag on developer velocity.) If you couldn't trust the basic machine operators, you'd never get anything done - you'd have to write sanity checks everywhere, and then you have no guarantee that the sanity checks themselves aren't backdoored.

For many internal apps, the network is inside the trust boundary. It's assumed that any network connection comes from a trusted source, because otherwise the firewall would've rejected this. And being able to assume this saves a lot in developer velocity; it becomes feasible to write one-off internal tools without the devs having to carefully audit all the I/O & cross-process code for vulnerabilities. If you didn't have this trust, most of these apps wouldn't get written, because the productivity benefit they provide isn't greater than the cost of writing a hardened, secure system. It's not just networking calls; if you can assume that your users are non-malicious employees, you also don't need to worry about XSS or XSRF, pathological regexps, DOS attacks based on large payloads, etc.


And sometimes you don't trust the operations. I've heard tales of flight computer software running on 3 computers from 3 vendors running 3 implementations of software from 3 independent teams. Then the 3 computers voted on a course of action.


You may be interested in this PDF[1] which describes what you're talking about.

Title: Triple-Triple Redundant 777 Primary Flight Computer

[1] http://www.citemaster.net/get/db3a81c6-548e-11e5-9d2e-00163e...


Thank you. I heard it about SAAB Gripen fighter/attack/recon craft in the 90s, but for me it's faded memory of hearsay - so it's cool to read a more substantial reference. :)


If the format is 4 bytes of length followed by that number of bytes -- how exactly do you sanity check it, if you intend to occasionally send some really big whatevers at the start of a stream?


Most of the time you could set reasonable bounds on what you expect the length to be. How often does your custom protocol require multi-gigabyte messages?

If you can't, sanity check against negative numbers, make sure you check the return value of malloc(), and set a timeout on actually reading that much data. If the malloc fails, close the connection. If the timeout fails, close the connection and free the resources. On the public Internet, you probably don't want to send an error message, since it's just exposing internal system information an attacker could exploit. (In debug mode, running internally, you probably do want to expose this.)


This sounds like it will lead to annoying arbitrary limits.

You'll be just as angry when you find out some application only lets you send it messages up to 1 MB. What a stupid restriction, you'll think!


If you use 4 bytes for size, which is fairly convenient, you get 16 megabytes before any 1s show up in the most significant byte.

If you expect that it might be an accidental ASCII character, the smallest one you are likely to accidentally receive is space. 32 x 16MB = 512MB.

If that's still not a large enough atomic message to satisfy, allocate 8 bytes for the size. If sneaky ASCII creeps in to the most significant byte, it's going to at least be 2^57, or 512 exabytes.

Plus, this is just the first layer of defense. It's an easy and cheap one that keeps you from overallocating, but it's not the end of the story. You still need to examine the rest of the message for validity, and throw it out as soon as possible if you find it's invalid.

And, finally, if you are taking the "allocate a buffer to hold the message" approach, and you might be receiving multiple messages at a time, you have to consider the possibility of receiving multiple spurious messages at once. The only thing worse than a process spuriously allocating a gigabyte and trying to parse stuff into it is a process spuriously trying to do that a few thousand times per second.


I can just imagine a HN article 'Check out the crazy reason that this protocol is limited to 512MB!'


There's barely any difference between a limit of 512MB vs. a limit of 2GB/4GB. Almost nobody is going to split that hair. You either design for reasonably-small messages, or you design for 'unlimited' size and go with a 64 bit size field. Either way shaving a few bits off the top is harmless.

But you shouldn't be allocating 100MB+ upfront anyway. No matter how big of a message you allow.


This reminds me of the other article on HN recently. Using calloc makes it cheap because it's using the kernel's CoW. Of course you probably want to use that CoW functionality explicitly in this case.


Yeah, everybody goes "640K OUGHT TO BE ENOUGH FOR ANYBODY, LOL." But if you're designing (say) a custom in-house protocol for transmitting telemetry, then "what happens in 20 years when someone tries use it for bulk data transfer for some reason" really shouldn't be higher on your list of priorities than "what happens when it receives literally any data besides what it was expecting." Hypothetical future expandability is good to consider, but not as important as present-day intended use.


Which is more annoying? Being unable to send your 1G message, or the service being down because it's being DOSed by 1G messages?


A DOS attack could be performed using any length of message, although the effort required to kill the server by making it allocate all it's memory would be much lower.


Nah. I don't think it's stupid at all that email is generally limited to 10MB, for example.


Set limit = 10M, log request sizes, when you start seeing requests with size >1M, consider raising limit.


I suppose you could start each message with a magic constant, followed by the length of the remainder, then the payload.

Any messages that don't start with the magic constant just get ignored.


Cool idea, BUT nobody uses SCTP (packet based). UDP is not reliable (most applications dislike this), and TCP does not have "messages" actually requiring you to send length in some form...sort of like these guys did...


Unless you have a fixed sized header which includes the size.

That's all that is needed to avoid people having to pull up a debugger to figure out why things are going bad in production.


> Unless you have a fixed sized header which includes the size.

Isn't that actually part the problem here? It's getting an erroneous size and trying to allocate a big buffer so it can read the data even though the data isn't really that big.

One solution might be a fixed header size, and a header checksum. Allocate space for the header, read what should be the header, including the checksum, and if the checksum is correct, then allocate the space requested for the data. A fixed size header doesn't really help unless you are actually checking that it's valid before proceeding.


The problem here was the memory leak, because it didn't free the enormous buffer in the failure path.


No, the original post was about a memory leak. This is really just a follow on about finding weird values popping up in protocols.

Also, even if we do constrain this to memory leaks, I would say that leaks just make the problem worse, and a real bad bug that causes a persistent DOS rather than just an ephemeral DOS, but it's still a problem without them.

Actually allocating the memory prior to confirming you need it is bad because if you get enough requests that do that quick enough, you eat up all the memory. If you aren't freeing the memory, you just don't have to be as quick. Considering that that crazy value in question here that is being passed to malloc represents over 1GB of memory, it wouldn't take that many requests at all.


It could help in obvious cases, but it sounds a lot like "security by obscurity".


No, it's just a bit of defensive programming against silliness and misconfiguration. Not every errant connection is an attacker trying to pwn you, some are just going to be honest mistakes. Picking up on wrong magic numbers and logging "missing magic from <ip>" would help immensely debugging this issue. All the other advice about avoiding mallocing an unsanitised amount of memory still stands, but this would just make it all easier to figure out


Right, it does nothing to protect from targeted DoS attacks. The obvious answer is that you usually want a fixed-size buffer per connection, and any message larger than that needs to be read and handled chunk by chunk.


I disagree. It's a well known method of sanity checking serialized data, and it's used by lots and lots of file formats and "over the wire" data formats.

It's not really a security feature, and it doesn't have to be a secret value. It also doesn't mean you shouldn't sanity check the other data fields. The sole purpose is to quickly rule out data that's blatantly incorrect.


Most database drivers have a binary protocol that does this. It's a pretty standard practice. It's not for security, it's for protocol negotiation.


Perhaps the best solution would be to avoid sanity checking at all, and to instead read into a dynamically-expanding buffer. If the server really does send down 1.2 GB, you'll be ready for it, but a malicious attacker trying to DoS you won't do damage by sending "1GB" followed by a connection close.


One dead simple way is to have a 2 or 4 byte header field that's always a known value. If it doesn't match then bail out immediately.


Fixed-length header + HMAC (or simply a CRC).

If the HMAC/CRC doesn't check out, don't process the packet.


CRC doesn't protect against malicious activity, and HMAC requires key distribution or exchange (way more overhead than a simple in-house protocol warrants). True, a CRC protects against accidental crap like this service regularly exploding, but in a case like this where the client can't talk to the server properly, letting the client explode is probably a good outcome.


Header and CRC


I mean, even in low-latency trading where error checking is viewed as a sign of weakness (I kid, mostly) at least people have the common sense to add sensible limits to message size (like MTU) when using length-prefixed protocols. Even in these situations where the I/O is across all-ASIC optical and electrical paths corrupt messages do happen.


Graphite's pickle format does this: http://graphite.readthedocs.io/en/latest/feeding-carbon.html

We ran into a similar issue when someone was trying to send a line protocol data to a pickle port.

Yes this format stinks.


If you search either of these numbers on google you see a ton of errors and people asking befuddled questions. We're literally doing a public service to future versions of ourselves by juicing the google results for this post. For once, it's totally appropriate to upvote for visibility. Upvotes to the left.


This comment should be the top-voted comment, so that next time I land on this story I'll remember why it had so many upvotes.


The lesson from this would be: when creating a network protocol, always start the stream or packet with a magic number, in both directions. If the magic number doesn't match, drop the packet or close the connection.

In fact, one could say that these are HTTP's magic numbers: 'HTTP/' for the response, and a few ('GET ', 'HEAD ', 'POST ', 'PUT ', and so on) for the request. IIRC, one trick web servers use to speed up parsing a request is to treat the first four bytes as an integer, and switch on its value to determine the HTTP method.


Not sure about this. "magic" in the sense of file type recognition has no cryptographic/security ambitions. On the other hand, when implementing a network protocol, I'd say one needs to be really careful and expect the unexpected at any point (not just in the "magic" phase). What does it buy you (in terms of robustness) to have a fakeable magic number at the start of the stream?


For completeness, we should add 542393671 (0x20544547) and 1347703880 (0x50545448), i.e., the little endian versions, to the list. Googling those numbers also turns up a lot of people with strange error messages (caused by deserializing "GET " or "HTTP" as a 32-bit integer).


I was personally expecting a piece about a baffling resurgence in the use of ICQ.


Previously, Go's TLS library will report "tls oversized record received with length 20527" when the remote address was not actually handling TLS connections. The magic number is simply because https://github.com/golang/go/issues/11111. Even better, when you google that error, you get all docker-related issues. Poor docker.


From the title I was expecting this to be some math post about strange factors or something. I was really disappointed. Is the youth today really fascinated by merely translating ASCII strings into numeric translations?


The context is what makes this interesting, that the number showed up in an unexpected place.


Except it didn't. It's only unexpected if you're ignorant of what it means, in which case any number is unexpected. You could literally write this same inane blog post about dozens of different common phrases in binary, starting with POST, DELETE, etc.


Did you read it?


me too: more specifically something about https://en.wikipedia.org/wiki/Amicable_numbers




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: