SpamMimic: encode your message into something innocent-looking

Smerity · on Dec 9, 2014

For my favourite variation of this, see "Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method"[1].

Chang et al. use synonym substitution to encode hidden data into standard text, resulting in perfectly readable and sensible output afterwards. This is far less suspicious than someone keeping spam. Appendix B gives an example of how innocuous the output can be. The best part is that even when the attacker knows the system being used, it's still secure against an enemy (Kerckhoffs's principle).

[1]: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00176

sytelus · on Dec 9, 2014

The approach in this paper looks much cooler. A major thing missing in OP's website is that there is no way to add cover text (equivalent to "key" for encryption). However I like the fact that message comes out as spam so it (hopefully) stays out of inbox of receiving person and one would need to know what to look for in spam folder. This is very cool because I'd never thought of utility of making a message purposely look like spam.

praptak · on Dec 9, 2014

A similar idea applies to executable code. Hydan is a tool that hides messages in x86 executables by using code polymorphism: takes an executable and produces one that has unchanged size, behaves exactly the same way but additionally carries a hidden message.

nullc · on Dec 9, 2014

This would be much more powerful with wet paper codes, I suspect. A problem with it is that the changes are all forced, so it likely has a huge impact on the statistics of the text.

pmoriarty · on Dec 9, 2014

I encourage anyone interested in this to read about Peter Wayner's Mimic Functions.[1]

From the abstract:

  A mimic function changes a file A so it assumes the statistical
  properties of another file B. That is, if p(t,A) is the
  probability of some substring t occuring in A, then a mimic
  function f, recodes A so that p(t,f(A)) approximates p(t,B) for
  all strings t of length less than some n. This paper describes
  the algorithm for computing mimic functions and compares the
  algorithm with its functional inverse, Huffman coding. It also
  provides a description of more robust mimic functions which can
  be defined using context-free grammars.

Using mimic functions, one could mimic spam or any other text (or non-text, for that matter) corpus.

The two main challenges are deciding which statistical properties one wants to mimic (for an adversarial steganalyst's mind is not always readily available for perusal) and then actually mimicking them. In other words, it's easier said than done.

[1] - http://www.nic.funet.fi/pub/crypt/old/mimic/mimic.text

anon4 · on Dec 9, 2014

  Dear Friend ; Thank-you for your interest in our publication 
  . If you no longer wish to receive our publications 
  simply reply with a Subject: of "REMOVE" and you will 
  immediately be removed from our club ! This mail is 
  being sent in compliance with Senate bill 1626 ; Title 
  3 , Section 308 . THIS IS NOT MULTI-LEVEL MARKETING 
  . Why work for somebody else when you can become rich 
  as few as 10 WEEKS ! Have you ever noticed more people 
  than ever are surfing the web plus nearly every commercial 
  on television has a .com on in it ! Well, now is your 
  chance to capitalize on this . We will help you process 
  your orders within seconds and deliver goods right 
  to the customer's doorstep ! You are guaranteed to 
  succeed because we take all the risk . But don't believe 
  us ! Prof Simpson who resides in Illinois tried us 
  and says "Now I'm rich, Rich, RICH" . This offer is 
  100% legal ! We BESEECH you - act now . Sign up a friend 
  and you'll get a discount of 20% . God Bless ! Dear 
  Friend , Especially for you - this amazing news ! We 
  will comply with all removal requests . This mail is 
  being sent in compliance with Senate bill 1618 ; Title 
  2 , Section 301 . This is not multi-level marketing 
  ! Why work for somebody else when you can become rich 
  in 58 weeks ! Have you ever noticed people will do 
  almost anything to avoid mailing their bills plus most 
  everyone has a cellphone ! Well, now is your chance 
  to capitalize on this ! We will help you SELL MORE 
  and increase customer response by 170% ! You are guaranteed 
  to succeed because we take all the risk . But don't 
  believe us . Mr Jones of Georgia tried us and says 
  "Now I'm rich many more things are possible" ! This 
  offer is 100% legal ! So make yourself rich now by 
  ordering immediately ! Sign up a friend and you'll 
  get a discount of 60% . Best regards !

wofo · on Dec 9, 2014

Dear Friend ; Especially for you - this red-hot announcement . This is a one time mailing there is no need to request removal if you won't want any more . This mail is being sent in compliance with Senate bill 2216 , Title 9 ; Section 303 ! THIS IS NOT A GET RICH SCHEME . Why work for somebody else when you can become rich within 41 days ! Have you ever noticed more people than ever are surfing the web & how many people you know are on the Internet ! Well, now is your chance to capitalize on this . We will help you SELL MORE and sell more ! The best thing about our system is that it is absolutely risk free for you ! But don't believe us ! Mr Anderson of South Carolina tried us and says "I was skeptical but it worked for me" ! We are a BBB member in good standing ! We IMPLORE you - act now ! Sign up a friend and you get half off . Thanks ! Dear Cybercitizen ; Your email address has been submitted to us indicating your interest in our letter . If you no longer wish to receive our publications simply reply with a Subject: of "REMOVE" and you will immediately be removed from our mailing list ! This mail is being sent in compliance with Senate bill 1621 ; Title 4 , Section 302 ! This is not a get rich scheme . Why work for somebody else when you can become rich in 41 DAYS . Have you ever noticed people love convenience plus most everyone has a cellphone . Well, now is your chance to capitalize on this . We will help you deliver goods right to the customer's doorstep & turn your business into an E-BUSINESS . The best thing about our system is that it is absolutely risk free for you ! But don't believe us . Ms Anderson of Hawaii tried us and says "I was skeptical but it worked for me" ! We are licensed to operate in all states ! We BESEECH you - act now ! Sign up a friend and you get half off ! God Bless .

hartator · on Dec 9, 2014

Dear Friend , Especially for you - this amazing announcement . This is a one time mailing there is no need to request removal if you won't want any more . This mail is being sent in compliance with Senate bill 2116 ; Title 3 ; Section 304 ! This is a ligitimate business proposal ! Why work for somebody else when you can become rich within 31 weeks . Have you ever noticed nearly every commercial on television has a .com on in it plus more people than ever are surfing the web . Well, now is your chance to capitalize on this . We will help you increase customer response by 130% plus process your orders within seconds . You can begin at absolutely no cost to you . But don't believe us ! Mr Jones of Alaska tried us and says "I was skeptical but it worked for me" . We are licensed to operate in all states ! If not for you then for your LOVED ONES - act now ! Sign up a friend and your friend will be rich too . Warmest regards ! Dear Cybercitizen , You made the right decision when you signed up for our mailing list ! This is a one time mailing there is no need to request removal if you won't want any more ! This mail is being sent in compliance with Senate bill 2416 , Title 4 , Section 302 . THIS IS NOT A GET RICH SCHEME . Why work for somebody else when you can become rich in 99 days . Have you ever noticed more people than ever are surfing the web & how long the line-ups are at bank machines . Well, now is your chance to capitalize on this ! WE will help YOU increase customer response by 150% and sell more . The best thing about our system is that it is absolutely risk free for you ! But don't believe us ! Ms Simpson who resides in Indiana tried us and says "My only problem now is where to park all my cars" . We are licensed to operate in all states ! Do not delay - order today ! Sign up a friend and you'll get a discount of 50% . Thanks ! Dear Friend ; Your email address has been submitted to us indicating your interest in our newsletter . If you no longer wish to receive our publications simply reply with a Subject: of "REMOVE" and you will immediately be removed from our mailing list ! This mail is being sent in compliance with Senate bill 2416 ; Title 3 ; Section 302 ! This is not a get rich scheme ! Why work for somebody else when you can become rich as few as 58 DAYS ! Have you ever noticed more people than ever are surfing the web & how many people you know are on the Internet ! Well, now is your chance to capitalize on this ! WE will help YOU deliver goods right to the customer's doorstep and decrease perceived waiting time by 140% . You can begin at absolutely no cost to you ! But don't believe us . Ms Anderson of Georgia tried us and says "Now I'm rich many more things are possible" ! We are licensed to operate in all states ! We IMPLORE you - act now ! Sign up a friend and your friend will be rich too . Thanks .

yoha · on Dec 9, 2014

This is an interesting concept but this implementation seems rather inefficient. It should be possible to exploit spacing, punctuation and case more effectively.

Related web comic with a similar idea: http://cube-drone.com/2013_06_05-Cube_Drone_37_The_Often_Ins...

mobiuscog · on Dec 9, 2014

That comic describes approximately 95% of HN.

tiler · on Dec 9, 2014

SpamMimic works by using a context free probabilistic grammar to derive its output. Each production of the grammar is translated into a Huffman tree based on the probabilities assigned to each variable or terminal symbol in the production.

For example:

  S -> A(.25) | B(.75)
  A -> aS(1.0)
  B -> bS(.75) | b(.25)

You simply feed the mimic function an encoded message (as a binary string) until you consume all the bits. Of course you can also pad the bit string so that it always terminates on a terminal symbol.

I wrote a program not too long ago that took some inspiration from SpamMimic and linguistic steganography in general. For fun I used the comments from this thread as input to my program:

  So why not send it as spam? The key here is hiding in this approximately 95% of HN.

  So why not send spammer--not a spy.

  For my favourite variation seems rather inefficient. It should be possible output 
  can be used already, and you just look for in spam thousands of people with it. 2. 
  Also send lots of receiving person and a Novel Vertex Coding and identify the 
  system being used, it's used to encoding and identify the fake spam. Appendix 
  B gives an enemy (Kerckhoffs's principle).

  [1]: https://github.com/rw/tweetfs

  Plainsight uses each byte of the

The encoded message is: 'meet at 3'

carmaa · on Dec 9, 2014

This is pretty cool.

Closest thing I've seen to text steganography [1]. Probably more efficient (I tried to encode a 1.000 character message, and the encoded message ended up being 80.000 chars long - that's a long SPAM email) to encode your message in a picture though, although I can see use cases for when text only may be desireable.

[1] http://en.wikipedia.org/wiki/Steganography

Xoxox · on Dec 9, 2014

My encode for the word "combinator"

Dear Professional , This letter was specially selected to be sent to you . If you no longer wish to receive our publications simply reply with a Subject: of "REMOVE" and you will immediately be removed from our mailing list . This mail is being sent in compliance with Senate bill 2416 ; Title 4 , Section 302 . THIS IS NOT A GET RICH SCHEME ! Why work for somebody else when you can become rich as few as 69 days ! Have you ever noticed people love convenience and more people than ever are surfing the web ! Well, now is your chance to capitalize on this ! We will help you process your orders within seconds and SELL MORE ! You can begin at absolutely no cost to you ! But don't believe us ! Prof Ames who resides in North Carolina tried us and says "Now I'm rich many more things are possible" ! This offer is 100% legal ! We BESEECH you - act now ! Sign up a friend and you'll get a discount of 50% . Thanks .

peterwaller · on Dec 9, 2014

Neat idea. I know it's not a serious proposal, but the problem with this sort of approach will be that the message will be identifiable from the fact that it hasn't actually been sent as spam to many people. So an attacker can identify a suspect message just by considering its distribution.

antihero · on Dec 9, 2014

So why not send it as spam? The key here is hiding in the this can be done in multiple ways.

1. Encode fake spam and send spam thousands of people with it. 2. Also send lots of real spam to your intended targets.

Their spam filter could then attempt decoding and identify the fake spam from the real spam, and you just look like a big nasty old spammer.

The thing I like about this approach is that for all we know, it's used already, and some of those junk emails that we've got, you know, maybe even a classic, could have actually been messages from some spy agency that contained a message!

peterwaller · on Dec 9, 2014

If only then it were a solution which scaled and didn't have harmful effects..

Karunamon · on Dec 9, 2014

Depends on how you define "harm" i suppose. Considering it's not real spam, there's not a real scam company on the other side of the message waiting to grab your cash, the "damage" is a few hundred kilobytes of text ending up in the spam can along with all the other legitimate spammers.

rw · on Dec 9, 2014

Previously: https://news.ycombinator.com/item?id=6427525

Pardon the essay, but I've written a tool in this space before.

Back in 2011, I wrote a textual steganography library and command-line application, called Plainsight: https://github.com/rw/plainsight

Additionally, @workmajj and I wrote TweetFS using Plainsight. It lets you recursively pack up directories and post them as an encoded linked list of Tweets to Twitter: https://github.com/rw/tweetfs

Plainsight uses each byte of the input message to generate tokens. Bits are used to decide how to traverse the token tree, weighted by frequency. The drawbacks are 1) verbosity and 2) incorrect grammar.

One of the lessons of writing Plainsight is that spam can be used to contain secret messages. Send enough gibberish to enough people, with your intended recipient included, and you'll look like a spammer--not a spy.

-- Example 1 (regular text) Type your message to encode:

   echo 'Meet at Union Square at noon. The password is FuriousGreen.' > cleartext

Then, pipe it through Plainsight:

   cat cleartext | plainsight -m encipher -f sherlock.txt > ciphertext

The output will be Doyle-esque gibberish:

   cat ciphertext | fold -s
   which was the case, of a light. And, his hand. "BALLARAT." only applicant?" 
   decline be walking we do, the point of the little man in a strange, her 
   husband's hand, going said road, path but you do know what I have heard of you, 
   I found myself to get away from home and for the ventilator little cold night, 
   and I he had left my friend Sherlock of our visitor and he had an idea was not 
   to abuse step I of you, I knew what I was then the first signs it is the 
   daughter, at least a fellow-countryman. had come. as I have already explained, 
   the garden. what you can see a of importance. your hair. a picture upon of the 
   money which had brought a you have a little good deal in way: out to my wife 
   and hurry." made your hair. a charge me a series events, and excuse no sign his 
   note-book has come away and in my old Sherlock was already down to do with the 
   twisted

Now, decipher that ciphertext:

   cat ciphertext | plainsight -m decipher -f sherlock.txt > deciphered
   cat deciphered
   Meet at Union Square at noon. The password is FuriousGreen.

-- Example 2 (binary data)

   $ dd if=/dev/urandom of=/dev/stdout bs=1 count=10 | plainsight -m encipher -f 1984.txt
   10+0 records in
   10+0 records out
   10 bytes (10 B) copied, 9e-05 s, 111 kB/s
   Adding models:
   Model: 1984.txt added in 0.89s (context == 2)
   input is "<stdin>", output is "<stdout>"

   enciphering: 100%|#####################################################################################################################################################################|474.67  B/s | Time: 0:00:00
   
   which is a war is real, the proles used mind on the telescreen. He could see through all right to. You have read what said. 'Yes,' only in the Ministry

----

One serious use case is to seed the generator with a spam email corpus. This lets you generate messages that look like spam. Example:

   wget https://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
   tar -jxvf 20030228_spam.tar.bz2
   cat spam/0* > spam-corpus.txt

   echo "The Magic Words are Squeamish Ossifrage" | plainsight -m encipher -f spam-corpus.txt > spam_ciphertext
   
   $ cat spam_ciphertext
   (8.11.6/8.11.6) 3 (Normal) Internet can send e-mails until to transfer 26 10 [127.0.0.1] also include address from the most logical, mail business for your Car have a many our portals ESMTP Thu, 29 1.0 this letter on internet, <a style=3D"color: 0px; text/plain; cellspacing=3D"0" how quoted-printable about receiving you would like width=3D"15%" width=3D"15%" border="0" width="511" Date: Tue, 27 Thu, 19 26 because zzzz@localhost.spamassassin.taint.org for
   
   $ cat spam_ciphertext | plainsight -m decipher -f spam-corpus.txt
   Adding models:
   Model: spam-corpus.txt added in 2.57s (context == 2)
   input is "<stdin>", output is "<stdout>"
   
   deciphering: 100%|#####################################################################################################################################################################|543.84  B/s | Time: 0:00:00
   
   The Magic Words are Squeamish Ossifrage

leephillips · on Dec 9, 2014

This (awesome work) makes me wonder if people are using systems like this in the wild. Because I've gotten plenty of spam email, and stumbled across websites (thanks, Google!) that read just like this.

skygazer · on Dec 11, 2014

Hmm. That's intriguing. Spam could be modern day Numbers Stations, broadcast to our inboxes.

reitanqild · on Dec 9, 2014

Another explanation is someone having fun with search engines or attempting SEO.

Nursie · on Dec 9, 2014

I kinda-sorta wrote one of these about 15 years ago. In VB 6!

It simply took your data stream and encoded the message in the first letters of each word in some generated gibberish. You transform the 8-bit arbitrary byte stream into a 26-bit ascii representation to give you your list of first letters.

The gibberish was generated by choosing randomly from a list of common structures. That last sentence would have been encoded as [a,N,V,G,P,V,Av,P,a,N,P,Aj,N] - article, noun, verb ... Each word category (articles are skipped) had a dictionary containing one or more of each type of word starting with each letter.

Wasn't quite as convincing as the fake spam! I was rather pleased with it though, and it was far more interesting than the work I was supposed to be doing, as is writing this post. Back to work....

skygazer · on Dec 11, 2014

You just reminded me -- I once wrote a script that decoded and then eval'd a hidden command encoded within the whitespace of the script file itself. My goal was to create an entirely benign looking script that would hold up to visual scrutiny, but still be possibly malicious - in the final variant, it downloaded an additional remote script. Not that I ever used it for anything, but it did temper my natural trust in cursory inspection of benign-seeming open source code.

nemasu · on Dec 9, 2014

This is pretty cool, reminds me a bit of something I made a while back: https://github.com/nemasu/utf8encode

It encodes/decodes data into valid UTF-8 characters, please ignore the terrible interface & coding (was more proof of concept) -_-

Can use it to post more information using Twitter (albeit, completely human unreadable).

john2x · on Dec 9, 2014

Heh, the mailto link when encoding defaults to billg@microsoft.

stonewhite · on Dec 9, 2014

It is rather interesting that this site never mentions the word Steganography.

But also it is an intelligent way of implementing it, masking it as spam.

rjaco31 · on Dec 9, 2014

They actually do, in the Credits page.

hayksaakian · on Dec 9, 2014

Can this be layered on top of a PGP signed and encrypted email message?

I wouldn't really that stenography by itself to secure anything.

Sure its a good diversion.

fit2rule · on Dec 9, 2014

Add a pinch of NLP and you can not only set your enemies on a course to hell, but re-invent religion while you're at it.

jmnicolas · on Dec 9, 2014

Interesting but if the recipient's ISP classify it as spam, the mail may never reach its mailbox.

wauter · on Dec 9, 2014

So, is there an explanation somewhere of what they do?