For my favourite variation of this, see "Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method"[1].
Chang et al. use synonym substitution to encode hidden data into standard text, resulting in perfectly readable and sensible output afterwards. This is far less suspicious than someone keeping spam. Appendix B gives an example of how innocuous the output can be. The best part is that even when the attacker knows the system being used, it's still secure against an enemy (Kerckhoffs's principle).
The approach in this paper looks much cooler. A major thing missing in OP's website is that there is no way to add cover text (equivalent to "key" for encryption). However I like the fact that message comes out as spam so it (hopefully) stays out of inbox of receiving person and one would need to know what to look for in spam folder. This is very cool because I'd never thought of utility of making a message purposely look like spam.
A similar idea applies to executable code. Hydan is a tool that hides messages in x86 executables by using code polymorphism: takes an executable and produces one that has unchanged size, behaves exactly the same way but additionally carries a hidden message.
This would be much more powerful with wet paper codes, I suspect. A problem with it is that the changes are all forced, so it likely has a huge impact on the statistics of the text.
I encourage anyone interested in this to read about Peter Wayner's Mimic Functions.[1]
From the abstract:
A mimic function changes a file A so it assumes the statistical
properties of another file B. That is, if p(t,A) is the
probability of some substring t occuring in A, then a mimic
function f, recodes A so that p(t,f(A)) approximates p(t,B) for
all strings t of length less than some n. This paper describes
the algorithm for computing mimic functions and compares the
algorithm with its functional inverse, Huffman coding. It also
provides a description of more robust mimic functions which can
be defined using context-free grammars.
Using mimic functions, one could mimic spam or any other text (or non-text, for that matter) corpus.
The two main challenges are deciding which statistical properties one wants to mimic (for an adversarial steganalyst's mind is not always readily available for perusal) and then actually mimicking them. In other words, it's easier said than done.
Dear Friend ; Thank-you for your interest in our publication
. If you no longer wish to receive our publications
simply reply with a Subject: of "REMOVE" and you will
immediately be removed from our club ! This mail is
being sent in compliance with Senate bill 1626 ; Title
3 , Section 308 . THIS IS NOT MULTI-LEVEL MARKETING
. Why work for somebody else when you can become rich
as few as 10 WEEKS ! Have you ever noticed more people
than ever are surfing the web plus nearly every commercial
on television has a .com on in it ! Well, now is your
chance to capitalize on this . We will help you process
your orders within seconds and deliver goods right
to the customer's doorstep ! You are guaranteed to
succeed because we take all the risk . But don't believe
us ! Prof Simpson who resides in Illinois tried us
and says "Now I'm rich, Rich, RICH" . This offer is
100% legal ! We BESEECH you - act now . Sign up a friend
and you'll get a discount of 20% . God Bless ! Dear
Friend , Especially for you - this amazing news ! We
will comply with all removal requests . This mail is
being sent in compliance with Senate bill 1618 ; Title
2 , Section 301 . This is not multi-level marketing
! Why work for somebody else when you can become rich
in 58 weeks ! Have you ever noticed people will do
almost anything to avoid mailing their bills plus most
everyone has a cellphone ! Well, now is your chance
to capitalize on this ! We will help you SELL MORE
and increase customer response by 170% ! You are guaranteed
to succeed because we take all the risk . But don't
believe us . Mr Jones of Georgia tried us and says
"Now I'm rich many more things are possible" ! This
offer is 100% legal ! So make yourself rich now by
ordering immediately ! Sign up a friend and you'll
get a discount of 60% . Best regards !
Dear Friend ; Especially for you - this red-hot announcement
. This is a one time mailing there is no need to request
removal if you won't want any more . This mail is being
sent in compliance with Senate bill 2216 , Title 9
; Section 303 ! THIS IS NOT A GET RICH SCHEME . Why
work for somebody else when you can become rich within
41 days ! Have you ever noticed more people than ever
are surfing the web & how many people you know are
on the Internet ! Well, now is your chance to capitalize
on this . We will help you SELL MORE and sell more
! The best thing about our system is that it is absolutely
risk free for you ! But don't believe us ! Mr Anderson
of South Carolina tried us and says "I was skeptical
but it worked for me" ! We are a BBB member in good
standing ! We IMPLORE you - act now ! Sign up a friend
and you get half off . Thanks ! Dear Cybercitizen ;
Your email address has been submitted to us indicating
your interest in our letter . If you no longer wish
to receive our publications simply reply with a Subject:
of "REMOVE" and you will immediately be removed from
our mailing list ! This mail is being sent in compliance
with Senate bill 1621 ; Title 4 , Section 302 ! This
is not a get rich scheme . Why work for somebody else
when you can become rich in 41 DAYS . Have you ever
noticed people love convenience plus most everyone
has a cellphone . Well, now is your chance to capitalize
on this . We will help you deliver goods right to the
customer's doorstep & turn your business into an E-BUSINESS
. The best thing about our system is that it is absolutely
risk free for you ! But don't believe us . Ms Anderson
of Hawaii tried us and says "I was skeptical but it
worked for me" ! We are licensed to operate in all
states ! We BESEECH you - act now ! Sign up a friend
and you get half off ! God Bless .
Dear Friend , Especially for you - this amazing announcement
. This is a one time mailing there is no need to request
removal if you won't want any more . This mail is being
sent in compliance with Senate bill 2116 ; Title 3
; Section 304 ! This is a ligitimate business proposal
! Why work for somebody else when you can become rich
within 31 weeks . Have you ever noticed nearly every
commercial on television has a .com on in it plus more
people than ever are surfing the web . Well, now is
your chance to capitalize on this . We will help you
increase customer response by 130% plus process your
orders within seconds . You can begin at absolutely
no cost to you . But don't believe us ! Mr Jones of
Alaska tried us and says "I was skeptical but it worked
for me" . We are licensed to operate in all states
! If not for you then for your LOVED ONES - act now
! Sign up a friend and your friend will be rich too
. Warmest regards ! Dear Cybercitizen , You made the
right decision when you signed up for our mailing list
! This is a one time mailing there is no need to request
removal if you won't want any more ! This mail is being
sent in compliance with Senate bill 2416 , Title 4
, Section 302 . THIS IS NOT A GET RICH SCHEME . Why
work for somebody else when you can become rich in
99 days . Have you ever noticed more people than ever
are surfing the web & how long the line-ups are at
bank machines . Well, now is your chance to capitalize
on this ! WE will help YOU increase customer response
by 150% and sell more . The best thing about our system
is that it is absolutely risk free for you ! But don't
believe us ! Ms Simpson who resides in Indiana tried
us and says "My only problem now is where to park all
my cars" . We are licensed to operate in all states
! Do not delay - order today ! Sign up a friend and
you'll get a discount of 50% . Thanks ! Dear Friend
; Your email address has been submitted to us indicating
your interest in our newsletter . If you no longer
wish to receive our publications simply reply with
a Subject: of "REMOVE" and you will immediately be
removed from our mailing list ! This mail is being
sent in compliance with Senate bill 2416 ; Title 3
; Section 302 ! This is not a get rich scheme ! Why
work for somebody else when you can become rich as
few as 58 DAYS ! Have you ever noticed more people
than ever are surfing the web & how many people you
know are on the Internet ! Well, now is your chance
to capitalize on this ! WE will help YOU deliver goods
right to the customer's doorstep and decrease perceived
waiting time by 140% . You can begin at absolutely
no cost to you ! But don't believe us . Ms Anderson
of Georgia tried us and says "Now I'm rich many more
things are possible" ! We are licensed to operate in
all states ! We IMPLORE you - act now ! Sign up a friend
and your friend will be rich too . Thanks .
This is an interesting concept but this implementation seems rather inefficient. It should be possible to exploit spacing, punctuation and case more effectively.
SpamMimic works by using a context free probabilistic grammar to derive its output. Each production of the grammar is translated into a Huffman tree based on the probabilities assigned to each variable or terminal symbol in the production.
For example:
S -> A(.25) | B(.75)
A -> aS(1.0)
B -> bS(.75) | b(.25)
You simply feed the mimic function an encoded message (as a binary string) until you consume all the bits. Of course you can also pad the bit string so that it always terminates on a terminal symbol.
I wrote a program not too long ago that took some inspiration from SpamMimic and linguistic steganography in general. For fun I used the comments from this thread as input to my program:
So why not send it as spam? The key here is hiding in this approximately 95% of HN.
So why not send spammer--not a spy.
For my favourite variation seems rather inefficient. It should be possible output
can be used already, and you just look for in spam thousands of people with it. 2.
Also send lots of receiving person and a Novel Vertex Coding and identify the
system being used, it's used to encoding and identify the fake spam. Appendix
B gives an enemy (Kerckhoffs's principle).
[1]: https://github.com/rw/tweetfs
Plainsight uses each byte of the
Closest thing I've seen to text steganography [1]. Probably more efficient (I tried to encode a 1.000 character message, and the encoded message ended up being 80.000 chars long - that's a long SPAM email) to encode your message in a picture though, although I can see use cases for when text only may be desireable.
Dear Professional , This letter was specially selected
to be sent to you . If you no longer wish to receive
our publications simply reply with a Subject: of "REMOVE"
and you will immediately be removed from our mailing
list . This mail is being sent in compliance with Senate
bill 2416 ; Title 4 , Section 302 . THIS IS NOT A GET
RICH SCHEME ! Why work for somebody else when you can
become rich as few as 69 days ! Have you ever noticed
people love convenience and more people than ever are
surfing the web ! Well, now is your chance to capitalize
on this ! We will help you process your orders within
seconds and SELL MORE ! You can begin at absolutely
no cost to you ! But don't believe us ! Prof Ames who
resides in North Carolina tried us and says "Now I'm
rich many more things are possible" ! This offer is
100% legal ! We BESEECH you - act now ! Sign up a friend
and you'll get a discount of 50% . Thanks .
Neat idea. I know it's not a serious proposal, but the problem with this sort of approach will be that the message will be identifiable from the fact that it hasn't actually been sent as spam to many people. So an attacker can identify a suspect message just by considering its distribution.
So why not send it as spam? The key here is hiding in the this can be done in multiple ways.
1. Encode fake spam and send spam thousands of people with it.
2. Also send lots of real spam to your intended targets.
Their spam filter could then attempt decoding and identify the fake spam from the real spam, and you just look like a big nasty old spammer.
The thing I like about this approach is that for all we know, it's used already, and some of those junk emails that we've got, you know, maybe even a classic, could have actually been messages from some spy agency that contained a message!
Depends on how you define "harm" i suppose. Considering it's not real spam, there's not a real scam company on the other side of the message waiting to grab your cash, the "damage" is a few hundred kilobytes of text ending up in the spam can along with all the other legitimate spammers.
Pardon the essay, but I've written a tool in this space before.
Back in 2011, I wrote a textual steganography library and command-line application, called Plainsight: https://github.com/rw/plainsight
Additionally, @workmajj and I wrote TweetFS using Plainsight. It lets you recursively pack up directories and post them as an encoded linked list of Tweets to Twitter: https://github.com/rw/tweetfs
Plainsight uses each byte of the input message to generate tokens. Bits are used to decide how to traverse the token tree, weighted by frequency. The drawbacks are 1) verbosity and 2) incorrect grammar.
One of the lessons of writing Plainsight is that spam can be used to contain secret messages. Send enough gibberish to enough people, with your intended recipient included, and you'll look like a spammer--not a spy.
-- Example 1 (regular text)
Type your message to encode:
echo 'Meet at Union Square at noon. The password is FuriousGreen.' > cleartext
cat ciphertext | fold -s
which was the case, of a light. And, his hand. "BALLARAT." only applicant?"
decline be walking we do, the point of the little man in a strange, her
husband's hand, going said road, path but you do know what I have heard of you,
I found myself to get away from home and for the ventilator little cold night,
and I he had left my friend Sherlock of our visitor and he had an idea was not
to abuse step I of you, I knew what I was then the first signs it is the
daughter, at least a fellow-countryman. had come. as I have already explained,
the garden. what you can see a of importance. your hair. a picture upon of the
money which had brought a you have a little good deal in way: out to my wife
and hurry." made your hair. a charge me a series events, and excuse no sign his
note-book has come away and in my old Sherlock was already down to do with the
twisted
Now, decipher that ciphertext:
cat ciphertext | plainsight -m decipher -f sherlock.txt > deciphered
cat deciphered
Meet at Union Square at noon. The password is FuriousGreen.
-- Example 2 (binary data)
$ dd if=/dev/urandom of=/dev/stdout bs=1 count=10 | plainsight -m encipher -f 1984.txt
10+0 records in
10+0 records out
10 bytes (10 B) copied, 9e-05 s, 111 kB/s
Adding models:
Model: 1984.txt added in 0.89s (context == 2)
input is "<stdin>", output is "<stdout>"
enciphering: 100%|#####################################################################################################################################################################|474.67 B/s | Time: 0:00:00
which is a war is real, the proles used mind on the telescreen. He could see through all right to. You have read what said. 'Yes,' only in the Ministry
----
One serious use case is to seed the generator with a spam email corpus. This lets you generate messages that look like spam. Example:
wget https://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
tar -jxvf 20030228_spam.tar.bz2
cat spam/0* > spam-corpus.txt
echo "The Magic Words are Squeamish Ossifrage" | plainsight -m encipher -f spam-corpus.txt > spam_ciphertext
$ cat spam_ciphertext
(8.11.6/8.11.6) 3 (Normal) Internet can send e-mails until to transfer 26 10 [127.0.0.1] also include address from the most logical, mail business for your Car have a many our portals ESMTP Thu, 29 1.0 this letter on internet, <a style=3D"color: 0px; text/plain; cellspacing=3D"0" how quoted-printable about receiving you would like width=3D"15%" width=3D"15%" border="0" width="511" Date: Tue, 27 Thu, 19 26 because zzzz@localhost.spamassassin.taint.org for
$ cat spam_ciphertext | plainsight -m decipher -f spam-corpus.txt
Adding models:
Model: spam-corpus.txt added in 2.57s (context == 2)
input is "<stdin>", output is "<stdout>"
deciphering: 100%|#####################################################################################################################################################################|543.84 B/s | Time: 0:00:00
The Magic Words are Squeamish Ossifrage
This (awesome work) makes me wonder if people are using systems like this in the wild. Because I've gotten plenty of spam email, and stumbled across websites (thanks, Google!) that read just like this.
I kinda-sorta wrote one of these about 15 years ago. In VB 6!
It simply took your data stream and encoded the message in the first letters of each word in some generated gibberish. You transform the 8-bit arbitrary byte stream into a 26-bit ascii representation to give you your list of first letters.
The gibberish was generated by choosing randomly from a list of common structures. That last sentence would have been encoded as [a,N,V,G,P,V,Av,P,a,N,P,Aj,N] - article, noun, verb ... Each word category (articles are skipped) had a dictionary containing one or more of each type of word starting with each letter.
Wasn't quite as convincing as the fake spam! I was rather pleased with it though, and it was far more interesting than the work I was supposed to be doing, as is writing this post. Back to work....
You just reminded me -- I once wrote a script that decoded and then eval'd a hidden command encoded within the whitespace of the script file itself. My goal was to create an entirely benign looking script that would hold up to visual scrutiny, but still be possibly malicious - in the final variant, it downloaded an additional remote script. Not that I ever used it for anything, but it did temper my natural trust in cursory inspection of benign-seeming open source code.
Chang et al. use synonym substitution to encode hidden data into standard text, resulting in perfectly readable and sensible output afterwards. This is far less suspicious than someone keeping spam. Appendix B gives an example of how innocuous the output can be. The best part is that even when the attacker knows the system being used, it's still secure against an enemy (Kerckhoffs's principle).
[1]: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00176