Oh this is just the tip of the iceberg when it comes to abusing Unicode! You can use a similar technique to this to overflow the buffer on loads of systems that accept Unicode strings. Normally it just produces an error and/or a crash but sometimes you get lucky and it'll do all sorts of fun things! :)
I remember doing penetration testing waaaaaay back in the day (before Python 3 existed) and using mere diacritics to turn a single character into many bytes that would then overflow the buffer of a back-end web server. This only ever caused it to crash (and usually auto-restart) but I could definitely see how this could be used to exploit certain systems/software with enough fiddling.
Yeah. Zalgo text is a common test for input fields on websites. But it usually doesn't do anything interesting. Maybe an exception trigger on some database length limit. Doesn't typically even kill any processes. The exception is normally just in your thread. You can often trigger it just by disabling JS on even modern forms, but,, at best you're maybe leaking a bit of info if they left debug on and print the stack trace or a query.
Another common slip-up is failing to count \n vs \r\n in text strings since JS usually usually counts a carriage return as 1 byte, but HTTP spec requires two.
unescape(encodeURIComponent("ç")).length is the quick and dirty way to do a JS byte length check. The \r\n thing can be done just by cleaning up the string before length counting.
A few months ago I made a post which I (should've) named "Unicode codepoints that expand or contract when case is changed in UTF-8". A decent parser shouldn't have any issues with things like this, but software that makes bad Unicode assumptions might.
This is cute but unnecessary - Unicode includes a massive range called PUA: the private use area. The codes in this range aren’t mapped to anything (and won’t be mapped to anything) and are for internal/custom use, not to be passed to external systems (for example, we use them in fish-shell to safely parse tokens into a string, turning an unescaped special character into just another Unicode code point in the string, but in the PUA area, then intercept that later in the pipeline).
You’re not supposed to expose them outside your api boundary but when you encounter them you are prescribed to pass them through as-is, and that’s what most systems and libraries do. It’s a clear potential exfiltration avenue, but given that most sane developers don’t know much more about Unicode other than “always use Unicode to avoid internationalization issues”, it’s often left wide open.
I just tested and private use characters render as boxes for me (), the point here was to encode them in a way that they are hidden and treated as "part of" another character when copy/pasting.
People immediately began discussing the applications for criminal use given the constraint that only emoji are accepted by the API. So for that use case the PUA wouldn't be an option, you have to encode it in the emoji.
Isn't this more what the designated noncharacters are for, rather than the private-use area? Given how the private-use area sometimes gets for unofficial encodings of scripts not currently in Unicode (or for things like the Apple logo and such) I'd be worried about running into collisions with that if I used the PUA in such a way.
Note that designated noncharacters includes not only 0xFFFF and 0xFFFE, and not only the final two code points of every plane, but also an area in the middle of Arabic Presentation Forms that was at some point added to the list of noncharacters specifically so that there would be more noncharacters for people using them this way!
I'll be h󠄾󠅟󠅠󠅕󠄜󠄐󠅞󠅟󠄐󠅣󠅕󠅓󠅢󠅕󠅤󠅣󠄐󠅘󠅕󠅢󠅕onest, I pasted this comment in the provided decoder thinking no one could miss the point this badly and there was probably a hidden message inside it, but either you really did or this website is stripping them.
You can't invisibly watermark an arbitrary character (I did it to one above! If this website isn't stripping them, try it out in the provided decoder and you'll see) with unrecognized PUA characters, because it won't treat them as combining characters. You will cause separately rendered rendered placeholder-box characters to appear. Like this one: (may not be a placeholder-box if you're privately-using the private use area yourself).
j󠄗󠅄󠅧󠅑󠅣󠄐󠅒󠅢󠅙󠅜󠅜󠅙󠅗󠄜󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅣󠅜󠅙󠅤󠅘󠅩󠄐󠅤󠅟󠅦󠅕󠅣󠄴󠅙󠅔󠄐󠅗󠅩󠅢󠅕󠄐󠅑󠅞󠅔󠄐󠅗󠅙󠅝󠅒󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅧󠅑󠅒󠅕󠄫󠄱󠅜󠅜󠄐󠅝󠅙󠅝󠅣󠅩󠄐󠅧󠅕󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠅒󠅟󠅢󠅟󠅗󠅟󠅦󠅕󠅣󠄜󠄱󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅝󠅟󠅝󠅕󠄐󠅢󠅑󠅤󠅘󠅣󠄐󠅟󠅥󠅤󠅗󠅢󠅑󠅒󠅕󠄞󠄒󠄲󠅕󠅧󠅑󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄜󠄐󠅝󠅩󠄐󠅣󠅟󠅞󠄑󠅄󠅘󠅕󠄐󠅚󠅑󠅧󠅣󠄐󠅤󠅘󠅑󠅤󠄐󠅒󠅙󠅤󠅕󠄜󠄐󠅤󠅘󠅕󠄐󠅓󠅜󠅑󠅧󠅣󠄐󠅤󠅘󠅑󠅤󠄐󠅓󠅑󠅤󠅓󠅘󠄑󠄲󠅕󠅧󠅑󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠄺󠅥󠅒󠅚󠅥󠅒󠄐󠅒󠅙󠅢󠅔󠄜󠄐󠅑󠅞󠅔󠄐󠅣󠅘󠅥󠅞󠅄󠅘󠅕󠄐󠅖󠅢󠅥󠅝󠅙󠅟󠅥󠅣󠄐󠄲󠅑󠅞󠅔󠅕󠅢󠅣󠅞󠅑󠅤󠅓󠅘󠄑󠄒󠄸󠅕󠄐󠅤󠅟󠅟󠅛󠄐󠅘󠅙󠅣󠄐󠅦󠅟󠅢󠅠󠅑󠅜󠄐󠅣󠅧󠅟󠅢󠅔󠄐󠅙󠅞󠄐󠅘󠅑󠅞󠅔󠄪󠄼󠅟󠅞󠅗󠄐󠅤󠅙󠅝󠅕󠄐󠅤󠅘󠅕󠄐󠅝󠅑󠅞󠅨󠅟󠅝󠅕󠄐󠅖󠅟󠅕󠄐󠅘󠅕󠄐󠅣󠅟󠅥󠅗󠅘󠅤󠇒󠅰󠆄󠅃󠅟󠄐󠅢󠅕󠅣󠅤󠅕󠅔󠄐󠅘󠅕󠄐󠅒󠅩󠄐󠅤󠅘󠅕󠄐󠅄󠅥󠅝󠅤󠅥󠅝󠄐󠅤󠅢󠅕󠅕󠄜󠄱󠅞󠅔󠄐󠅣󠅤󠅟󠅟󠅔󠄐󠅑󠅧󠅘󠅙󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅟󠅥󠅗󠅘󠅤󠄞󠄱󠅞󠅔󠄐󠅑󠅣󠄐󠅙󠅞󠄐󠅥󠅖󠅖󠅙󠅣󠅘󠄐󠅤󠅘󠅟󠅥󠅗󠅘󠅤󠄐󠅘󠅕󠄐󠅣󠅤󠅟󠅟󠅔󠄜󠅄󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄜󠄐󠅧󠅙󠅤󠅘󠄐󠅕󠅩󠅕󠅣󠄐󠅟󠅖󠄐󠅖󠅜󠅑󠅝󠅕󠄜󠄳󠅑󠅝󠅕󠄐󠅧󠅘󠅙󠅖󠅖󠅜󠅙󠅞󠅗󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠄐󠅤󠅘󠅕󠄐󠅤󠅥󠅜󠅗󠅕󠅩󠄐󠅧󠅟󠅟󠅔󠄜󠄱󠅞󠅔󠄐󠅒󠅥󠅢󠅒󠅜󠅕󠅔󠄐󠅑󠅣󠄐󠅙󠅤󠄐󠅓󠅑󠅝󠅕󠄑󠄿󠅞󠅕󠄜󠄐󠅤󠅧󠅟󠄑󠄐󠄿󠅞󠅕󠄜󠄐󠅤󠅧󠅟󠄑󠄐󠄱󠅞󠅔󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠅄󠅘󠅕󠄐󠅦󠅟󠅢󠅠󠅑󠅜󠄐󠅒󠅜󠅑󠅔󠅕󠄐󠅧󠅕󠅞󠅤󠄐󠅣󠅞󠅙󠅓󠅛󠅕󠅢󠄝󠅣󠅞󠅑󠅓󠅛󠄑󠄸󠅕󠄐󠅜󠅕󠅖󠅤󠄐󠅙󠅤󠄐󠅔󠅕󠅑󠅔󠄜󠄐󠅑󠅞󠅔󠄐󠅧󠅙󠅤󠅘󠄐󠅙󠅤󠅣󠄐󠅘󠅕󠅑󠅔󠄸󠅕󠄐󠅧󠅕󠅞󠅤󠄐󠅗󠅑󠅜󠅥󠅝󠅠󠅘󠅙󠅞󠅗󠄐󠅒󠅑󠅓󠅛󠄞󠄒󠄱󠅞󠅔󠄐󠅘󠅑󠅣󠅤󠄐󠅤󠅘󠅟󠅥󠄐󠅣󠅜󠅑󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄯󠄳󠅟󠅝󠅕󠄐󠅤󠅟󠄐󠅝󠅩󠄐󠅑󠅢󠅝󠅣󠄜󠄐󠅝󠅩󠄐󠅒󠅕󠅑󠅝󠅙󠅣󠅘󠄐󠅒󠅟󠅩󠄑󠄿󠄐󠅖󠅢󠅑󠅒󠅚󠅟󠅥󠅣󠄐󠅔󠅑󠅩󠄑󠄐󠄳󠅑󠅜󠅜󠅟󠅟󠅘󠄑󠄐󠄳󠅑󠅜󠅜󠅑󠅩󠄑󠄒󠄸󠅕󠄐󠅓󠅘󠅟󠅢󠅤󠅜󠅕󠅔󠄐󠅙󠅞󠄐󠅘󠅙󠅣󠄐󠅚󠅟󠅩󠄞󠄗󠅄󠅧󠅑󠅣󠄐󠅒󠅢󠅙󠅜󠅜󠅙󠅗󠄜󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅣󠅜󠅙󠅤󠅘󠅩󠄐󠅤󠅟󠅦󠅕󠅣󠄴󠅙󠅔󠄐󠅗󠅩󠅢󠅕󠄐󠅑󠅞󠅔󠄐󠅗󠅙󠅝󠅒󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅧󠅑󠅒󠅕󠄫󠄱󠅜󠅜󠄐󠅝󠅙󠅝󠅣󠅩󠄐󠅧󠅕󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠅒󠅟󠅢󠅟󠅗󠅟󠅦󠅕󠅣󠄜󠄱󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅝󠅟󠅝󠅕󠄐󠅢󠅑󠅤󠅘󠅣󠄐󠅟󠅥󠅤󠅗󠅢󠅑󠅒󠅕󠄞 is for Jabberwocky. Does this decode?
10 years or so ago I shocked coworkers with using U+202D LEFT-TO-RIGHT OVERRIDE mid in filenames on windows. So funnypicturegnp.exe became funnypictureexe.png
Combined with a custom icon for the program that mimics a picture preview it was pretty convincing.
I worked in phishing detection. This was a common pattern used by attackers, although .exe are blocked automatically most of the time, .html is the new malicious extension (often hosting an obfuscated window.location redirect to a fake login page).
RTL abuse like cute-cat-lmth.png was relatively common, but also trivial to detect. We would immediately flag such an email as phishing.
Basically it's possible to hide some code that looks like comments but compiles like code. I seem to recall the CVE status was disputed since many text editors already make these suspicious comments visible.
I’d never heard of this particular trick but I’m glad my decades of paranoia-fueled “right click -> open with” treatment of any potentially sketchy media file was warranted! :D
For a real-world use case: Sanity used this trick[0] to encode Content Source Maps[1] into the actual text served on a webpage when it is in "preview mode". This allows an editor to easily trace some piece of content back to a potentially deep content structure just by clicking on the text/content in question.
It has it's drawbacks/limitations - eg you want to prevent adding it for things that needs to be parsed/used verbatim, like date/timestamps, urls, "ids" etc - but it's still a pretty fun trick.
I love the idea of using this for LLM output watermarking. It hits the sweet spot - will catch 99% of slop generators with no fuss, since they only copy and paste anyway, almost no impact on other core use cases.
I wonder how much you’d embed with each letter or token that’s output - userid, prompt ref, date, token number?
I also wonder how this is interpreted in a terminal. Really cool!
Why does anybody think AI watermarking will ever work? Of course it will never work, any watermarking can be instantly & easily stripped...
The only real AI protection is to require all human interaction to be signed by a key verified by irl identity and even then that will: A never happen, B be open to abuse by countries with corrupt governments and countries with corrupt governments heavily influenced by private industry (like the US).
> any watermarking can be instantly & easily stripped...
I think it took a while before printer watermarking (yellow dots) was discovered. It certainly cannot be stripped. It was possibly developed in the mid-80s but not known to the public until mid 2000s.
In most linux terminals, what you pass it is just a sequence of bytes that is passed unmangled. And since this technique is UTF-8 compliant and doesn't use any extra glyphs, it is invisible to humans in unicode compliant terminals. I tried it on a few.
It shows up if you echo the sentence to, say, xxd ofc.
(unlike the PUA suggestion in the currently top voted comment which shows up immediately ofc)
Additional test corrections:
While xxd shows the message passing through completely unmangled on pasting it into the terminal, when I selected from the terminal (echoed sentence, verified unmangled in xxd, then selected and pasted the result of echo), it was truncated to a few words using X select in mate terminal and konsole - I'm not sure where that truncation happens, whether it's the terminal or X.
In xterm, the final e was mangled, and the selection was even more truncated.
The sentence is written unmangled to files though, so I think it's more about copying out of the terminal dropping some data. Verified by echoing the sentence to a test file, opening it in a browser, and copying the text from there.
On MacOS, kitty shows an empty box, then an a for the "h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a" post below. I think this is fair and even appreciated. Mac Terminal shows "ha". That "h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a" (and this one!) can be copied and pasted into the decoder successfully.
There are other possible approaches to LLM watermarking that would be much more robust and harder to detect. They exploit the fact that LLMs work by producing a probability distribution that gives a probability for each possible next token. These are then sampled randomly to produce the output. To add fingerprints when generating, you could do some trickery in how you do that sampling that would then be detectable by re-running the LLM and observing its outputs. For example, you could alternate between selecting high-probability and low-probability tokens. (A real implementation of this would be much more sophisticated than that obviously, but hopefully you get the idea)
This is not a great method in a world with closed models and highly diverse open models and samplers. It’s intellectually appealing for sure! But it will always be at best a probabilistic method, and that’s if you have the llm weights at hand.
What makes it not a good method? Of course if a model's weights are publicly available, you can't compel anyone using it to add fingerprinting at the sampler stage or later. But I would be shocked if OpenAI was not doing something like this, since it would be so easy and couldn't hurt them, but could help them if they don't want to train on outputs they generated. (Although they could also record hashes of their outputs or something similar as well – I would be surprised if they don't.)
That's already happening - my kids have had papers unfairly blamed on chatgpt by automated tools. Protect yourself kids, use an editor that can show letter by letter history.
2 people I worked with had this happen and one of them is going to war over it as it was enough to lower the kids grade for college or something. Crazy times.
There are of course human writers who are less-communicative than AI, called "shit writers", and humans who are less accurate than AI, called "liars".
The difference is humans are responsible for what they write, whereas the human user who used an AI to generate text is responsible for what the computer wrote.
It's worth noting, just as a curiosity, that screen readers can detect these variation selectors when I navigate by character. For example, if I arrow over the example he provided (I can't paste it here lol), I here: "Smiling face with smiling eyes", "Symbol e zero one five five", "Symbol e zero one five c", "Symbol e zero one five c", "Symbol e zero one five f". This is unfortunately dependent on the speech synthesizer used, and I wouldn't know if the characters were there if I was just reading a document, so this isn't much of an advantage all things considered.
Ironically enough I have a script that strips all non-ascii characters from my screen reader because I found that _all_ online text was polluted with invisible and annoying to listen to characters.
Mine (NVDA) isn't annoying about non-ASCII symbols, interestingly enough. But for something like this form of Unicode "abuse" (?), if you throw a ton of them into a message or something, they become "lines" I have to arrow past because my screen reader will otherwise remain silent on those lines unless I use say-all (continuous reading for those who don't use screen readers).
There's a Better Discord plugin that I think uses this or something similar, so you could send completely encrypted messages, that look like nothing to everyone else. You'd need to share a password secret for them to decode it though.
The title lis little misleading: "Note that the base character does not need to be an emoji – the treatment of variation selectors is the same with regular characters. It’s just more fun with emoji."
Using this approach with non-emoji characters makes it more stealth and even more disturbing.
I don't see this as all that disturbing. A detector for it wouldn't be hard to write (variant on something that doesn't actually have variants, flag it!), it seems to me it could be useful for signing things.
Even more than just simply watermarking LLM output, it seems like this could be a neat way to package logprobs data.
Basically, include probability information about every token generated to give a bit of transparency to the generation process. It's part of the OpenAI api spec, and many other engines (such as llama.cpp) support providing this information. Normally it's attached as a separate field, but there are neat ways to visualize it (such as mikupad [0]).
Probably a bad idea, but this still tickles my brain.
i've found a genuinely useful application of this that isn't just "talk secretively":
I'm part of a community that has a chatroom on both discord and another platform. we have developped a bot that can make the bridge between the two, but it needs to maintain a table in a database of all the bridged messages, so that people use the "reply" feature on either platform, the reply is also performed on the other platform. With this, each platform's bridged message can contain hidden data that points to the other's platform ID, without the need to store it ourselves, and without it being visible to users.
obviously we won't be tearing down the infra already in place anytime soon, so we won't actually be doing that, but that's definitely a useful application for when you don't want to be hosting a database and just wish you could attach additional data to a restrictive format.
More generally, you can use encoding formats that reserve uninterpreted byte sequences for future use to pass data that is only readable by receivers who know what you're doing, though note this not a cryptographically secure scheme and any sort of statistical analysis can reveal what you're doing.
The png spec, for instance, allows you to include as many metadata chunks as you wish, and these may be used to hold data that cannot be used by any mainstream png reader. We used this in the Navy to embed geolocation and sensor origin data that was readable by specialized viewers that only the Navy had, but if you opened the file in a browser or common image viewer, it would either ignore or discard the unknown chunks.
Lots of image formats store arbitrary metadata (and data data) either by design or by application-specific extensions. I remember seeing seismic and medical images that contained data for display in specialized applications and writing code to figure out if binary metadata was written in big-endian or little-endian byte order (the metadata often did not have the same endianness as the image data!) For example, TIFF files containing 3d scans as a sequence of slices, with binary metadata attached to each slice. If you opened it up in your system default image viewer, you'd only see the first slice, but a specialized viewer (which I did not have) would display it as a 3d model. Luckily (IIRC) the KDE file browser let you quickly flip through all the images in a directory using the keyboard, so I was able to dump all the layers into separate files and flip through them to see the 3d image.
Imagine using the ID card emoji (U+1FAAA) as a universal carrier for digital ID tokens. A dumb demo is available at https://pit.lovable.app/ which—without any secure protocol—simply encodes a National Identification Number into the emoji using variation selectors.
The idea is that banks could issue encrypted ID tokens in this way, letting them move seamlessly across any platform that supports Unicode (messaging apps, email, web forms, etc.). The heavy lifting of security (preventing replay attacks, interception, ensuring token freshness, etc.) would be managed separately with robust cryptography, while the emoji serves purely as a transport layer.
It's not about reinventing security but about creating a cross-platform way to carry identity tokens. Thoughts?
So that the operating system could recognize it automatically, and to include a potentially long URL to the retail bank's web service to initiate the protocol, such as signing a document or an identification protocol.
This is cool. I tried pasting the output into an Instagram comment and it stayed intact, so I have a feeling someone could do some interesting stuff with that. Who needs a botnet C&C server when you can post totally invisible commands on public forums?
I mean, steganography has been a thing for quite a while. Not disagreeing, just saying this is how some programs/ideas were passed around the internet decades ago by "less than upstanding netizens" ;)
Wanted to pass a secret code to a friend? Encode the bit-data in the alpha channel of an image. It could even be encrypted/scrambled within the image itself. Post the perfectly normal image to a public forum, ping your friend, they run it through the "decoder" and Robert's your mother's brother.
Of course these weren't "logic bombs" like this post is describing, but even those have been around for a while too.
(author here) some people in this thread and elsewhere asked me about whether an LLM could decode this, and the answer seems to be: not likely by itself, but it often can if it has access to a Python interpreter!
so.... in theory you should be able to create several visually identical links that give access to different resources?
I've always assumed links without any tracking information (unique hash, query params, etc) were safe to click(with regards to my privacy). but if this works for links I may need to revise my strategy regarding how to approach links sent to me.
"Visually identical" is never good enough. Have you heard of attacks confusing Latin letters and Cyrillic letters? For example C versus С. (The latter is known as CYRILLIC CAPITAL LETTER ES.) Have you heard of NFC forms versus NFD forms? For example é versus é (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT versus LATIN SMALL LETTER E WITH ACUTE.)
Nothing that's important when it comes to security and privacy should rely on a "visually identical" check. Fortunately browsers these days are already good at this; their address bars use puny code for the domain and percent encoding for the rest of the URL.
As the sibling comment has mentioned Unicode in DNS uses a punycode encoding but even further then that the standard specifies that the Unicode data must be normalized to NFC[0] before being converted to punycode. This means that your second example (decomposed e with combining acute accent vs the composed variant) is not a valid concern. The Cyrillic one is however.
[0] https://www.rfc-editor.org/rfc/rfc5891 § 4.1 "By the time a string enters the IDNA registration process as described in this specification, it MUST be in Unicode and in Normalization Form C"
Sure, but the security concerns of that I feel are much less concerning than having multiple domain names with the same visual appearance that point to different servers. That has immediate impact for things like phishing whereas lookalike path or query portions would at least ensure you are still connecting to the server that you think you are.
Yes but I guess that the message was meaning that browsers now detect homographs and display the punycode instead. See also https://news.ycombinator.com/item?id=14130241; at that time Firefox wasn't fixed, but in the meantime it fixed the issue too (there's a network.idn.punycode_cyrillic_confusables preference, which is enabled by default).
My understanding is that "weird" unicode code points become https://en.wikipedia.org/wiki/Punycode. I used the 󠅘󠅕󠅜󠅜󠅟 (copy-pasted from the post, presumably with the payload in it) to type a fake domain into Chrome, and the Punycode I got appeared to not have any of the encoding bits.
However, I then pasted the emoji into the _query_ part of a URL. I pointed it to my own website, and sure enough, I can definitely see the payload in the nginx logs. Yikes.
Edit: I pasted the very same Emoji that 'paulgb used in their post before the parenthetical in the first paragraph, but it seems HN scrubs those from comments.
domains get "punycode" encoded, urls get "url encoded"[1], which should make unicode characters stand out. That being said, browsers do accept some non-ascii characters in urls and convert them automatially, so theoretically you could put "invalid" characters into a link and have the browser convert it only after clicking. That might be a viable strategy.
> I've always assumed links without any tracking information (unique hash, query params, etc) were safe to click(with regards to my privacy). but if this works for links I may need to revise my strategy regarding how to approach links sent to me.
Well, it was never safe, what you see and where the link are pointing at are different things, that's why the actual link is displayed at the bottom left of your browser when you move your mouse over it (or focus it via keyboard)
You need to decode the text after copy pasting it, I believe clicking on text will not interact with the obfuscated data since your computer will just find the unicode and ignore the obfuscated data.
This is just so that you can hide data and send it to someone to be decoded (or watermarking as mentionned)
but my fear is precisely that I my be sending data to a remote host while I'm completely unaware of this fact.
I tried to create a POC with some popular url shortner services, but doesn't seems to work.
what I wanted to create was a link like <host.tld>/innoc󠅥󠅣󠅕󠅢󠄝󠅙󠅔󠄪󠅑󠅒󠅓ent that redirects to google.com. in this case the "c" contains some hidden data that will be sent to the server while the user is not aware. this seems possible with the correct piece of software.
URIs with non-ASCII characters are technically invalid. Browsers and the like should (but likely don’t all do) percent-encode any invalid characters for display if they accept such invalid URIs.
You could store UTF-8 encoded data inside the hidden bytestring. If some of the UTF-8 encoded smuggled characters are variation selector characters, you can smuggle text inside the smuggled text. Smuggled data can be nested arbitrarily deep.
I'm imagining post-incident analysis finding out that, "the data was exfiltrated via some Unicode string..." then they put it up on the screen and it's just an enormous line of turtle emoji
> I'm imagining post-incident analysis finding out that, "the data was exfiltrated via some Unicode string..." then they put it up on the screen and it's just an enormous line of turtle emoji
Since it took me a minute to make the connection, I'll just say explicitly that I enjoyed the understated "it's turtles all the way down" joke.
This is one of the reasons I've been advocating to use UTF-8 as a tokenizer for a long time. The actual problem IMHO are tokenizers themselves, which obscure the encoding/decoding process in order to gain some compression during training to fit more data in for the same budget, and arguably gaining some better understanding from the beginning. Again just a lack of computing power.
If you use UTF-8 directly as tokenizer, this problem becomes evident once you fit it into the context window. Plus, you can run multiple tests for this type of injection; no emoji should take more than up to 40 bytes (10 code points * 4 bytes per code point in the worst case). This is an attack on tokenizers, not on UTF-8.
Plus, Unicode publishes the full list of sequences valid containing the ZWJ character in emoji-zwj-sequences.txt
This and several other abuse cases forced my previous work to use code pointers to count 'characters' for user's nickname / status messages. No one wanted to download 9MB simply browsing other users.
The ability to add watermarks to text is really interesting. Obviously it could be worked around , but could be a good way to subtly watermark e.g. LLM outputs
The issue with the standard watermark techniques is that they require an output of at least a few hundred tokens to reliably imprint the watermark. This technique would apply to much shorter outputs.
A crude way:
To watermark:
First establish a keyed DRBG.
For every nth token prediction:
read a bit from the DRBG for every possible token to label them red/black.
before selecting the next token, set the logit for black tokens to -Inf, this ensures a red token will be selected.
To detect:
Establish the same DRBG.
Tokenize, for each nth token, determine the red set of tokens in that position.
If you only see red tokens in lots of positions, then you can be confident the content is watermarked with your key.
This would probably take a bit of fiddling to work well, but would be pretty much undetectable. Conceptually it's forcing the LLM to use a "flagged" synonym at key positions. A more sophisticated version of a shiboleth.
In practice you might chose to instead watermark all tokens, less heavy handedly (nudge logits, rather than override), and use highly robust error correcting codes.
It feels like this would only be feasible across longer passages of text, and some types of text may be less amenable to synonyms than others. For example, a tightly written mathematical proof versus a rambling essay. Biased token selection may be detectable in the latter (using a statistical test), and may cause the text to be irreparably broken in the former.
To handle low entropy text, the “adding a smaller constant to the logits” approach avoids having much chance of changing the parts that need to be exactly a particular thing,
Though in this case it needs longer texts to have high significance (and when the entropy is low, it needs to be especially long).
But for most text (with typical amounts of entropy per token) apparently it doesn’t need to be that long? Like 25 words I think I heard?
What if the entire LLM output isn’t used? For example, you ask the LLM to produce some long random preamble and conclusion with your actual desired output in between the two. Does it mess up the watermarking?
This is too strippable to be a good watermark, it would only catch the ones who are unaware. The leakers, yes, the cybersecurity people, no.
Rather, I see a use in signing things. Newspapers, politicians etc, generate a unique key and encode it into your article or whatever. Now it's easy for anyone to check if a quote attributed to you actually came from you. Sure, it's not secure but it doesn't need to be because it's simply a stable identifier. Even paywalled sites could display a snippet around the provided quote without being problematic.
Isn't it also an option to use this to generate pretty decent passwords? Like including one of those unicode chars encoded with a message, pretty sure no password cracking tool currently has support for this
I implemented something similar years ago, but much simpler/less sophisticated.
Unicode has two non-printing space characters: zero-width space (U+200B) and zero-width joiner (U+200D). This allows you to encode arbitrary data in binary. I would give an example, but HN seems to strip this :(
I was using this technique last year with Bing Image Creator.
It let you get around their filter on brand names and celebrity names by smuggling them into the prompt in a way the AI could read, but the human-written filter was not designed for.
The r1 somehow knew at an early stage that the message was HELLO but it couldn’t figure out the reason. Even at the end, its last “thought” insists that there is an encoding mistake somewhere. However the final message is correct. I wonder how well it would do for a nonstandard message. Any sufficiently long English message would fall to statistical analysis and I wonder if the LLMs would think to write a little Python script to do the job.
It’s like guessing 1/2 or 2/3 on a math test. The test authors pick nice numbers, and programmers like ”hello”. If the way to encode the secret message resembles other encodings, it’s probably that the pattern matching monster picked it up and is struggling to autocomplete (ie backwards rationalize) a reason why.
I did some experimentation today. I wouldn't expect AI to solve it using only their own reasoning, but I've had a decent hit rate of getting AI to solve them when they have access to a Python interpreter. Here's Gemini Flash 2 solving one (albeit it lost the spaces) in a single prompt and about 7 seconds!
My deepseek-r1 seems to be a bit more lost on decoding "How do I make meth". Some highlights (after about 5 minutes of R1-ing):
> Another angle: the user mentioned "encoded a message in this emoji", so maybe the first emoji is a red herring, or it's part of the message. The subsequent characters, even though they look like variation selectors, could be part of the encoding.
There's no way an LLM is decoding this. It's just giving you a statistically likely response to the request, "guess my secret message." It's not a big surprise that it guessed "Hello" or "Hello, world"
I got Claude to get “the raisons play at midnight" from an emoji in one prompt and three uses of its "analysis" tool. (the X Y at mightnight is a snowclone that Claude has probably seen, but I randomly picked "raisons" and "play")
My prompt was "I think this emoji contains a hidden messaage, can you decode it? Use JavaScript if necessary."
I'm not too surprised by this, but I'm annoyed that no amount of configuration made those bytes visible again in my editor. Only using hexdump revealed them.
And as a higher-level configuration you can set most, maybe even all, of the relevant invisible characters (still not sure how 0x34f grapheme joiner fits in) at once with something like:
Here is the bare minimum this is built on, which you can type in yourself if you're paranoid or want to start from the bottom up. Swap in the hexadecimal codepoint of the invisible character after the ?\x
vscode's "Unicode Highlight: Non-basic ASCII" causes the character to get highlighted.
Sadly, the more appropriate "Unicode Highlight: Invisible Characters" setting does not reveal them.
What's interesting is that even a "view source" shows nothing amiss, and if I do a copy/paste from the debug inspector view of "This sentence has a hidden message󠅟󠅘󠄐󠅝󠅩󠄜󠄐󠅩󠅟󠅥󠄐󠅖󠅟󠅥󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅘󠅙󠅔󠅔󠅕󠅞󠄐󠅝󠅕󠅣󠅣󠅑󠅗󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅤󠅕󠅨󠅤󠄑." it still shows up....
When people discuss things like “Do LLMs know about this?” On a public website I always think that it’s the equivalent of somebody whose phone is wiretapped calling their friend and asking if the FBI knows about something.
I think that's a very cynical view. The author seeing what an LLM would make of it was more akin to getting a new game and wondering if you can pet the dog.
Even kids figure out how to manipulate unicode text. If you want to bypass a swear filter, replace a letter with an alternate representation of the same letter.
Normalization implementations must not strip variation selectors by definition. The "normal" part of normalization means to convert a string into either consistently decomposed unicode, or composed unicode. ie U+00DC vs U+0055 + U+0308. However this decomposition mapping is also used (maybe more like abused) for converting certain "legacy" code points to non-legacy code points. There does not exist a rune which decomposes to variant selectors (and thus these variant selectors do not compose into anything) so normalization must not alter or strip them.
source: I've implemented Unicode normalization from scratch
It is not bulletproof though. In this "c󠄸󠅙󠄑󠄐󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅣󠅖󠅣󠅖󠅣󠅕󠅖󠅗󠅣󠅢󠅗󠄐󠅣󠅢󠅗󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅦󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅦󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣 " and that space, are about 3500 characters. Copying only the "c" above (not this one) will keep some of the hidden text, but not all. Nevertheless, while I knew that this is possible, it still breaks a lot of assumptions around text.
Edit: the text field for editing this post is so large, that I need to scroll down to the update button. This will be a fun toy to create very hard to find bugs in many tools.
FWIW, we considered this technique back at Pebble to make notifications more actionable and even filed a patent for that (sorry!) https://patents.justia.com/patent/9411785
Back then on iOS via ANCS, the watches wouldn't receive much more than the textual payload you'd see on the phone. We envisioned to be working with partners such as WhatsApp et al. to encode deep links/message ids into the message so one could respond directly from the watch.
Respectfully: how the hell would that be a valid patent? Feels like patenting the idea of writing text in white on white on a Word document such that you don't lose it but it doesn't get printed.
It's just insane to ever call that "an invention".
Companies acquire indefensible patents all the time. They are used in bulk to threaten smaller competitors ("we've got 500 patents in this field, better not bring your product to market"). This is one reason why patents can be terrible for competition.
About 25 years ago, this was explained to me as "sword patents and shield patents".
Sure, some can use patents as swords, to suppress legitimate competition, or to extract undue rents. But you can also use patents as shields, to protect in various ways against those swords.
If I ran a BigTech (like the original warm-fuzzy Google reputation), I'd be registering any plausible patents, and have lawyers figure out how to freely license-out the ones that weren't key secret sauce, under terms that figuratively poisoned anyone doing illegitimate sword patents.
They are also used in bulk to defend against larger competitors using this type of threat. In a war where the ammunition is garbage, you either lose or you start hoarding garbage.
Patents are part of the game you have to play, like it or not. If you don't patent your inventions somebody else will and they will come after you with their lawyers. Patents are used defensively far more often than they are used offensively in these stupid "Intellectual Property" battles.
Because of this, there is absolutely no point in shaming someone for patenting a thing, especially when they are apologetic about it like parent is, and most especially when they are not threatening to weaponize the patent themselves.
No, I don't buy it. If the patents are publicly and perpetually freely licensed except for defensive-only purposes, then sure, they're not unethical. Red Hat's patent promise ( https://www.redhat.com/en/about/patent-promise ) is one example. If patents were actually intended for defensive purposes only, then this would be an easy and uncontroversial thing to do. However, in practice this is vanishingly rare, and lawyers fight against it tooth & nail. This tells you that the companies do not actually file them for defensive-only purposes, unlike what you claim.
Yes, that's the reason for the "except for defensive purposes" part. Quoting from Red Hat's promise:
> Our Promise also does not extend to the actions of a party (including past actions) if at any time the party or its affiliate asserts a patent in proceedings against Red Hat (or its affiliate) or any offering of Red Hat (or its affiliate) (including a cross-claim or counterclaim).
Company B may still consult its portfolio and exercise it against Company A defensively, because Company A revoked its license of Company B's patents by asserting against Company B in the first place.
So in other words, Red Hat does not freely license their patents, they say "you are free as long as you don't come after us." Which is exactly the system 99% of companies follow, just more formally stated. Yet you berated the poor guy from Pebble for even obtaining the patent he did??
> Which is exactly the system 99% of companies follow, just more formally stated
Not just formally, but in a legally binding manner, including if the patent is acquired by another company (eg during a company purchase). Even if the original filer has the best intentions, companies change ownership or change legal strategy or go out of business. Patent trolls buy up those patents from closed companies. Legally licensing your patents for defensive-only purposes means they can't ever be used by any of those bad actors.
If the intent of these patents is truly only for defense, then why isn't it common to use a license like this? They lose nothing by it.
> Yet you berated the poor guy from Pebble for even obtaining the patent he did??
Yes. It is IMO unethical to create software patents that aren't covered by such a legally-binding license.
"including if the patent is acquired by another company (eg during a company purchase)"
Honest questions, I promise: Is that true? Has that ever been tested in court? Why don't more corporations or patent lawyers advocate for this? Is it because the types of engineers that post on hacker news are requesting it not be done?
Look, nobody likes patent trolls, we all hate weaponized patents. It's great that you want to fix the situation. I just think you are barking up the wrong tree trying to lay guilt trips on engineers for doing what their lawyer advised them to do.
Nothing is certain in courts, obviously, but Red Hat's license is very explicit that that is the intent:
> Red Hat intends Our Promise to be irrevocable (except as stated herein), and binding and enforceable against Red Hat and assignees of, or successors to, Red Hat’s patents (and any patents directly or indirectly issuing from Red Hat’s patent applications). As part of Our Promise, if Red Hat sells, exclusively licenses, or otherwise assigns or transfers patents or patent applications to a party, we will require the party to agree in writing to be bound to Our Promise for those patents and for patents directly or indirectly issuing on those patent applications. We will also require the party to agree in writing to so bind its own assignees, transferees, and exclusive licensees.
If a court somehow overturned that, I wouldn't hold it against the patent filer.
> Why don't more corporations or patent lawyers advocate for this?
My opinion is it's because the patents have value as a weapon, not only for defense (this here is my disagreement with your original claim that these patents only exist for defense). De-fusing the weapon by using a legally binding license like this lowers the value of the patent in a potential purchase scenario. In other words: "money."
> I just think you are barking up the wrong tree trying to lay guilt trips on engineers for doing what their lawyer advised them to do.
Nah. If you do a bad thing, you are responsible for the bad thing you did. I think the OP can probably handle a little light scolding from some anonymous person on an Internet forum. My hope is that they, and other readers, learn from this mistake and don't do it again.
I replied to one comment thread. Perhaps you should put on your big boy pants and use the little [-] thing to minimize threads you aren't interested in reading.
> Because of this, there is absolutely no point in shaming someone for patenting a thing
Well I wouldn't shame someone whose job was to patent something absurd. I was just saying that this is not an invention at all, and any system that protects that "innovation" is a broken system.
I think the magic is in the context of Unicode. Which also makes it almost twice as ridiculous from my point of view. Because it seems to be doing exactly what unicode is meant to do.
But doesn't it say that the whole patent system is broken? I get the "you pay to file a patent, it's your problem if it's invalid in the end". But the side effect of that is that whether it's valid or not, it's a tool you can use to scare those who don't have the resources to go to court.
It's like those completely abusive non-compete clauses in work contracts (yes in some countries that's the norm). They are completely abusive and therefore illegal. But it still hurts the employee: I have friends who have been declined a job in a company because the company did not want to take any risk. The company was like "okay, it's most likely an invalid clause, but if your previous employer sues us it will anyway cost resources we don't want to spend, so we'd rather not hire you". So an illegal, invalid clause had the effect that the company who abused it wanted. Which means it's a broken system.
So whoever now owns that patent (Google? maybe some patent troll picked it up?) could, in theory, sue the author of this article for patent infringement, right? Even though they invented it independently and never once used or looked at your patent. Do you think you made the world a better place or a worse place by filing that patent?
_Can_ they sue them for patent infringement? They just described a technique (that you can see in the patent filing anyway) and not selling a product based on it. I think there's nothing to sue here. I'm curious is my understanding of this is correct.
One of the benefits of the patent system (that now seems to be far outweighed by negatives) is that patents are public information. Your invention is documented for all to see. I don't think that someone writing about public information is a punishable office, but IANAL
No. The author could not be sued for this successfully. All they did was write a blog post about an interesting technique. They could literally read the patent application and write a blog post about that, assuming the methods are the same.
What percentage of your actions are based around making the world a better place, instead of personal fulfillment or gain?
> All they did was write a blog post about an interesting technique. They could literally read the patent application and write a blog post about that, assuming the methods are the same.
Okay, change "sue" to "prevent from creating a marketable product without paying a royalty to the patent owner in return for having provided nothing of value." The point remains.
> What percentage of your actions are based around making the world a better place, instead of personal fulfillment or gain?
Many harms are unavoidable, but I make a point to at least not go out of my way to make it a worse place, for example by filing software patents. The company I work for provides financial bonuses for filing software patents, and I will never participate in that program. (I've even tried to convince the lawyers to license our patents similar to Red Hat's open patent promise, because they claim they are intended only to be used defensively... but no luck so far.)
> Do you think you made the world a better place or a worse place by filing that patent?
Come on, what does this contribute to this conversation? The poster clearly is aware of the drawbacks of such patents, and didn't clearly play any role in filing the patent (they said "we … filed it," not "I filed it"). This kind of response just encourages people not to mention such things; it can't possibly change their past behavior, and, since Pebble the company per se doesn't exist any more, is also unlikely to change future behavior.
> The poster clearly is aware of the drawbacks of such patents, and didn't clearly play any role in filing the patent (they said "we … filed it," not "I filed it").
A person with the same name as that commenter is listed as an inventor on the patent.
> it can't possibly change their past behavior
Obviously, but it can change future behavior. Maybe realizing that they made the world a worse place by filing that patent will prevent them, or a reader of this discussion, from doing it again in future.
Well, given that the technique itself makes the world a worse place, anything that impedes its use is probably positive...
And, no, they couldn't do anything meaningful to the author of the article. They could get them ordered not to do it any more, and they could recover their economic damages... which are obviously zero.
First of all, it's not just a game, it's an outright battle to the death (of your company). Sure, you can choose not to wield patents, even in self defense, but good luck with that.
You can also choose to legally declare that your patents may only be used for defensive purposes. But no one ever does this, because they do not actually intend to use them only for defensive purposes. This is a bogus defense of software patents.
Nope. That’s not how piles of patents are wielded defensively by the big companies. They don’t protect their IP with defensive patents, they defend their company using the threat of using unrelated patents offensively against the attacker.
Yes, I know. That's what this part of my post means:
> may only be used for defensive purposes
It's done with a clause in the public license like "if you sue us, then we revoke this license to you and may in turn sue you." You retain the MAD defensive benefit of the patents, while also not hampering innovation with your patent's existence. If the patents truly exist only for defense, then there is no reason not to do this.
Berating people for filing patents in self defense is not how we fix this problem. The government put these rules in place. Businesses have to at least accumulate patents to use defensively (you found a patent of yours that you think I'm violating? Well let me do a quick search through the patents I have...what's that? Nevermind, I'm not actually infringing your patent? Good, that's what I thought.)
Hopefully a wholly undefendable patent, you're essentially trying to patent the Unicode spec. The rest of it is perform an action in response to a text message which clearly isn't novel.
Would this patent cover just the encoding alone? The first sentence says:
> A method, apparatus, and system relating to embedding hidden content within a Unicode message and using the hidden content to perform a particular computer action.
So, in my extremely unqualified opinion, just the encoding technique alone is not covered by the patent, only when combined with some action performed based on the encoding?
Just curious this seems like simple digital Steganography or maybe even even the same as Shannon's boolean gate work. Do you think the patent is defendable in court?
you dont need 256 codepoints so you can neatly represent an octet (whatever that is), you just need 2 bits. you can just stack as many diacritical marks you want on any glyph. either the renderer allows practically unlimited or it allows 1/none. in either case that's a vuln. what would be really earth shattering is what i was hoping this article was: a way to just embed "; rm -rf ~/" into text without it being rendered. you also definitely dont need rust for this unless you want to exclude 90% of the programmer population.
I think the Rust is more readable for bytemucking stuff than dynamic languages because the reader doesn't have to infer the byte widths, but for what it's worth the demo contains a TypeScript implementation: https://github.com/paulgb/emoji-encoder/blob/main/app/encodi...
An octet is a group of 8 bits. Today we normally use the word "byte" instead. The term is often used in older internet protocols and comes from an era where bytes were not necessarily 8 bits.
I remember doing penetration testing waaaaaay back in the day (before Python 3 existed) and using mere diacritics to turn a single character into many bytes that would then overflow the buffer of a back-end web server. This only ever caused it to crash (and usually auto-restart) but I could definitely see how this could be used to exploit certain systems/software with enough fiddling.
reply