Hacker News new | past | comments | ask | show | jobs | submit login
Reverse-Engineering Apple Dictionary (2020) (fmentzer.github.io)
278 points by goranmoomin on Sept 12, 2021 | hide | past | favorite | 59 comments



Another approach for this is to explore the format through Apple's tools for building dictionaries – as they provide a "Dictionary Development Kit" in Xcode's downloadable "Additional Tools" package (which has documentation for the XML format and a bunch of scripts/binaries for building the bundle).

I wound up doing this a while ago for a similar toy project. After some poking around, it turned out that dictionary bundles are entirely supported by system APIs in CoreServices! The APIs are private, but Apple accidentally shipped a header file with documentation for them in the 10.7 SDK [1]. You can load a dictionary with `IDXCreateIndexObject()`, read through its indices with the search methods (and the convenient `kIDXSearchAllMatch`), and get pointers to its entry data with `IDXGetFieldDataPtrs()`.

It takes a bit of fiddling to figure out the structure (there are multiple indices for headwords, search keywords, cross-references, etc., and the API is a general-purpose trie library) and request the right fields, but those property lists in the bundle are there to help! (As the author of this article discovered, the entries are compressed and are proceeded with a 4-byte length marker.)

[1] https://github.com/phracker/MacOSX-SDKs/blob/master/MacOSX10...


I have memories of using the Dictionary Development Kit to create custom dictionaries (I remember creating one for medical jargon) about ten years ago. (At that time custom dictionaries are placed in ~/Library/Dictionaries, and system dictionaries in /System/Library/Dictionaries, not some obfuscated path like now.)

To find the Kit in question, simply Google "com.apple.TrieAccessMethod" and you should find it online.


For anyone else needing to tackle something like this, its definitely worth checking out Binwalk [1]. It is meant for extracting firmware but it works decently well on most files-in-files type data formats.

[1] https://github.com/ReFirmLabs/binwalk


This was my first instinct as well! Running binwalk on the binary dictionary file immediately outputs that it's a series of zlib-compressed values.


It seems highly likely (given that this is a dictionary and requires fast lookups) that you're reverse engineering something like CEVFS (i.e. a virtual file system for compressing a database). Which is why the dictionary is broken into chunks... these are the compressed pages of the database.


Another way to extract all of those compressed zip files would have been to use binwalk.

    binwalk --dd='.*' *.asset
Edit: should have read the comments before I posted this, because enragedcacti already mentioned this tool an hour ago.


Thank you I didn't know about this Binwalk.

I used it and was able to figure out the remaining bits of the file format thanks to you and other tips in this thread.

https://github.com/solarmist/apple-peeler


Thank you for posting this code on Github! There has been some reverse-engineering done on the language dictionaries bundled with Mac OS, and it's nice to know that the same model is being used on the Apple Watch! I look forward to seeing your dictionary app.

https://josephg.com/blog/reverse-engineering-apple-dictionar...

There's also a command-line tool that can query the dictionary:

https://github.com/takumakei/osx-dictionary

Something I haven't yet reverse-engineered is Apple's word segmentation. I can get the word breaks in Chinese by pressing option + right arrow + space, repeatedly. But I have no idea how the backend for that works.


>Apple's word segmentation

Unless they changed it, it's probably similar to CFStringTokenizer which used ICU Boundary Analysis (and maybe mecab for Japanese).


Thank you! The ICU Boundary Analysis documentation says it uses a dictionary to split Chinese, Japanese, Thai or Khmer.

https://unicode-org.github.io/icu/userguide/boundaryanalysis...

Is that the same as the macOS dictionary being parsed here? It seems like a pretty big file to grep every time!


No, the ICU dictionaries are seen at: https://github.com/unicode-org/icu/tree/main/icu4c/source/da...

I assume at compile time it's converted to a more efficient query format


I have thought about building a vocabulary learning tool for learning Japanese on top of Apple Dictionary. My idea is simple: user collects dicitionary items and the tool offers lookup / spaced-repetition.

However, I'm concerned that the dictionary is copyrighted. Is there any precedent that says whether such a tool would be legal/illegal?


If your tool doesn't come with bundled data from Apple, extract data in local and doesn't leak it to another party it's probably safe to do so (like emulators).


This was an amazingly delightful read. I was hoping it would shed some light on a long-time project I had which was reversing the Oxford language dictionaries that were included on CDROM with the big printed texts. (I already did it, but with a debugger instead of by reversing the binary format). Alas, it did not, but it was super encouraging to see the enthusiasm and interest in language dictionaries.


would be really really cool if someone could make a small script to convert these into a format understandable by dict://

https://en.wikipedia.org/wiki/DICT


I've created a first step towards that. A general purpose tool for extracting the data back into XML.

https://github.com/solarmist/apple-peeler


I bet it'd be pretty easy to make an awk script to convert XML into something like dictfmt(1)'s -j Jargon File format text.

https://linux.die.net/man/1/dictfmt http://www.catb.org/jargon/oldversions/jarg447.txt


Maybe. Don't know. I've never really used dictfmt or the like.

What's it good for?


once you get the dictd server set up, all you need to do to get a definition is run curl. google 'curl dict' for some stuff.


The "seemingly random bytes" look like a small 32-bit little endian number to me, probably the length of the subsequent payload.


Author here, funny to see this popping up on HN :) Was definitly a fun ride making this.


Thanks for sharing this.

I've used your article and other previous attempts to clear up the remaining unknown bits of the file format and build apple-peeler, a tool for extracting the XML from the dictionary files.

https://github.com/solarmist/apple-peeler


Nice read!

Just to mention that Core Foundation includes tokenisers (CFStringTokenizer) and lemmatisers (NSLinguisticTagger) of its own in case you ever wanted to avoid the Python processing!


That's awesome!

But be aware (i.e. "beware") that Apple can pull the rug out of unpublished APIs, without warning.

I have been caught out, by this, myself.


This is fun. I have another idea: I'd be interested on calling Siri from the command line. Even if using private APIs. (But without hacky fake drivers, or accessibility tools)


The closest I've managed to this has been through AppleScript (which likely violates your "no accessibility tools" criteria): https://forum.keyboardmaestro.com/t/type-to-siri/13328/17


Yeah, I guess another requirement, it should be as "invisible" as possible. So you could programmatically interact with Siri in the background and perhaps not notice.


The knowledge part is powered by Wolfram Alpha, which has an API, I believe


I assume one reason Apple has made it more challenging to extract the dictionary resources is in order to satisfy licensing constraints with the dictionary authors. I wonder if they'd block an app like this through the App Store submission process, if submitted.


That’s highly unlikely. The author is incorrect about it being a Zip file, it’s actually a simple gzip/zlib stream. Because it’s not possible to seek within such a stream, they are always chunked in any file that requires random access. Elsewhere there will be an index file which maps words to their corresponding chunks, so that the definition can be quickly loaded without having to decompress the entire file - or even load it all into memory. This is very normal stuff in the world of file formats.


I suspect it’s more likely they have a bunch of internal frameworks for creating, accessing, and distributing simple databases (something akin to SQLite).

So the difficulty we’re seeing here isn’t deliberate obfuscation, but rather just an dev using a database structure to make app design and word lookup easier.

If there really was licensing concerns I would expect there to at-least be some basic encryption, not just some basic compression.


Apple use regular SQLite for a whole bunch of other applications, such as Apple Photos. I wonder why they didn't use it here?


The underlying file format and associated libraries are probably largely unchanged from the NeXTStep era. If it ain’t broke, don’t fix it.

http://toastytech.com/guis/ns20mail.png

(Edit: beaten by 30 seconds!)


The dictionary app (and hence its database) may go all the way back to NeXTStep/Openstep.

Maybe not all of it. But I wouldn't doubt if bits of it went all the way back to the early days where a fancy dictionary was one of the star features of NeXTStep.


I’d definitely assume this would be a copyright problem if used in an app.


Yes, a lot of people involved in dictionary processing are worried about copyright!

After I wrote my post about Apple's dictionary files, I got a mysterious email showing up in my inbox. The email was from someone who's spent some time writing code to do the same thing, but doesn't want to post it under his own name in case he falls fowl of his country's DMCA equivalent. Crazy. He said I could post his code under the condition that I took his name off it.

https://josephg.com/blog/apple-dictionaries-part-2/


I have helped with the writing and editing of a number of dictionaries over the years. It's difficult, highly specialized work. For some of the jobs, the only compensation has been a share of the royalties from future sales. I doubt if many people like me are enthusiastic about the dictionary data becoming accessible for free.


Yes, dictionary content is the revenue source of dictionary vendors, so of course they don't want anyone to use it without permission. On the other hand there are more and more open-data projects (I started one myself), often based on printed dictionaries that felt in the public domain.


I started one myself, too, more than twenty years ago, in a naive burst of enthusiasm about the potential of online collaboration. It was an attempt to create a new comprehensive Japanese-English dictionary from scratch, not based on existing dictionaries [1]. Other volunteers were equally enthusiastic, but the immensity of the task before us, and the competing time pressures of paying work, caused us to gradually stop working on it.

[1] https://t-nex.jp/dictionaries/jekai/index.html


Did you start it before JEDict from Breen started existing? It's great for English language for other Japanese-* pairs are poorly endowed. There is also focused work on neologisms that could be done with a few people and could really improve existing free solutions.

I'm personally using the route of digitizing an existing dictionary, albeit manually because that can't be automated in this case.


JEDict already existed, but it was just a glossary, i.e., Japanese words with English equivalents. But there aren’t many one-to-one correspondences in meaning between Japanese and English words, and such glossaries, while useful, can also be frustrating and misleading to users. My idea for jeKai was to create a dictionary with explanatory definitions, like those that appear in monolingual dictionaries. That turned out to be much more work than we were ready to do, though.

A paper I wrote nine years ago on related issues is here, in case you are interested:

“Kokugo Dictionaries as Tools for Learners: Problems and Potential”

https://researchmap.jp/multidatabases/multidatabase_contents...


Well source language word paired with one or more translations is the minimal structure for a dictionary and some printed dictionaries are indeed like this.

I browsed some entries of your dictionary and indeed their are (sometimes quite elaborate) explanations. I could easily why it ran out of steam, especially even a basic compiling is a daunting task. Also contributors are fews. On the Jibiki.fr (Japanese-French) project, most of the corrections are made by two members. They're starting from an existing dictionary and it still took years to just check the headwords.

Thanks very much for the paper. I'll read it. I actually already had one of your paper on my machine (Asialex 2011 proceedings) but haven't read it yet.


Thank you. I tried to e-mail you at the address on your profile page to suggest that we continue this discussion privately, but Gmail replied with “Your message wasn't delivered to [that address] because the address couldn't be found, or is unable to receive mail.” If you would like to chat about these issues further, please e-mail me at the address on my paper or at my personal website.


What’s the state of Wiktionary like in your opinion?


I hadn’t used Wiktionary for a few years, so I just spent some time looking through it. It was pretty good a few years ago, and now it looks even better. I’m sure many people find it very useful. The amount of information on each page, though, might make it a bit intimidating to some users.

It also seems to have some unevenness in coverage. For example, the entry for the word “anecdata” (a word discussed recently at [1]), has five illustrative quotations, which are quite handy [2]. The entry for the more mundane “anecdote,” however, has none [3]. Such unevenness might be inevitable in volunteer dictionary projects, as volunteers like to work on the more interesting words.

[1] https://news.ycombinator.com/item?id=28375767

[2] https://en.wiktionary.org/wiki/anecdata

[3] https://en.wiktionary.org/wiki/anecdote


I use Wiktionary pretty often, and it has come in particularly useful this past week!

We're translating some strings on our software user interface, and checking the abbreviations and acronyms used. Sometimes there are amusing or [nsfw] connotations in other languages! Thank you Wiktionary for warning us about abbreviating "low pressure" as "LP" in Taiwan.

https://en.wiktionary.org/wiki/LP#Noun_2


I don't really use it often as a user nor i my projects to have a definite opinion. There is some pairs of words (about 5K) in Sino-Vietnamese that came with their chu nom writing which was very helpful to one of project. Otherwise I think it lacks structure and can't be harvested automatically easily (I don't think Wikidata integrate it all, and that website is a non-starter for me). Also every language is structured differently so Wiktionary can hardly be commented as a whole.


https://en.wiktionary.org/wiki/Help:FAQ:

Q: Is it possible to download Wiktionary?

A: Yes. https://dumps.wikimedia.org/enwiktionary/ should have the latest copy of the main namespace. The cleanest navigation page is https://dumps.wikimedia.org/. Just download a -articles.xml.bz2 file and some software to read it (for nix, for Windows).

Q: Can I use data from Wiktionary in my program?

A: As long as you meet the conditions of the GNU Free Documentation License or Creative Commons Attribution/Share-Alike License, certainly.

Latest dump for English is from September 1. I wouldn’t know whether it has all the data or how easy it is to parse it.


> Otherwise I think it lacks structure and can't be harvested automatically easily

Indeed, it depends on the language and your goals - I had a very high success rate plucking out Russian grammatical tables from English Wiktionary with a few hours of scripting the data cleaning (https://github.com/thombles/declensions). I have a theory that you could get better results using an offline archive of the page sources but haven't tried this yet.


I have a vague recollection of reading somewhere that you're explicitly forbidden from creating a dictionary app using the OS dictionaries. You can do dictionary lookup within apps (so, for example, you're free to use the dictionary to look up definitions in your word processor or ePub app, but not to have an app which lets a user enter a word and get a definition back).


At the moment, I'm using a slight modification of the gist plus `sed 's/<[^>]*>//g'` to look up words from the shell on Linux. It would be nice to have some XML parsing into plain text, but it kind of works.


I also found this Dictionary API which imports the dictionaries into NodeJs by utilizing a utility called „dedict“.

https://github.com/nikvdp/dictionary-api/blob/master/convert...


Unfortunately it is not possible to implement add-on(add source) to Dictionary.app that is dynamic, like built-in Wikipedia; for example to query urbandictionary.com; only static offline is possible. I tried to investigate this years ago, don't think something changed since then.


Has anyone reversed-engineered the Apple emoji dictionary that maps some keywords to emojis? Last time I checked, they only shipped binaries on the newest MacOS. Would love to use that mapping to elevate search on my custom emoji picker.


I think it’s just the name data from the Unicode standard? (possibly the CLDR data but I might be wrong)


It's more than just the CLDR keywords (https://unicode.org/emoji/charts/emoji-list.html), since text replacement or autocomplete shows up for words not on that list.


you may try assign text replacements for emojis: https://www.iphonehacks.com/2016/04/text-replacement-mac.htm...


I'm trying to find the source data Apple uses for text replacements / suggestions.


Is this a plist?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: