The algorithm works as follows: Read 5 characters. Is that in the dictionary? No. What about the first 4? No. 3? No. 2? Yes! Copy to the output string, add a space, slide along 2 characters. Repeat.
Searching a 5 MB dictionary repeatedly like this was too slow. I made it a little faster by splitting the dictionary based on the number of characters in the word (fiveWords, fourWords, threeWords, twoWords). It's still slow, though. I have to run it in local JavaScript, because I don't have a server. Any suggestions?
On a totally different extreme, the iTunes Store's EPF data (freely available!) is 55 GB of text. I'd like to sort out Artist names by Producer. To grep the file takes 30 mins. Is there a faster way?
Third, my algorithm for offline StackOverflow was optimised for disk space. I search the index of titles as 200MB of plain text to get the post ID. Then I use dd to get a 4KB chunk of the tar.bz2 compressed file. I can read it using bzip2recover. Then I check if the post ID is there, and binary search like that. It's slow (5 seconds to load) but doesn't waste my precious disk space, and I can be patient when I need offline StackOverflow.
I would put your dictionary into a prefix tree if some sort, and the first implementation I would reach for is a trinary tree [1]. As you scan your input, walk the tree. Each time you get to the end of the tree, insert a space one go back to the root. Once the tree is built it’s O(length of input) [2] regardless if dictionary size.
As for offline StackOverflow, you should use a database engine like SQLite or Redis to manage indexing into smaller blocks of compressed post data on-disk, or store the posts directly in the database and keep the database files in a filesystem like BTRFS that supports online compression.
https://pingtype.github.io
The algorithm works as follows: Read 5 characters. Is that in the dictionary? No. What about the first 4? No. 3? No. 2? Yes! Copy to the output string, add a space, slide along 2 characters. Repeat.
Searching a 5 MB dictionary repeatedly like this was too slow. I made it a little faster by splitting the dictionary based on the number of characters in the word (fiveWords, fourWords, threeWords, twoWords). It's still slow, though. I have to run it in local JavaScript, because I don't have a server. Any suggestions?
On a totally different extreme, the iTunes Store's EPF data (freely available!) is 55 GB of text. I'd like to sort out Artist names by Producer. To grep the file takes 30 mins. Is there a faster way?
Third, my algorithm for offline StackOverflow was optimised for disk space. I search the index of titles as 200MB of plain text to get the post ID. Then I use dd to get a 4KB chunk of the tar.bz2 compressed file. I can read it using bzip2recover. Then I check if the post ID is there, and binary search like that. It's slow (5 seconds to load) but doesn't waste my precious disk space, and I can be patient when I need offline StackOverflow.