Hacker News new | past | comments | ask | show | jobs | submit login
What are the lesser known but useful data structures? (stackoverflow.com)
347 points by mck- on Jan 18, 2014 | hide | past | favorite | 75 comments



There's a whole set of interesting data structures that are not very well known: succinct data structures[1]. The idea is simple: we want to store data in a compressed form, but also perform certain operations quickly without uncompressing.

These can be very useful for certain applications. The article on "Cramming 80,000 Words into a JavaScript File"[2] is a nice example. It shows you how you can store a compressed trie in memory but still use it. I also like this[3] series of blog posts leading up to wavelet trees.

These certainly count as obscure data structures, unlike many of the ones listed on SO. I had never even considered the idea of compressing data in memory like this, much less encountered actual examples of succinct data structures! I have to thank Edward Kmett for introducing me to the whole field.

These data structures are important not just because they're neat themselves, but because they got me to think a new way. In particular, I realized that using pointers all over the place--to represent things like trees--is not always efficient. Instead of parsing data, it might be better to store it as a blob of some sort with a binary index. Just starting to consider details like that is valuable all on its own.

[1]: http://en.wikipedia.org/wiki/Succinct_data_structure

[2]: http://stevehanov.ca/blog/index.php/?id=120

[3]: http://alexbowe.com/rrr/ and http://alexbowe.com/wavelet-trees/


The problem with succinct data structures (and wavelet trees especially, but also rank-select dictionaries) is that while they have excellent performance characteristics when you look at them from a theoretical perspective, they tend to perform relatively poorly in practice.

This seems to be due to how much memory is accessed when checking a single bit, and the difficulty in predicting branches.

Of course it could just be that all my implementations have sucked, but even in playing around with libcds[1] didn't yield the kind of performance expected.

If anyone knows of a fast implementation of Wavelet Trees, I'd love to see it.

[1] https://github.com/fclaude/libcds


I'm one of the authors of a more complete c++11 succinct data structure library [1] which includes several engineered implementations of wavelet trees, compressed suffix arrays and compressed suffix trees. while you are right that random access is one of the main bottlenecks of succinct data structures, we found that sometimes SDS can be faster than their uncompressed counterparts [2]

The github URL contains some additional resources such as tutorial slides and cheatsheets if you want to check it out.

[1] https://github.com/simongog/sdsl-lite/ [2] http://people.eng.unimelb.edu.au/sgog/optimized.pdf


> Instead of parsing data, it might be better to store it as a blob of some sort with a binary index.

This is exactly something I did for JSON, I call it semi-indexing: instead of parsing it into a tree of pointers, I create a succinct representation of the parsing tree, which is orders of magnitude smaller than the original JSON. Construction is much faster than parsing because there are basically no memory allocations, and access is not that much slower.

About performance of succinct data structures in general, it is true that they have shown poor practical performance, but things are changing, both because we have better CPUs (while memory latency is pretty much unchanged), and better algorithms are being found. I did my Ph.D. on practical succinct data structures. We found that in some applications, the access times are competitive or faster than the non-succinct counterparts, while the space is much smaller. One example is tries: in my thesis [2] there are experiments for string dictionaries, and for query autocompletion (for example for search engines).

Another area where (quasi-)succinct data structures are having some success is inverted indexes: recently proposed posting lists based on Elias-Fano [3] have been shown to outperform standard delta-encoded posting lists for queries with sparse intersection, and are used in Facebook's graph search [4].

Finally, the biggest success story of SDS has historically been molecular biology, because the size of the DNA sequences processed is so large that non-succinct data structures are impractical. Many sequence assemblers/aligners use variants of FM-indexes and Compressed Suffix Arrays, that are self-indexes based on the Burrows-Wheeler Transform and Wavelet Trees.

[1] https://github.com/ot/semi_index

[2] http://www.di.unipi.it/~ottavian/files/phd_thesis.pdf

[3] http://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pdf

[4] http://www.vldb.org/pvldb/vol6/p1150-curtiss.pdf


IMHO, the FM-index deserves to be highlighted. That it is possible to store a string in a compressed format which can answer length-P substring queries in O(P) time (with good constant factors) is quite surprising at first sight.

I recently wrote a few words about this here: http://ocfnash.wordpress.com/2014/01/03/dna-of-a-password-di... By the time I finished I decided the whole area was exciting and seems not at all as well known as it should be. I also think there are plenty of related, as-yet-undiscovered ideas, waiting for somebody to find them!


A nice application for the FM-index: it has massively accelerated alignment of short DNA sequences against large databases.


Seriously, speaking about interesting data structures, Burrows-Wheeler Transform and its reverse are borderline "magic". Definitively worth understanding them in detail.

Speaking of which: What are efficient ways to insert data into BWT structures without reconstructing the whole thingy or to merge two BWT structures? Any hints?


> This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here.

Yet it's one of the best questions on SO. Something's very wrong with SO if this isn't considered a good, on-topic question for a programming Q&A site.


How is that one of the best questions on SO? There is always someone who wants SO to be what they want it to be. I'm not saying everything is perfect in SO but I think the standard answer is that if you want a site that is about subjective discussions related to programming you should make one...


How is it not one of the better questions on SO? It's educational, it's on-topic, it's more cerebral than the usual questions about jQuery, it has high quality answers, the question itself is as clear as day, it's interesting, and it's very popular in terms of views and upvotes.

There's a degree of opinion in most answers anyway especially as there's often multiple ways to do the same thing and every answerer will have a preference of some kind. I don't necessarily see this one as a particularly subjective discussion anyway, or why you think I'm trying to make SO be something it's not. In any event, why are marginally subjective discussions are a horrible thing anyway in a Q&A site?


How can anyone answer "really useful but are unknown to most programmers"? They are so useful we don't know about them? Aren't the answers, by definition, known, to a lot of programmers? Useful for what exactly?

I think I heard this criticism of SO for about 1 million times and this is the 1 million + 1 that tripped my fuse, so apologies for that. This question, and the answers, though IMO are bits of trivia and opinions more suited to a blog post.

Does it have some marginal usefulness in exposing some people without background to some random data structures or algorithms- maybe. The question is still there, it wasn't deleted, so SO is acting as the resource that you'd like it to be. If it's your first exposure to data structures I think there are better options out there (text books, online courses). If you are looking to solve some particular problem this is probably not the best resource. So it just stands as a (maybe interesting) bit of computer science trivia.

There are endless subjective topics that are tangentially related to programming and may be fun/amusing/interesting but SO is about things that have answers, not about discussions. I like it that when I have a question I can search and get a good answer for it, not just someone's opinion. For entertainment value and interesting things I go to other places, such as HN...


> The question is still there, it wasn't deleted, so SO is acting as the resource that you'd like it to be.

Try asking such a question today, or any time in the last couple of years!

> So it just stands as a (maybe interesting) bit of computer science trivia.

Just because you label it trivia doesn't necessarily mean it wasn't important to the asker or the thousands of people that read, commented, and upvoted. Just like people asking about jQuery, the asker had a question in mind and wanted to crowdsource the answer.

> but SO is about things that have answers

An opinion can be an answer too.

> For entertainment value and interesting things I go to other places, such as HN...

Why must the answers on SO be bland as tofu? You seem to give the impression that learning and the SO answers must be plain and dull, and any hint of entertainment must be quickly quashed. I see nothing wrong with being entertained and educated at the same time, especially as things that are entertaining tend to be more readily recalled.


> Why must the answers on SO be bland as tofu?

Why do we have vegetarian restaurants when the majority of the world wants meat?


Why do meat eaters have such a desire to disparage vegetarians in completely irrelevant contexts?


I didn't disparage vegetarians, and not even their diets. Tofu on its own is bland and that's really hard to refute. Tofu with other stuff is delicious.


The problem is that SO is not for entertainment. It's not even really for education. They (the site's operators) want it to be an injective mapping of google queries to solutions. You don't need to become a better programmer. You need to keep coming back to SO to ask more questions. On the other side they need to supply a steady stream of middling difficulty questions, because everyone isn't John Skeet.

Nobody is going to be working on a project and say "My client requires the use of obscure data structures, better Google some!". You can bash the multitude of jQuery questions, but having worked in a variety of settings, some people need things spelled out for them. You need a lot of simple questions with alternate wordings, different permutations of libraries, etc. Even though you can generalize these answers, they're hard to find from Google, and a lot of people will read "Undefined variable foo? Nonsense, my undefined variable is baz. Better open a new ticket".


It is a good question, but it would have been better suited for the "CS Theory" StackExchange site. For example, a similar question was asked there:

http://cstheory.stackexchange.com/questions/1539/whats-new-i...


SO has critical mass which would be difficult to replicate in a new site; remember that it was started by Jeff Atwood and Joel Spolsky, who were already established bloggers. "Make your own SO for subjective discussions" would be a good suggestion for someone who already has a large online following, but not for a general programmer.

I have yet to see a good argument why these types of questions detract from the main site and asked on a separate site (http://programmers.stackexchange.com/). If the problem is cluttering up search results with opinions, then they could be marked as "subjective" and people could be given the option to avoid these results.


Maybe you could get Joel to refer people from SO to your site?

I guess my problem is that Joel and the others running the site don't want it to be a forum for those discussions. This topic has been beaten to death already and the outcome isn't going to change. So "we" need to let go of it. It's like Go and generics... (just kidding)

I can think of a few reasonable arguments. A lot of subjective discussions tend to degenerate into holy wars or popularity contests. It's hard to tell what's a good answer and what's not and it can get annoying both for people participating and people reading.

Even if you mark those as subjective they would clutter the front page. Whenever you give people the option you need to consider defaults, if the default is not to see subjective questions then no one will ever see them, if the default is to see them we're back to square 1. What about search engines?

EDIT: It also so happens that some of the "good" subjective questions do seem to hang around, get attention, and get cool answers, so the system does seem to be working.


> Even if you mark those as subjective they would clutter the front page.

Really simple solution: put a checkbox on the right: "[ ] Hide posts marked as subjective" Make it bold and obvious, and far above the fold.


We tried a variation on that fairly early on. It led to endless arguments over how subjective something had to be to be branded subjective. See: http://meta.stackoverflow.com/questions/51627/should-we-perm...

In the end, it was a huge distraction from what we should've been talking about instead: which questions are actually resulting in the creation of something useful?


Even though they are included in the GNU C library, most people do not seem to know about Obstacks:

An "obstack" is a pool of memory containing a stack of objects. You can create any number of separate obstacks, and then allocate objects in specified obstacks. Within each obstack, the last object allocated must always be the first one freed, but distinct obstacks are independent of each other.

Aside from this one constraint of order of freeing, obstacks are totally general: an obstack can contain any number of objects of any size. They are implemented with macros, so allocation is usually very fast as long as the objects are usually small. And the only space overhead per object is the padding needed to start each object on a suitable boundary.

https://www.gnu.org/software/libc/manual/html_node/Obstacks....

Sure, they’re not very interesting, but the point is that you get them for free in the GNU C standard library.


The FIFO equivalent can also be useful.


I’m not sure what you are referring to, do you have a link?


I don't have a link, but it's an obvious transformation - basically, using a ring buffer for allocations where I can guarantee that older things will be freed before newer stuff.


It's not highlighting one thing, but Chris Okasaki's book on Purely Functional Data Structures, and this brilliant top answer to a question about functional data structures published since the book will keep you in reading material for a while: http://cstheory.stackexchange.com/a/1550

(it was all 'lesser known' to me when I started using haskell not so long ago)


I second the plug for Purely Functional Data Structures - brilliant stuff!


I use a Hierarchical Triangular Mesh for indexing gamma ray events from the universe. The data is partitioned in the database according to it's HTM id.

http://arxiv.org/pdf/cs/0701164.pdf

Currently I use this for indexing ~11 billion gamma ray events. Researchers typically supply a region in the sky, a search radius, and some cuts (energy, event quality, etc...)


HTM is generally good. We used it to index white light data from out satellite. It makes for a very flexible way to store spherical surface data without losing context or information when projecting.


Looking at these lists, I strongly suspect that people upvote based on whether they personally recognize the data structure.

It goes against the intent of the original question, but iIt's almost ideally designed to make you feel good--you get the rush of knowledge then nerd sniped as you head to wikipedia.


Lest people think this is useless negativity, the point is that you need to scroll down and look at the other pages to see the really interesting ones. :)


I love how the question was locked because "it is not considered a good, on-topic question for this site". It's crazy. Unless an extremely specific concrete answer can be given, a question immediately gets killed. SO has turned into such a turd of a website.


> Unless an extremely specific concrete answer can be given, a question immediately gets killed

Isn't that pretty much SO's explicit goal from day one? Complaining about SO not being a site for open discussion is like complaining that HN doesn't cover celebrity gossip.


Well it's not like that. If you tried submitting celebrity gossip here you wouldn't get much traction to put it mildly. On the other hand a lot of blocked questions on SO were very popular, people produced very valuable and detailed answers to them and they (and the questions) got a lot of upvotes. It's clearly a policy of blocking content the community is very interested in. Better analogy would be blocking politics at HN.


Not only that but SO in general is a rather bad source for programming advice - at least for beginners.

What you generally see on SO happening:

- Statements are being marked as "correct" even though they did not answer the question. It happens all the time that people as a Cocoa/OS X related question and get a iOS/UIKit answer which is then marked as correct. - Wrong answers are marked as correct all the time. - There are a couple of people who reformat my answers on SO all the time. They add emphasis, reformat my code/paragraphs etc. I don't know why they are doing it - probably to get a couple of points easily. This is so annoying. - It seems to me that collective intelligence sometimes does not produce the best answers... My friend and myself often begin our days by telling us about funny "answers" on SO, how we tried to argue but then got "overruled" by the majority even though the majority is wrong.

This is probably not the right place to rant about SO...


> They add emphasis, reformat my code/paragraphs etc.

That's the reason I left SO while still in my noob phase - they'd make really strange changes, such as adding typos or removing paragraph breaks. And it's obvious why, too: yes, they get points for it and their name is now right next to the answer. I also found they had generally insane amounts of karma. Yeah, I really don't get SO, for lots of reasons.


No, you don't get any points for making edits unless you're under 2000, and even then it has to be robo-approved blindly by three other users.


No matter why some users do this: It sucks. I mean I am the one giving the answer - I am the expert. It should be the expert who decides which term/sentence should be emphasised and which don't. I have nothing against correcting my bad english but when it comes to the content or formatting I don't want to be messed with it unless you want me to leave.


You forget your <oblig> tag. Its pretty obvious that the success of SO as a Q/A site is thanks to all the mechanisms in place to make it not another shitty forum. And its actual function as a Q/A site has far more value than a handful of cute lists.


> SO has turned into such a turd of a website.

Yeah, not imposing any kind of quality on questions definitely gave Quora a distinct unturdlike quality.

"What are some things that make you really sad?" - real world example from this week's Quora Weekly Digest.

http://www.quora.com/Sadness/What-are-some-things-that-make-...


Just a note: Quora tailors the digests on a per-user basis. I'm not certain what the algorithms involved are, but I'd be willing to bet that they use information about what you and your friends spend time reading about in constructing it.

For example, my most recent weekly digest does not have "What are some things that make you really sad?", but does have "What are your favourite quotes from Harry Potter?" -- it often seems to include Harry Potter. Maybe not a very great question, but it does fit my social network's reading interests pretty snugly.


Being required to register an account with Quora to view that makes me sad :(



Speaking of which, is there a good alternative site for questions that are forbidden on SO?


There are other sites in the StackExchange network like Programmers and Code Review and some others for math and other things. But I'm not sure which site (SE or otherwise) would be the best for this particular topic.


Probably the CS stackexchange (http://cs.stackexchange.com/), or CStheory (http://cs.stackexchange.com/). Unfortunately, neither are anywhere near as popular as StackOverflow, so the chances of getting a good discussion are lower.


Most times this comes up on HN, the SE team points to http://www.slant.co/


Perhaps Quora?


Yahoo answers.


I was shooting for 'brevity is the soul of wit' but fear I might have hit 'sarcasm is the lowest form of humour' and completely failed to get my point across, to wit, stackoverflow is designed for productive, informative, objective q&a, and alternatives that are less prescriptive, such as Yahoo answers, often end up as a source of unending crap.


I'm happy to see finger trees got mentioned. Finger trees[0] are extremely useful and general data structure that can be used to implement persistent sequences, priority queues, search trees and priority search queues. (Haskell's Data.Sequence[1] uses specialized 2-3 finger trees internally) They can form the basis of all sorts of interesting custom structures by supplying the appropriate monoid[3], but this does make them harder to approach if you are not familiar with the abstractions.

[3] A monoid is any structure that has members that can combine associatively. In addition, it must have an element that can combine with any other element and result in the other element. Some examples: (strings, string concatenation, the empty string); (integers, addition, 0); (natural numbers, max, 0); (booleans, and, True); (functions, composition, the identity function). The functional pearl[2] that describes the design of Haskell's diagrams library[4] goes into much more detail if you are interested in their application to programming.

[0] http://apfelmus.nfshost.com/articles/monoid-fingertree.html

[1] http://hackage.haskell.org/package/containers-0.5.4.0/docs/D...

[2] http://www.cis.upenn.edu/~byorgey/pub/monoid-pearl.pdf

[4] http://projects.haskell.org/diagrams/


"In addition, it must have an element that can combine with any other element and result in the other element."

Aka, an identity element.

Monoids that have inverses (that is, every element has another element that, when combined, produces the identity element) are "groups". Of the examples you gave:

"Strings over composition" does not form a group - there's nothing you can concatenate with a non-empty string to get an empty string.

"Integers over addition" does form a group. The inverse of x is -x.

"Natural numbers over max" is not a group. Once you get above 0 you cannot get back to it just by applying max.

"Booleans over and" is not a group. Once you have false you can't get back to true.

"Functions over composition" is not a group. Many functions have inverses, but some do not. If you restrict the set to "functions with inverses" then you do have a group.


Exactly. I was trying to avoid any mathematical jargon while describing monoids because they are such general and pervasive structures in programming. Further, being able to construct useful new monoids is crucial for making full use of a finger tree and I didn't want to have someone find an interesting structure unapproachable because they had to walk through an abstract algebra jargon storm in order to understand it. I realize I referenced associativity without describing it, so I didn't quite achieve my goal, but hopefully the examples got the idea across to someone who might otherwise have dismissed it due to the perception that it was too academic and unapproachable.

As an aside, trying to remove the jargon from a definition like this is surprisingly hard and really expands the size of the definition. I think it is still worthwhile to try when writing for a general audience because algebraic structure is everywhere and very useful to recognize for people who are interested in rigorously manipulating abstraction, i.e. programmers.


I agree; my "aka" wasn't meant as criticism. It's easy to get lost in the longer definitions, so I was just restating it directly in the hopes that one or the other - or the combination - will be clear to most people.

Though I'm not sure "abstract algebra jargon storm" properly describes use of "associative" and "identity" - I remember learning about the "associative property of multiplication" and such in elementary school and my wife confirms she had similar experience.



I don't really consider tries and bloom filters all that poorly known. These commonly come up in interviews for fresh graduates at Google/Facebook. Zippers, skip lists, ropes, round-robin databases, etc. are more genuinely not known I think.


XOR linked list: http://en.wikipedia.org/wiki/XOR_linked_list

It's a double-linked list with just one link per node. However, to start traversing it you have to know at least two adjacent nodes.

PS. May not be useful per se, but interesting nonetheless.


The XOR swap is interesting, too. But I wouldn't want to see one in code someone wrote today, and which wasn't targeted at an embedded platform -- and a heavily constrained embedded platform, at that. (MSP430? Sure. ARM Cortex-M3? You've got enough space to do it properly.)


It's not just "you've got enough space" - it's only relevant if both operands are in registers already. If they're both in memory, then the xors are entirely redundant (first things you'd have to do is read values into two registers, at which point you can just write it back the other way).

I wouldn't object to seeing it in a critical section in high performance code with substantial register pressure, but ideally it could be inserted by the compiler!


I was going to say Trie but it was the first response to the SO thread so I guess it wasn't as little known as I thought.

I implemented it because I was making a game that had scrabble elements in it and needed to check ahead to see if the player had a word that could still take letters (a prefix) or if they'd hit a dead end. Fit the whole SOWPODS into a remarkably tiny space with millisecond lookups. Probably my favourite part of the project.


I found the following page in the Concatenative wiki particularly interesting: http://concatenative.org/wiki/view/Exotic%20Data%20Structure... (Note that the page itself is not related to the concatenative languages.)



Extendible hashing is amazing in space utilization while retaining the performance of hashing.


Does anyone know of an implementation of rope in golang? Something a little more feature complete than https://github.com/christianvozar/rope.


Let me just drop this video that is on my watchlist: http://www.youtube.com/watch?v=-sEdiFMntMA&feature=share&lis... (Erik Dermaine is the lecturer)


I recently learned about the spatial index tree family in connection with data mining. I hope to implement a data-mining centric X tree (n-dimensional) solution for a data analytics package I'm writing soon. That family is is how you efficiently handle KNN lookups, afaict.


Me and my friend were pretty serious about creating a new data structure called "drum". A drum is a one way store. You write to it but can't read from it. We put it off till we figured a practical use.


/dev/null is a nice implementation.


An append only log? The value of writing data is that you can read it, otherwise your structure is semantically equivalent to not writing at all.


Or reading some information computed from it, I suppose. Something needs to read it, ever, for that of course, but it doesn't necessarily need to be exposed in the interface.


Backups? People tend to treat them that way anyways.


imho most DS & Algo are like kihon - the basic principles t'werk just fine, it's about augments and applications.


Cup-a-bits


It's mentioned in the link, but circular/ring buffers.

I've been grappling with decoding/playing back an audio stream and wouldn't have gotten it working if I hadn't found out about boosts lockfree ring buffer.


Recently I learned that if you're using a ring buffer for message passing, the clflush opcode (after sending and after receiving) can dramatically reduce cache misses.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: