Querying Riak Just Got Easier: Introducing Secondary Indices

psadauskas · on July 26, 2011

Awesome feature, but why these terrible URLs?

  # Query for category_bin = "armor"
  curl http://127.0.0.1:8098/buckets/loot/index/category_bin/eq/armor
  {"keys":["gauntlet24"]}

  # Query for price_int between 300 and 500
  curl http://127.0.0.1:8098/buckets/loot/index/price_int/range/300/500
  {"keys":["gauntlet24"]}

Why not use URI query parameters for the query parameters?

  /buckets/loot/index?category_bin=armor
  /buckets/loot/index?price_int=[300..500]

KevBurnsJr · on July 26, 2011

Here are two (minor) considerations

1) If you want to be pedantic about HTTP, RFC 2616 states with respect to responses from URIs containing query strings that "caches MUST NOT treat responses to such URIs as fresh unless the server provides an explicit expiration time." Although this clause is broadly ignored by the vast majority of middleware, it could be argued that slashes are more correct. http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13...

2) Middleware like Squid comes with strip_query_terms on by default so if you put Squid in front of a Riak cluster and you wanted to see what was actually being run you'd have to make changes to the config file. Otherwise the request uri in the logs just reads "/buckets/loot/index?"

seancribbs · on July 26, 2011

That is still under debate AFAIK. The latter form would allow you to compose multiple index lookups as well, which is why I'm for it (although the query planner might not support it yet).

KevBurnsJr · on July 26, 2011

What about

  /buckets/loot/index/category_bin,armor/price_int,300,500

Looks a lot like link walk syntax, don't it?

arielweisberg · on July 26, 2011

I don't follow how these don't always end up being distributed queries since the index key doesn't include the partition key. Where is the locality coming from in the index scanning and value retrieval?

If the key is individual armor type, there could (and usually will) be values in any price range at every partition.

Is the index partitioned separately on its own key? Is a copy of the value stored in the index or is it then retrieved separately from the index scan?

rbranson · on July 26, 2011

It leverages merge_index from Riak Search, which stores the index data on the same partition as the object. Queries are only performed against a subset of partitions that the "query planner" thinks will contain the object.

arielweisberg · on July 26, 2011

If the index key does not start with the partition key, then won't that end up being everything in the majority of cases?

In the examples given (price, license plate) there is no locality between the partition keys (armor id, person) and the index key. A query for all armor priced between 200-400 would end up touching every partition that contains armor priced between 200-400. Unless the set of armor is small you will end up needing to scan every partition.

jtuple · on July 26, 2011

Currently, the entire keyspace is queried, but querying the entire keyspace does not requiring touching every partition. Only a covering subset which is influenced by your N-value (number of replicas) needs to be queried because the index is replicated alongside your k/v data.

For example, in a 4-partition ring with N=2, keys mapping to p1 are replicated on p1,p2; p2 on p2,p3; p3 on p3,p4; and p4 on p4,p1. As such, you only need to query p1,p3 or p2,p4 to cover the entire keyspace.

In general, approximately RingSize / N partitions need to be queried. The new smart coverage code figures this out as well as deals with routing around failed nodes and other issues.

EDIT: Since the replicas value (N) is settable per bucket in Riak, there's some interesting extreme cases that you could envision here. For example, you could have a bucket where N = RingSize, in which case the index is replicated to every node and you only need to query a single partition to lookup values. Of course, then you lose the ability to perform multiple queries in parallel with a more partitioned/distributed index space (which would be more useful for large results sets). As with database systems in general, the best configuration here depends on data and use case.

strmpnk · on July 26, 2011

I assume queries are done over R=1 consistency then? Is W=N the only way to keep writes consistent with these indexes at all times?

jtuple · on July 26, 2011

As far as I know, R=1. Rusty is likely the best to comment on this and things may change before/after release, but currently there is no way to specify R for index lookups, and only the minimal set of replicas is queried.

Technically, when you perform a write, Riak will always dispatch to N replicas. W simply requires Riak to confirm W writes before responding to the client. So W=N allows you to know N index sets have been updated, but it's not strictly necessary. At the end of the day, indexes are eventually consistent like the rest of Riak.

strmpnk · on July 26, 2011

Right. I'm just saying that if I write and I want to assume a query after that write will include it, I will need W=N since R=1. Which is fine... but tricky. W=2,R=2,N=3 has been my favorite combination but I guess there are always cases to try other setups.

arielweisberg · on July 26, 2011

I get it, very cool.

rbranson · on July 26, 2011

This presentation is particularly interesting because I think it really touches on why secondary indices should be implemented at the datastore level and are impractical to simply graft onto a distributed K/V store.

alnayyir · on July 26, 2011

The candor about the trade-offs involved in picking one database vs. another is definitely appreciated and is a tone the community needs to take on the subject in general.

tolitius · on July 27, 2011

http://www.dotkam.com/2011/07/06/noram-db-if-it-does-not-fit...

I agree PR and lots of BUZZ makes it really difficult to choose the right DB.

My latest obsession is Riak => Basho is VERY honest, and quick in helping you out and telling where and how Riak can help you, and most importantly what would NOT be a good fit for Riak. ( I am not working for them :)

/Anatoly