More

usman-m · on July 7, 2017

Curious how you guys use contexts?

marcua · on July 7, 2017

Good question! When a designer is extracting content from a customer's old website, they save all of the structured information about that customer in a single context, keyed by the customer's ID.

usman-m · on April 13, 2016

Almost. Currently continuous views that contain an ORDER BY or LIMIT clause are not supported.

usman-m · on April 13, 2016

They work better for non-sliding window queries. Triggers on sliding window queries are much more resource intensive, both for CPU and memory. Essentially, the trigger process has to keep track of tuples for each step in the window and combine them whenever any tuple is updated to get the new value.

usman-m · on Sept 29, 2015

Great stuff! For now it seems like this leaves out the job of finding crowd workers. Could it be possible to do something like designing microtasks in MTurk and "stitching" the workflow together using Orchestra?

dhaas · on Sept 29, 2015

(Orchestra dev here too!) That's a great question. We're hoping that emerging online expert communities like Upwork or Dribbble will make it easier and easier to find experts suited to the work, but it's definitely not a solved problem (or tech recruiting would be so much easier!).

Re: MTurk, there's nothing stopping you from stitching microtasks together with Orchestra, but I think that undersells Orchestra's strengths (e.g., collaboration among workers) a bit.

usman-m · on July 30, 2015

Our native implementations of all probabilistic data structures use MurmurHash3, so this isn't a problem. The dumbloom implementation is in no way a good Bloom filter, as the name suggests :)

usman-m · on July 30, 2015

It probably uses a HyperLogLog--the 2% error rate kind of gives it away. Bloom filters approximate set membership queries, HyperLogLogs approximate set cardinality queries. COUNT DISTINCT is a set cardinality query.

We actually support a HyperLogLog backed COUNT DISTINCT aggregate too: http://docs.pipelinedb.com/aggregates.html#general-aggregate...

striking · on July 30, 2015

Consider my metaphorical hat eaten. Thanks for the cool tools! I'm currently working with Postgres and this looks like a great thing to add to the mix.

anarazel · on July 31, 2015

There's a postgres extension that implements hll for postgres. Rather useful: https://github.com/aggregateknowledge/postgresql-hll

usman-m · on July 30, 2015

Oops, it was meant to be:

  SELECT <user_id> FROM (SELECT DISTINCT user_id FROM user_actions);

You're absolutely right that both those queries will give the same result. I guess I was trying to motivate the basic problem of finding whether some user exists in a set of users, and `SELECT DISTINCT` is the SQL way of representing a set.

Fixed the post, thanks!

gleb · on July 30, 2015

I'd put more effort into setting up a believable problem in these kind of posts, before presenting a solution. Much like in a company pitch, it's hard to understand the value of product if you don't understand what problem it is trying to solve.

It doesn't help that using unnecessary DISTINCTs is subqueries is a common performance problem in novice SQL. Why people do that I don't really understand, but they do.

That's the thing about probabilistic data structures - I've never seen a real-world performance problem in SQL where they would have been helpful. I really would like to have an "aha" moment where somebody shows me one.

Probabilistic data structures do seem like a natural match for streaming databases, but that's different.

usman-m · on July 30, 2015

ahachete, I'm not sure if I totally understand your question.

Continuous views are consumers for streams. You can think of them as high throughput real-time materialized views. The source of data for the stream can be practically anything. Logical decoding on the other hand is a producer of streaming data--it's basically a human readable replication log. So you could potentially stream the logically decoded log into PipelineDB and build some continuous views in front of it.

ahachete · on July 30, 2015

I was thinking of a system where data is extracted from the source database and then the (changes) data are processed real-time by a software that consumes this stream. So other than the obvious differences (need to write the software, SQL support) what would be the real advantage of using PipelineDB over a system with PostgreSQL+logical decoding+stream processing of that data?

Thanks!

usman-m · on July 30, 2015

Mostly that. I've also been thinking about how we could incorporate some machine learning algorithms, like online perceptrons.

usman-m · on July 7, 2015

Not right now, but it's definitely a feature we're thinking about.

Awesome--let us know what you think about it!

djupblue · on July 7, 2015

Will do, I love PostgreSQL so I'm pretty excited to se what the possibilities are!

Found this gem in the docs :D

Unsupported Aggregates: xmlagg ( xml )

:(