With CDNs, there are definitely privacy (and GDPR) concerns, since often the complete traffic passes through the CDN. And in order to cache, rewrite and optimize responses, SSL connections are typically terminated in the CDN. And this implies that the CDN provider processes and potentially sees sensitive user data in plain text. Due to the distributed nature of CDNs with many edge locations, one can never be really sure what the exact legal circumstances are when user data is passed through the CDN nodes in different countries. And even though CDN providers won't leak user data on purpose, cloud bleed [1] has shown that this can happen by accident and at massive scale.
This is one of the reasons, why at Baqend we opted for a different approach: by using Service Workers, we can make sure that as a service provider we will only see public data. Personally identifiable information like cookies and sensitive session information does not leave the user's browser. And if you think about it that makes a lot of sense, since the public data really is what makes a site fast or slow. User-specific data (e.g., a profile, payment data, etc.) is hardly ever the cause of performance problems. Nonetheless, we do use CDNs for public data as they are an indispensable tool to achieve low latency.
"What if you could have a fully managed database service that's consistent, scales horizontally across data centers and speaks SQL?"
Looks like Google forgot to mention one central requirement: latency.
This is a hosted version of Spanner and F1. Since both systems are published, we know a lot about their trade-offs:
Spanner (see OSDI'12 and TODS'13 papers) evolved from the observation that Megastore guarantees - though useful - come at performance penalty that is prohibitive for some applications. Spanner is a multi-version database system that unlike Megastore (the system behind the Google Cloud Datastore) provides general-purpose transactions. The authors argue: We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Spanner automatically groups data into partitions (tablets) that are synchronously replicated across sites via Paxos and stored in Colossus, the successor of the Google File System (GFS). Transactions in Spanner are based on two-phase locking (2PL) and two-phase commits (2PC) executed over the leaders for each partition involved in the transaction. In order for transactions to be serialized according to their global commit times, Spanner introduces TrueTime, an API for high precision timestamps with uncertainty bounds based on atomic clocks and GPS. Each transaction is assigned a commit timestamp from TrueTime and using the uncertainty bounds, the leader can wait until the transaction is guaranteed to be visible at all sites before releasing locks. This also enables efficient read-only transactions that can read a consistent snapshot for a certain timestamp across all data centers without any locking.
F1 (see VLDB'13 paper) builds on Spanner to support SQL-based access for Google's advertising business. To this end, F1 introduces a hierarchical schema based on Protobuf, a rich data encoding format similar to Avro and Thrift. To support both OLTP and OLAP queries, it uses Spanner's abstractions to provide consistent indexing. A lazy protocol for schema changes allows non-blocking schema evolution. Besides pessimistic Spanner transactions, F1 supports optimistic transactions. Each row bears a version timestamp that used at commit time to perform a short-lived pessimistic transaction to validate a transaction's read set. Optimistic transactions in F1 suffer from the abort rate problem of optimistic concurrency control, as the read phase is latency-bound and the commit requires slow, distributed Spanner transactions, increasing the vulnerability window for potential conflicts.
While Spanner and F1 are highly influential system designs, they do come at a cost Google does not tell in its marketing: high latency. Consistent geo-replication is expensive even for single operations. Both optimistic and pessimistic transactions even increase these latencies.
It will be very interesting to see first benchmarks. My guess is that operation latencies will be in the order of 80-120ms and therefore much slower than what can be achieved on database clusters distributed only over local replicas.
> they do come at a cost Google does not tell in its marketing: high latency
Poor latency is fundamental for a CP system, it's kind of a given. Would be nice if they explained it to users though, how achieving consensus is necessary for global consistency and how it is impossible to do that without waiting.
> So here are some facts and trivia that are not so well-known or published that I collected by talking to Parse engineers that now work at Facebook. As I am unsure about whether they were allowed to share this information, I will not mention them by name.
It's both lessons learned by Parse engineers and us, so I think the intended ambiguity is okay.
Do you have a source for that? It is what the Parse engineers told me, but that kind of information is notoriously hard to corroborate. Any time I asked this question to MongoDB engineers they were evading concrete numbers for deployment sizes.
I can assure you that you are incorrect. I am the primary source and very aware of Parse's scale.
I worked on building a feature that allowed Parse to list that many databases (it had response that does not fit in the document size limit of 16mb -- requiring multiple cursors, command cursors).
I am a collaborator on the parse-server project, love the Parse team, respect what they built, and have no incentive to deceive you.
I worked at MongoDB for 4 years in various roles, consulting, engineering, etc and saw all sorts of deployments of various shapes and sizes.
I do believe you, but can your share some more hard facts?
I've changed the passage to "one of the greatest". As in the disclaimer above the article, I'm mainly presenting assertions by Parse engineers, who were very sure about being the largest MongoDB user in terms of cluster size.
Author of the post here. If you have additional key insights about the Parse infrastructure, please post them here and I will directly add them to the article.
We were a parse user for (many) apps, and tried to run parse server briefly before just letting all the features built on it die.
I think one of the major instabilities not mentioned the request fanout. On startup the parse ios client code could fan out to upwards of 10 requests, generally all individual pieces of installation data. Your global rate limit was 30 by default. Even barely used apps would get constantly throttled.
It was even worse running ParseServer, since on Mongodb those 10 writes per user end up as a sitting in a collection write queue blocking reads until everything is processed. It was easier to kill the entire feature set it was based on.
I know there were a ton of other issues but I've forced myself to forget them.
What was the throughput (per server) after the Go rewrite? I'd imagine an async rewrite would be a lot more efficent than the 15-30 req/s that the Rails version was getting.
How often (if ever) did you experience data loss due to the MongoDB write concern = 1?
I was the lead for Parse Push. I reverse engineered Resque on Go so we could let any async stage be switched to Go cleanly. I added a few features like multiple workers per Redis connection and soft acks that let the next job be pulled by the framework but still kept the server alive during a soft shutdown until the job completed.
After moving the APNs (v2) server to Go I was able to map the wonky networking to Resque much better. My show & tell that week was literally "this MacBook is now running higher throughout than our cluster of >100 push servers". For APNs, the rewrite was a boon of over 500x per server.
As a great fringe benefit, fewer Resque servers meant less Redis polling. CPU dropped from 97% to <10 IIRC, which helped save us the next Black Friday where Redis falling over from Push load often caused problems.
I don't recall how often data loss was a problem. One of the points that hadn't gone into a lot of detail was when certain ops tricks happened. For example, I recall that reading from secondaries was reserved for larger push customers only. These queries could be wildly inefficient, have results in the 10s of millions, and due to mongo bugs would lock up an entire database if they had a geo component in the query. In later versions of Mongo where the bugs were fixed and after another generation of the auto indexer we were able to send all traffic back to primaries.
Oh. And our ResqueGo outperformed Ruby so badly that we couldn't do a reasonable X% server transition from Ruby to Go by rebalancing the number of workers. It had to be done my making a new Go queue and doing % rollout of which queue work was sent to. We learned that the hard way once (though luckily there was no regression IIRC).
> Some problems are caused by DBMS to backend and backend-to-frontend network utilization, which is generally not something a specialized DBMS can solve.
I absolutely agree. RDBMSs are the kings of one-size-fits-most. In the NoSQL ecosystem, MongoDB starts growing into this role. Of course, many performance problems are not caused by the backend but rather the network (an average website makes 108 HTTP requests) or the frontend (Angular2 without AOT, I'm looking at you). We took an end-to-end approach to tackle this problem:
1. As objects, files and queries are fetched via a REST API, we can make them cachable at the HTTP level to transparently use browser caches, forward proxies, ISP interception proxies, CDNs and reverse proxies.
2. In order to guarantee cache coherence, we invalidate queries and objects in our CDNs and reverse proxy caches when the change using an Apache Storm-based real-time query matching pipeline.
3. To keep browser up-to-date we ship a Bloom filter to the client indicating any data that has a TTL that is non-expired but was recently written. This happens inside the SDK. Optionally client can subscribe to queries via websockets (as in Firebase or Meteor).
The above scheme is the core of Baqend and quite effective in optimizing end-to-end web performance. Coupled with HTTP/2 this gives us a performance edge of >10x compared to Firebase and others. The central lesson learned for us is that it's not enough to focus on optimizing one thing (e.g. database performance) when the actual user-perceived problems arise at a different level of the stack.
I think the comparison to Firebase is a bit disingenuous. Their core offering is real-time synchronization of large, arbitrary data structures. From perusing Baqend briefly, it seems like your architecture is much more poll-based.
Hi, I'm the real-time guy at Baqend and I'd like to object.
Baqend has streaming queries whose results stay up-to-date over time; purely push-based, no pulling involved. You just have to register your query as streaming query and receive relevant events or the updated result every time anything of interest is happening. In our opinion, this is not only comparable, but goes beyond the synchronization feature of Firebase, because our real-time queries can be pretty complex. A Baqend streaming query can look something like this:
SELECT * FROM table
WHERE forename LIKE 'Joh%'
AND age > 23
ORDER BY surname, job DESC
We are going to publish this feature during the next weeks. But you can already read the API docs here to get a feeling for the expressiveness of the feature: https://www.baqend.com/guide/#streaming-queries
I work on Firebase now, though not the DB team anymore. I'd love more data on how the DB is being outperformed if you're willing to share. Was it write throughout? Read throughout? Latency? Were you using small or large object sizes? Were indexes and ACLs used in both cases? Are you sure your firebase layout used best practices? Have you run the test lately to verify whether the findings are still true?
We also do have some benchmarks comparing real-time performance, which are under submission at a scientific conference, so we cannot share them yet.
Unfortunately, testing against Firebase using established benchmarks such as YCSB is not very useful, as it's unclear which limits are enforced on purpose and which limits are inherent to the database design. Would you be open to providing a Firebase setup for heavy load testing w.r.t. to throughput, latency and consistency?
Wow, the presentation here is really polished. Kudos!
For any readers, the test AFAICT is time to last byte for a single reader. It's a bit unfortunate that it seems local caching is affecting the bagend result by the time you switch the comparison to Firebase (though in full disclosure, Firebase's local cache isn't available on JS so that's a bit on us too).
I could give a few perf tips on the data layout for Firebase but it isn't clear whether the toy app has intentional indirection to model something more real.
The core problem was that the sustained number of requests did not really cause any bottlenecks. Actual problems were:
- At the Parse side: expensive database queries on the shared cluster that could not be optimized since developers had no control over indexing and sharding.
- At the customer side: any load peaks (e.g. a front page story on hacker news) caused the Parse API to run into rate limits and drop your traffic.
I feel like for indexing it could be done dynamically, based on on performance analysis from performance and query profiling. This way you can also do it without really understanding the application logic at all.
That's actually exactly what parse did. They used a slow query log to automatically create up to 5 indexes per collection. Unfortunately this did not work that well especially for larger apps.
I guess 5 indexes might be a little short for some apps. On the other hand too many or too large indexes can get a bottleneck too. In essence, you want to be quite careful when choosing indexes for large applications.
Also some queries tend to get complicated and choosing the best indexes to speed up these queries can be extremely difficult especially if you want your algorithms to choose it automatically.
We created more than 5 indices per collection if necessary. But fundamentally some queries can't be indexed, and if you allow your customers to make unindexable queries, they'll run them. Think of queries with inequality as the primary predicate, or queries where an index can only satisfy one of the constraints like SELECT * FROM Foo WHERE x > ? ORDER BY y DESC LIMIT 100, etc.
That is absolutely right. You can easily write queries that can never be executed efficiently even with great indexing. Especially in MongoDB if you think about what people can do with the $where operator.
What would in retrospect be your preferred approach to prevent users from executing inefficient queries?
We are currently investigating whether deep reinforcement learning is a good approach for detecting slow queries and making them more efficient by trying different combinations of indices.
It's hard to say. Most customers want to do the right thing (though some just don't feel that provable tradeoffs in design are their problem because they outsourced).
I did some deep diving in large customer performance near the end of my tenure at parse to help some case studies. Frankly it took the full power of Facebook's observability tools (Scuba) to catch some big issues. My top two lessons were
1. Fix a bug in our indexer for queries like {a:X, B:{$in: Y}}. The naive assumption says you can index a or b first in a compound index and there's no problem. The truth is that a before b had a 40x boost in read performance due to locality
2. The mongo query engine uses probers to pick the best index per query. If the same query is used in different populations then the selected index would bounce and each population would get preferred treatment for the next several thousand queries. If data analysis shows you have multiple populations you can add fake terms to your query to split the index strategy.
Fwiw, the Google model is to just cut unindexable queries from the feature set. You can only have one sort or range field in your query IIRC in DataStore
The Google Datastore is built on Megastore. Megastore's data model is based on entity groups, that represent fine-grained, application-defined partitions (e.g. a user's message inbox). Transactions are supported per co-located entity group, each of which is mapped to a single row in BigTable that offers row-level atomicity. Transactions spanning multiple entity groups are not encouraged, as they require expensive two-phase commits. Megastore uses synchronous wide area replication. The replication protocol is based on Paxos consensus over positions in a shared write-ahead log.
The reason for the Datastore only allowing very limited queries is that they seek to target each query to an entity group in order to be efficient. Queries using the entity group are fast, auto-indexed and consistent. Global indexes, on the other hand, are explicitly defined and only eventually consistent (similar to DynamoDB). Any query on unindexed properties simply returns empty results and each query can only have one inequality condition [1].
UnfilteredStackTrace Traceback (most recent call last) <ipython-input-2-0e20e3adf861> in <module>() 2 ----> 3 image = generate_image_from_text("alien life", seed=7) 4 display(image)
67 frames UnfilteredStackTrace: TypeError: lax.dynamic_update_slice requires arguments to have the same dtypes, got float16, float32.
The stack trace below excludes JAX-internal frames. The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last) /content/min-dalle/min_dalle/models/dalle_bart_decoder_flax.py in __call__(self, decoder_state, keys_state, values_state, attention_mask, state_index) 38 keys_state, 39 self.k_proj(decoder_state).reshape(shape_split), ---> 40 state_index 41 ) 42 values_state = lax.dynamic_update_slice(
TypeError: lax.dynamic_update_slice requires arguments to have the same dtypes, got float16, float32.