It's 2017, and despite all this parallel stuff, query planners are still dumb as bricks. Every day I get annoyed at various obvious-to-the-eye WHERE clause pushdown opportunities, or situations where I can manually copy a view to a table, add indexes, and build a query on top of that quicker than letting the database fill up the entire disk with temporary files trying to do it as a subquery. It's utterly maddening if you spend your life on ad-hoc analytics/reporting workloads.
I strongly believe there are huge opportunities for AI to come along and, given a logical description of the data and query, do a better job of physically organising and serving the data. I realise some database management systems do this to an extent, but it's pretty weak sauce in my experience.
I'd like to see your proposal get a bit more concrete. Your request seems a bit more for "magic" than "AI".
You're not likely to succeed with "here's the query and schema" and have AI somehow figure out what the best plan there is. There's a lot of rules to observe for correctness, and you can't just try to execute arbitrary valid plans, because the initial query plans are going to be really bad. So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.
I think there's plenty opportunities for ML type improvements, but they're in individual parts, less as a wholesale query planner replacement. I think the most likely beneficiaries are around statics collection and evaluation, incremental improvements to query plans, and some execution time -> plan time feedback loops.
> So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.
Surely query optimizers do this already. What about that sounds like "magic"?
Computers can beat Lee Sedol at Go, it hardly seems like "magic" for a computer to come up with a query plan that is obvious to a trained human.
For example, OP mentioned the case of creating a temporary table with an index instead of emitting a lot of temporary files executing a subquery. Surely you could statically verify that such a transformation was legal, and it seems likely that a cost model could accurately predict that such a transformation would be cheaper. So why doesn't the query optimizer perform this transformation? Presumably because it doesn't yet explore this part of the search space of possible transformations.
EDIT: apparently this post is making some people mad. I'm not saying it's easy or obvious, but with ML being applied to so many problems very successfully, it seems very strange to say it's expecting "magic" to suggest that ML could improve on obviously sub-optimal query plans.
We have a cost function and we have a search space of valid transformations. Sounds like a lot of room to experiment with ML-based approaches to me.
> > So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.
> Surely query optimizers do this already.
Sure. But not in a form that's applicable for ML / AI.
> What about that sounds like "magic"?
It's naming a buzzword and expecting revolutionary results, without actually suggesting how $buzzword can applied to achieve those results.
I'm asking for concrete proposals because I think it's an interesting field.
> Computers can beat Lee Sedol at Go, it hardly seems like "magic" for a computer to come up with a query plan that is obvious to a trained human.
Just because a technology worked for one field, doesn't mean it's trivially applicable for other fields (not that the original application for go was trivial in the least).
And this is exactly why I hate trendy buzzwords, especially when they involve fields that are large, complex, and been around forever. The Dunning-Kruger effect is running at 11 here.
Assuming that there’s a single “AI” technology that’s used and widely applicable is deeply flawed, as is the idea that all of googles computational power for such a peacock-oriented game can make its way into a query planner. It’s already the case that Postgres is based on genetic algorithms, so it’s already got “AI” baked in.
I’m sure there is plenty of room for improvement with query planners, including new paradigms, but you can’t just throw such a generic buzzword at it and have that be intelligible. You might as well say “it just needs better algorithms.” Well, ya. It has zero substance.
Postgres uses GEQO for really long lists of joins. For lots of workloads, if you turn it off, you'll get better performance because the query planning time is dwarfed by the runtime of a bad plan, so you're happy to wait and brute force it. But even then, the constraints in which that brute forcing applies are pretty limited (by the schema and by potentially suboptimal SQL).
I'm sorry that you think my suggestion has zero substance. If you think that the sub-second query plans that Postgres currently generates for ad-hoc workloads are adequate, I wish you luck. I personally think there are lots of missing heuristics that an ML approach could pick up, either in the general case, or over time in a specific database.
I'm interested in how you think they could best be applied. I know I've pushed the point that it uses the genetic algorithm higher to ensure I got better plans for larger join lists before. But what ways could we use ML to generate optimal plans for different query parameters? An optimal plan for one set of parameters could be totally non optimal for another set. How could we handle that? I'd actually love to see this type of discussion happening on the mailing list.
I don't have the answers - I would honestly like to see training done on real workloads. In at least _some_ situations, it seems like live-calculating statistics would be quicker and I know ML work has been done on estimating statistics more accurately.
I'll be honest, for a lot of stuff I can live without any real intelligence. For my workloads, I'd be happy to wait for the query planner to gain full knowledge of the physical layout of the database before trying anything. My other main annoyance, as mentioned, is WHERE clause pushdowns. I think there is a simple, if not efficient, decision procedure for this stuff and I'm happy to wait for the planner to do the work.
Are the pushdown problems you're experiencing with postgres, or something else? If with PG, I'd very much welcome you to report them. While we know about some weaknesses of the predicate pushdown logic (some more complicated group bys, lateral + subqueries), it'd be good to know about more of them.
FWIW, to me pushdown doesn't sound at all like a field where ML would be helpful...
Yeah, I'm not really proposing some specialised ML algorithm within Postgres to handle these edge cases, I'm more hoping that someone comes along with an entirely new paradigm for describing data and queries, that allows a platform to evolve on top of that for maximum performance given the described workload.
In a particular case that annoyed me this week, casting an IN clause with a subquery to an array (on a query over a view with a GROUP BY) resulted in wildly better performance, despite setting the statistics target to the maximum value, indexing everything etc. I'm happy to demonstrate a trivial reproduction of this issue, and I'm sure I'm a crap DBA, but every day it's the same fight - SQL is _not_ a declarative language. It is incredibly sensitive to implementation details, in ways that are way less obvious than any other programming language.
Postgres could do a better job in all sorts of situations, but I personally think that finding all these edge cases algorithmically or analytically is futile. A platform built on a paradigm more closely aligned to the logical descriptions of data and queries could, over time, learn to perform better.
Yes; as we all know, superoptimizers were revolutionary and now form the core of compilers.
Also, superoptimizers are stochastic search algorithms, which are only tangentially related to ML, which is again only tangentially related to AI. The relation between AI and superoptimization is close to nil.
I wonder how hard it would be to give Postgres a pluggable query planner, so that extensions can try to improve on its results. Then people could try things out without Postgres committing to something too soon.
I'm still learning what Postgres extensions are capable of. I've written a few but only to define functions & types. It seems like I've seen more "interesting" achievements though, like CitusDB seems able to rewrite queries. Can an extension even add syntax?
> I wonder how hard it would be to give Postgres a pluggable query planner, so that extensions can try to improve on its results. Then people could try things out without Postgres committing to something too soon.
You can quite easily replace the planner, it's just a single hook. Building a replacement's the hard part ;)
> Can an extension even add syntax?
No, not really. There's no good way to achieve extensibility with a bison based parser. At least none that we PG devs know and "believe" in.
What I'm saying is that I, as a human, can spot simple situations under which semantically-correct optimizations are available. I can also create queries where literally cutting and pasting IDs over and over again into a WHERE clause is quicker than joining or using an IN clause. These things are dumb.
I personally think there are massive opportunities for data description languages and query languages that make expressing a single set of semantics simpler. SQL is supposedly a declarative language, but almost always requires you to understand _all_ the underlying implementation details of a database to get good performance.
Beyond that, I'd be more than happy with ML that enhanced, heuristically, some parts of the query optimiser. Hell, I'd be happy with a query optimiser that just went away and optimised, 24 hours a day, or at least didn't just spend 19ms planning a query that is then going to run for 8 hours.
ML with human-in-the-loop (HITL). There is a feedback loop that learns based on human expectation. Peloton[1] is already learning based on data access patterns, but doesn't consider subjective feedback. Maybe there could be a query planner that learns which query plans are better via reinforcement learning (the reward is given by the human).
I think named entity recognition may help if detailed labels were used. The parser can already label the where clause, select clause, etc. However a component of a where clause in a given query may be better executed as a join clause. Given enough training data, NER should be able to recognize the join clause within the where clause in that specific context and the resulting labels would give the optimizer a hint on how to rewrite that specific query.
Now I have no idea how much training data you would need or if a single, trained model would work for most use cases. But I think it could work in theory.
If you want to take part in a community effort to fix this problem, consider contributing the to Apache Calcite project. It has relational algebra that spans across many different databases and dialects, including NoSQL databases like MongoDB and Elasticsearch. I haven't been very involved with the project directly, but I've worked on several projects that heavily depended on it, and have written extensions for those projects.
I took a peek into calcite some time ago and was disappointed. The project seems to be overly bureaucratic, and the code quality quite low. (Side note, I've noticed a steady decline in quality of Apache projects for at least the past decade or so.)
I wanted to perform some introspection on queries in-flight, and potentially re-write them, but this seemed like an awful chore with calcite. One, calcite didn't implement my dialect -- it looked non-trivial to add new dialects. Two, going from query to parse tree (with metadata) back to query doesn't seem to be something that's intended to be supported.
Calcite is really a replacement frontend with its own SQL dialect that you can tweak the compiled output (SQL, CQL, etc.). It has its own worts and is not generally applicable to other projects. It works well within the Apache ecosystem but I don't envision much adoption outside of it (especially non-Java, like Postgres).
If you were writing a Tableau replacement, you might consider using Calcite to generate your queries. I don't see many other use cases.
I don't think it is fair to say the code quality is low. There are a lot of complex concepts in the optimizer and the community is always working to add tooling and documentation to make it easier for newer developers.
There are some organizational quirks to how the code is written. It has been around for a long time, so it still has some design elements from working within older Java versions. Like many projects it is reliant on integration tests, likely more than would be desirable.
As far as the wider Apache ecosystem, I know there are projects that get by with pretty low code quality. Unfortunately there aren't many central voices at Apache enforcing specific policies around how codebases are managed, they are more focused on community development. I think they may be better off trying to take a closer look at codebases during incubation. Then again, there is no excess of resources waiting around to review code, and a lot of budding communities that want to take part.
> Side note, I've noticed a steady decline in quality of Apache projects for at least the past decade or so.
That's a weird trend to notice, given that none of the same developers are working on "Apache projects" generally. The ASF isn't like e.g. Mozilla; it isn't a monolithic org with its own devs. It's a meta-bureaucracy that takes ownership of projects for legal (copyright + patent assignment) reasons, and then offers project-management charters to member projects, giving them process and structure for contribution without a canonical owner.
An ASF project is sort of like a "working group" (e.g. Khronos, WHATWG), except that it's usually one or two large corporations and a bunch of FOSS devs, rather than being exclusively a bunch of corporations.
---
On the other hand, if there is a trend, it might be because of the increasing reliance on the "open core" strategy for corporate FOSS contributors to make money.
My own experience is with Cloudant, who contribute to Apache CouchDB, but also have their own private fork of CouchDB running on their SaaS. Originally, Cloudant did what they liked in their own fork, and then tried to push the results upstream. The ASF didn't like this very much, since these contributions weren't planned in the open, so Cloudant increasingly tried instead to mirror its internal issues as ASF Bugzilla issues and get public-member sign-off before going forward on private solutions. Which is kind of a shadow play, since many of the founding members of the CouchDB ASF project either have always worked for Cloudant, or have since been hired by Cloudant, so it's Cloudant employees (in their roles as ASF project members) signing off on these Cloudant issues. But it still slows things down.
A good comparison that people might be familiar with, with the same dynamics, is WebKit. The WebKit project has its own separate development organization, but most of the developers happen to also work for Apple.
Previously, WebKit was Apple and Google, but even two corporate contributors were too big for the small pond. Which, to me, shows that they were each there expecting to dominate the decision-making process, rather than find consensus with the FOSS contributors; and having an equally-powerful player that they had to form consensus with was too much for them.
How's that dispositive of a declining average code quality? ASF has influence over which projects get incubated, and even some over how projects are managed.
This is interesting. When you see these 'obvious to the eye' opportunities, do you stop and treat them as an exercise in generalizing the optimization for a query planner?
I'm curious how much query planners have to make tradeoffs between effective optimizations and not overloading the analysis phase.
It's not that simple. Server side prepared statements are used to great effect in OLTPish scenarios too.
A serviceable approach would be to perform additional optimizations based on the query's cost - but that's hard to do without incurring a complete replan, because of the bottom up dynamic programming approach of a good chunk of the planning.
I care about the total amount of time until I get to do real work with the data returned. Subsecond query plans for ad-hoc workloads are pointless for me if I then wait 10 hours for results.
Here's a concrete example: a view with a GROUP BY. Give it a concrete ID to filter by, and it will push it down to the underlying table and return quick results. I can then script a loop to do this to get the full dataset for a list of IDs. However, if I supply a `WHERE id IN (...)` or a JOIN, the query plan will be totally different and will take forever. This is dumb, and this is with up-to-date statistics etc. I'm happy to accept I'm probably not the target market, but just having the option to leave the query planner running for minutes instead of milliseconds would be great (if it indeed has work to _do_ in those minutes).
That was going to be my next question: is there any tunability to the query planner? Obviously you're going to be more tolerant of planner overhead than someone maintaining the maintaining the backend for a big CRUD site.
ETA: some quick googling suggests no, since most tuning documentation is on the initial query or the indexing.
For example one of the expensive steps is exploring the possible join tree - deciding in what order the tables will be joined (and then using which join algorithm). That's inherently exponential, because for N tables there are N! orderings. By default PostgreSQL will only do exhaustive search for up to 8 tables, and then switch to GEQO (genetic algorithm), but you can increase join_collapse_limit (and possibly also from_collapse_limit) to increase the threshold.
Another thing is you may make the statistics more detailed/accurate, by increasing `default_statistics_target` - by default it's 100, so we have histograms with 100 buckets in histograms and up to 100 most common values, giving us ~1% resolution. But you can increase it up to 10000, to get more accurate estimates. It may help, but of course it makes ANALYZE and planning more expensive.
And so on ...
But in general, we don't want too many low-level knobs - we prefer to make the planner smarter in general.
What you can do fairly easily is to replace the whole planner, and tweak it in various ways - that's why we have the hook/callback system, after all.
My default statistics target is 10000. I have turned GEQO off entirely, although I rarely hit the threshold at which it's relevant. But there's still no way of telling the planner that I am sat, in a psql session, doing an ad-hoc query with a deadline. The fact that there are situations where I have to disable sequential scans entirely proves to me that Postgres doesn't care about performance for interactive use. Simple SELECT count(*)s could save weeks of peoples lives but there's no way to tell the planner that you don't mind leaving it running over a coffee break.
Please report some of these cases to the list. We're much more likely to fix things that we hear are practical problems than the ones that we know theoretically exist, but are much more likely waste of plan time for everyone, not to speak of development and maintenance overhead.
I find "disable sequential scans entirely proves to me that Postgres doesn't care about performance for interactive use." in combination with not reporting the issue a bit contradictory. We care about stuff we get diagnosable reports for.
A lower-hanging fruit may be a dynamic profiler that continuously analyzes query patterns, and automatically reorganizes indexes, table clustering, partitioning, and query planning based on the actual database activity.
This could even be something that operates as a layer on top of Postgres, although it would probably need more precise data than what is currently available through stats tables, and since Postgres doesn't have query hints there's no way to modify query plans. It would also need the look at logs to see what the most frequent queries are, and possibly require a dedicated slave to "test run" queries on to gauge whether it has found an optimal solution (or maybe do it on the actual database during quiet hours).
> a dynamic profiler that continuously analyzes query patterns, and automatically reorganizes indexes, table clustering, partitioning, and query planning based on the actual database activity.
Kinda wish someone would package up capabilities like these as nicely as the SQL Server Profiler and Index Tuning Wizard. But even so, real fixes almost always lie a level above - you can write semantically identical SQL (just like any other language) that has wildly different performance characteristics. That's annoying for a supposedly declarative language. And nothing will save you if you have just messed up your schema - something that tells you do denormalise stuff to get it on the same page of data because of such and such a query load would also help.
I've always wondered why there doesn't exist something like this, but architected like Apache's mod_security: something you run for a while in "training" mode to let it learn your usage patterns, which then spits out a hints file (in mod_security's case, a behaviour whitelist; in this case, a migration to add a set of indexes + views).
As a bonus, it could (like mod-security) also have a "production mode" where the DB consumes that hints file and applies the instructions in it as an overlay, rather than as a permanent patch to your schema; that way, you'd be able to re-train after changes and then swap out the hints in production by just swapping out the file and HUPping the DB server (which would hopefully do the least-time migration to turn the old indexes into the new indexes), rather than needing to scrub out all the old indexes before building new ones.
This is also exactly what Microsoft SQL Server has done _since 2005_ with dynamic managements views. It shows on a running system what indexes would be optimal for an actual workload.
There is lot of somewhat-researchy work on this topic, the primary one being the effort at Microsoft Research, called AutoAdmin, that started about 20 years ago. They looked at automatically creating indexes, views, histograms, and stuff like that. Peloton@CMU is a more recent incarnation of the same idea with newer tradeoffs.
Although it might sound easy, this turns out to be a very challenging problem for many technical reasons.
A primary reason for it not getting traction in practice is also that, database administrators don't like an automated tool messing with their setup and potentially nullifying all the tricks they might have played to improve the performance. This was especially true in big DB2/Oracle deployments, but is increasingly less true, which has opened it up for innovations in the last few years.
I would prefer they worked on something much simpler for now: the ability to mark a table as "log all full table scans on this table unless the SQL query contains the string xxxxx".
This would make it easy to catch missing indexes without logging all queries and without the nightly batch updates which are expected to do full table scans.
Are the statistics wrong on your data? Or is it that the subqueries modify the data set such that the statistics are unknown for the outer query?
Arbitrary ad hoc query support is hard, it feels like you want indices on everything but then data insertion is dog slow. Then even with perfect stats I’ve seen vertica and PostgreSQL order aspects of queries wrong, I know the pain you speak of. I’ve not convinced myself that it’s always solvable or obvious until you do it wrong though, I never spent much time explaining queries before executing them or explaining “fast” queries though.
It's not a big problem to make query planner that generates close to perfect plans. The problem is the amount of time it would take for that query planner to generate a plan.
Let me decide the amount of time, then. If I can cut and paste a bunch of queries together to make them faster than the current planner, I'd be happy to let the query planner run for tens of minutes.
I feel like they’re burying the lede a bit in this one:
“PostgreSQL 10 provides better support for parallelized queries by allowing more parts of the query execution process to be parallelized. Improvements include additional types of data scans that are parallelized as well as optimizations when the data is recombined, such as pre-sorting. These enhancements allow results to be returned more quickly.”
This sounds HUGE. I want to see detailed benchmarks and examples of the kind of workloads that are faster now.
We've recently moved from Oracle (after using Oracle for 15 years) and we have found Postgresql's documentation to be superior to Oracle and the SQL dialect to adhere closer to the standard and to make more sense (Oracle has been around for a long time and has a lot of old and non-standard SQL built in). Overall things are simpler yet everything runs as fast or faster than it did on Oracle. Better price, better performance, better documentation, better SQL, better install process... what's not to like?
> yet everything runs as fast or faster than it did on Oracle.
Be careful with that statement, Oracle’s license disallows users of the database from making benchmarks or any kind of performance comparisons (another reason to move to PGSQL)
The beauty of course being that the mere existence of such a clause tells you more than enough about the performance characteristics.
Now, I'm not going to discount Oracle entirely. From what I've learned from some truly hardcore DBAs, there are certain workloads and scenarios where Oracle is still unbeatable.[ß] But outside those special cases, postgres is the rational, correct choice.
ß: For example, if you have insane performance requirements on a single system, I have been explained that oracle can be used with their own "fs" layer, which in turn runs directly on top of block device layer. Essentially, a bare-bones file system designed for nothing but their own DB usage.
I'll play devil's advocate: it's easy to benchmark, but it's hard to demonstrate that the outcome is significant, accurate, or even meaningful. And for whatever reason[0], misinformation tends to be stickier than truths. If I were Oracle, I'd get tired of dispelling bad benchmarks real fast, so much so that I'd go as far as prohibiting publishing benchmarks in the ToS, just to avoid dealing with it.
For example, APFS vs. HFS+ (filesystems are a kind of hierarchical database, so it's not that big a stretch). Multiple secondary sources, citing the the same outdated benchmark from months ago, declare that APFS is slower than HFS+. Here's one from HN[1], with some back & forth arguments and further ad-hoc measurements. Yet nobody bothered running or even mentioned fio.
Or the "ZFS will eat your data if you don't use ECC memory" meme that refuse to die.
[0] Maybe it's because in order to refute "X is bad is wrong", it's hard to not state "X is bad" first, and biased / less astute audiences tend to stop paying attention after absorbing "X is bad"[2].
Imagine if CPU makers prevented you from running PassMark.
Hell, imagine if car's manufacturer warranty was voided if you took it to a dynamometer.
Why do we as a society allow these absurd rent seeking practices like these? What possible social good could come from preventing discussion about an enterprise software product?
> Hell, imagine if car's manufacturer warranty was voided if you took it to a dynamometer.
Ferrari has the right to force you to give your car back if you do anything unauthorized with it, including any modification, or any unapproved testing. This was why Deadmau5 had to give back his Purrari (he modified the logo and theme to a nyancat theme)
This sounds like bullshit to me. If you own the car you have every right to modify it. It's your property.
I can see Ferrari not selling you any new cars because you did something they don't like, but I don't see how they have e the right to force you to give your car back (which you paid for) if you modify it.
It's not a license. It's a lease and the car is yours at the end of the lease period.
> The lucky few selected to get an F50 would make a down payment of $240,000. Yes, that's right: a down payment of $240,000... on a car. After that, monthly payments were $5,600. Five thousand. Six hundred. Per month. For 24 months. And then, at the end of the lease, you owned the car -- assuming you could come up with the final payment of $150,000. Total payments over the 2-year span: $534,400. Only then could you resell your F50 and make money off the wild speculation.
Worth noting that (I'm 99% sure) you're absolutely allowed to benchmark Oracle, just not publish the results publicly. It's pretty common for DB papers to compare against postgres and a number of "mystery" databases because the authors' license agreement with Oracle prevents them from naming it.
I think the purpose is to try and prevent benchmarks being posted that are on improperly configured or optimized systems and therefore misleading, not to "hide" the performance characteristics of the database. That said, Oracle sucks and I can never imagine using their software.
It is, but it is also a very true one. I've seen way too many terrible benchmarks of MySQL and Postgres to remember them all, because of misconfiguration, not understanding best practices which apply to each, etc. I can see why they'd restrict them.
Even if they are an ex-user of Oracle products, they would have had to benchmark the software at a time the were using it or have someone who is using it benchmark it. Either way a benchmark would be made at some point while an Oracle product user, which taking kuschku's word for it, is against the license agreement.
Looking at the license agreement, it doesn't prohibit running benchmarks and recording their results, so if you ask me they wouldn't be breaking the agreement at the time they were bound by it.
(Of course nobody in danger of getting involved in a court case with Oracle is going to be asking me.)
Unbelievable. Unfortunately Oracle will stick around thanks to the non-technical people in large organisations that control the money and like products they can "trust".
1) Oracle is really lacking in modern features/usability. Features where frozen in roughly 1999 and they are pretty still the same (mostly). (You are STILL limited to 30 chars for a table name FFS). They do add new stuff from time to time but anything existing isn't modified. Works but not fun to work with.
2) MSSQL needs NOLOCK everywhere (I've seen codebases with this on EVERY query). The default locking really sucks. I'm sure a DBA can make the locking sane system wide but I've never seen this on any of the DBs I've worked with. Also, SQL Manager is a PITA IMHO. Toad it is not. Almost all DB interactions via a 1G windows only install is a "bad idea"
3) MySQL is nice but will just silently eat your data from time to time. Auto truncate is evil, as is missing enums. These have both hit me multiple times in production. Note: Not sure if this is still the case since I avoid it now for this reason.
4) Postgres. Lots of nice features and easy to work with but the optimizer will sometimes do something silly. Sometimes you have to cast your bound variable just use an index. (id=? => id=?::NUMBER just because you do a setObject in jdbc)
Thanks, I didn't realized they finally improved that.
Of course I doubt I ever notice since I doubt my current employer will never upgrade that far. We have currently frozen our large oracle database as a way to force long term migration off it.
Pretty much everything is moving to Postgres on AWS with a bit of other databases thrown in for spice.
> MySQL is nice but will just silently eat your data from time to time. Auto truncate is evil, as is missing enums. These have both hit me multiple times in production. Note: Not sure if this is still the case since I avoid it now for this reason.
Looks like I am forced to use MySQL (or some of its variants ) in the near future. This thing about MySQL eating data is a statement I have read about occasionally. Is there any way to identify and beware use cases where this could happen? Would there be any more thorough documentation of this issue anywhere?
Modern MySQL is extremely well suited for data that cannot be lost - as is I'm sure, Postgres.
That said, if you're pre 5.6, I _strongly_ suggest upgrading to 5.6 or all the way to MariaDB 10. The performance, safety, stability, etc have skyrocketed in recent years.
Older versions would silently allow you to insert 75 chars in to a 50 char column. The extra was just gone. Of course without an error, nobody notices until somebody notices the missing data. This is usually an end user in production and the data is just gone.
You just silently lost all the xx-small values, they have been promoted to different values that exist. (Unless this has been fixed as well). Migration scripts are the real issue as they don't know about any custom values that may have been added out of band.
I cannot comment on every single aspect of Postgres vs MSSQL, but there are a few things I like in MSSQL that I don't believe exist in Postgres:
1) SQL Server Management Studio (SSMS) - the default GUI is a decent free front-end that integrates well and lets you do advanced stuff like graphically live-profile queries (live execution plans), easily setup security, build indices, setup various things like linked servers, compression, etc. Although I'm a text-editor sort of person, I don't have SQL syntax memorized for infrequent tasks like creating indices so a GUI (or an IDE) can really help productivity in these instances.
Postgres's default GUI, pgAdmin is comparatively weak, and the good ones are third-party payware.
2) Columnar indices - MSSQL has a fairly good implementation called columnstore indices, which creates a derived column-oriented data structure which speeds up analytic queries quite a bit.
3) Speed - SQL Server is very performant and optimized, and doesn't choke on very large datasets. Postgres is decent, but on my datasets it doesn't seem to be very performant.
Also, MSSQL locking is a boon and a bane. It's not the best for environments with high contention, but it is ok for most analytic database use cases. On the other hand, Postgres' MVCC (and oh the vacuuming) can be annoying.
pgAdmin is a surprisingly bad GUI for a database as good as Postgres, yet I see folks recommending it on forums and such. I don't quite understand the reasoning behind that -- I can only surmise that these folks have never used a decent SQL GUI nor experienced how a decent SQL GUI can massively increase their productivity.
I wonder if rewriting pgAdmin in Electron might help.
Others recommend using psql from the command-line. Now I spend most of my time on the command-line, and psql is great for one-off stuff, but when you have to edit, test and profile complex queries, the cmd-line interface just doesn't cut it for rapid iteration.
I think this is a huge gap in the Postgres world. But I also think that DataGrip is very promising. I have a pretty high view of Jetbrains' tools.
What are the good third-party payware GUIs for Postgres? I've looked and never seen any that came anywhere close to SSMS, much less SSMS + SQL Prompt combo.
Navicat is poor. Their buggy native linux app is horrible slow .net 2.0 app with wine loaded included, we felt scammed on that but managed to get our money back.
Aside from tooling, those systems often perform much better than PostgreSQL for large queries or transactions, as they feature much better optimizations. Even outside of newer optimizations like "columnar" storage, several of those systems do code generation from queries to avoid function calls, branches, etc., which can have huge performance implications. I worked on the internals of PostgreSQL once, and the number of function calls in the innermost loops were very high.
PostgreSQL also used to be (is?) single-threaded, which limited performance of a single query on multi-core machines -- I haven't looked into it to see if there has been any fundamental change in the architecture in the last 4-5 years.
Yes, I was just reading through that. The server is still single-threaded though -- they are getting the parallelism by starting multiple processes to do independent chunks of work. This makes sense for PostgreSQL, but has some fundamental limitations (e.g., it requires duplicated copies of a hash table to parallelize a hash join).
>The server is still single-threaded though -- they are getting the parallelism by starting multiple processes to do independent chunks of work.
So...it isn't single threaded then? I mean that is exactly how the most advanced competitors operate (Oracle, SQL Server) as well -- a given connection stays on one thread, with the advantages that confers, unless the planner decides to parallelize.
To be technical, MSSQL uses its own bespoke scheduling, and will preempt the thread for io. All io is nonblocking. The physical thread can vary for this reason. PGSQL really does use synchronous io and a single thread though. The former is probably more scalable but the latter has been serving PGSQL fine, too.
In the specific case of hashjoins, it does build them independently right now. There's a patch to rectify that though, by putting the hashtable also into shared memory. The coordination necessary to make multi phase batch joins and other such funny bits work, unfortunately made it infeasible to get into 10.
The biggest issue with SQL Server is that it is myopic. The tooling and everything around it is geared toward only SQL Server. The database itself is also geared around only SQL Server...making it a huge pain to get your data out to use it with something else like Elastic Search. It's geared towards being comfortable enough to lock you in and hold your data hostage.
For 99% of standard SQL, they all work the same today.
The commercial databases are still faster since they have more advanced algorithms and optimizations, as well as better scale out options and tooling - but Postgres is quickly catching up and will be fine for the majority of scenarios. Postgres also has better general usability with robust JSON support, CSV handling, foreign data(base) access, lots of extensions and other features that help make it a powerful data platform.
Today the real difference will be for companies that have some combination of existing Oracle/Microsoft tools and services, advanced clustering needs, complex security requirements, or a dependency on the more advanced features like MSSQL's in-memory OLTP.
The tools to access Microsoft SQL server are really good since they have Visual Studio based tool. MySQL workbench is almost at that quality but I'm not sure what a Postgres alternative would be.
IntelliJ Ultimate has pretty good built in tools as well that handle a lot of different dialects. I've used MSSQL and PG with pretty good results, e.g. code complete that uses the db schemas is really nice to have when exploring databases.
I cut my teeth on Oracle when I was first getting started in technology. All I did was write ad hoc queries all day long. My next DB heavy job was using SQL Server, where I built an analytics engine to do bootstrapping for sparse datasets. After that I used it to run the back-end data layer for a credit card processing company. I used MySQL at a different finance company that was doing similar things but at a smaller scale. Ever since then, I’ve been using Postgres.
Based on that experience, I’d rank them in this order:
1. Postgres
2. SQL Server
3. Oracle
99. MySQL
Postgres often lags behind the others in features, but the dev team chooses their battles wisely and ends up with better implementations.
Postgres is a real beacon of light in the open source world. Solid community. Many projects claim the benefits of open source, but they are never fully realized. Also, because Postgres is not operated by a freemium model, you always have access to the latest and greatest features. The extensibility is fantastic and well-leveraged by the community. I’ve never experienced a case where Postgres tried to figure out what I was doing and decided to do the wrong thing. Postgres fails early and loudly when there’s a problem with what I’m asking it to do, which is exactly what I want it to do. I don’t ever want to have to second guess the integrity of my data.
I haven’t run explicit benchmarks between any of these databases. But when I do similar things across two different systems, I feel like they are generally on par. But like I said, I can’t prove that with any numbers. There are probably specific work profiles that people can come up with that would show better performance for one platform over the other. But I don’t think there’s a realistic difference in performance in general. Not one that’s big enough to push your decision.
The real moment of revelation though, is when you find out that you can run your preferred programming language inside of Postgres. When you actually get to the point that transformations are outside of what you want to do in SQL, and you can just write a Python function and have it execute inside your database instead of having to do I/O, process the data, and then push it back . . . it is life-changing.
The only reason SQL Server isn’t tied for first place is because of the lack of extensibility and because it’s expensive to use in production. But it is rock solid, and has some nice things that Postgres doesn’t have, like hinting queries. Again, the Postgres community has discussed this, and it may never actually happen, but there are reasonable points as to why not. But it is really handy in SQL Server to be able to guide the query planner on the fly like that. SQL Server has also had solid pub-sub for a long time, though we’re getting that now with this version of Postgres.
I’m not a huge fan of Microsoft in general, but you absolutely have to give them props for their tooling. There is nothing even close to SSMS for any other database system. It is by far the gold standard for a visual interface to you data.
Obviously, if you’re throwing down money, you’re also getting a certain level of support for the product. I’m not convinced this should be a deciding factor between Postgresand SQL Server because, again, the Postgres community is amazing.
I should also point out that there’s a free version of SQL Server that will suffice for the needs of a great many people. It’s features are limited (no pub/sub, and there’s a size limit on your total dataset), but it’s totally functional for a lot of use cases. Even though I use Postgres for everything in production, I will always keep a PC around to run SQL Server for one off things that are just easier to do there.
Oracle is mostly fine. I was so new to everything when I was using it that I probably can’t speak that well to its strengths and weaknesses. Other people who have used it more recently can probably do it better than me. I just can’t for the life of me understand why anyone would pay their prices when SQL Server and Postgres exist, unless it’s for the support contract. And where I’m kind of meh about Microsoft, I’m actively against Oracle and Sun Microsystems. I’m pretty sure that Larry Ellison’s personal model is, “Just go ahead and be evil.” But that’s kind of a tangent and not really all that relevant.
MySQL is a different animal. It has a different design philosophy than the others. It’s more like MongoDB in principle than the others are. It’s main goal is to be friendly to the developer. And to entice you into upgrading to the paid tier.
Which is all fine. But one consequence of that is that it tries really hard to succeed under any circumstance, even if “success” means corrupting or losing your data. So it fails rarely, late, and quietly under certain conditions.
For that reason, I don’t think of it as even being in the same category as the other three. As in, it would never be an option for me, similar to MongoDB. I want my dev tools and programming languages to be focused on the developer. And I want my data store to be focused on my data. I think that this is a fundamental and deadly flaw with MySQL.
Different use cases have different requirement though, so your mileage will vary. I’m an incredible pedant about data integrity because the work I do requires it. There are legitimate cases where it just doesn’t matter all that much.
But in terms of feature parity and performance, they are all pretty close in general terms. Each will have specific cases that they really excel at and have been optimized for.
> The real moment of revelation though, is when you find out that you can run your preferred programming language inside of Postgres. When you actually get to the point that transformations are outside of what you want to do in SQL, and you can just write a Python function and have it execute inside your database instead of having to do I/O, process the data, and then push it back
I'm sorry but I just can't take MSSQL Server seriously when it cannot export valid CSV. It does not escape commas for CSV or tabs for TSV, and has no option to. If they had their own flat file export that could be re-imported I could forgive it because I could write my own library for that format, but there simply is no way to cleanly export data from MSSQL. I had to write a SQL query that was ~2000 lines of 80 columns in order to export a database to CSV and properly escape each field manually, and it took FOREVER.
Not just exporting though. I've seen automated systems that attempt to import csv files directly through SQL Server, and it would always break at quoted fields and fields with newlines. And nobody could figure out why it broke all the time, and fields would be out of order or shifted or missing.
I wasn't able to convince people to fix it but I ended up writing a Go utility to reliably import/export csv into sql server using Bulk Insert for my own use (and sanity). And it ended up being faster than other methods to boot.
> I'm sorry but I just can't take MSSQL Server seriously when it cannot export valid CSV.
“Valid CSV” is a dubious phrase, since the closest thing CSV has to a spec is an RFC that tried to map out the space of the wide variety of inolementations then existing.
Anyhow, SQL Server is a database server; there are a wide variety of ETL tools that will export from the server to any common (or not, really) format you like, including just about any flavor of CSV/TSC you might be interested in.
There is no such thing as “Valid CSV”. It’s an ambiguous format, with dozens of variants.
SQL server supports CSV with exactly the same senantics as Excel. Which is what people expect 99.9% of the time, because most CSV data goes to or from Excel in the real world.
If you’re DB-to-DB imports and exports, use a sane file format with a sane delimiter such as the Unicode INFORMATION SEPARATOR and RECORD SEPARATOR characters which were inherited from ASCII.
Not sure if you're trolling here but... have you tried bcp?
"The bulk copy program utility (bcp) bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files."
A missing small feature here or there is kind of a lame reason to not take a large and well-respected product seriously. That said, bcp and PowerShell (piping the output of Invoke-Sqlcmd through Export-CSV) are among the more common ways that can be done with SQL Server.
If they can't get a fundamentally simple feature right, why would I give them the benefit of the doubt that other features are thought out? bcp does _not_ escape commas or newlines in a field, you have to write sql to replace() those characters on select.
>>I feel like Microsoft SQL server is the easiest to use and has the best tooling.
Agreed. SQL Server Management Studio is fantastic and is one of the main reasons I enjoy working with MSSQL.
Unfortunately Postgres severely lacks in the tooling department. PgAdmin 3 used to be good, but PgAdmin 4 is simply horrendous. It makes me dread interacting with Postgres.
A filesystem snapshot is only reliable if you stop the database to do it, or if there is some sort of cooperation between the database and the tool that triggers the FS snapshot creation.
File system and LVM snapshots are atomic, so if you cannot get a working database from that your database won't survive a power failure either. You can also do a file system backup of PostgreSQL without atomic snapshots but then you will need to inform PostgreSQL of the backup and then copy files in the right order. The pg_basebackup tool and various third party tools handle this for you.
pgAdmin 3 still exists and afaik remains compatible with newer Postgres releases. You can still use it.
Postage, a tool developed by a family of software devs, was gaining popularity but recently became unmaintained without explanation (afaik). [0]
I started using IntelliJ DataGrip on a trial basis and it's good, but I probably won't pay for it. Sick of paying monthly subscription fee for every little tool I need from JetBrains, especially when I put down a project and don't need that tool for another x months.
Used DBeaver briefly but it's so many clicks just to set up a primary key that I shelved it for now. Will probably come back to it when DataGrip trial is over.
Not a conventional management tool but pgModeler [1] is a cool project IMO. Open-source, but they put a limit on the Windows binaries they distribute to try to get people to fund development. Can build from source yourself, install on Linux, or probably find free third-party builds elsewhere.
I think that most devs are just sticking with pgadmin3.
You can just buy jetbrains' tools, you don't get updates, but you can always use them. (Specifically, if you subscribe even once you get the latest version, and can use it forever, even if you unsubscribe)
I'm sure DataGrip is better for writing queries in.
I really doubt it's better for managing SQL Agent jobs, doing backup/restore, or for managing users.
I have yet to find anything quite as nice as SSMS for Postgres, but Navicat for PostgreSQL is pretty good. It's not free, but its not expensive and has features I feel are worth paying for such as being able to diff schemas or data across DBs on different servers.
SSMS may well be fantastic, but as the vast majority of our dev team uses Macs, we have been converting our SS databases whenever possible to PostgreSQL over the years and have found only performance benefits, in addition to the (obvious) cost benefits.
We are (much slower than I'd like lol) going through the same migration. We've been using DataGrip for some time for both. I used to run Navicat and occasionally spin up a Windows VM for SSMS, but since moving to DataGrip, found that it does everything I need for both database engines.
Postgres continues to astound me and the dedication of the people working on it is wonderful. Thank you to everyone who ever contributed a line of code or to testing or to documentation. Everything is top-notch!
As someone who's familiar and uses postgres, but not familiar with the more detailed things databases/postgres related where would be a good place to start?
I don't know what I don't know and am not familiar with situations in which the new functionality introduced here should be used.
The manuals are really well written and easy to access. I would start at the release notes from the manual and then look at the corresponding manual pages.
I find they are well written for some things, but not others. They are good manuals for documentation, but not guides. I was trying to read about certain lock events recently, and understand their impacts, and aside from a single reference in the manually there was absolutely nothing.
Not specific to Postgres but I found that knowing how to write complex and efficient SQL queries can be overlooked but is a very useful skill, and then understanding indexes and query plans allows to drastically improve performances and thus reduce scaling needs
Understanding your database configuration is important as well. I spent way longer than I'd like to admit trying to improve performance of complex query that would not execute the way I wanted it to no matter what, PostgreSQL absolutely refused to use a calculated index on one table and then do a HASH JOIN to a second, instead devolving into some really hairy execution plan I can't even remember at this point. Turns out, I had done everything right in my index design and query - but due to the sheer volume of data the query planner didn't use the "correct" plan because work_mem was set too low.
Always be mindful of how your database is tuned, you could be doing everything right but the query planner will seem to act insane regardless.
How deep do you want to go Bruce Momjian has a ton of talks on internals and other advanced topics so if you want to dig deep would highly recommend them :
https://momjian.us/
I fully support learning for learning’ sake. But you could also view the new features as solutions to problems. If you don’t feel you’re having problems then you’re probably not missing out. There is no inherent value in using the new functionality.
If you do have a specific problem then you could ask the community about that and they will help you determine whether there is new functionality that addresses it. Then go read about it in the manual, try it out, and see if it works for you. I find learning from solving my own problems easier than reading a manual cover to cover.
I’ve seen devs thinking like that for decades. Since they already know Fortran 77, which is a Turing complete language, they can solve any computing problem.
Still, I’d argue that it’s valuable to explore new things, at least at the level where you get an overview of which tool to use for which job.
Congratulations to the PG team! Another great release. Does anyone know when will PostgreSQL 10 be available on public cloud (AWS in particular) as a managed service (RDS)?
PostgreSQL 10 will be available in Aiven (managed Postgres and other databases in multiple clouds including AWS and GCP, see https://aiven.io) within a couple of days after we've completed our validation process with the final version.
Native partitioning, as well as the advancements on the replication and scaling side of things look like good first steps for a distributed SQL system.
Can anyone speak to how much closer this brings Postgres to being able to work like Spanner/CockroachDB? Partitioning is great but having to define my own partitions ahead of time isn’t a great solution for us folks who want to be able to spin up new databases on the fly (without the overhead of assigning new ranges for the tables on these new instances.)
Obviously CockroachDB has a long way to go before it works as well as a single instance Postgres DB. But how far does Postgres have before it works as well as a multi-instance Cockroach DB?
I think PG is already there and beyond if you consider using Citus.
We've pretty large deployment in our production with 45TB of data, across 1,280 cores 4.8TB of RAM. Most of our tables have hundreds of billions of rows that are under write-heavy pressure. Citus is currently handling around 60k writes per second. Most of our SELECT queries run under 200ms, in some cases those are fairly complex.
The current setup is a result of long tests done across many different database solutions. We never considered CockroachDB seriously for our production, but we did use or test MongoDB, Cassandra (including ScyllaDB), HBase (including Big Table) and Spanner.
We struggled with most, but we used HBase for a year before we moved to PG/Citus. Our expenses dropped by half once we done it as PG is just better suited for our setup. We tested out Spanner fairly heavily as GCP folks tried to convince us to give it a chance. The performance however wasn't even comparable.
Citus has its own quirks and it's not perfect, so definitely do your research before you decide to use it. The best part however that you the best SQL engine underneath with all the powerful features and tools that come with it. We for instance utilize heavily triggers and pub/sub directly from PG. Huge portion of our application logic is done on the DB level through UDF. The part we like about citus the most though is the team behind it. They are incredibly helpful and go beyond the scope of a traditional support even if they don't need to.
That's the bit. Not schema free, but column oriented instead of row oriented. I evaluated Citus a while back comparing to Impala w/ Parquet on HDFS and a couple of other systems. Citus was underwhelming in that context, and the syntax for the fdw awkward owing to keeping base Postgres stock.
Understood. Different use cases. We like schema, gives us an opportunity to keep data in check. We also communicate through protobuf so there is no benefit in schema-free database for us.
I know that Heap Analytics is using jsonb to have schema-free setup on citus and I think it's working well for them.
That's the second time you've thought I was talking about schema free. Column stores have nothing to do with schema free. Completely orthogonal.
Column-oriented stores normally have just as much schema as row-oriented stores. What they're faster at is scans where the full row tuple is not required to compute the result. Storing the data in columnar format rather than row also means much better compression: similar data is brought closer together.
Column stores are great, for some tasks and have disadvantages in others. Redshift was I believe the first implementation at top of PG so it's doable, however not sure that PG itself is a good engine for it.
In our case, where we need to have access to multiple columns at once, column stores (including the ones that clump them as families/groups) proved to be slower and required more resources than the row-oriented stores.
I think this really goes case by case based on your needs. If we would benefit from column-oriented DB we would not chose PG/Citus but something else (probably would stay still on Hbase).
Hi all--My name is Andy and I'm a PM here at Cockroach labs. We are currently working through partitioning as a key feature of our 1.2 release scheduled for April 2018. Our solution can be found at this RFC (https://github.com/cockroachdb/cockroach/pull/18683). One thing to note, we don't require you to define your partitions up front. We'd love to get your feedback on this feature--please feel free to comment on the RFC or reach out to me directly at andy@cockroachlabs.com.
I can't talk about the largest production database, but our largest internal cluster we've tested on so far is 128 machines and hit each one with our own continuous load generators and a chaos monkey (who turns on and off machines at random).
No idea! It’s still in its infancy as a DB. Functional, but probably not worth the switch, unless you really really don’t want to deal with scaling a DB manually (which is where I’m at.)
Right but to have a meaningful answer to your question one would need to know specifics right? As example a spanner setup that will perform on par with the largest reasonable PG setup you could have would be in 100K-150K/month range
I haven't worked with databases in a while, at my employer we are moving to MariaDB (from MySQL) - is there some reason why we wouldn't be considering PostgreSQL? Is there some drawback to P?
While the other comments talk about the benefits of switching to pgsql (and I concur whole heartedly) it seems no one has addressed your specific case. Your company isn’t really “switching” to anything, MariaDB was a fork of MySQL when the license kerfuffle was going on and there were issues with the stewardship of the project. It’s more of upgrading to a newer release of MySQL than switching to a different database engine.
i.e. “switching” to MariaDB is probably just a sysadmin upgrading the software and no changes to your code or database queries (unless replication is involved) but pgsql will certainly require more involved changes to the software.
PostgreSQL is really better at executing arbitrarily complex queries, and the documentation is so much better.
PostgreSQL also provides additional features like LISTEN/NOTIFY, Row Level Security, transactional DDL, table functions (like generate_series), numbered query parameters ($1 instead of ?), and many others.
In my opinion, the only reasons for choosing MySQL/MariaDB are if you absolutely need clustered indexes (also known as index organized tables — PostgreSQL uses heap organized tables) or if you architecture relies on specific MySQL replication features (for example using Vitess to shard your database).
Technology choices are complex, so there are always some plausible reasons.
But at this point, I think PostgreSQL should be the default choice for most people for new systems, and then move away if you have some real problem with it. The chance of regret is a lot lower with postgres.
There's a drawback with any choice of technology, but PostgreSQL's are pretty well known.
Replication can be a bit of a pain to set up compared to anything in the MySQL family since the tooling to manage it isn't part of the core project (there are tools out there, like repmgr from 2ndQuadrant).
Similar story with backups, bring your own tooling - again, 2ndQuadrant has a great solution with barman, there's also WAL-E if you want to backup to S3 along with many others.
Uber certainly presented a valid pain-point with the way indexes are handled compared to the MySQL family, any updates to an indexed field require an update to all indexes on the table (compared to MySQL which uses clustered indexes, as a result only specific indexes need to be updated). If you have a lot of indexes on your tables and update indexed values frequently you're going to see an increase in disk I/O.
Someone else can probably come up with a more exhaustive list, but the first two are things I've personally been frustrated with - even with the tooling provided by 2ndQuadrant I still have to admit other solutions (namely Microsoft SQL Server) have better stories around replication and backup management, though the edge is in user-friendly tooling and not so much underlying technology.
On the other hand, PostgreSQL has a lot of great quality of life features for database developers. pl/pgsql is really great to work with when you need to do heavy lifting in the database; composite types, arrays and domains are extremely useful for complex models and general manipulation; full-fat JSON support can be extremely useful for a variety of reasons, as can the XML features; PostGIS is king when it comes to spatial data; and a whole hell of a lot more.
PostgreSQL is hands-down my favorite database because it focuses on making my life, when wearing the database developer hat, a lot nicer. With the DBA hat on it complicates things some compared to other products, but the tooling out there is at least decent so it's not a huge deal.
Regarding replication, the logical replication in pg10 should make this significantly easier. Between that and "Quorum Commit for Synchronous Replication" I'm really excited to see how these pan out in production. Pg doesn't normally hype or talk about (or let into production) things that don't work.
Logical replication really isn't going to fix the pain points I have with replication from a DevOps/Sysadmin perspective.
The biggest issue with replication in PostgreSQL is restoring a failed master, pg_rewind provides the low-level plumbing to do this and it's somewhat integrated into repmgr now - but it's far from being easy to use compared to something like SQL Server Availability Groups. Being the sole Linux sysadmin / PostgreSQL DBA in my organization means I have to take responsibility for this since the tools are complex enough I can't easily throw them over to our Windows admins or SQL Server DBA's in an emergency. This is partially an issue with staffing, but if the tooling was a little easier to understand and robust enough in fully-automatic operation I could just leave common troubleshooting steps in a runbook, but right now when replication breaks it REALLY breaks.
That's because they are Windows admins and not nix admins. They probably don't have a good understanding of the command line and if they do not the rich nix command line.
I have a single pg admin in current project and he can throw most tasks to the Oracle (running on Linux) admins or even junior Linux admins.
While "complex" they are still fairly basic to command line admins.
Anything with less than 3 pipes or one regex should be simple to command line *nix person. Granted to excludes you including a complex AWK script or inline perl execution on the command line instead of in a file like a normal sane person.
The major issue with logical replication is the lack of DDL replication, meaning any changes to your table structure will cause issues until the replicas are updated to match. Also an all tables subscription will not automatically pickup new tables, it needs to be refreshed too. At this stage, I'd recommend against using logical replication unless absolutely necessary because it doesn't really do what people would normally expect it to.
Migrating to Postgres will certainly be significantly more work than going to MariaDB for you. But I would still recommend you consider it, because Postgres is safer, more powerful and has fewer edge-case behaviors. If your systems are highly-wedded to MySQL's idiosyncrasies, the difference in cost may be too high for you to do it right now, but I would seriously consider it.
MariaDB compared to postgres is a difficult choice, both have a multitude of features. MariaDB 10 was a game changer for the mysql landscape.
To make postgres 10 as easy to scale as mariadb 10, and for query semantics in mariadb 10 to allow for the complexity that postgres 10 allows; both require tradeoffs in management tooling, development practices and understanding.
Neither of them are a silver bullet. They both require effort, like any RDBMS. But both are great choices.
Not so much a reason to not use it, but something to keep in mind: queries such as `SELECT COUNT(*)` tend to be a bit more expensive in PostgreSQL compared to MySQL/MariaDB. This doesn't necessarily mean they're always slower, but it's something you should take into account.
Another thing to take into account is that updating between minor versions (major versions per 10.x) is a bit tricky since IIRC the WAL format can change. This means that upgrading from e.g. 10.x to 11.0 requires you to either take your cluster offline, or use something like pg_logical. This is really my only complaint, but again it's not really a reason to _not_ use PostgreSQL.
`select count(star)` is fast only on non-transactional MyISAM storage engine which locks the whole table for a single writer and has no ACID support. It's not what people usually mean when they say they want an RDBMS.
When ACID support is required, you use InnoDB, and that has the same `select count(star)` performance as other database engines.
What workload do you have that you need exact count of rows in a big table? Because if the table is not big or if inexact count would suffice, there are solutions (e.g. HyperLogLog).
Since 9.2 Postgres has had index-only scans which can substantially improve the performance of `SELECT COUNT(* )` queries. If you need to get exact counts, and your query plan is not showing an index-only scan, you should try adding an index on the columns you are filtering on to see if you can get the query to run on an index-only scan.*
* <all the usual caveats about attempting to tune query performance>
> queries such as `SELECT COUNT(*)` tend to be a bit more expensive in PostgreSQL compared to MySQL/MariaDB
As far as I know, that's true when you use the MyISAM storage engine (which is non transactional), but not when you use InnoDB (which has been the default for years now).
Hm... looks like it's mainly focused on distributed stuff. Instead, I would hope they focus on the bare essentials as well.
I've recently started working with PostgreSQL and stumbled upon a problem with their UPSERT implementation.
Basically, I want to store an id -> name mapping in a table, with the usual characteristics.
CREATE TABLE entities (id SERIAL PRIMARY KEY, name VARCHAR(10) UNIQUE NOT NULL);
Then, I want to issue a query (or queries) with some entity names that would return their indices, inserting them if necessary.
It appears that's impossible to do with PostgreSQL. The closest I can get is this:
INSERT INTO entities (name) VALUES ('a'), ('b'), ('c')
ON CONFLICT (name) DO NOTHING
RETURNING id, name;
However, this (1) only returns the newly inserted rows, and (2) it increments the `id` counter even in case of conflicts (i.e. if the name already exists). While (2) is merely annoying, (1) means I have to issue at least one more SELECT query.
We like to mock Java, C++ and Go for being stuck in the past, but that's nothing compared to the state of SQL. Sure, there are custom extensions that each database provider implements, but they're often lacking in obvious ways. I really wish there was some significant progress, e.g. TypeSQL or CoffeeSQL or something.
I'm having trouble imagining why you need this. Often when people come up with "interesting" scenarios involving UPSERT, they are trying to use the database as persistent memory for some algorithm, rather than as a relational database. What are you actually trying to do? Why would it be insufficient to do something like:
BEGIN;
INSERT INTO entities (name)
(VALUES ('a'), ('b'), ('c') EXCEPT SELECT name FROM entities);
SELECT id, name FROM entities WHERE name IN ('a', 'b', 'c');
COMMIT;
That still executes two queries under the hood. The basic issue is a misunderstanding of what INSERT RETURNING does.
I've actually seen the specific misunderstanding many times on forums and on IRC. I wonder if it'd be possible to change the syntax to something like "RETURNING [ ALL | NEW ] * | output_expression [ [ AS ] output_name ]" where when empty "NEW" would be used and be equivalent to the current semantics where "ALL" would include rows that were not inserted on conflict. "ALL" would have no different meaning on "ON CONFLICT DO UPDATE" or when no conflict resolution was specified.
So you're saying that a SELECT statement that happens strictly after an INSERT statement (on the same connection) can possibly not see the data that was written by the INSERT statement?
I guess that would be possible, but I would be very surprised if that was so... Also, I would also expect writes to rows to be atomic (i.e. you wouldn't see a row with half-written data, or a row that has an `id` but not (yet) a `name`) - again, that kind of behaviour would be possible, but very surprising to the point of the DB being unusable.
This is why almost all ORMs create a transaction and have a "unit-of-work" that encompasses a single transaction.
Race conditions are always surprising. Fortunately, we have a simple remedy for that: transactions. :)
Suppose you have another connection and it runs DELETE FROM t WHERE id = (SELECT MAX(id) FROM t). If that winds up in the middle, it could screw things up for you. Is it likely to happen? No, but again... the remedy is simple. Also, the scenario you describe seems simple enough, but what happens when you bulk insert a few thousand rows?
By the way, with Postgres, there really is no way to not have a transaction. If you don't issue BEGIN, it treats the statement as a small transaction. So you're getting the same overhead on a single statement, if your plan was to somehow avoid the overhead.
If this makes you question the database's usability, I have some bad news for you: all ACID compliant databases work this way. The "atomic" unit is the transaction, not the connection. The consistency guarantees pertain to the transaction, not the connection. This is not a weird thing about Postgres, this is how they all work (except, of course, that MySQL can't always roll back aspects of a transaction, like DDL).
most people do read (sometimes even in another thread / other connection) -> transaction -> write.
Also the basic transaction isolation in pg would still have these inconsistentsis of phantom/non repeatable reads and serialization anomalies.
There's a difference between not using transactions because you don't know better and not using transactions for reasons that arise out of a deep understanding of what is going on. 99% of the time, it's the former. The argument that transactions aren't perfect is a shitty reason not to use them, the kind of bad advice that doesn't hurt you today, but may hurt you in the future.
well I'm not against using transactions, I'm using them a lot.
But I most often I read data on other threads where I do not need a transaction, especially for validating data or just getting data there is not the biggest need for transactions. And especially if your data can be dirty (not 100% accurate data)
basically yes you need transactions to guard against dirty reads, but most often you don't care for dirty data, because if you are not having financial data or data that needs to be taken with care, let's say you have a timeline with posts, you don't care if the last read was dirty and probably the user does not care anyway, he will just refresh the page once he needs more up to date data.
For now, there are optimizations planned for the query planner that would optimize this to a single query in the next releases.
Some commercial databases already do that.
INSERT RETURNING only returns the value of the changed rows, but it does not return anything if you redirect writes with triggers or child tables, including partitioning
I'm having trouble justifying semantics that return rows which havn't been inserted or updated.
Actually, I find SQL pretty nice. It's strongly typed in non-MySQL engines. It's pretty expressive, and can be highly optimized without having to update queries. Analytical features found in non-MySQL databases such as windows are also invaluable for many types of analysis.
The biggest issue I've ever had is that nesting queries when doing some analytical work can be a pain. WITH does handle this to some extent, but it doesn't seem quite as nice as something like PIG where results an "flow" through the query.
The other issue is transformations like pivot tables where you don't know the column names ahead of time. pg has extensions for that, but they aren't quite as nice as a native SQL solution would be.
If I understand what you're looking for, you can likely accomplish this using a WITH expression, something along the lines of (untested):
WITH
input_rows (name) AS
(VALUES ('a'), ('b'), ('c')),
insert_rows AS
(INSERT INTO entities (name)
SELECT ir.name
FROM input_rows AS ir
ON CONFLICT DO NOTHING
RETURNING id, name),
-- returned when insert was successful
SELECT 'new' AS src, i.id, i.name
FROM insert_rows AS i
UNION ALL
-- return when there's a conflict
SELECT 'existing' AS src, e.id, e.name
FROM entities AS e
JOIN input_rows USING (name);
As for the sequence incrementing, that's the nature of sequences: it's a trade-off between getting fast, unique values versus the overhead required tracking whether the id is used.
I think this is a fair assessment, though it's not like the postgres team is known for "move fast and break things" - they do slow and incremental. Plus, in the press release a DBA from Yandex says they're excited about it so we should know soon :-)
I think the parallelism can be trusted. This is just more work on the infrastructure released in 9.6. Partitioning on the other hand is something which is new in this release.
I'm super excited for the better parallelism, and all the little quality of life improvements (like enums / uuids being able to participate in exclusion constraints). There was an amazing amount of work that has gone into this release.
A heartfelt thank you to all PostgreSQL contributors for such a wonderful solid, well documented and versatile tool. One of my absolute favorite pieces of software.
Just upgraded and it's been rocky so far. (I'll spare you the upgrade details, just make sure you don't try to upgrade using apt if you initially installed the enterprisedb.com bundle)
But now that I've got v10 running I'm unable to do a fairly simple CREATE TABLE command that worked fine in 9.6: it maxes out my RAM and my swap until the db crashes. (which never happened once in 9.6)
(This is not directed at you personally, but a lot of people just complain about bugs on public forums like this one without filing a bugreport, thus not giving the developers a chance to address the problem. I'm guilty of this myself way too often.)
You can practice with pg_upgrade as well, backup your PG_DATADIR, restore it on your new machine and run the upgrade. I typically don't bother doing it on a separate VM myself, I just run the upgrade without the --link flag so it copies all the data files and log segments, test and then open it up.
I have 100GB of data in Postgres 9.2 on a single machine. I have it backed up to Tarsnap, and a tonne of spare disk space locally, and can make a snapshot of the machine at the beginning.
I really need to know: What is the general flow for upgrading... can I go from 9.2 to 10? Or do I need to install and migrate to every intermediate version? If there are multiple approaches to this, which is considered the safest (even if the downtime is a little longer)?
Then I create a new server and install the new version of Postgres, and apply all custom configurations, turn off the database to external access, use pg_dump / pg_dumpall to create logical backups, restore them to the new server, configure a new replica, test my application internally, then when i'm happy i'll turn everything back on for external access.
> I have 100GB of data in Postgres 9.2 on a single machine. I have it backed up to Tarsnap, and a tonne of spare disk space locally, and can make a snapshot of the machine at the beginning.
Are you backing up the database data directory via tarsnap while the database is running? If so then it's possible your backup wouldn't be a consistent snapshot. Snapshotting the data directory is an option, but you'd can't blindly do it without stopping the cluster first.
If you haven't already, I'd suggest setting up automated jobs to backup the database via pg_dump for a logical backup. That'd be a fallback in case you run into any issues upgrading the database directly.
Not necessarily true. Postgres is designed to always have a consistent (or rebuildable) on-disc format. That is, at an instant the entire fileset is consistent with itself.
The main problem when backing up the entire database directory is that a backup program will read the files over an extended period of time, and you can't guarantee that Postgres won't have changed stuff in-between.
The main problem is with the WAL. Everything[1] gets written to the WAL before going to the actual database tables. Stuff that has definitely made it to the actual database tables then drops off the end of the WAL, because Postgres knows that it is safe.
However, you can tell Postgres that a backup is in progress[2], and it will stop stuff dropping off the WAL. Then, no matter how many database writes happen, nothing disappears from the database directory[3]. You can take a database backup by just copying the files. Just make sure you tell Postgres when you have finished taking your backup[4], and it will then drop the old WAL entries. Otherwise, the WAL will grow until you run out of disc space.
When you restore your backup, Postgres will rerun all the actions recorded in the WAL in the state it was in when the backup program read it.
[1] Well, not necessarily everything, but close enough.
[3] Well, yes, stuff will change in the database directory, but the WAL says that it is going to change, and that record is still there, so it's all good.
I'd go with a pg_dump to a plaintext SQL file, then run `psql < backup.sql` and see what happens. I'd run pg10 concurrently in a docker container or something to try before trying it live.
If 10 fails, try 9.6 (docker run postgres:9.6) and do the same, to see if it's an issue exclusively with 10.
I will say that after going through their download form and download button, I learned a bit about how not to design a download form and download button.
Does anyone have a tutotrial on how to upgrade the version in Debian? If I install it from the apt then the service remained connected to the 9.6 and not the 10.
Plus, I've to migrate all the configs, is there anaything that helps in doing this?
Hmm... I have some tables partitioned using the popular pg_partman extension. I wonder how that will interact with pg 10's new partitioning functionality.
I half expect everything to be borked if I try to upgrade to 10 :/
Good, that makes it less 'urgent' at least. Though, I really hate the clunky way partitioning works in Postgres<10... I might have to take the time to rework my partitioning at some point.
There are tons of ways to do that, the lazy way would probably be pay someone like Citus (https://www.citusdata.com/product). Alternatively, use a 3rd party tool, do manual sharding, or the PG FDW, or ... the list is pretty endless. Depends a lot on your needs, requirements, how much sharding you have to do, etc. It's not a simple topic.
First I'd make sure your DB sizes really NEED that before going down that road, it's a lot of hassle, and I'd personally push for just sizing your DB box up and up and up before going down that road if at all possible. Simple solutions(read: not sharding) usually win, especially when production falls over and you have to prop it back up.
I strongly believe there are huge opportunities for AI to come along and, given a logical description of the data and query, do a better job of physically organising and serving the data. I realise some database management systems do this to an extent, but it's pretty weak sauce in my experience.