PostgreSQL 10 Released

thom · on Oct 5, 2017

It's 2017, and despite all this parallel stuff, query planners are still dumb as bricks. Every day I get annoyed at various obvious-to-the-eye WHERE clause pushdown opportunities, or situations where I can manually copy a view to a table, add indexes, and build a query on top of that quicker than letting the database fill up the entire disk with temporary files trying to do it as a subquery. It's utterly maddening if you spend your life on ad-hoc analytics/reporting workloads.

I strongly believe there are huge opportunities for AI to come along and, given a logical description of the data and query, do a better job of physically organising and serving the data. I realise some database management systems do this to an extent, but it's pretty weak sauce in my experience.

anarazel · on Oct 5, 2017

I'd like to see your proposal get a bit more concrete. Your request seems a bit more for "magic" than "AI".

You're not likely to succeed with "here's the query and schema" and have AI somehow figure out what the best plan there is. There's a lot of rules to observe for correctness, and you can't just try to execute arbitrary valid plans, because the initial query plans are going to be really bad. So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.

I think there's plenty opportunities for ML type improvements, but they're in individual parts, less as a wholesale query planner replacement. I think the most likely beneficiaries are around statics collection and evaluation, incremental improvements to query plans, and some execution time -> plan time feedback loops.

haberman · on Oct 5, 2017

> So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.

Surely query optimizers do this already. What about that sounds like "magic"?

Computers can beat Lee Sedol at Go, it hardly seems like "magic" for a computer to come up with a query plan that is obvious to a trained human.

For example, OP mentioned the case of creating a temporary table with an index instead of emitting a lot of temporary files executing a subquery. Surely you could statically verify that such a transformation was legal, and it seems likely that a cost model could accurately predict that such a transformation would be cheaper. So why doesn't the query optimizer perform this transformation? Presumably because it doesn't yet explore this part of the search space of possible transformations.

EDIT: apparently this post is making some people mad. I'm not saying it's easy or obvious, but with ML being applied to so many problems very successfully, it seems very strange to say it's expecting "magic" to suggest that ML could improve on obviously sub-optimal query plans.

We have a cost function and we have a search space of valid transformations. Sounds like a lot of room to experiment with ML-based approaches to me.

anarazel · on Oct 5, 2017

> > So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.

> Surely query optimizers do this already.

Sure. But not in a form that's applicable for ML / AI.

> What about that sounds like "magic"?

It's naming a buzzword and expecting revolutionary results, without actually suggesting how $buzzword can applied to achieve those results.

I'm asking for concrete proposals because I think it's an interesting field.

> Computers can beat Lee Sedol at Go, it hardly seems like "magic" for a computer to come up with a query plan that is obvious to a trained human.

Just because a technology worked for one field, doesn't mean it's trivially applicable for other fields (not that the original application for go was trivial in the least).

azinman2 · on Oct 5, 2017

And this is exactly why I hate trendy buzzwords, especially when they involve fields that are large, complex, and been around forever. The Dunning-Kruger effect is running at 11 here.

Assuming that there’s a single “AI” technology that’s used and widely applicable is deeply flawed, as is the idea that all of googles computational power for such a peacock-oriented game can make its way into a query planner. It’s already the case that Postgres is based on genetic algorithms, so it’s already got “AI” baked in.

I’m sure there is plenty of room for improvement with query planners, including new paradigms, but you can’t just throw such a generic buzzword at it and have that be intelligible. You might as well say “it just needs better algorithms.” Well, ya. It has zero substance.

thom · on Oct 6, 2017

Postgres uses GEQO for really long lists of joins. For lots of workloads, if you turn it off, you'll get better performance because the query planning time is dwarfed by the runtime of a bad plan, so you're happy to wait and brute force it. But even then, the constraints in which that brute forcing applies are pretty limited (by the schema and by potentially suboptimal SQL).

I'm sorry that you think my suggestion has zero substance. If you think that the sub-second query plans that Postgres currently generates for ad-hoc workloads are adequate, I wish you luck. I personally think there are lots of missing heuristics that an ML approach could pick up, either in the general case, or over time in a specific database.

Tostino · on Oct 6, 2017

I'm interested in how you think they could best be applied. I know I've pushed the point that it uses the genetic algorithm higher to ensure I got better plans for larger join lists before. But what ways could we use ML to generate optimal plans for different query parameters? An optimal plan for one set of parameters could be totally non optimal for another set. How could we handle that? I'd actually love to see this type of discussion happening on the mailing list.

thom · on Oct 6, 2017

I don't have the answers - I would honestly like to see training done on real workloads. In at least _some_ situations, it seems like live-calculating statistics would be quicker and I know ML work has been done on estimating statistics more accurately.

I'll be honest, for a lot of stuff I can live without any real intelligence. For my workloads, I'd be happy to wait for the query planner to gain full knowledge of the physical layout of the database before trying anything. My other main annoyance, as mentioned, is WHERE clause pushdowns. I think there is a simple, if not efficient, decision procedure for this stuff and I'm happy to wait for the planner to do the work.

anarazel · on Oct 6, 2017

Are the pushdown problems you're experiencing with postgres, or something else? If with PG, I'd very much welcome you to report them. While we know about some weaknesses of the predicate pushdown logic (some more complicated group bys, lateral + subqueries), it'd be good to know about more of them.

FWIW, to me pushdown doesn't sound at all like a field where ML would be helpful...

thom · on Oct 8, 2017

Yeah, I'm not really proposing some specialised ML algorithm within Postgres to handle these edge cases, I'm more hoping that someone comes along with an entirely new paradigm for describing data and queries, that allows a platform to evolve on top of that for maximum performance given the described workload.

In a particular case that annoyed me this week, casting an IN clause with a subquery to an array (on a query over a view with a GROUP BY) resulted in wildly better performance, despite setting the statistics target to the maximum value, indexing everything etc. I'm happy to demonstrate a trivial reproduction of this issue, and I'm sure I'm a crap DBA, but every day it's the same fight - SQL is _not_ a declarative language. It is incredibly sensitive to implementation details, in ways that are way less obvious than any other programming language.

Postgres could do a better job in all sorts of situations, but I personally think that finding all these edge cases algorithmically or analytically is futile. A platform built on a paradigm more closely aligned to the logical descriptions of data and queries could, over time, learn to perform better.

haberman · on Oct 5, 2017

> Sure. But not in a form that's applicable for ML / AI.

You have a cost function you want to minimize. That can't form the basis of a search among valid program transformations?

> Just because a technology worked for one field, doesn't mean it's trivially applicable for other fields

I didn't say "trivially." I just pushed back on the idea that it's "magic."

There are precedents for ML-like approaches to optimization for regular compilers. For example, superoptimizers.

superoptimiz · on Oct 5, 2017

Yes; as we all know, superoptimizers were revolutionary and now form the core of compilers.

Also, superoptimizers are stochastic search algorithms, which are only tangentially related to ML, which is again only tangentially related to AI. The relation between AI and superoptimization is close to nil.

pjungwir · on Oct 6, 2017

I wonder how hard it would be to give Postgres a pluggable query planner, so that extensions can try to improve on its results. Then people could try things out without Postgres committing to something too soon.

I'm still learning what Postgres extensions are capable of. I've written a few but only to define functions & types. It seems like I've seen more "interesting" achievements though, like CitusDB seems able to rewrite queries. Can an extension even add syntax?

EDIT: This seems to be the important part: https://github.com/citusdata/citus/blob/master/src/backend/d...

I wonder what `planner_hook` is about. . . .

anarazel · on Oct 6, 2017

> I wonder how hard it would be to give Postgres a pluggable query planner, so that extensions can try to improve on its results. Then people could try things out without Postgres committing to something too soon.

You can quite easily replace the planner, it's just a single hook. Building a replacement's the hard part ;)

> Can an extension even add syntax?

No, not really. There's no good way to achieve extensibility with a bison based parser. At least none that we PG devs know and "believe" in.

thom · on Oct 6, 2017

What I'm saying is that I, as a human, can spot simple situations under which semantically-correct optimizations are available. I can also create queries where literally cutting and pasting IDs over and over again into a WHERE clause is quicker than joining or using an IN clause. These things are dumb.

I personally think there are massive opportunities for data description languages and query languages that make expressing a single set of semantics simpler. SQL is supposedly a declarative language, but almost always requires you to understand _all_ the underlying implementation details of a database to get good performance.

Beyond that, I'd be more than happy with ML that enhanced, heuristically, some parts of the query optimiser. Hell, I'd be happy with a query optimiser that just went away and optimised, 24 hours a day, or at least didn't just spend 19ms planning a query that is then going to run for 8 hours.

bruth · on Oct 6, 2017

ML with human-in-the-loop (HITL). There is a feedback loop that learns based on human expectation. Peloton[1] is already learning based on data access patterns, but doesn't consider subjective feedback. Maybe there could be a query planner that learns which query plans are better via reinforcement learning (the reward is given by the human).

[1]: https://github.com/cmu-db/peloton

rpedela · on Oct 6, 2017

I think named entity recognition may help if detailed labels were used. The parser can already label the where clause, select clause, etc. However a component of a where clause in a given query may be better executed as a join clause. Given enough training data, NER should be able to recognize the join clause within the where clause in that specific context and the resulting labels would give the optimizer a hint on how to rewrite that specific query.

Now I have no idea how much training data you would need or if a single, trained model would work for most use cases. But I think it could work in theory.

jaltekruse · on Oct 5, 2017

If you want to take part in a community effort to fix this problem, consider contributing the to Apache Calcite project. It has relational algebra that spans across many different databases and dialects, including NoSQL databases like MongoDB and Elasticsearch. I haven't been very involved with the project directly, but I've worked on several projects that heavily depended on it, and have written extensions for those projects.

xyzzy_plugh · on Oct 5, 2017

I took a peek into calcite some time ago and was disappointed. The project seems to be overly bureaucratic, and the code quality quite low. (Side note, I've noticed a steady decline in quality of Apache projects for at least the past decade or so.)

I wanted to perform some introspection on queries in-flight, and potentially re-write them, but this seemed like an awful chore with calcite. One, calcite didn't implement my dialect -- it looked non-trivial to add new dialects. Two, going from query to parse tree (with metadata) back to query doesn't seem to be something that's intended to be supported.

Calcite is really a replacement frontend with its own SQL dialect that you can tweak the compiled output (SQL, CQL, etc.). It has its own worts and is not generally applicable to other projects. It works well within the Apache ecosystem but I don't envision much adoption outside of it (especially non-Java, like Postgres).

If you were writing a Tableau replacement, you might consider using Calcite to generate your queries. I don't see many other use cases.

jaltekruse · on Oct 5, 2017

I don't think it is fair to say the code quality is low. There are a lot of complex concepts in the optimizer and the community is always working to add tooling and documentation to make it easier for newer developers.

There are some organizational quirks to how the code is written. It has been around for a long time, so it still has some design elements from working within older Java versions. Like many projects it is reliant on integration tests, likely more than would be desirable.

As far as the wider Apache ecosystem, I know there are projects that get by with pretty low code quality. Unfortunately there aren't many central voices at Apache enforcing specific policies around how codebases are managed, they are more focused on community development. I think they may be better off trying to take a closer look at codebases during incubation. Then again, there is no excess of resources waiting around to review code, and a lot of budding communities that want to take part.

derefr · on Oct 5, 2017

> Side note, I've noticed a steady decline in quality of Apache projects for at least the past decade or so.

That's a weird trend to notice, given that none of the same developers are working on "Apache projects" generally. The ASF isn't like e.g. Mozilla; it isn't a monolithic org with its own devs. It's a meta-bureaucracy that takes ownership of projects for legal (copyright + patent assignment) reasons, and then offers project-management charters to member projects, giving them process and structure for contribution without a canonical owner.

An ASF project is sort of like a "working group" (e.g. Khronos, WHATWG), except that it's usually one or two large corporations and a bunch of FOSS devs, rather than being exclusively a bunch of corporations.

---

On the other hand, if there is a trend, it might be because of the increasing reliance on the "open core" strategy for corporate FOSS contributors to make money.

My own experience is with Cloudant, who contribute to Apache CouchDB, but also have their own private fork of CouchDB running on their SaaS. Originally, Cloudant did what they liked in their own fork, and then tried to push the results upstream. The ASF didn't like this very much, since these contributions weren't planned in the open, so Cloudant increasingly tried instead to mirror its internal issues as ASF Bugzilla issues and get public-member sign-off before going forward on private solutions. Which is kind of a shadow play, since many of the founding members of the CouchDB ASF project either have always worked for Cloudant, or have since been hired by Cloudant, so it's Cloudant employees (in their roles as ASF project members) signing off on these Cloudant issues. But it still slows things down.

A good comparison that people might be familiar with, with the same dynamics, is WebKit. The WebKit project has its own separate development organization, but most of the developers happen to also work for Apple.

Previously, WebKit was Apple and Google, but even two corporate contributors were too big for the small pond. Which, to me, shows that they were each there expecting to dominate the decision-making process, rather than find consensus with the FOSS contributors; and having an equally-powerful player that they had to form consensus with was too much for them.

anarazel · on Oct 5, 2017

How's that dispositive of a declining average code quality? ASF has influence over which projects get incubated, and even some over how projects are managed.

altstar · on Oct 5, 2017

You are more than welcome to join any Apache project and improve the code quality.

fatbird · on Oct 5, 2017

This is interesting. When you see these 'obvious to the eye' opportunities, do you stop and treat them as an exercise in generalizing the optimization for a query planner?

I'm curious how much query planners have to make tradeoffs between effective optimizations and not overloading the analysis phase.

anarazel · on Oct 5, 2017

> I'm curious how much query planners have to make tradeoffs between effective optimizations and not overloading the analysis phase.

In case of PostgreSQL: constantly. Pretty much every planner improvement has to argue that it's not likely to unduly increase plan time.

kuschku · on Oct 12, 2017

What about pre-compiling plans? With server-side prepared statements it would make sense to use entirely different planning time tradeoffs.

anarazel · on Oct 13, 2017

It's not that simple. Server side prepared statements are used to great effect in OLTPish scenarios too.

A serviceable approach would be to perform additional optimizations based on the query's cost - but that's hard to do without incurring a complete replan, because of the bottom up dynamic programming approach of a good chunk of the planning.

kuschku · on Oct 13, 2017

Well, I’d love to see a way to set such a query planning cost target.

There’s a few statements which I never change, but run thousands of times a day, and which are extremely expensive.

thom · on Oct 6, 2017

I care about the total amount of time until I get to do real work with the data returned. Subsecond query plans for ad-hoc workloads are pointless for me if I then wait 10 hours for results.

Here's a concrete example: a view with a GROUP BY. Give it a concrete ID to filter by, and it will push it down to the underlying table and return quick results. I can then script a loop to do this to get the full dataset for a list of IDs. However, if I supply a `WHERE id IN (...)` or a JOIN, the query plan will be totally different and will take forever. This is dumb, and this is with up-to-date statistics etc. I'm happy to accept I'm probably not the target market, but just having the option to leave the query planner running for minutes instead of milliseconds would be great (if it indeed has work to _do_ in those minutes).

fatbird · on Oct 6, 2017

That was going to be my next question: is there any tunability to the query planner? Obviously you're going to be more tolerant of planner overhead than someone maintaining the maintaining the backend for a big CRUD site.

ETA: some quick googling suggests no, since most tuning documentation is on the initial query or the indexing.

pgaddict · on Oct 6, 2017

Sure there are ways to affect the planner.

For example one of the expensive steps is exploring the possible join tree - deciding in what order the tables will be joined (and then using which join algorithm). That's inherently exponential, because for N tables there are N! orderings. By default PostgreSQL will only do exhaustive search for up to 8 tables, and then switch to GEQO (genetic algorithm), but you can increase join_collapse_limit (and possibly also from_collapse_limit) to increase the threshold.

Another thing is you may make the statistics more detailed/accurate, by increasing `default_statistics_target` - by default it's 100, so we have histograms with 100 buckets in histograms and up to 100 most common values, giving us ~1% resolution. But you can increase it up to 10000, to get more accurate estimates. It may help, but of course it makes ANALYZE and planning more expensive.

And so on ...

But in general, we don't want too many low-level knobs - we prefer to make the planner smarter in general.

What you can do fairly easily is to replace the whole planner, and tweak it in various ways - that's why we have the hook/callback system, after all.

thom · on Oct 8, 2017

My default statistics target is 10000. I have turned GEQO off entirely, although I rarely hit the threshold at which it's relevant. But there's still no way of telling the planner that I am sat, in a psql session, doing an ad-hoc query with a deadline. The fact that there are situations where I have to disable sequential scans entirely proves to me that Postgres doesn't care about performance for interactive use. Simple SELECT count(*)s could save weeks of peoples lives but there's no way to tell the planner that you don't mind leaving it running over a coffee break.

anarazel · on Oct 8, 2017

Please report some of these cases to the list. We're much more likely to fix things that we hear are practical problems than the ones that we know theoretically exist, but are much more likely waste of plan time for everyone, not to speak of development and maintenance overhead.

I find "disable sequential scans entirely proves to me that Postgres doesn't care about performance for interactive use." in combination with not reporting the issue a bit contradictory. We care about stuff we get diagnosable reports for.

atombender · on Oct 5, 2017

A lower-hanging fruit may be a dynamic profiler that continuously analyzes query patterns, and automatically reorganizes indexes, table clustering, partitioning, and query planning based on the actual database activity.

This could even be something that operates as a layer on top of Postgres, although it would probably need more precise data than what is currently available through stats tables, and since Postgres doesn't have query hints there's no way to modify query plans. It would also need the look at logs to see what the most frequent queries are, and possibly require a dedicated slave to "test run" queries on to gauge whether it has found an optimal solution (or maybe do it on the actual database during quiet hours).

AI would definitely be interesting, though.

dantiberian · on Oct 5, 2017

> a dynamic profiler that continuously analyzes query patterns, and automatically reorganizes indexes, table clustering, partitioning, and query planning based on the actual database activity.

This is the core idea behind Peloton, a research database from CMU: http://pelotondb.io. Andy Pavlo has a really interesting, and entertaining talk about it https://www.youtube.com/watch?v=mzMnyYdO8jk.

mwpmaybe · on Oct 5, 2017

It only handles indexes, so it's just a start, but have you seen HypoPG[0] and Dexter[1]?

0. https://github.com/dalibo/hypopg

1. https://github.com/ankane/dexter

thom · on Oct 6, 2017

Kinda wish someone would package up capabilities like these as nicely as the SQL Server Profiler and Index Tuning Wizard. But even so, real fixes almost always lie a level above - you can write semantically identical SQL (just like any other language) that has wildly different performance characteristics. That's annoying for a supposedly declarative language. And nothing will save you if you have just messed up your schema - something that tells you do denormalise stuff to get it on the same page of data because of such and such a query load would also help.

derefr · on Oct 5, 2017

I've always wondered why there doesn't exist something like this, but architected like Apache's mod_security: something you run for a while in "training" mode to let it learn your usage patterns, which then spits out a hints file (in mod_security's case, a behaviour whitelist; in this case, a migration to add a set of indexes + views).

As a bonus, it could (like mod-security) also have a "production mode" where the DB consumes that hints file and applies the instructions in it as an overlay, rather than as a permanent patch to your schema; that way, you'd be able to re-train after changes and then swap out the hints in production by just swapping out the file and HUPping the DB server (which would hopefully do the least-time migration to turn the old indexes into the new indexes), rather than needing to scrub out all the old indexes before building new ones.

tatersolid · on Oct 8, 2017

This is also exactly what Microsoft SQL Server has done _since 2005_ with dynamic managements views. It shows on a running system what indexes would be optimal for an actual workload.

Drdrdrq · on Oct 5, 2017

It exists for MongoDB (Dex).

anonetal · on Oct 5, 2017

There is lot of somewhat-researchy work on this topic, the primary one being the effort at Microsoft Research, called AutoAdmin, that started about 20 years ago. They looked at automatically creating indexes, views, histograms, and stuff like that. Peloton@CMU is a more recent incarnation of the same idea with newer tradeoffs.

Although it might sound easy, this turns out to be a very challenging problem for many technical reasons.

A primary reason for it not getting traction in practice is also that, database administrators don't like an automated tool messing with their setup and potentially nullifying all the tricks they might have played to improve the performance. This was especially true in big DB2/Oracle deployments, but is increasingly less true, which has opened it up for innovations in the last few years.

thom · on Oct 6, 2017

Yes. I have views in my database, and CPU cycles (and disk) to spare. Please, please use them.

jeffdavis · on Oct 5, 2017

There's a fair amount of research in that area and at PGCon there is usually at least one talk about how to make it work in postgres.

But it's a ways off and can take many different forms.

silvestrov · on Oct 5, 2017

I would prefer they worked on something much simpler for now: the ability to mark a table as "log all full table scans on this table unless the SQL query contains the string xxxxx".

This would make it easy to catch missing indexes without logging all queries and without the nightly batch updates which are expected to do full table scans.

thom · on Oct 6, 2017

I'm sure you're aware of this, but apart from the filtering, you _can_ get a count of sequential scans for a table from pg_stat_all_tables.

macdice · on Oct 5, 2017

You might be interested in the AQO project that was presented at PGCon 2017:

https://www.pgcon.org/2017/schedule/events/1086.en.html

It's using machine learning techniques to improve cardinality estimates.

jalayir · on Oct 5, 2017

Have your run ANALYZE? ;)

TheCondor · on Oct 6, 2017

Are the statistics wrong on your data? Or is it that the subqueries modify the data set such that the statistics are unknown for the outer query?

Arbitrary ad hoc query support is hard, it feels like you want indices on everything but then data insertion is dog slow. Then even with perfect stats I’ve seen vertica and PostgreSQL order aspects of queries wrong, I know the pain you speak of. I’ve not convinced myself that it’s always solvable or obvious until you do it wrong though, I never spent much time explaining queries before executing them or explaining “fast” queries though.

qaq · on Oct 5, 2017

It's not a big problem to make query planner that generates close to perfect plans. The problem is the amount of time it would take for that query planner to generate a plan.

thom · on Oct 6, 2017

Let me decide the amount of time, then. If I can cut and paste a bunch of queries together to make them faster than the current planner, I'd be happy to let the query planner run for tens of minutes.

qaq · on Oct 6, 2017

Well that would be a reasonable strategy for analytics queries

alexnewman · on Oct 5, 2017

Agreed. You following the cmu research on ml planned dB?

simonw · on Oct 5, 2017

I feel like they’re burying the lede a bit in this one:

“PostgreSQL 10 provides better support for parallelized queries by allowing more parts of the query execution process to be parallelized. Improvements include additional types of data scans that are parallelized as well as optimizations when the data is recombined, such as pre-sorting. These enhancements allow results to be returned more quickly.”

This sounds HUGE. I want to see detailed benchmarks and examples of the kind of workloads that are faster now.

riffraff · on Oct 5, 2017

maybe the benchmarks are not so impressive and the workloads not super common, so they didn't want to mislead you? :)

Anyway, some more info http://rhaas.blogspot.hu/2017/03/parallel-query-v2.html

simonw · on Oct 5, 2017

Thanks, that's a really detailed article.

brianskarda · on Oct 5, 2017

The wiki entry linked at the bottom of the page has a ton of good information https://wiki.postgresql.org/wiki/New_in_postgres_10

disconnected · on Oct 5, 2017

I've used both mysql and postgresql and they worked just fine for my needs.

But I have always been curious: how does postgresql (or even mysql) stack up vs proprietary databases like Oracle and Microsoft sql server?

irrational · on Oct 5, 2017

We've recently moved from Oracle (after using Oracle for 15 years) and we have found Postgresql's documentation to be superior to Oracle and the SQL dialect to adhere closer to the standard and to make more sense (Oracle has been around for a long time and has a lot of old and non-standard SQL built in). Overall things are simpler yet everything runs as fast or faster than it did on Oracle. Better price, better performance, better documentation, better SQL, better install process... what's not to like?

kuschku · on Oct 5, 2017

> yet everything runs as fast or faster than it did on Oracle.

Be careful with that statement, Oracle’s license disallows users of the database from making benchmarks or any kind of performance comparisons (another reason to move to PGSQL)

EDIT: Source that refers to the ToS: https://stackoverflow.com/a/12116865

bostik · on Oct 5, 2017

The beauty of course being that the mere existence of such a clause tells you more than enough about the performance characteristics.

Now, I'm not going to discount Oracle entirely. From what I've learned from some truly hardcore DBAs, there are certain workloads and scenarios where Oracle is still unbeatable.[ß] But outside those special cases, postgres is the rational, correct choice.

ß: For example, if you have insane performance requirements on a single system, I have been explained that oracle can be used with their own "fs" layer, which in turn runs directly on top of block device layer. Essentially, a bare-bones file system designed for nothing but their own DB usage.

RJIb8RBYxzAMX9u · on Oct 6, 2017

I'll play devil's advocate: it's easy to benchmark, but it's hard to demonstrate that the outcome is significant, accurate, or even meaningful. And for whatever reason[0], misinformation tends to be stickier than truths. If I were Oracle, I'd get tired of dispelling bad benchmarks real fast, so much so that I'd go as far as prohibiting publishing benchmarks in the ToS, just to avoid dealing with it.

For example, APFS vs. HFS+ (filesystems are a kind of hierarchical database, so it's not that big a stretch). Multiple secondary sources, citing the the same outdated benchmark from months ago, declare that APFS is slower than HFS+. Here's one from HN[1], with some back & forth arguments and further ad-hoc measurements. Yet nobody bothered running or even mentioned fio.

Or the "ZFS will eat your data if you don't use ECC memory" meme that refuse to die.

[0] Maybe it's because in order to refute "X is bad is wrong", it's hard to not state "X is bad" first, and biased / less astute audiences tend to stop paying attention after absorbing "X is bad"[2].

[1] https://news.ycombinator.com/item?id=15384774

[2] https://www.youtube.com/watch?v=yJD1Iwy5lUY&t=46s

tschwimmer · on Oct 5, 2017

This is absolutely 100% insane.

Imagine if CPU makers prevented you from running PassMark.

Hell, imagine if car's manufacturer warranty was voided if you took it to a dynamometer.

Why do we as a society allow these absurd rent seeking practices like these? What possible social good could come from preventing discussion about an enterprise software product?

kuschku · on Oct 5, 2017

> Hell, imagine if car's manufacturer warranty was voided if you took it to a dynamometer.

Ferrari has the right to force you to give your car back if you do anything unauthorized with it, including any modification, or any unapproved testing. This was why Deadmau5 had to give back his Purrari (he modified the logo and theme to a nyancat theme)

ben-schaaf · on Oct 5, 2017

This sounds like bullshit to me. If you own the car you have every right to modify it. It's your property.

I can see Ferrari not selling you any new cars because you did something they don't like, but I don't see how they have e the right to force you to give your car back (which you paid for) if you modify it.

omginternets · on Oct 6, 2017

>If you own the car you have every right to modify it. It's your property.

Some (many?) Ferrari models aren't for sale. Rather, they're indefinitely leased.

kuschku · on Oct 5, 2017

Well, because you don’t buy the car. Ferrari owns it, you buy a license to use it.

Just like with all your online services, or your software.

dagw · on Oct 6, 2017

Ferrari owns it, you buy a license to use it.

That sounds like sort of claim that should be backed up with some pretty significant evidence.

omginternets · on Oct 6, 2017

One random google result: https://www.autotrader.com/car-news/these-are-the-insane-ter...

This is far from the first time they've done this, as subsequent google searches will attest. I'll leave this as an exercise to the reader.

pmontra · on Oct 6, 2017

It's not a license. It's a lease and the car is yours at the end of the lease period.

> The lucky few selected to get an F50 would make a down payment of $240,000. Yes, that's right: a down payment of $240,000... on a car. After that, monthly payments were $5,600. Five thousand. Six hundred. Per month. For 24 months. And then, at the end of the lease, you owned the car -- assuming you could come up with the final payment of $150,000. Total payments over the 2-year span: $534,400. Only then could you resell your F50 and make money off the wild speculation.

omginternets · on Oct 6, 2017

That's a distinction without a difference.

I also suggest the Google query.

luka-birsa · on Oct 6, 2017

Bullshit. Purrari was sold online.

petepete · on Oct 6, 2017

Yeah, Ferrari sent him a cease and desist notice. He replaced it with a Nyanborghini.

mattb314 · on Oct 6, 2017

Worth noting that (I'm 99% sure) you're absolutely allowed to benchmark Oracle, just not publish the results publicly. It's pretty common for DB papers to compare against postgres and a number of "mystery" databases because the authors' license agreement with Oracle prevents them from naming it.

efficax · on Oct 5, 2017

I think the purpose is to try and prevent benchmarks being posted that are on improperly configured or optimized systems and therefore misleading, not to "hide" the performance characteristics of the database. That said, Oracle sucks and I can never imagine using their software.

Drdrdrq · on Oct 5, 2017

I'm sure that's just a very convenient excuse.

Tostino · on Oct 5, 2017

It is, but it is also a very true one. I've seen way too many terrible benchmarks of MySQL and Postgres to remember them all, because of misconfiguration, not understanding best practices which apply to each, etc. I can see why they'd restrict them.

vkjv · on Oct 5, 2017

If they are no longer using it, would the license still apply?

7265626F6F74 · on Oct 5, 2017

Even if they are an ex-user of Oracle products, they would have had to benchmark the software at a time the were using it or have someone who is using it benchmark it. Either way a benchmark would be made at some point while an Oracle product user, which taking kuschku's word for it, is against the license agreement.

mjw1007 · on Oct 5, 2017

Looking at the license agreement, it doesn't prohibit running benchmarks and recording their results, so if you ask me they wouldn't be breaking the agreement at the time they were bound by it.

(Of course nobody in danger of getting involved in a court case with Oracle is going to be asking me.)

kuschku · on Oct 5, 2017

Well, it does exactly that, doesn’t it?

> You may not

[…]

> - disclose results of any program benchmark tests without our prior consent.

mjw1007 · on Oct 5, 2017

The point is that performing the benchmarks and disclosing the results can happen at different times.

SomeHacker44 · on Oct 5, 2017

Sure sounds like they're an ex-user of Oracle and not a user. :)

cup-of-tea · on Oct 6, 2017

Unbelievable. Unfortunately Oracle will stick around thanks to the non-technical people in large organisations that control the money and like products they can "trust".

mohaine · on Oct 5, 2017

The warts from a dev perspective:

1) Oracle is really lacking in modern features/usability. Features where frozen in roughly 1999 and they are pretty still the same (mostly). (You are STILL limited to 30 chars for a table name FFS). They do add new stuff from time to time but anything existing isn't modified. Works but not fun to work with.

2) MSSQL needs NOLOCK everywhere (I've seen codebases with this on EVERY query). The default locking really sucks. I'm sure a DBA can make the locking sane system wide but I've never seen this on any of the DBs I've worked with. Also, SQL Manager is a PITA IMHO. Toad it is not. Almost all DB interactions via a 1G windows only install is a "bad idea"

3) MySQL is nice but will just silently eat your data from time to time. Auto truncate is evil, as is missing enums. These have both hit me multiple times in production. Note: Not sure if this is still the case since I avoid it now for this reason.

4) Postgres. Lots of nice features and easy to work with but the optimizer will sometimes do something silly. Sometimes you have to cast your bound variable just use an index. (id=? => id=?::NUMBER just because you do a setObject in jdbc)

gnud · on Oct 5, 2017

For the MSSQL locking, you should probably change the transaction isolation level instead of using NOLOCK everywhere.

I think it's a horrible wart that you have to do that for every session, though. The default can't be changed.

adzm · on Oct 5, 2017

Change the mssql translation to run in read committed snapshot to get similar behavior to Oracle.

NOLOCK is usually a bad idea.

skissane · on Oct 5, 2017

> (You are STILL limited to 30 chars for a table name FFS).

In the latest release, 12.2, the limit has been increased to 128.

https://docs.oracle.com/database/122/NEWFT/new-features.htm#...

mohaine · on Oct 6, 2017

Thanks, I didn't realized they finally improved that.

Of course I doubt I ever notice since I doubt my current employer will never upgrade that far. We have currently frozen our large oracle database as a way to force long term migration off it.

Pretty much everything is moving to Postgres on AWS with a bit of other databases thrown in for spice.

beefield · on Oct 6, 2017

> MySQL is nice but will just silently eat your data from time to time. Auto truncate is evil, as is missing enums. These have both hit me multiple times in production. Note: Not sure if this is still the case since I avoid it now for this reason.

Looks like I am forced to use MySQL (or some of its variants ) in the near future. This thing about MySQL eating data is a statement I have read about occasionally. Is there any way to identify and beware use cases where this could happen? Would there be any more thorough documentation of this issue anywhere?

erulabs · on Oct 6, 2017

Any modern distribution of MySQL or MariaDB should come configured to throw query errors rather than truncate data on insert.

See STRICT_ALL_TABLES / https://mariadb.com/kb/en/library/sql-mode/

Modern MySQL is extremely well suited for data that cannot be lost - as is I'm sure, Postgres.

That said, if you're pre 5.6, I _strongly_ suggest upgrading to 5.6 or all the way to MariaDB 10. The performance, safety, stability, etc have skyrocketed in recent years.

mohaine · on Oct 6, 2017

Older versions would silently allow you to insert 75 chars in to a 50 char column. The extra was just gone. Of course without an error, nobody notices until somebody notices the missing data. This is usually an end user in production and the data is just gone.

Also watch out/avoid enums.

Example:

CREATE TABLE shirts ( ... size ENUM('x-small', 'small', 'medium', 'large', 'x-large'));

You have to specify all the values in the alter so to add xx-small it is

('xx-small','x-small', 'small', 'medium', 'large', 'x-large')

and then later if you add xx-large and forgot about the xx-small add:

('x-small', 'small', 'medium', 'large', 'x-large', 'xx-large')

You just silently lost all the xx-small values, they have been promoted to different values that exist. (Unless this has been fixed as well). Migration scripts are the real issue as they don't know about any custom values that may have been added out of band.

beefield · on Oct 6, 2017

Thank you. Helpful.

reednj · on Oct 6, 2017

Theres a server wide option to turn off auto-truncate for MySQL, which I believe is on by default, at least on ubuntu

tatersolid · on Oct 8, 2017

Sane and safe defaults matter in programming. MySQL has failed at this since its inception.

wenc · on Oct 5, 2017

I cannot comment on every single aspect of Postgres vs MSSQL, but there are a few things I like in MSSQL that I don't believe exist in Postgres:

1) SQL Server Management Studio (SSMS) - the default GUI is a decent free front-end that integrates well and lets you do advanced stuff like graphically live-profile queries (live execution plans), easily setup security, build indices, setup various things like linked servers, compression, etc. Although I'm a text-editor sort of person, I don't have SQL syntax memorized for infrequent tasks like creating indices so a GUI (or an IDE) can really help productivity in these instances.

Postgres's default GUI, pgAdmin is comparatively weak, and the good ones are third-party payware.

2) Columnar indices - MSSQL has a fairly good implementation called columnstore indices, which creates a derived column-oriented data structure which speeds up analytic queries quite a bit.

3) Speed - SQL Server is very performant and optimized, and doesn't choke on very large datasets. Postgres is decent, but on my datasets it doesn't seem to be very performant.

Also, MSSQL locking is a boon and a bane. It's not the best for environments with high contention, but it is ok for most analytic database use cases. On the other hand, Postgres' MVCC (and oh the vacuuming) can be annoying.

emptyfile · on Oct 6, 2017

Working with pgAdmin on a daily basis is pure torture, and the new v4 JavaScript app is just....shockingly bad.

wenc · on Oct 6, 2017

pgAdmin is a surprisingly bad GUI for a database as good as Postgres, yet I see folks recommending it on forums and such. I don't quite understand the reasoning behind that -- I can only surmise that these folks have never used a decent SQL GUI nor experienced how a decent SQL GUI can massively increase their productivity.

I wonder if rewriting pgAdmin in Electron might help.

Others recommend using psql from the command-line. Now I spend most of my time on the command-line, and psql is great for one-off stuff, but when you have to edit, test and profile complex queries, the cmd-line interface just doesn't cut it for rapid iteration.

I think this is a huge gap in the Postgres world. But I also think that DataGrip is very promising. I have a pretty high view of Jetbrains' tools.

jeltz · on Oct 8, 2017

Rewriting it as a HTML+JS application only made it worse.

starik36 · on Oct 5, 2017

What are the good third-party payware GUIs for Postgres? I've looked and never seen any that came anywhere close to SSMS, much less SSMS + SQL Prompt combo.

wenc · on Oct 5, 2017

There's Navicat.

Datagrip is on its way to becoming really good, though it still has some issues.

There's a whole list here: https://postgresapp.com/documentation/gui-tools.html

(I use the free Apex Refactor with SSMS and it makes editing SQL a pleasure)

sitepodmatt · on Oct 6, 2017

Navicat is poor. Their buggy native linux app is horrible slow .net 2.0 app with wine loaded included, we felt scammed on that but managed to get our money back.

wenc · on Oct 6, 2017

I've read some good reviews about the Windows version of Navicat, but I can see the Linux port being bad if it's WINE based.

insulanian · on Oct 5, 2017

JetBrains DataGrip is the best I found.

mappu · on Oct 5, 2017

I use HeidiSQL for MySQL, it seems to support Postgres.

anonetal · on Oct 5, 2017

Aside from tooling, those systems often perform much better than PostgreSQL for large queries or transactions, as they feature much better optimizations. Even outside of newer optimizations like "columnar" storage, several of those systems do code generation from queries to avoid function calls, branches, etc., which can have huge performance implications. I worked on the internals of PostgreSQL once, and the number of function calls in the innermost loops were very high.

PostgreSQL also used to be (is?) single-threaded, which limited performance of a single query on multi-core machines -- I haven't looked into it to see if there has been any fundamental change in the architecture in the last 4-5 years.

grzm · on Oct 5, 2017

> PostgreSQL also used to be (is?) single-threaded, which limited performance of a single query on multi-core machines

From the submission:

"Improved Query Parallelism - Quickly conquer your analysis"

Query parallelism was introduced in 9.6 and expanded in 10.

anonetal · on Oct 5, 2017

Yes, I was just reading through that. The server is still single-threaded though -- they are getting the parallelism by starting multiple processes to do independent chunks of work. This makes sense for PostgreSQL, but has some fundamental limitations (e.g., it requires duplicated copies of a hash table to parallelize a hash join).

endorphone · on Oct 5, 2017

>The server is still single-threaded though -- they are getting the parallelism by starting multiple processes to do independent chunks of work.

So...it isn't single threaded then? I mean that is exactly how the most advanced competitors operate (Oracle, SQL Server) as well -- a given connection stays on one thread, with the advantages that confers, unless the planner decides to parallelize.

adzm · on Oct 5, 2017

To be technical, MSSQL uses its own bespoke scheduling, and will preempt the thread for io. All io is nonblocking. The physical thread can vary for this reason. PGSQL really does use synchronous io and a single thread though. The former is probably more scalable but the latter has been serving PGSQL fine, too.

tychver · on Oct 6, 2017

I think bitmap heap scans have had concurrent IO for quite a while now? There's the effective_io_concurrency setting for it.

jeffdavis · on Oct 5, 2017

No, processes don't create fundamental limitations. They can still share memory, it's just an "opt-in" choice.

Postgres processes share memory for all kinds of things. Hash tables may be duplicated, but not due to any fundamental limitations.

halayli · on Oct 5, 2017

PostgreSQL uses shared memory, it doesn't copy the hash table.

anarazel · on Oct 5, 2017

In the specific case of hashjoins, it does build them independently right now. There's a patch to rectify that though, by putting the hashtable also into shared memory. The coordination necessary to make multi phase batch joins and other such funny bits work, unfortunately made it infeasible to get into 10.

halayli · on Oct 5, 2017

I stand corrected, it definitely reconstructs the hash table in each process.

anarazel · on Oct 5, 2017

FWIW, here's the patchset to fix that: https://commitfest.postgresql.org/15/871/

halayli · on Oct 5, 2017

Thanks! I remember reading this thread a while back and I thought it made it in.

Tostino · on Oct 5, 2017

Yes there has been. This release expanded it significantly.

brightball · on Oct 5, 2017

The biggest issue with SQL Server is that it is myopic. The tooling and everything around it is geared toward only SQL Server. The database itself is also geared around only SQL Server...making it a huge pain to get your data out to use it with something else like Elastic Search. It's geared towards being comfortable enough to lock you in and hold your data hostage.

manigandham · on Oct 5, 2017

For 99% of standard SQL, they all work the same today.

The commercial databases are still faster since they have more advanced algorithms and optimizations, as well as better scale out options and tooling - but Postgres is quickly catching up and will be fine for the majority of scenarios. Postgres also has better general usability with robust JSON support, CSV handling, foreign data(base) access, lots of extensions and other features that help make it a powerful data platform.

Today the real difference will be for companies that have some combination of existing Oracle/Microsoft tools and services, advanced clustering needs, complex security requirements, or a dependency on the more advanced features like MSSQL's in-memory OLTP.

zitterbewegung · on Oct 5, 2017

The tools to access Microsoft SQL server are really good since they have Visual Studio based tool. MySQL workbench is almost at that quality but I'm not sure what a Postgres alternative would be.

benjaminjackman · on Oct 5, 2017

IntelliJ Ultimate has pretty good built in tools as well that handle a lot of different dialects. I've used MSSQL and PG with pretty good results, e.g. code complete that uses the db schemas is really nice to have when exploring databases.

teilo · on Oct 5, 2017

Look at JetBrains DataGrip - the standalone tool for DBs. IntelliJ's tools are a derivative of that.

noir_lord · on Oct 5, 2017

Aye, Ultimate is a no brainer if you are a polyglot, you get nearly all the singular ide's in one.

the_rosentotter · on Oct 5, 2017

For a web interface, especially for data browsing/editing, TeamPostgreSQL[1] is the best I have used for any database.

[1] http://www.teampostgresql.com/

manigandham · on Oct 5, 2017

DBeaver is highly recommended: https://dbeaver.jkiss.org

LeonM · on Oct 5, 2017

Better link to the official website: https://dbeaver.com/

manigandham · on Oct 5, 2017

That's the enterprise/commercial version which recently has a cost involved. Worth it for the nosql access.

bdcravens · on Oct 5, 2017

I've been using DataGrip for both SQL Server and Postgresql for over a year now, and am very happy.

ianamartin · on Oct 6, 2017

I cut my teeth on Oracle when I was first getting started in technology. All I did was write ad hoc queries all day long. My next DB heavy job was using SQL Server, where I built an analytics engine to do bootstrapping for sparse datasets. After that I used it to run the back-end data layer for a credit card processing company. I used MySQL at a different finance company that was doing similar things but at a smaller scale. Ever since then, I’ve been using Postgres.

Based on that experience, I’d rank them in this order:

1. Postgres 2. SQL Server 3. Oracle 99. MySQL

Postgres often lags behind the others in features, but the dev team chooses their battles wisely and ends up with better implementations.

Postgres is a real beacon of light in the open source world. Solid community. Many projects claim the benefits of open source, but they are never fully realized. Also, because Postgres is not operated by a freemium model, you always have access to the latest and greatest features. The extensibility is fantastic and well-leveraged by the community. I’ve never experienced a case where Postgres tried to figure out what I was doing and decided to do the wrong thing. Postgres fails early and loudly when there’s a problem with what I’m asking it to do, which is exactly what I want it to do. I don’t ever want to have to second guess the integrity of my data.

I haven’t run explicit benchmarks between any of these databases. But when I do similar things across two different systems, I feel like they are generally on par. But like I said, I can’t prove that with any numbers. There are probably specific work profiles that people can come up with that would show better performance for one platform over the other. But I don’t think there’s a realistic difference in performance in general. Not one that’s big enough to push your decision.

The real moment of revelation though, is when you find out that you can run your preferred programming language inside of Postgres. When you actually get to the point that transformations are outside of what you want to do in SQL, and you can just write a Python function and have it execute inside your database instead of having to do I/O, process the data, and then push it back . . . it is life-changing.

The only reason SQL Server isn’t tied for first place is because of the lack of extensibility and because it’s expensive to use in production. But it is rock solid, and has some nice things that Postgres doesn’t have, like hinting queries. Again, the Postgres community has discussed this, and it may never actually happen, but there are reasonable points as to why not. But it is really handy in SQL Server to be able to guide the query planner on the fly like that. SQL Server has also had solid pub-sub for a long time, though we’re getting that now with this version of Postgres.

I’m not a huge fan of Microsoft in general, but you absolutely have to give them props for their tooling. There is nothing even close to SSMS for any other database system. It is by far the gold standard for a visual interface to you data.

Obviously, if you’re throwing down money, you’re also getting a certain level of support for the product. I’m not convinced this should be a deciding factor between Postgresand SQL Server because, again, the Postgres community is amazing.

I should also point out that there’s a free version of SQL Server that will suffice for the needs of a great many people. It’s features are limited (no pub/sub, and there’s a size limit on your total dataset), but it’s totally functional for a lot of use cases. Even though I use Postgres for everything in production, I will always keep a PC around to run SQL Server for one off things that are just easier to do there.

Oracle is mostly fine. I was so new to everything when I was using it that I probably can’t speak that well to its strengths and weaknesses. Other people who have used it more recently can probably do it better than me. I just can’t for the life of me understand why anyone would pay their prices when SQL Server and Postgres exist, unless it’s for the support contract. And where I’m kind of meh about Microsoft, I’m actively against Oracle and Sun Microsystems. I’m pretty sure that Larry Ellison’s personal model is, “Just go ahead and be evil.” But that’s kind of a tangent and not really all that relevant.

MySQL is a different animal. It has a different design philosophy than the others. It’s more like MongoDB in principle than the others are. It’s main goal is to be friendly to the developer. And to entice you into upgrading to the paid tier.

Which is all fine. But one consequence of that is that it tries really hard to succeed under any circumstance, even if “success” means corrupting or losing your data. So it fails rarely, late, and quietly under certain conditions.

For that reason, I don’t think of it as even being in the same category as the other three. As in, it would never be an option for me, similar to MongoDB. I want my dev tools and programming languages to be focused on the developer. And I want my data store to be focused on my data. I think that this is a fundamental and deadly flaw with MySQL.

Different use cases have different requirement though, so your mileage will vary. I’m an incredible pedant about data integrity because the work I do requires it. There are legitimate cases where it just doesn’t matter all that much.

But in terms of feature parity and performance, they are all pretty close in general terms. Each will have specific cases that they really excel at and have been optimized for.

int_19h · on Oct 6, 2017

> The real moment of revelation though, is when you find out that you can run your preferred programming language inside of Postgres. When you actually get to the point that transformations are outside of what you want to do in SQL, and you can just write a Python function and have it execute inside your database instead of having to do I/O, process the data, and then push it back

https://blogs.technet.microsoft.com/dataplatforminsider/2017...

wehadfun · on Oct 5, 2017

I feel like Microsoft SQL server is the easiest to use and has the best tooling.

I would say that MySQL and Postgres are generally ok to use I slightly perfer MySQL tooling.

Oracle seems to be the most different one but once I got use to it it was ok.

slysf · on Oct 5, 2017

I'm sorry but I just can't take MSSQL Server seriously when it cannot export valid CSV. It does not escape commas for CSV or tabs for TSV, and has no option to. If they had their own flat file export that could be re-imported I could forgive it because I could write my own library for that format, but there simply is no way to cleanly export data from MSSQL. I had to write a SQL query that was ~2000 lines of 80 columns in order to export a database to CSV and properly escape each field manually, and it took FOREVER.

infogulch · on Oct 5, 2017

Not just exporting though. I've seen automated systems that attempt to import csv files directly through SQL Server, and it would always break at quoted fields and fields with newlines. And nobody could figure out why it broke all the time, and fields would be out of order or shifted or missing.

I wasn't able to convince people to fix it but I ended up writing a Go utility to reliably import/export csv into sql server using Bulk Insert for my own use (and sanity). And it ended up being faster than other methods to boot.

dragonwriter · on Oct 5, 2017

> I'm sorry but I just can't take MSSQL Server seriously when it cannot export valid CSV.

“Valid CSV” is a dubious phrase, since the closest thing CSV has to a spec is an RFC that tried to map out the space of the wide variety of inolementations then existing.

Anyhow, SQL Server is a database server; there are a wide variety of ETL tools that will export from the server to any common (or not, really) format you like, including just about any flavor of CSV/TSC you might be interested in.

slysf · on Oct 5, 2017

If you can't write data to a file, and then read it back in, it's not valid.

tatersolid · on Oct 8, 2017

There is no such thing as “Valid CSV”. It’s an ambiguous format, with dozens of variants.

SQL server supports CSV with exactly the same senantics as Excel. Which is what people expect 99.9% of the time, because most CSV data goes to or from Excel in the real world.

If you’re DB-to-DB imports and exports, use a sane file format with a sane delimiter such as the Unicode INFORMATION SEPARATOR and RECORD SEPARATOR characters which were inherited from ASCII.

hunterjrj · on Oct 5, 2017

Not sure if you're trolling here but... have you tried bcp?

"The bulk copy program utility (bcp) bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files."

https://docs.microsoft.com/en-us/sql/tools/bcp-utility

slysf · on Oct 5, 2017

Thank you, yes I did try it, bcp does not escape commas or newlines in a field.

reed1 · on Oct 6, 2017

Compare that with creating a table right from csv file (using fdw) in postgres without importing data. Very neat

greggyb · on Oct 6, 2017

Polybase does this for SQL Server and is the recommended way to.load data into Azure SQL DW.

reed1 · on Oct 6, 2017

Looks good, setup is quite a hassle. but still not as straightforward as postgres file fdw

dhd415 · on Oct 5, 2017

A missing small feature here or there is kind of a lame reason to not take a large and well-respected product seriously. That said, bcp and PowerShell (piping the output of Invoke-Sqlcmd through Export-CSV) are among the more common ways that can be done with SQL Server.

slysf · on Oct 5, 2017

If they can't get a fundamentally simple feature right, why would I give them the benefit of the doubt that other features are thought out? bcp does _not_ escape commas or newlines in a field, you have to write sql to replace() those characters on select.

vacri · on Oct 6, 2017

Exporting data incorrectly isn't a missing small feature. It's data corruption.

enraged_camel · on Oct 5, 2017

>>I feel like Microsoft SQL server is the easiest to use and has the best tooling.

Agreed. SQL Server Management Studio is fantastic and is one of the main reasons I enjoy working with MSSQL.

Unfortunately Postgres severely lacks in the tooling department. PgAdmin 3 used to be good, but PgAdmin 4 is simply horrendous. It makes me dread interacting with Postgres.

troyk · on Oct 5, 2017

PostreSQL has great tooling, maybe not the GUI-UX experience you are familiar with MSSQL on Windows.

psql is great for inspecting schema and running ad-hoc queuries. SQL scripts can also be piped through it for cron-sql jobs etc.

pg_dump + psql + ssh to easily copy databases between hosts

pg_bench for benchmarking.

Lots more: https://www.postgresql.org/docs/current/static/reference-cli...

X86BSD · on Oct 5, 2017

Zfs snapshots and send/recv are even easier for making backups or copying databases.

majewsky · on Oct 5, 2017

A filesystem snapshot is only reliable if you stop the database to do it, or if there is some sort of cooperation between the database and the tool that triggers the FS snapshot creation.

jeltz · on Oct 5, 2017

File system and LVM snapshots are atomic, so if you cannot get a working database from that your database won't survive a power failure either. You can also do a file system backup of PostgreSQL without atomic snapshots but then you will need to inform PostgreSQL of the backup and then copy files in the right order. The pg_basebackup tool and various third party tools handle this for you.

cookiecaper · on Oct 5, 2017

pgAdmin 3 still exists and afaik remains compatible with newer Postgres releases. You can still use it.

Postage, a tool developed by a family of software devs, was gaining popularity but recently became unmaintained without explanation (afaik). [0]

I started using IntelliJ DataGrip on a trial basis and it's good, but I probably won't pay for it. Sick of paying monthly subscription fee for every little tool I need from JetBrains, especially when I put down a project and don't need that tool for another x months.

Used DBeaver briefly but it's so many clicks just to set up a primary key that I shelved it for now. Will probably come back to it when DataGrip trial is over.

Not a conventional management tool but pgModeler [1] is a cool project IMO. Open-source, but they put a limit on the Windows binaries they distribute to try to get people to fund development. Can build from source yourself, install on Linux, or probably find free third-party builds elsewhere.

I think that most devs are just sticking with pgadmin3.

[0] https://github.com/workflowproducts/postage

[1] https://pgmodeler.com.br/

kuschku · on Oct 5, 2017

You can just buy jetbrains' tools, you don't get updates, but you can always use them. (Specifically, if you subscribe even once you get the latest version, and can use it forever, even if you unsubscribe)

reitanqild · on Oct 5, 2017

Sad to see postage becoming unmaintained shortly after the first time I heard about it.

That said it seems there is little to gain except nice words for smaller open source efforts.

uptownJimmy · on Oct 5, 2017

SSMS is not fantastic, IMO.

JetBrains' DataGrip is significantly better, and works with all the extant SQL implementations.

We can definitely debate the SQL implementations themselves, but DataGrip makes tooling a non-issue.

gnud · on Oct 5, 2017

I'm sure DataGrip is better for writing queries in. I really doubt it's better for managing SQL Agent jobs, doing backup/restore, or for managing users.

Does DataGrip let you do Kerberos authentication?

kuschku · on Oct 5, 2017

> Does DataGrip let you do Kerberos authentication?

It allows you to manually specify every authentication parameter if you wish to, but the simple login page only supports password or client-cert auth.

bdcravens · on Oct 5, 2017

I've been using DataGrip with both SQL Server and Postgresql for over a year, and am very happy with it.

irrational · on Oct 5, 2017

I use DataGrip if I want to use a GUI, but the vast majority of time its just so simple to work from the command line.

matrix · on Oct 5, 2017

I have yet to find anything quite as nice as SSMS for Postgres, but Navicat for PostgreSQL is pretty good. It's not free, but its not expensive and has features I feel are worth paying for such as being able to diff schemas or data across DBs on different servers.

SomeHacker44 · on Oct 5, 2017

SSMS may well be fantastic, but as the vast majority of our dev team uses Macs, we have been converting our SS databases whenever possible to PostgreSQL over the years and have found only performance benefits, in addition to the (obvious) cost benefits.

bdcravens · on Oct 5, 2017

We are (much slower than I'd like lol) going through the same migration. We've been using DataGrip for some time for both. I used to run Navicat and occasionally spin up a Windows VM for SSMS, but since moving to DataGrip, found that it does everything I need for both database engines.

jarym · on Oct 5, 2017

Postgres continues to astound me and the dedication of the people working on it is wonderful. Thank you to everyone who ever contributed a line of code or to testing or to documentation. Everything is top-notch!

tabeth · on Oct 5, 2017

As someone who's familiar and uses postgres, but not familiar with the more detailed things databases/postgres related where would be a good place to start?

I don't know what I don't know and am not familiar with situations in which the new functionality introduced here should be used.

TLDR: am database/postgres noob. Help?

mamcx · on Oct 5, 2017

This have help me in discover extra powers of sql:

http://modern-sql.com/

And

http://use-the-index-luke.com/

zitterbewegung · on Oct 5, 2017

The manuals are really well written and easy to access. I would start at the release notes from the manual and then look at the corresponding manual pages.

https://www.postgresql.org/docs/10/static/release-10.html#id...

https://www.postgresql.org/docs/

tibbon · on Oct 5, 2017

I find they are well written for some things, but not others. They are good manuals for documentation, but not guides. I was trying to read about certain lock events recently, and understand their impacts, and aside from a single reference in the manually there was absolutely nothing.

They tell the what, but rarely the why.

skizm · on Oct 5, 2017

Sort of random/oddly specific usecase, but this article helped me understand a bit more of what happens under the hood with both PostgeSQL and MySQL:

https://eng.uber.com/mysql-migration/

It is a fairly easy read, even for someone (like myself) who has very little knowledge of what's actually going on under the hood of most DBs.

cookiecaper · on Oct 5, 2017

Further reference material:

- PgSQL mailing list thread discussing Uber's post: https://www.postgresql.org/message-id/579795DF.10502%40comma...

- Slides from "A PostgreSQL Response to Uber", a talk that provides the PgSQL-perspective counterarguments to the Uber post: http://thebuild.com/presentations/uber-perconalive-2017.pdf [PDF]

- Previous HN discussion on the above: https://news.ycombinator.com/item?id=14222721

- Slides from "Migrating Uber from MySQL to Postgres" (2013), in which Uber's prior migration _to_ PgSQL is discussed: https://www.yumpu.com/en/document/view/53683323/migrating-ub...

skizm · on Oct 5, 2017

Ha! I had no idea this article was so hotly contested. I'll definitely be reading through the links. Thanks.

tarsinge · on Oct 5, 2017

Not specific to Postgres but I found that knowing how to write complex and efficient SQL queries can be overlooked but is a very useful skill, and then understanding indexes and query plans allows to drastically improve performances and thus reduce scaling needs

Edit: wording

snuxoll · on Oct 5, 2017

Understanding your database configuration is important as well. I spent way longer than I'd like to admit trying to improve performance of complex query that would not execute the way I wanted it to no matter what, PostgreSQL absolutely refused to use a calculated index on one table and then do a HASH JOIN to a second, instead devolving into some really hairy execution plan I can't even remember at this point. Turns out, I had done everything right in my index design and query - but due to the sheer volume of data the query planner didn't use the "correct" plan because work_mem was set too low.

Always be mindful of how your database is tuned, you could be doing everything right but the query planner will seem to act insane regardless.

qaq · on Oct 5, 2017

How deep do you want to go Bruce Momjian has a ton of talks on internals and other advanced topics so if you want to dig deep would highly recommend them : https://momjian.us/

why-el · on Oct 5, 2017

I would suggest the database courses from CMU. Really good stuff and free online.

rbjorklin · on Oct 5, 2017

What’s CMU? Do you have a link?

jalcazar · on Oct 5, 2017

Carnegie Mellon University. Private research university in Pittsburgh, Pennsylvania with a strong CS department.

https://www.youtube.com/watch?v=MyQzjba1beA&list=PLSE8ODhjZX...

polskibus · on Oct 5, 2017

Andy Pavlo and Peloton db are the best!

why-el · on Oct 5, 2017

As the others have mentioned. Sorry for the laziness, I wrote while on the go.

tofflos · on Oct 5, 2017

I fully support learning for learning’ sake. But you could also view the new features as solutions to problems. If you don’t feel you’re having problems then you’re probably not missing out. There is no inherent value in using the new functionality.

If you do have a specific problem then you could ask the community about that and they will help you determine whether there is new functionality that addresses it. Then go read about it in the manual, try it out, and see if it works for you. I find learning from solving my own problems easier than reading a manual cover to cover.

pletnes · on Oct 5, 2017

I’ve seen devs thinking like that for decades. Since they already know Fortran 77, which is a Turing complete language, they can solve any computing problem.

Still, I’d argue that it’s valuable to explore new things, at least at the level where you get an overview of which tool to use for which job.

welder · on Oct 6, 2017

Use a cloud database:

https://www.graph.cool/

jfbaro · on Oct 5, 2017

Congratulations to the PG team! Another great release. Does anyone know when will PostgreSQL 10 be available on public cloud (AWS in particular) as a managed service (RDS)?

oskari · on Oct 5, 2017

PostgreSQL 10 will be available in Aiven (managed Postgres and other databases in multiple clouds including AWS and GCP, see https://aiven.io) within a couple of days after we've completed our validation process with the final version.

esseti · on Oct 5, 2017

well done people!

oskari · on Oct 11, 2017

PG 10 is now available on AWS, Google Cloud, Azure, DigitalOcean and UpCloud with Aiven.

See http://blog.aiven.io/2017/10/aiven-is-first-to-offer-postgre... for more information

Alex3917 · on Oct 5, 2017

It usually takes four or five months, although last time it only took a month or so, so hard to say.

lwansbrough · on Oct 5, 2017

Native partitioning, as well as the advancements on the replication and scaling side of things look like good first steps for a distributed SQL system.

Can anyone speak to how much closer this brings Postgres to being able to work like Spanner/CockroachDB? Partitioning is great but having to define my own partitions ahead of time isn’t a great solution for us folks who want to be able to spin up new databases on the fly (without the overhead of assigning new ranges for the tables on these new instances.)

Obviously CockroachDB has a long way to go before it works as well as a single instance Postgres DB. But how far does Postgres have before it works as well as a multi-instance Cockroach DB?

doh · on Oct 5, 2017

I think PG is already there and beyond if you consider using Citus.

We've pretty large deployment in our production with 45TB of data, across 1,280 cores 4.8TB of RAM. Most of our tables have hundreds of billions of rows that are under write-heavy pressure. Citus is currently handling around 60k writes per second. Most of our SELECT queries run under 200ms, in some cases those are fairly complex.

The current setup is a result of long tests done across many different database solutions. We never considered CockroachDB seriously for our production, but we did use or test MongoDB, Cassandra (including ScyllaDB), HBase (including Big Table) and Spanner.

We struggled with most, but we used HBase for a year before we moved to PG/Citus. Our expenses dropped by half once we done it as PG is just better suited for our setup. We tested out Spanner fairly heavily as GCP folks tried to convince us to give it a chance. The performance however wasn't even comparable.

Citus has its own quirks and it's not perfect, so definitely do your research before you decide to use it. The best part however that you the best SQL engine underneath with all the powerful features and tools that come with it. We for instance utilize heavily triggers and pub/sub directly from PG. Huge portion of our application logic is done on the DB level through UDF. The part we like about citus the most though is the team behind it. They are incredibly helpful and go beyond the scope of a traditional support even if they don't need to.

barrkel · on Oct 5, 2017

Do you use the column storage stuff?

doh · on Oct 5, 2017

You mean cstore[0]? No. We've some jsonb but very little. We know what our data look like, so we don't have to.

[0] https://github.com/citusdata/cstore_fdw

barrkel · on Oct 5, 2017

That's the bit. Not schema free, but column oriented instead of row oriented. I evaluated Citus a while back comparing to Impala w/ Parquet on HDFS and a couple of other systems. Citus was underwhelming in that context, and the syntax for the fdw awkward owing to keeping base Postgres stock.

doh · on Oct 5, 2017

Understood. Different use cases. We like schema, gives us an opportunity to keep data in check. We also communicate through protobuf so there is no benefit in schema-free database for us.

I know that Heap Analytics is using jsonb to have schema-free setup on citus and I think it's working well for them.

barrkel · on Oct 6, 2017

That's the second time you've thought I was talking about schema free. Column stores have nothing to do with schema free. Completely orthogonal.

Column-oriented stores normally have just as much schema as row-oriented stores. What they're faster at is scans where the full row tuple is not required to compute the result. Storing the data in columnar format rather than row also means much better compression: similar data is brought closer together.

doh · on Oct 6, 2017

You're correct, I misunderstood you.

Column stores are great, for some tasks and have disadvantages in others. Redshift was I believe the first implementation at top of PG so it's doable, however not sure that PG itself is a good engine for it.

In our case, where we need to have access to multiple columns at once, column stores (including the ones that clump them as families/groups) proved to be slower and required more resources than the row-oriented stores.

I think this really goes case by case based on your needs. If we would benefit from column-oriented DB we would not chose PG/Citus but something else (probably would stay still on Hbase).

awoods187 · on Oct 5, 2017

Hi all--My name is Andy and I'm a PM here at Cockroach labs. We are currently working through partitioning as a key feature of our 1.2 release scheduled for April 2018. Our solution can be found at this RFC (https://github.com/cockroachdb/cockroach/pull/18683). One thing to note, we don't require you to define your partitions up front. We'd love to get your feedback on this feature--please feel free to comment on the RFC or reach out to me directly at andy@cockroachlabs.com.

qaq · on Oct 5, 2017

What is the largest production instance of CockroachDB ?

MycroftH · on Oct 5, 2017

I can't talk about the largest production database, but our largest internal cluster we've tested on so far is 128 machines and hit each one with our own continuous load generators and a chaos monkey (who turns on and off machines at random).

qaq · on Oct 5, 2017

Cool but how much data was it storing and what was the performance?

lwansbrough · on Oct 5, 2017

No idea! It’s still in its infancy as a DB. Functional, but probably not worth the switch, unless you really really don’t want to deal with scaling a DB manually (which is where I’m at.)

qaq · on Oct 5, 2017

Right but to have a meaningful answer to your question one would need to know specifics right? As example a spanner setup that will perform on par with the largest reasonable PG setup you could have would be in 100K-150K/month range

coldcode · on Oct 5, 2017

I haven't worked with databases in a while, at my employer we are moving to MariaDB (from MySQL) - is there some reason why we wouldn't be considering PostgreSQL? Is there some drawback to P?

ComputerGuru · on Oct 5, 2017

While the other comments talk about the benefits of switching to pgsql (and I concur whole heartedly) it seems no one has addressed your specific case. Your company isn’t really “switching” to anything, MariaDB was a fork of MySQL when the license kerfuffle was going on and there were issues with the stewardship of the project. It’s more of upgrading to a newer release of MySQL than switching to a different database engine.

i.e. “switching” to MariaDB is probably just a sysadmin upgrading the software and no changes to your code or database queries (unless replication is involved) but pgsql will certainly require more involved changes to the software.