We saved $50k/year with a Go microservice coded in a hackathon

attaboyjon · on April 11, 2018

I've come to the conclusion that the problem in tech is that all the people doing the work are in their early twenties and have no idea what they are doing. Once they get some experience they are quickly promoted to the CTO position. Rinse and repeat.

What we have here is a classic dbms problem and no one at Movio seems to know how to deal with that. Instead of migrating from Mysql to something serious (Postgres) they move to some columnar DB no one has heard of. Nevermind that postgres and a reasonably priced DBA and a little thought put into their data model/queries could probably handle all their issues.

Sorry for the snark, cheers on a successful product.

endymi0n · on April 11, 2018

Can't help but also think "WTF are they doing there..." - we're doing exactly the same (user segmentation, targeting and campaign execution for cinema & movie users, disclaimer: we're more or less their only competitor, albeit indirect), but our solution is running at ~30k/year total at 10 times their user base. No magic in there, just good architecture and solid Computer Science. Boring technology (Go/Redshift/Postgres/S3).

The only thing I'd fully agree on is that using Go saved us a lot of resources as well. It's an awesome choice for stuff like this that needs to be reasonably performant as well as being simple, understandable and reasonably fast built.

indigochill · on April 11, 2018

Well, a problem in tech certainly. An alternative possibility (and another problem in tech) could be some mid-level developer could have figured out the problem at the start but because of artificial time pressure to deliver they didn't have the time to, so went with the first bad idea that popped into their head without taking the necessary time to evaluate it.

The fact this had to happen in a hackathon suggests a typical disconnect between management and development (and probably poor prioritization by management). Because development knew this was a problem and how to fix it (evidenced by the fact they fixed it), but it took removing management (aka a hackathon) to give development the space to fix it. And now the company pats itself on the back for having the vision to host a hackathon instead of structuring and prioritizing correctly in the first place so this would just get fixed on the clock.

I do think the author's takeaway about the value of simplicity and pragmatism are on point, but that applies not just to code but to management as well.

colde · on April 11, 2018

It is worth noting that on their website, their management team doesn't include a CTO, even though their main product is basically a software solution. They have a few sales people represented though, so management might not be great techwise.

maddiez · on April 11, 2018

CTO has moved to another company with not so shiny tech stack, as far as I know they were the main adopter of Go and other solutions to replace Java and then Scala. Perhaps Movio did not yet find the best fit for the company.

_pmf_ · on April 11, 2018

Typical prototype-gone-live scenario. A.k.a. 90 per cent of projects where prototypes are used at any stage ...

smoe · on April 11, 2018

I think it is not necessarily the age but the mindset of focusing on solutions instead of understanding the problem first.

It often goes like this: Oh snap, we encountered a problem! Lets find a tool, framework, language that promises to solve a similar sounding problem. Now we have a problem with a layer of abstraction on top. Soon to be two problems. Lets find a tool, framework, language to solve both of them ...

It is a spaghetti to the wall approach, where you just throw a bunch of things at your problem hoping that something sticks. And who cares how long it will stick.

Secondly as a developer I think in start-ups dedicated db experts are way underrated. Sure your fullstack devs can cobble together some tables, changing them 15 times a day to accommodate business requests and slap indexes on everything that gets slow. That is also the way to get into trouble once you scale, and instead of reflecting why this is, people reach for the bowl of pasta.

I was no different, when just starting out. I thought my biggest strength was, how quickly I can come up with easy "solutions" for any problem the company had. Took me years to realize how silly of an approach this is.

vram22 · on April 11, 2018

Right. That second paragraph of yours, i.e.:

>It often goes like this: Oh snap, we encountered a problem! Lets find a tool, framework, language that promises to solve a similar sounding problem. Now we have a problem with a layer of abstraction on top. Soon to be two problems. Lets find a tool, framework, language to solve both of them ...

is absolutely real, I've actually seen in happen both in projects I was in, and heard or read about.

This Rich Hickey video is somewhat relevant, IMO:

Tech Video: Rich Hickey: Hammock-Driven Development:

https://jugad2.blogspot.in/2016/03/tech-video-rich-hickey-ha...

You have to watch it at least part way through to get some of the better points in it, although the whole thing is good.

smoe · on April 11, 2018

Yeah, my opinions on this are "slightly" influences by the Rich Hickey talks "simple made easy" and "hammock driven development":)

The difficulty I find is, identifying the moment to leave the hammock again in a startup enviroment. To what degree do you need to understand a problem before you take action. If you try to understand it 100%, you'll never get anything out there.

But I'm already very happy that I was able to convince the business side of the company of the approach in a brief talk about it and they now referrer to "the hammock" themselves :)

vram22 · on April 11, 2018

>The difficulty I find is, identifying the moment to leave the hammock again in a startup enviroment. To what degree do you need to understand a problem before you take action. If you try to understand it 100%, you'll never get anything out there.

Agreed. The problem, though, (and I'm painting with a broad brush here) is that the erring tends to be much more on the side of not trying to understand much or at all, of the problem, before jumping into action. I think a lot of it is due to peer pressure and wanting to be "seen" by peers and bosses (and VCs) to be doing stuff, as opposed to really getting things done better in the medium term, even if in the short term it looks like you are not acting but "only" thinking or analyzing or designing stuff. Hence my comment in that post I linked to, about "we have to ship next week". All too common - been there, seen a good amount of that. In fact, this subthread between HN user jacquesm and me just recently, is basically about the same point, although described in different words:

https://news.ycombinator.com/item?id=16774234

>But I'm already very happy that I was able to convince the business side of the company of the approach in a brief talk about it and they now referrer to "the hammock" themselves :)

Cool :)

pm90 · on April 11, 2018

Somewhat related: I've seen this problem exacerbated by the presence of "architects" who don't seem to have implemented running systems in a long time, and especially have no experience with running newer technologies; or limited experience which breaks at scale. Not saying this applies to all software architects, but I've seen this often enough.

e.g. I remember using a dedicated jenkins environment to run continuous, scheduled integration tests for my service. When the architect found out, he immediately sent me links to software packages that are dedicated to running continuous tests. I asked whether he had any experience running these new packages and if he would be willing to set it up/maintain it.... radio silence.

tetha · on April 11, 2018

I think with experience, "easy" changes.

Some time ago, I thought it was <easy> to write code to do things. By now, I mostly ponder how I put things into postgres/kafka|rabbitmq|../memcache|redis|.../elasticsearch/neo4j so I can reduce everything to good queries into these systems.

peburrows · on April 11, 2018

Overall, I agree with the sentiment of your comment, but the assertion that MySQL isn’t a serious database is just flat out wrong.

attaboyjon · on April 11, 2018

You're right. I started using mysql in 2000 when it was still a toy. It's an outdated bias :) I was more flabergastered they would choose 'InifiniDB' when there are so many other great options out there.

Buttons840 · on April 11, 2018

Well, MySQL still doesn't implement the SQL standard from 1999 (20 years ago), because it's missing common table expressions. Although the next release will support them, thankfully. Sorry, I just had that axe to grind.

dijit · on April 11, 2018

I disagree with your assertion that MySQL /is/ a serious database.

The questions I usually ask myself when evaluating database solutions is:

* Does it accept invalid data?

* Does it change data on error?

* Does the query planner change drastically between minor versions?

* How strong is transaction isolation? can I create constraints, columns or tables in a transaction?

* Does it scale vertically above 40~ CPU threads and 1M IOPS?

The answer to all these questions, for MySQL is "No". You could argue the value of some of them, but a lot of them highlight architectural or development procedural misgivings.

Thaxll · on April 11, 2018

All this list is fud. You should provide examples to those claims because right now you sound like someone that never use MySQL.

paulryanrogers · on April 11, 2018

Not the OP but one example of accepting invalid data is Mysql defaulting values to null/0/0000-00-00 when no value assigned and no default on column. Unless one is in strict mode.

dijit · on April 11, 2018

I appreciate that it might sound like that to someone who hasn't used MySQL in production for 10+ years.

To start with, this is still true today: https://vimeo.com/43536445 Despite being 6 years old, strict mode is still required.

Anything prior to MySQL 5.7 will accept "0000-00-00 00:00:00" as a valid date, 5.7 will not (which is sane) however this means migrating from 5.6 -> 5.7 just got a little harder.

In fact it wouldn't validate /any/ date so it would assume every year was a leap year and febuary always had 29 days.

Regarding the query planner: https://bugs.mysql.com/bug.php?id=74602 This affected my prod mysql/zabbix installation https://support.zabbix.com/browse/ZBX-10652

Regarding performance: This is what I found from my own experience: I was given the task of testing the limits of MySQL, MySQL was the chosen technology and I was no involved in making that decision so- whatever. We were given 10 servers, with 40 cores (2014-2015~) 128G of DDR3/ECC and 8 SATA SSDs in RAID-0 with 1G of RAID cache for write-back. We managed to get MySQL to bottleneck pretty quickly, our queries involved a lot of binary data so we should have been raw IOPS bound, but we weren't we were memory bound. So we replaced the memory allocator with a faster one (jemalloc) and we get a 30% performance improvement. We suspected that the kernel sockets implementation was slowing us down so we compiled a custom "fastsockets" linux kernel. The improvement was around 4%, but we were bottlenecked on memory. After doing a full trace of what MySQL was doing we saw that InnoDB was spinning on a lock quite a lot. I asked if we could try other SQL solutions (MSSQL/PostgreSQL) Postgresql was first chosen because we could just install it, no license and no OS change... it was twice as fast as the optimised MySQL installation out of the box with a stock CentOS6 kernel.

We never even bothered testing MSSQL because PostgreSQL met our performance targets, we were now IOPS bound.

-- More anecdatum:

Regarding data consistency we (tried) to migrate to postgresql for performance reasons in 2014 (my previous company), and failed because MySQL had been corrupting our data very slowly and silently for many years (corrupting meaning not honouring NOT NULL, not honouring type safety, allowing invalid dates, inserting data on error) So far in that actually reimporting the output of `mysqldump` would not work.

e12e · on April 11, 2018

> will accept "0000-00-00 00:00:00"

Isn't it? I thought that today()-(2018 years, 4 months and 10 days) would be approximately that date? Maybe you prefer +0000 vs just 0000?

'ISO 8601 prescribes, as a minimum, a four-digit year [YYYY] to avoid the year 2000 problem. It therefore represents years from 0000 to 9999, year 0000 being equal to 1 BC and all others AD. However, years prior to 1583 are not automatically allowed by the standard. Instead "values in the range [0000] through [1582] shall only be used by mutual agreement of the partners in information interchange."

To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver.[19] An expanded year representation [±YYYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign[20] instead of the more common AD/BC (or CE/BCE) notation; by convention 1 BC is labelled +0000, 2 BC is labeled −0001, and so on.'

https://en.m.wikipedia.org/wiki/ISO_8601

Now, if mysql accept, but can't store such a date, I understand that it's a problem.

paulryanrogers · on April 11, 2018

Technically true but still 0 is not yet a month or day of the month, so 0000-00-00 is still bad data

e12e · on April 12, 2018

Ah, yeah. Was too focused on the year to consider the month/day!

dijit · on April 11, 2018

MSSQL/Oracle/PostgreSQL/firebird all accept 0000-01-01 which is a valid date.

BartBoch · on April 11, 2018

I think a lot of what have you written can be solved software-side. Good Database should be no excuse for bad code.

I do not think MySQL is a technological debt as in 80% startups moving to the different solution is cheap and non-problematic. The LAMP is good enough and quickest/cheapest for the majority of tech companies.

dijit · on April 11, 2018

I'm being perfectly fair in being critical of a software which claims to be doing those things.

You can solve issues in your application if you know there will be issues like these, knowing the pitfalls and drawbacks of a technology is certainly noble- but if you do then why not choose something that follows principle of least surprise. (There might be reasons).

I would never claim that you should move everything from MySQL if you use it. However if you care about data consistency ensure that you change the defaults, engage strict mode, ensure that your application has no bugs in handling data.

This is actually hard to do correctly, it's overhead in development that you shouldn't be caring about. Just choose something that has sane error conditions and the problem vanishes.

jjeaff · on April 11, 2018

Considering many, many of the world's largest tech companies use MySQL or MySQL compatible databases, it's rather absurd to say that MySQL isn't a serious database. Regardless of whether it matches yours or someone else's personal list of capabilities.

dijit · on April 11, 2018

To be perfectly fair with you, you can make bad choices and still get something useful done.

Most companies are not alive "because they chose mysql over something else" they're alive because they have "good enough" tech to get the job done. The job that they're trying to accomplish is the thing that makes them successful.

Uber isn't super huge because it used a specific database technology. It's huge because it's good at marketing, it's providing some value to people.

otterley · on April 11, 2018

If it got the work done at a reasonable cost and performed reasonably well (i.e. it served the purpose it was meant to serve), how “bad” a choice could it have been?

zeveb · on April 11, 2018

I'm reminded of an article about zombie companies I read recently: they're companies which are inefficient/pporly-managed/poorly-executing, but due to market/regulatory inefficiencies they're not dead yet. Companies which use MySQL are in a similar situation: they're not doing as well as they could be, and all other things being equal they ought to be put out of business by their competitors — but all other things are rarely equal.

Still, if you are making choices for yourself, you don't choose mediocrity and hope to muddle through: you choose excellence. Choosing MySQL isn't choosing excellence.

otterley · on April 11, 2018

“Excellence” is a poor criterion for comparative analysis because it is (a) subjective and (b) unquantifiable.

Do you have objective or quantifiable data and references upon which your opinion is based, _and_ is universally applicable to any arbitrary problem that a SQL database might be an appropriate solution for?

dijit · on April 11, 2018

If it costs development time because they need to be extra careful about not sending queries that ERROR and 'wipe' data.

If it silently corrupts data over years and gets discovered much later. (As was the case with my previous company, an e-commerce retailer that lost large chunks of order history)

otterley · on April 11, 2018

Are those problems still unresolved in MySQL today? How do you know that similar or worse problems did not exist in alternative solutions at the time it was implemented?

dijit · on April 11, 2018

MySQL is making strides to fix these kinds of issues ever since the Oracle acquisition for sure.

> How do you know that similar or worse problems did not exist in alternative solutions at the time it was implemented?

Because I've been working on database solutions for over 10 years, there are problems in other software but I consider data loss to be worse than any of them. For example the autovacuum in postgresql 8.3 and before was mostly garbage which ended up bloating highly transactional databases. But deleting data when you fail a constraint is worse.

poooogles · on April 11, 2018

One true Scotsman fallacy at work is what it is.

collyw · on April 11, 2018

Agreed.

I have 15 years of experience and can built a decent clean system using "boring" technologies. But all the decent paid work where I live is maintaining big balls of mud with tech that was obviously peak hype when it was chosen, and nothing done according to best practices because of that would require sticking with a tech and learning it properly. Its quite frustrating.

Then we have the interview process where people expect me to give up my weekend for their coding test and can't even be bothered to give you feedback afterwards. Or some ridiculous algorithmic nonsense that has no relevance to the job. Getting bored of it all.

ksec · on April 11, 2018

If I ever be a CEO of the company / Startup, that one criteria I made is either I decide on all the technologies we use, or there is no CTO so i make those decision.

And that criteria of technologies could be summed into one sentence. Use something boring. No Hyped programming languages / DB / tools allowed.

Of course some would argue you would be doing it wrong even if it was using old tech / programming / tools. Well yes, but you have a sea of recourse and expertise there to ask for help. Instead of spending energy and time doing figuring it out.

Of course if your company is all about tech innovation, AI or something cutting edge there surely you will have to tried something new. But 80% of those startup aren't.

shoo · on April 11, 2018

Quoting Dan McKinley's "choose boring technology" [cbt]:

> Embrace Boredom.

> Let's say every company gets about three innovation tokens. You can spend these however you want, but the supply is fixed for a long while. You might get a few more after you achieve a certain level of stability and maturity, but the general tendency is to overestimate the contents of your wallet. Clearly this model is approximate, but I think it helps.

> If you choose to write your website in NodeJS, you just spent one of your innovation tokens. If you choose to use MongoDB, you just spent one of your innovation tokens. If you choose to use service discovery tech that's existed for a year or less, you just spent one of your innovation tokens. If you choose to write your own database, oh god, you're in trouble.

> Any of those choices might be sensible if you're a javascript consultancy, or a database company. But you're probably not. You're probably working for a company that is at least ostensibly rethinking global commerce or reinventing payments on the web or pursuing some other suitably epic mission. In that context, devoting any of your limited attention to innovating ssh is an excellent way to fail. Or at best, delay success.

[CBT] http://mcfunley.com/choose-boring-technology

aaronbrethorst · on April 11, 2018

I'm a fan of taking a 'one new technology' approach. When I'm building something new, I get to choose zero or one new technologies to play with, depending on whether I want to get shit done or learn something new.

By choosing at most one new thing, you can better control for how your stack should work and how you expect it to respond to certain unexpected circumstances, which means you should be able to more effectively solve issues as they crop up than you'd be able to if you were using multiple new technologies.

fbr · on April 11, 2018

I agree with the main idea: working with hyped technologies is not a solution and you can build most of the things out there with boring technology.

But then ... you have to find, attract and hire good developers. That's already difficult, adding an extra layer of 'boring technology' will make this task even more challenging.

_wldu · on April 11, 2018

"In June 1970, E. F. Codd of IBM Research published a paper [1] defining the relational data model and introducing the concept of data independence. Codd's thesis was that queries should be expressed in terms of high-level, nonprocedural concepts that are independent of physical representation."

The key, the whole key, and nothing but the key so help me Codd.

Also said as... "In Codd we trust."

If none of these DB jokes mean anything to you, take a DB concepts class at a CS university. There's a lot of great research going back 50 years and you can learn a great deal about why things are the way they are (tuple algebra and calculus). And before changing anything for something you think may be better, you should fully understand what you are giving up.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86....

davidp · on April 11, 2018

I like to trot out this old gem[1] when people wonder why there's so much hate for MySQL.

Nontransactional DDL alone is sufficient to classify it as a toy DB for me. Yes, I've been personally bitten by it.

[1]: https://grimoire.ca/mysql/choose-something-else

preetamjinka · on April 11, 2018

According to [1] Oracle doesn't have transactional DDL either.

[1] https://stackoverflow.com/questions/4711447/oracle-ddl-and-t...

cdubzzz · on April 11, 2018

This talks a lot about 5.5 and mentions that 5.6 is “due out soon”. The current release series is 5.7. How much of this is outdated and how much has stayed the same?

lstamour · on April 11, 2018

Some things are improving — for example: https://dev.mysql.com/doc/refman/8.0/en/atomic-ddl.html

And https://mysqlserverteam.com/new-defaults-in-mysql-8-0/

Previously on HN: https://news.ycombinator.com/item?id=5122299

Personally, I only use MySQL and derivatives where I have to (basically WordPress.)

dsp1234 · on April 11, 2018

"Some statements cannot be rolled back. In general, these include data definition language (DDL) statements, such as those that create or drop databases, those that create, drop, or alter tables or stored routines." [0]

[0] - https://dev.mysql.com/doc/refman/5.7/en/cannot-roll-back.htm...

baud147258 · on April 11, 2018

I like the part "We used it because it was there. Please hire some fucking software developers and go back to writing elevator pitches and flirting with Y Combinator."

jjeaff · on April 11, 2018

Well, you should school those fools at Google, Facebook, Twitter, Pinterest, Amazon... and tell them how they are wasting time with their toy MySQL databases.

zimpenfish · on April 11, 2018

> those fools at Google, Facebook, Twitter, Pinterest, Amazon

... have dedicated hundreds of engineers and millions of dollars to nothing more than keeping MySQL up, running, and not crapping the bed every time someone looks at it funny. If you can afford that resource expenditure, by all means go nuts with MySQL. Most companies can't and would be far better served by something which doesn't need that amount of handholding to serve its basic purpose.

pisky · on April 11, 2018

There is one thing that bugs me about all the talk of postgres' superiority: why haven't these companies switched to postgresql? Surely they weren't all too far invested in mysql before a "more knowledgeable" DBA came along saying postgresql is better.

wlll · on April 11, 2018

Because MySQL is the PHP of databases. Much like PHP is designed to just carry on truckin' whatever problem it encounters (https://eev.ee/blog/2012/04/09/php-a-fractal-of-bad-design/), MySQL was too. It's easy to slam some data into it and get it back without having to think too hard about what you're actually storing (https://grimoire.ca/mysql/choose-something-else). It's ease of use made it pervasive, much like PHP.

Following on from that, I suspect a lot of large companies use MySQL because they always have, not because it's actually any good. For example, Basecamp used MySQL while I was there, but I never met a single Sysadmin there who would use it over Postgres if they were to start a new project.

cleong · on April 11, 2018

PHP was built for the web and has been successful at that job. It is easy to use because the core developers have made some good design choices for the task at hand. For example no threads, stateless requests, core functionality focused on outputting HTML, etc.

MySql and PHP are good. They do the job they were designed for in a cost effective way and of course that means there will be trade offs.

wlll · on April 18, 2018

PHP is crap. It's actively hard to write good code in it. Not good code like SOLID or pretty code that's self documenting, it's hard to write code that's not going to break in unique and interesting ways.

Sure, you can knock up a contact form in it really quickly, but that ease of use hides significant dangers.

I've programmed it, it's a terrible language.

cleong · on April 18, 2018

This might have been true 10 years ago with versions like PHP 4. But remember many companies have invested a lot into PHP including Facebook. In the newest version of PHP what you said cannot apply with a type system, OOP features like traits, class inheritance etc.

zimpenfish · on April 11, 2018

> Surely they weren't all too far invested in mysql

I think by the time Postgres sorted itself out into a more user/admin-friendly system (which is still fairly recently, really), MySQL had pretty much conquered the "quick and easy" mindshare and was deeply embedded almost everywhere.

And if you've spent millions of dollars architecting your systems such that MySQL's flaws aren't killer issues, there's very little financial benefit to switching, I guess.

manigandham · on April 12, 2018

MySQL for years was far easier to install, configure and run than PostgreSQL, especially features like replication which was much better than other options. Big companies use these databases as more like simple key/value systems rather than complex relational schemas so strong replication and operational simplicity was favored over the rich featureset of Postgres.

Eventually Postgres caught up in most things, and the delay was in some part because of implementing those features "correctly" and with more thought, but it's still a delay that hurt the uptake in the early days.

coldtea · on April 11, 2018

The snark is OK. The lack of any substance tho -- besides the dubious claim that a move to a "serious" db like Postgres would have saved them...

quest88 · on April 11, 2018

Why isn't MySQL serious? It has powered many popular sites.

attaboyjon · on April 11, 2018

You are right. It is a solid DB. I was more pointing out that if you have to move off Mysql, there are excellent options other than adopting a new columnar datastore.

zimpenfish · on April 11, 2018

> Why isn't MySQL serious? It has powered many popular sites.

As has PHP. "popularity" isn't really evidence for it being a "serious" tool, is it?

eeZah7Ux · on April 11, 2018

Why the downvotes? Popularity != quality.

bufferoverflow · on April 11, 2018

Can you demonstrate that Postgre is significantly faster than MySQL on average? Highly doubt so. The problem is in the way data model was architected and implemented.

zaarn · on April 11, 2018

PG isn't usually faster than MySQL on any naive database (which is 90% of database in the world I suppose)

Most E-Shops will be fine running MySQL or MariaDB.

The one thing PG excels at however is that you can tune it much more to your workload and it allows tuning the workload much more finely than MySQL/MariaDB. That and the ability to extend PG arbitrarily (try adding native functions to mysql without recompiling) via the C-FFI offered. You can write and define your own index methods that let you use an index that is perfect for the workload or you can add a new data type to support a new input with validation.

You can sink a lot of work into getting the most out of a PG database, MySQL not so much. But again, for most people MySQL will provide the same (or even better) performance than PG. (I still trust PG over MySQL after MySQL nulled out all entries of a table with only NOTNULL columns after a nasty crash)

dboreham · on April 11, 2018

Postgresql is different not because it is fast but because it works. And keeps working.

avereveard · on April 11, 2018

apples to oranges. you need to compare PostgreSQL with other database that don't take shortcut around ACID for performances (for example DDL forcing implicit commit in transactions)

carterehsmith · on April 11, 2018

So this is true, but I find that it does not matter much.

Like, how often one needs to rollback a DDL statement? I did that like... never.

And, what is the use case? Like, you added a column to a table by accident? Well, that will not break anything, so no harm done.

That is way different from regular dml rollback which may recover 1bn records and save your life :)

astura · on April 13, 2018

Seriously? You don't need to do DLL operations in day-to-day use but you'll need to when you're doing development work.

For example, you had a "color" column on a table, for a new feature youre now adding the ability to have multiple colors. You're going to create a new column, create a new table, populate that table, and drop the old column. If anything fails during that process you'd like to be able to roll back.

carterehsmith · on April 13, 2018

Good points, I am familiar with the process.

There is a concept or "forward-compatible change". Basically, you don't do things that will break your software.

Example, you don't add a NOT-NULL column unless you can give it a good DEFAULT value, to make it work.

Also dont' drop columns until the software is ready for it, etc.

If you have a decent ORM, it will compare your "how it needs to be" sql schema with the "how it is" schema. Then it will generate appropriate "ALTER TABLE ...." "CREATE INDEX " etc statements. Note that this is automated and you never need to type SQL statements to achieve that.

All together in the last XXX years, I did not really need to do a rollback on a dml statement.

slx26 · on April 11, 2018

To be fair, we used to die at 40. Now you can get a CS degree at 21 or 22, and you still "have no idea what are you doing". Maybe it's just that technology is really complicated, the world is really complicated, and everything is changing fast.

I don't mean it to be conformist, but it's easy to forget some things are actually hard when you are very clever or old enough to forget how it was like when you were still learning too.

foepys · on April 11, 2018

But at the same time it's easier than ever to find information on nearly everything. Asking experts also is easier than ever before. I don't think it's feasible to dismiss GP's claim with yours.

slx26 · on April 11, 2018

Sorry if it came across like that. Not trying to dismiss his claim, just trying to give a wider perspective. Similarly, in the same way you are right saying it's easier to find information on nearly everything nowadays, it might also be relevant to remember that we still have limited time and attention spans.

StreamBright · on April 11, 2018

I barely ever see an engineer knowing everything that happens from the top to the bottom on any platform. Very few companies are forced to get to know their stack in depth, usually they throw more money to the problem.

kevin_nisbet · on April 11, 2018

> I've come to the conclusion that the problem in tech is that all the people doing the work are in their early twenties and have no idea what they are doing.

Not that I disagree, but to be fair, I've seen plenty of tech ignorance with experienced and older engineers as well that has been pretty crippling.

dboreham · on April 11, 2018

Agree. I'm not sure per se this is an age thing: the field is just so big and apparently (but perhaps less so in reality) in a constant state of tech churn, that it is hard to anchor to consistent proven techniques and practices.

wimdetroyer · on April 11, 2018

What's wrong with mySQL?

stplsd · on April 11, 2018

>Instead of migrating from Mysql to something serious (Postgres)

This is just trolling

matte_black · on April 11, 2018

Agreed, I’m sure if this is the kind of problems they are having I could probably be saving them even more than $50k a year if they were on Postgres.

undefined1 · on April 11, 2018

The difference between mySQL and Postgres is that significant? What makes up the difference?

jjeaff · on April 11, 2018

Hardly. And in general, MySQL outperforms Postgres. But you will have more options to tweak Postgres to work better for you in edge cases.

orf · on April 11, 2018

I don't quite get this. How fast was running this query:

   Select loyaltyMemberID
   from table
   WHERE gender = x
   AND (age = y OR censor = z)

Why the random complexity with individual unions and a group? Of course that's going to be dog slow.

Sure, the filters can be arbitrary but with an ORM it's really really simple to build them up from your app code. The Django ORM with Q objects is particularly great at this.

Obviously I'm armchairing hard here but it smells like over engineering from this post alone. Stuff like this is bread and butter SQL.

Edit: I've just read the query in the post again and I really can't understand why you would write it like that. Am I missing something here?

Seems like a fundamental misunderstanding of SQL rather than a particularly hard problem to solve.

owenmarshall · on April 11, 2018

> Stuff like this is bread and butter SQL.

Ten or fifteen years ago, sure - a DBA would look at a query plan and figure out how to do it properly. Worse case you'd slap a materialized view in and query that.

But this is 2018! Programmers don't want to treat the database as anything but one big key value store ;)

Matheus28 · on April 11, 2018

If you aren't implementing your own tabular database on top of your existing tabular database, you aren't 2018 enough :)

mathgladiator · on April 11, 2018

I'm changing the game by code generating a tabular database on top of an eventualy consistent store. 2020 here we come!!!

elvinyung · on April 11, 2018

Rows by any other name would smell as sweet...

mgkimsal · on April 11, 2018

better make sure it's all immutable too.

avereveard · on April 11, 2018

oh boy do I have news for you https://www.cockroachlabs.com/

elvinyung · on April 11, 2018

not an eventually consistent store though.

cookiecaper · on April 11, 2018

Yeah, sadly, this is not too much of an exaggeration. I've worked on teams that insisted they needed DynamoDB, because, well, Dynamo is for "Big Data", and they certainly wouldn't work somewhere that had "Small Data"! Replace the buzzwords/products as applicable; you could actually probably just scramble them and it'd work just as well, since someone out there thinks "RabbitMQ means Web Scale", etc.

SQL databases are amazing, robust examples of engineering. They are your friends and they're the appropriate choice for the vast majority of software. They are not outmoded or passe. Though I acknowledge there is a separate use case for K-V stores, I almost want to make policy preventing their use just because I know so many developers will abuse them badly and then stare back at you blankly during the semi-annual massive downtime event, muttering something like "Well, it's based on research at Google, so I'm sure there's a way to recover the data..."

MrMorden · on April 11, 2018

Every fool knows that MongoDB is web scale. (Not as web scale as /dev/null, but MongoDB has a better logo.)

reificator · on April 11, 2018

I disagree. /dev/null's logo is way cooler.

https://www.nasa.gov/sites/default/files/cygx1_ill.jpg

toddBarkus · on April 11, 2018

I think this is a case for the return of the traditional "sysadmin" as "devops"/"SRE" is now the role of unblocking deploying a solution instead of questioning it's complexity/fitness.

cookiecaper · on April 11, 2018

While I agree, I also think that overt gating and approval processes create a high tension dynamic that frequently breaks down, whether it's ops v. dev, security v. dev, or others. It's easy for people to get their pride wounded, and they end up encouraged to find workarounds to the process. The simple answers to this are pretty much imaginary, unfortunately.

toddBarkus · on April 11, 2018

Sure, of course. I was hoping to point out that there's an increasingly overlooked value in having someone question complexity. The "no, you don't need React" of the frontend dev or the "our data is actually relational" of the back end dev.

rbranson · on April 11, 2018

If your “SRE” team is only “unblocking deploying a solution” then I’m sad to say they are an operations team who has rebranded themselves to appear more relevant.

toddBarkus · on April 11, 2018

That's most "SRE" -- it's a title arms race in that field between the underqualified and those that wish to convey they know how to do more than write system scripts in DSLs

EGreg · on April 11, 2018

Question, though...

If your columns have types, how can you encrypt them using custom keys for each one? Is it possible?

I want the keys to reside on the client and search for encrypted data. Basically single row lookups at a time.

pandapunchpower · on April 11, 2018

The column type would have to be the type of the encrypted value. The type of the unencrypted data could not be enforced by the DB and you would have to rely on code doing the correct thing.

I am however extremely wary of doing it that way. I don't know your requirements of course.

EGreg · on April 11, 2018

Here is the thing - encrypted stuff is just a weird encoded string. So I can’t really use columns normally.

What I really need is just a huge table with two fields: “token”, “content”

And the token is basically the primary key but encrypted with whatever encryption.

You could even do foreign keys this way.

Hmm I suddenly have an idea. What about a layer above the database that basically enforces foreign keys and joins in this way to support end to end encryption? The content would reference ENCRYPTED foreign keys. Only clients would decrypt stuff.

pandapunchpower · on April 12, 2018

>What I really need is just a huge table with two fields: “token”, “content”

Sounds more like a key value store and less like a relational database. Although you can store key value data in a relational db of course, there may be a better tool for the job.

JoachimSchipper · on April 12, 2018

You'll be interested in CryptDB.

LgWoodenBadger · on April 11, 2018

Are you sure you aren't better off searching for hashed data? The only thing it doesn't really support is non-exact matches.

nurettin · on April 11, 2018

Also, how do you index that?

Regardsyjc · on April 11, 2018

Do you have any recommendations for resources to learn best database practices? I'm currently designing my first database and I'm not sure what information is worth storing (like calculations) and how to choose which data to group in tables.

fnord123 · on April 11, 2018

SQL Antipatterns is good as another commenter recommended. But my favourite book on the topic is Markus Winand's SQL Performance Explained. Most of it is online here: https://use-the-index-luke.com/ but I recommend buying it since it's tiny and worth its weight in gold.

It's short so you actually read it and possibly reread it. It's to the point. It has pretty pictures. And it had directly applicable advice.

jmts · on April 11, 2018

I'm going to go out on a limb and suggest you start writing the application without a database first.

My early education on databases always seemed to follow a "how do we make a database do this?" rationale rather than "what data do we need to store to support these features?", which I think leads to a software design that is too strongly coupled with the database. Software modules end up dependent on database features, or table structure, and refactoring or switching data stores becomes more costly.

Instead, start with a simple in-memory data store - a list of objects with some interface for accessing them, will probably your starting point. Add some basic serialisation/deserialisation features (CSV, JSON, etc) when you get past initial testing and require some persistent data. Then, once you have your API in place and your software design is stabilising, you should be able to map that data to a database fairly easily:

* The primary structure maps to your main table

* Child structures become additional tables, with foreign keys

* Data used to lookup records can be indexed for better performance

Beyond that, you should profile/benchmark your application to find what needs to be optimised, and then investigate whether your software design or your data store should be doing the optimisation.

Let your software's features influence the design of your database. Don't let the database's features influence the design of your software.

jonah · on April 11, 2018

I read SQL Antipatterns[0] - Avoiding the Pitfalls of Database Programming by Bill Karwin a while ago and really liked it.

[0] https://pragprog.com/book/bksqla/sql-antipatterns

matte_black · on April 11, 2018

Ok, first you need to decide: is this going to be a purely transactional database (for business processes) or do you also plan to do data analysis straight inside this database (meaning you won’t be extracting, transforming and loading data into another database and analyzing it there).

If it’s transactional, I recommend keeping calculations only if you need to access summarized data frequently. For example, if you are tracking inventory by storing the history of transactions that occur into and out of inventory, it’s trivial to find out how much of each item you have in stock at any point in time by doing a sum of the change in quantities for each item type up to that point.

If you were usually interested in the “current” count, it would be expensive to perform this sum every time, so instead you could keep a separate table for calculating the running total of inventory per item and referring to that. Keep this table up to date through the use of triggers on insert events (Note that your log of inventory transactions would thus be an immutable stream of events)

An example of something not worth storing is derived data that is a combination of separate columns in a table. For most queries it’s probably trivial to be lazy and wait to perform such a calculation until you actually need it. If you still want to have a ready made “table” that has all the computations you want already entered as columns for easy querying, use a view. If you find yourself making liberal use of views, you’re on the right track.

If you are using a separate data warehouse for data analysis, then precompute and denormalize as much as you can.

*Disclaimer: most of what I’m talking about is from a Postgres perspective.

dugmartin · on April 11, 2018

I’d suggest learning the fundamental normal forms - once you understand them database design is very straightforward.

https://en.wikipedia.org/wiki/Database_normalization

jehlakj · on April 11, 2018

You can thank full stack developers for that

flukus · on April 11, 2018

If they don't know how to look at a query plan and craft a solution can you really call them full stack?

vishnugupta · on April 11, 2018

The stack has to bottom our somewhere and it's usually at the application code

matte_black · on April 11, 2018

If that’s where they bottom out they should be called “mid stack” rather than full stack.

If you are claiming to be full stack you better be prepared to go all the way.

zaarn · on April 11, 2018

I would describe myself full stack under that definition. Most "full stack" people I know that sit in my Uni courses have mostly learned Java EE + Oracle DB or Javascript + /dev/null^w^w MongoDB. Most of them would probably not be able to construct a relational database or libc from scratch.

Granted, such knowledge isn't immediately useful since it's something I or anyone is likely to do but it grants insight into systems. I know roughly how a query optimizer does and what it can, and more importantly, can't do.

When you know a system you can optimize for it. When you don't know a system you can only follow someone else's advice on how to optimize for it.

majewsky · on April 11, 2018

> /dev/null^w^w MongoDB

Nitpick: That's actually either three ^w or just one ^w, depending on how your WORDCHARS is set up. :)

zaarn · on April 11, 2018

I'm sadly not that much of a Vim expert, Hackernews lacks formatting for a strikethrough and I wasn't sure how a ~~/dev/null~~ would be interpreted.

adrianN · on April 11, 2018

How much quantum mechanics do you really need to know to code the next Uber for hamsters?

flukus · on April 11, 2018

Isn't the current quest in quantum mechanics (to find a grand unified theory) to find an abstraction that does leak? At the moment it's too self contained and doesn't explain anything about the macroscopic world.

In any case, having a basic understanding of the next level up (the electron) has proved quite useful to my career, otherwise I wouldn't know how turning things off then on again affects the machines I'm working with.

chiefalchemist · on April 11, 2018

Them, or the manager that hired them? If no one up the chain brings on a DBA what are they supposed to do? I hear ya. But this is as much a symptom of naive (and budget stretched) leadership as it is of the hands on deck.

olalonde · on April 11, 2018

Your solution costs 150K$/year while theirs took a weekend and saves 50K$/year...

chiefalchemist · on April 11, 2018

Allow me to clarify.

1) It took a weekend to complete. The friction was building for far longer. There's a cost to that, esp if it effects customer satisfaction and retention. They didn't refactor for fun, did they :) How many dev teams aren't so lucky? Is this article a no choice outlier, or a best practice?

2) My comment wasn't directed at the article but on another comment that blamed the developers. These problems should be owned by ownership / leadership / management more and engineers less.

3) That said, hire a DBA? I don't think that's necessary.

collyw · on April 11, 2018

I call myself full stack and I can tune a database. decent database design comes first though.

nemothekid · on April 11, 2018

>Seems like a fundamental misunderstanding of SQL rather than a particularly hard problem to solve.

Without knowing the rest of their stack, or what their data ingestion looks like, I think your query is oversimplified. If they are doing a union, then it's likely they aren't querying one table, but they are querying multiple tables. The article mentions that individual customers had as many as 500 million rows. Likely each customer has their own set of data they also pipe into the system. Next their own custom query language may support more complex algebra than standard equality.

IMO, the article doesn't sufficiently describe the problem for us to understand why their solution works. To you and I there are 100 other solutions they could have tried that seem simpler than the one they presented.

It's less likely that they overengineered - we are probably just underinformed.

hueving · on April 11, 2018

>It's less likely that they overengineered - we are probably just underinformed.

Based on 15 years in software companies in the valley it's much less likely that this isn't over-engineered. Nearly every decision I've seen chasing technology hype has been based on ignorance of existing solutions.

any626 · on April 11, 2018

The user data is most likely in rows instead of columns. Instead of having

    id, name, age, gender
    1213, fake, 60, female

they would have

    property_id, user_id, value
    1 (assume age), 1213,    60
    2 (gender),     1213,    female

This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.

thorin · on April 11, 2018

Entity attribute value anti pattern - this has been well known for at least 20 years. It can be tempting when you want to design a "flexible" system but really needs to be used sparingly. I was BI team lead on a product where the architect insisted that it be used on every entity (>300) as you never knew when you might want to add some bit of data. It led to some interesting (multi-page) sqls and the project ultimately failed. This was one of the reasons. Slow performance and often runtime errors when expected data wasn't present and the application layer couldn't cope. It was a good learning experience. https://mikesmithers.wordpress.com/2013/12/22/the-anti-patte...

zimpenfish · on April 11, 2018

We have this as a "meta" field (because MySQL is balls at adding new columns to big tables without multi-hour downtime) with some arcane nonsense format. Totally unqueryable with any efficiency.

cleong · on April 11, 2018

EAV pattern has trade-offs you need to compensate for (performance). Production systems that use EAV have flat tables, and heavy caching to have be flexible with acceptable performance.

bpicolo · on April 11, 2018

You could argue that there are cases for it. Datomic[0] is basically EAV on steroids.

https://www.datomic.com/

nstart · on April 11, 2018

Oh gosh this pattern. The first time I encountered it was in my first job where we used Magento. Super flexible. Also super slow. Does anyone have any advice how to make a db design like this work faster? Generally I thought when data is arranged like this it might be a prime candidate for document based storage. But I'm no dba so I have no idea if that would be correct.

matte_black · on April 11, 2018

If you are using Postgres, the JSONB datatype will let you do exactly this while still using the full power of SQL. Simply create a column where you keep a JSON object full of random user properties, if flexibility is what you want. You can even index properties.

mhd · on April 11, 2018

Or just create ad hoc tables with user fields. Quite often it's not that a customer has n different fields for n entities, but a few that apply to the majority (like internal ERP ids, classifcation etc.). Put them in a few tables, index them, join them. If you don't want to parse internal DB descriptors, create a set of "schema" tables to build queries from.

icebraining · on April 11, 2018

The question is whether it can be stored like this while allowing for fast queries. For example, unless it changed recently, Postgres doesn't calculate statistics to help the query planner on jsonb fields.

zaarn · on April 11, 2018

JSONB columns should behave just like any datatype with a GIN index in the recent releases, to my knowledge.

Still, a JSON column will arguably be faster than a pure KV table since you can more efficiently query it, especially any non-JSON columns.

ptman · on April 11, 2018

IIRC JSONB still has problem with index statistics

So values in JSONB columns can be indexed nicely, but the statistics can be much worse than for non-JSONB columns, which can lead the query planner astray.

NickNameNick · on April 11, 2018

Postgres will let you create an index on an expression into the JSON column, so querying should still be very quick.

matte_black · on April 11, 2018

The answer would make a great blog post.

bitexploder · on April 11, 2018

This just isn't that hard. They don't have that much data. It is really late for me, but, put it all in memory and figure it out. These just aren't hard problems. DBAs have been solving performance issues for decades with a clever index on the right column for 30+ years. Sorry if this is get off my lawn-ish, but I have been on too many projects where I made a DB index and solved a major bottleneck. Too many new developers are ignorant to the nuance of RDBMS tuning. I am not even a DBA.

collyw · on April 11, 2018

I am not a DBA, but I could see this crap when the NoSQL hype took off.

"Relational databases don't scale"

Well they worked fine for decades before Mongo's marketing claimed so.

thecatspaw · on April 11, 2018

> Does anyone have any advice how to make a db design like this work faster?

Normalize it properly. If this is not possible, ensure that both the userid and the property are indexed

bpicolo · on April 11, 2018

If what you actually want is data arranged like such, Datomic is probably a prime candidate

https://www.datomic.com/

siculars · on April 11, 2018

If they are using some sort of middleware orm, which they may well be because of their model, they are most likely using an EAV[0] schema which, although flexible for writes, is horrendous for reads. The join plus pivot is a disaster on virtually any relational system.

[0]https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

orf · on April 11, 2018

Hmm, that does seem probable. In fact that could make the SQL even more efficient as you'd only need a combined index on the 'prop' and 'value' columns, rather than N arbitrary combinations of indexes that may or may not be used.

Edit: Had some bad attempt at writing this query but it's rather late and it made no sense.

any626 · on April 11, 2018

You would need to have a new join for each new property

    SELECT DISTINCT loyaltyMemberID
    from members as m
    INNER JOIN properties as p1 on m.id = p1.user_id
    INNER JOIN properties as p2 on m.id = p2.user_id
    INNER JOIN properties as p3 on m.id = p3.user_id
    AND (p1.prop = 'gender' AND p1.value = x)
    AND ((p2.prop = 'age' AND p2.value = y) OR (p3.prop = 'censor' AND p3.value = z))

TimJYoung · on April 11, 2018

There's no need for the extra joins, you can just do the one join and then filter everything in the WHERE clause:

  SELECT DISTINCT loyaltyMemberID
  from members as m
  INNER JOIN properties as p on m.id=p.user_id
  WHERE (p.prop='name' AND p.value = value) AND

  ...etc.

83457 · on April 11, 2018

But how would you do exclusions with your approach?

TimJYoung · on April 12, 2018

I'm not sure what you're asking - could you give me an example of what you're envisioning that couldn't be satisfied with a combination of Boolean expressions in the WHERE clause ?

83457 · on April 19, 2018

With this query approach how do you find people that have a prop1 but do not a prop2?

If you get records back with prop1 then you have to remove those records from results based on another record.

There are multiple ways to accomplish this but it can't be done with a single join and simple where clause.

TimJYoung · on May 2, 2018

Sorry, I missed your reply. Yes, you are correct, in that case you would need to use a except, sub-query, derived table, etc.

pbnjay · on April 11, 2018

Especially with partial indexes, I still feel like this structure will be significantly faster than the original UNION ALL ... GROUP BY on calculated fields.

And they mention in the post that most queries don't use that many fields.

roel_v · on April 11, 2018

Confession time: in my first job, I build something like this (and it worked pretty well in the sense that it was very flexible), but then I also had to do a 'select' based on iirc 14 of such properties. I don't really recall the exact problem I had at first, but my solution was to create two separate (temporary) tables, select 7 of the properties into one and 7 into the other, run a select on both of those tables, then join the results in code. This ran at and acceptable speed (I must have done something so that adding criteria made the run time increase non-linearly - doing it on 14 was orders of magnitude slower than on 7).

Then years later I ran into the guy who had to do some work on it after I left that company. I must have scarred him pretty badly, because he remembered it enough to bring it up as pretty much the first topic after the obligatory 'hey so what are you up to nowadays'. When I think back about it now, it was a cringey solution - then again, this was at a company where nobody had ever heard of a 'database index' (or if they did, never mentioned or implemented them).

tengbretson · on April 11, 2018

Christ do people do this??

DCoder · on April 11, 2018

This is a pretty popular pattern known as Entity-Attribute-Value [0]. It's used by many products where a) data model needs to be very flexible and allow new attributes without schema changes, or b) a typical entity has a large number of possible attributes that may or may not be set for all entities ("sparse" attributes). WordPress uses this to store post metadata, Magento uses this to store product attributes and most of other data, Drupal uses a variation of this to store all the posts and other content you create… I have too much experience with this model to be surprised.

[0]: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

dbrgn · on April 11, 2018

Sounds like a great case for Postgres hstore (like OpenStreetMap does it)?

stu432 · on April 11, 2018

I mean, not really, theres no magic in Postgres' implementation of the pattern. They even spell it out for you:

"can be useful in various scenarios, such as rows with many attributes that are rarely examined"

They are querying these quite heavily, they aren't just random attributes they need to retrieve.

dbrgn · on April 11, 2018

You're right, I think most OSM databases extract the most commonly used keys into dedicated, potentially indexed columns.

bpicolo · on April 11, 2018

hstore querying is quite slow (and GIN indexes on hstores are pretty massive). I'd always go jsonb over hstores these days, but jsonb has the same indexing problem. JSON has a well-optimized/spec compliant serializer/deserializer in every language you can imagine as a baseline, whereas hstore does not.

rickycook · on April 11, 2018

i think that’s basically what it was designed for :p

friedman23 · on April 11, 2018

Is `value` a string here?

roel_v · on April 11, 2018

I once implemented a variation of this where there was a column called 'data_type', the valid values were the various SQL data types, and in code I would do a switch() on the (string) value of that column and then cast the contents of the 'value' column based on that... Ah the folly of youth...

any626 · on April 11, 2018

Depends

You could have

    property_id, user_id, value(string)
    1 (assume age), 1213,    60
    2 (gender),     1213,    female

or

    property_id, user_id, value_str, value_int
    1 (assume age), 1213,    null, 60
    2 (gender),     1213,    female, null

or have a mapping in the application to get the type of the property. Plenty of ways to handle it.

candiodari · on April 11, 2018

> This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.

I think you're right. Oh ... my ... god ...

I wish I could say this is the worst example of a database schema I've ever seen, but it isn't.

Technology cycle:

X gets invented -> idiots abuse it -> X "is bad" -> Y (strictly worse than X) is "so much better" -> idiots abuse it -> Y "is bad" -> ...

icebraining · on April 11, 2018

EVA is a valid pattern if the keys are dynamic. For example, in a CRM, the user might want to store properties of their clients that you haven't thought of. In our platform, we use different schemas for each company, so we can actually do a ADD COLUMN ..., but you don't want to do that if you have a multi-tenant DB :)

ajcodez · on April 11, 2018

It makes sense if each client gets to add their own fields.

candiodari · on April 11, 2018

Using the builtin type for that purpose is going to work way better. This depends on the DB you're using but is generally referred to as a "JSON" field (why ? Because they're a response to MongoDB, which calls is that). Oracle and SQL server have very similar things.

In Mysql, it is JSON data type [1], in Postgres JSON/JSONB [2].

Creating indexes across them is doable, through a workaround (involving what is generally referred to as "VIEWS", but can be called calculated columns or something like that).

And, frankly, in the worst case for indexing, these databases still perform comparable to key-value stores in speed (especially SQLite).

[1] https://dev.mysql.com/doc/refman/5.7/en/json.html#json-paths

[2] https://www.postgresql.org/docs/9.4/static/datatype-json.htm...

icebraining · on April 11, 2018

They may be generally a better option, but they have their own disavantages. For example, JSONB fields in Postgres won't deduplicate keys, so if you have large keys, your table size will increase quite a bit (which also makes it harder to keep it memory).

Using an EVA, you can have a "keys (id, key_name)" table, and then only use the IDs in the values table, reducing that waste.

By the way, you don't need views for indexing on JSONB fields, it's supported out of the box in PG.

jjeaff · on April 11, 2018

At least in MySQL, json field types are rather new. MySQL 5.7 is not yet an option with AWS Aurora or Google cloud SQL even.

And I don't think you will necessarily get better performance with json fields vs an EAV model. Yes, you can index json fields by creating virtual views, but that requires that you know the field ahead of time. With an EAV model, you can have your values table indexed and then join.

But I am excited to start using the json field types. In many cases, it will really simplify things over the traditional EAV stuff.

ThrustVectoring · on April 11, 2018

The UI they showed in the blog post looks like it has enough data available to generate that kind of query, too. Like, the ands/ors/nots are right there on the page, the filters are already there too getting translated to SQL as well, just mash them together and you get the same "algebra of sets" stuff right in the WHERE clause.

As it stands the SQL query is quite silly. It gets a list of every user ID that is included by each filter and compares which ones are in the filters you want and not the filters you don't want. Much better is to pass the filters into SQL, let it figure out which users match the filters you want and not the filters you don't, and just use that result.

jwatte · on April 11, 2018

Most Enterprise CRM like solutions store tables of customer-property-value instead of using one column per property.

This leads to lots of unions in advanced queries, and makes filtering harder. Some databases even calculate column block statistics to optimize these queries by doing less IO even for seeming table scans.

Why not one table with all customers and one column per property? There are a few reasons, having to do with anything from MySQL sucking at schema alters for really big tables, to expectations of Enterprise customers.

deizel · on April 11, 2018

Can be tricky to use EAV data models with traditional ORMs.. this type of functionality can often be slow or require plugins, if implemented at all:

https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

philliphaydon · on April 11, 2018

Wondering if they shoved the data in PostgreSQL with JSONB how well it would perform over EAV.

jjeaff · on April 11, 2018

I think that jsonb may not be as performant as EAV. You don't need joins or unions, but if you are dealing with dynamic fields, you need to know the fields ahead of time and set indexes for them in jsonb. For eav you just have to index your values table.

philliphaydon · on April 11, 2018

PostgreSQL can use multiple indexes so you don't need to worry about needing to know about the fields ahead of time.

Likewise you can get away with a full document GIN index.

I played around with some basic report stuff at work last year, the EAV data on my local machine, the report took ~7 seconds to run. I shoved the same data into PostgreSQL as JSONB, indexed it just as full doc cos I was lazy, the same report took ~80ms.

Obviously this isn't 'proof' my dataset was only 1.5m by 15m records. But with my limited knowledge i do believe it would perform better, I don't know how much better... but I think better...

wetha · on April 11, 2018

I’m not the author of the post. Your comment assumes a well known schema. My understanding from the post is that this solution can join and filter on “custom” datasets of arbitrary schema that each of their customers upload.

Matheus28 · on April 11, 2018

I've never played with this, but couldn't you create a table based on the dataset that the customers upload, and let your database engine handle filtering those queries? From the looks of it, even if they were doing full table scans for each query, it'd still be faster than all those unions...

TimPC · on April 11, 2018

I think the point is they don't know in advance what the query is and they didn't think they had a good solution to optimize all user entered variants across the range of possible groupings so they wanted a solution that was easier to optimize globally.

The general form of this is:

Select loyaltyMemberID from table WHERE V1_1= x_1 OR ... OR V1_n=x_n) AND (V2_1 = x_2_1 OR V2_2=x_2_2 OR ... V2_n=x_2_n) AND ... AND (Vn_1 = x_n_1 OR ... OR Vn_n= x_n_n) (some of these n's should actually be m_i's but I was lazy)

There may be some ability to optimize this in a number of ways but optimizing one example is not optimizing the general form. I can easily see how technology change could be a cleaner solution.

orf · on April 11, 2018

> There may be some ability to optimize this in a number of ways but optimizing one example is not optimizing the general form.

I totally get that, but isn't that the point of the query optimizer within the database itself? Why are you trying to outwit it? It should select the right indexes, provided the columns are indexed, and "do the right thing(tm)". It might take a bit of cajoling but they seem pretty good at this. Postgres collects statistics about the distribution of values themselves within the table to guide its choice of index, so in theory it could rewrite the boolean logic to use a specific index if it's sure that it will eliminate a higher % of the rows than another plan.

In any case, it seems the SQL they posted is a bit off. Why nest each individual filter as a UNION? If you wanted to go down the UNION route couldn't you do each individual group as a UNION, with standard WHERE filters?

rypskar · on April 11, 2018

I blame ORMs', if you don't understand SQL and how databases work you should not be allowed to use an ORM. If you know how databases work you, in many cases, will not use an ORM except for the most simple CRUD operations.

orf · on April 11, 2018

I know how databases work and I use a (good) ORM almost exclusively for app work (SQL for exploration usually though). The benefits are huge.

collyw · on April 11, 2018

I use both. I know the Django ORM and I know where its limits are. For getting data in it saves a load of time. When the query gets complex it starts adding time, or making things impossible (multiple join conditions weren't possible unless its been update in the most recent version).

rypskar · on April 11, 2018

OK, the part about not using ORM if you know SQL is a bit of an exaggeration. At least when you know SQL you know when to use an ORM and when to not use it. If all you know is ORM then you will always use it, and ORM seems to lead to many developers not learning SQL

scruple · on April 11, 2018

I think it depends... Who are we talking about here? Juniors, even intermediates, in my experience, haven't had enough time on the job to have learned enough to be writing raw SQL statements or query objects unless they're actively punching up on a daily basis. I am unfortunately talking from experience here.

What I am saying is, I really do not want a situation on my hands where the juniors that I work with, or most of the intermediates, and even a few of the seniors and leads, are writing raw SQL or query objects. Most of these folks have n years of experience in web and desktop application development and couldn't give you a passable answer to simple questions like, "What's a database index?" I know this isn't isolated to my current employer, or former employers, and I've seen it in other organizations where I've done some consulting on the side, and all of these folks I'm talking about here have largely worked else where in the past, too. And this in itself leads to other third-order effects, like the "SQL wizards" who get asked all of the "tough" SQL / database questions.

I want to stress that I understand the point that you're making, and I do agree with it, and of course, so do many (all?) ORM authors themselves, but I think the advice is wrong and is prone to take you to a much worse situation. I think we have an obligation as people who do grok SQL and databases to gently introduce our less experienced co-workers to the idea that ORMs are not a panacea to all database interactions, but until the companies we work have enough of an incentive to give us that sort of time and empowerment then I, for one, am going to recommend ORMs for everyone for everything unless they really, absolutely, demonstrably know what they're doing.

rypskar · on April 12, 2018

I do also understand your points and think we agree on most. I think that if a "developer" can't write SQL I would not trust that one to set up the ORM correct either. For basic usage, sure they will get it to work and all is good. But when you want to join table or run aggregate functions the same peoples who write bad SQL could also write bad ORM code with N+1 queries. ORM has its place and optimized beautiful SQL has its place, a craftsman know which tool to use where and when to ask for help.

One of the problems, in my opinion, is that SQL isn't "cool" or hip and by many seen as not important to learn. While the new fancy Javascript based language or framework which nobody use and that will be replaced next week is much more important to learn.

Btw, get of my lawn :) /end old man rant

scruple · on April 13, 2018

Yes, exactly!

> could also ORM code with N+1 queries

Oh, they absolutely do, and when we're lucky they actually catch them on their own before they get to code review. Some folks reach for tools like Bullet [0] and, that's great, but unfortunately, sometimes they treat that tooling like the Holy Gospel. They develop an over-reliance on them as if those tools exist to offload critical thinking. Drives me crazy... in my experience, it's been hard to combat this type of thing, too. The pace of "agile," the calculus between paying down technical debt and mentoring and progress, I don't really know why but I haven't had a lot of long-term luck.

> One of the problems, in my opinion, is that SQL isn't "cool" or hip and by many seen as not important to learn.

I think you're really right about that. I happen to like writing SQL quite a bit and I take a little bit of pride in that I kind of sort of actually understand a little about what is going on in the database and even then I neglect that skill. I picked up copies of both "SQL Anti-Patterns" and "SQL Performance Explained" based on recommendations from this thread and am eager to get in to them this weekend. Still lots to learn... And, I have some SQL problems that I can see coming up over the horizon today and I hope this gives me the edge I need to start grappling with them sooner rather than later.

[0]: https://github.com/flyerhzm/bullet

flukus · on April 11, 2018

> Edit: I've just read the query in the post again and I really can't understand why you would write it like that. Am I missing something here?

Oh I've seen this happen a lot. Somewhere along the line, often from a DBA, it is decided that sql in an app is evil and that everything must be in a stored proc. Then instead of some simple string concatenation you have to jump through hoops like this.

voltagex_ · on April 11, 2018

One of the things I've done to harden an app is to revoke all permissions other than EXEC on a particular schema, then make sure everything is done via paramatised stored procedures - no chance of SQL injection then.

flukus · on April 11, 2018

But that creates situations like this where you have to jump through hoops to solve simple problems. You solved one potential issue at the cost of creating many more.

> no chance of SQL injection then.

You know you can have sql injection attacks inside stored procedures? If you think stored procedures are a panacea then you don't understand the problem you're solving.

voltagex_ · on April 11, 2018

Perhaps I should have said "greatly reduced the risk of".

I'm not using something like Entity Framework and the CRUD apps I mostly wrote at work were well suited to a few simple sprocs.

The time it takes to write an ALTER script to change something pales in comparison to the two week change control process anyway...

voltagex_ · on April 11, 2018

https://stackoverflow.com/a/24029829/229631

https://blogs.msdn.microsoft.com/brian_swan/2011/02/16/do-st...

https://security.stackexchange.com/questions/68701/how-does-...

est · on April 11, 2018

I am building a rule engine quite similar to this. An AST parser will run all python DSLs and generated list of tables for INNER JOIN, then SELECT all the tables data out with filters in one pass, then run all results through the Python code.

It's quite fun.

benmmurphy · on April 11, 2018

age/gender are probably simple but i'm guessing censor is probably derived from a transaction table. if they are letting users select arbitrary time ranges to filter the transactions then you can't store a precomputed censor value for each user. but seeing that they are talking about caching maybe a lot of stuff can be precomputed.

BinaryIdiot · on April 11, 2018

Damn, you're not kidding. I wonder why they needed more than one query here plus UNION is slowwwwwwww. They never mention how frequent this query needs to run either, only the amounts of data involved in some aspects of this table.

toddBarkus · on April 11, 2018

I was going to ask how to optimize the SQL in your post as it seems like the obvious/naive implementation of the query. If you're missing something, so am I. I can only imagine it was built up over time from googling specific terms that already missed the point, e.g. "rds mySQL union query"

silveroriole · on April 11, 2018

Gotta agree with others and say that they’re clearly skimming over the facts that:

- they didn’t have the expertise to actually fix the SQL. That query smells bad. The data model smells bad. For some reason HN is always superstitiously afraid of letting developers touch the database, but if you don’t let devs touch the database enough you end up with this sort of thing; or that crap data model with properties in rows instead of columns, because oh god, we can’t let devs actually do DDL so we’d better make it all really flexible (and incredibly slow because it’s a misuse of the database). I mean, implementing your own result caching mechanism? I don’t know about MySQL but surely it has its own caching mechanism (Oracle does) that isn’t being used because the query is bad.

- project management probably had no interest in fixing the performance/incorrect data problems, and devs were expected to do it in their own time.

In a way though this makes me feel better, other people are dealing with these problems too and their overengineered solutions work and keep the company running, I guess mine will too :)

mjburgess · on April 11, 2018

There's no CTO on https://movio.co/en/company/

Seems like a red flag.

I'm all for companies releasing technical blog posts, but there's some really strange framing here.

This is actually a story about how decisions get made, and how better ones can be made. Reading a company's mea cupla tells you they are well-informed and well-intentioned.

This is not a story which presumes good decisions were made and "the (tiny, startup) database company went bust". That's their framing. Yikes.

hippich · on April 11, 2018

What else it shows - how expensive AWS hardware vs hosting own hardware. I guess you have to consider how often you have to scale, but hetzner offers dedicated servers with 64Gb and NVMe drives starting from 54 euros per month - https://www.hetzner.com/dedicated-rootserver?country=us - compare that to $580 per month these guys were paying for i3.2xlarge instance..

bobwaycott · on April 11, 2018

I’ve heard mention of Hetzner no less than a dozen times in the last couple days. What’s their deal? I’m not quite sure I grok this server auction thing they do, or how they’re so cheap.

foepys · on April 11, 2018

Hetzner got in the news recently because they now offer a "cloud" product for VPS. Not in the sense like AWS where you can shut down instances and pay less but in the sense that you can buy, provision, and delete VPS via an API and pay per hour. They are also dirt cheap and offer 20 TB egress traffic with even their cheapest VPS.

How they do it? I don't know. They are using Xeon processors and not i7 like some others.

CSDude · on April 11, 2018

If all you need is Virtual Machines or dedicated machines, you are good to host your databases, services etc. using AWS is very expensive. You could literally buy from 3-4 different vendors to maintain availability in disaster case and still be cheaper. Hetzner is one of the most affordable providers and both dedicadted and cloud offerings are fast enough.

hippich · on April 11, 2018

I am not talking about auction thing here, just their regular dedicated servers offerings. They are bare metal and it is up to you to set it up, but savings are in the 5x - 10x range if you can avoid waste by underutilizing hardware (i.e. cases where you have to scale daily 1 to 100 instances might not be financially advantageous)

zaarn · on April 11, 2018

They offer bare metal outside the auction, which I recommend unless you look for the absolute cheapest.

The Server Auction is basically servers they previously used and were freed up (because a customer didn't need it anymore, for example).

It's somewhat similar to what OVH is doing with KimSufi or SYS.

rbranson · on April 11, 2018

... and then you have to clone all the AWS services yourself. If you just need a few boxes, there are definitely cheaper options than AWS.