I've come to the conclusion that the problem in tech is that all the people doing the work are in their early twenties and have no idea what they are doing. Once they get some experience they are quickly promoted to the CTO position. Rinse and repeat.
What we have here is a classic dbms problem and no one at Movio seems to know how to deal with that. Instead of migrating from Mysql to something serious (Postgres) they move to some columnar DB no one has heard of. Nevermind that postgres and a reasonably priced DBA and a little thought put into their data model/queries could probably handle all their issues.
Sorry for the snark, cheers on a successful product.
Can't help but also think "WTF are they doing there..." - we're doing exactly the same (user segmentation, targeting and campaign execution for cinema & movie users, disclaimer: we're more or less their only competitor, albeit indirect), but our solution is running at ~30k/year total at 10 times their user base. No magic in there, just good architecture and solid Computer Science. Boring technology (Go/Redshift/Postgres/S3).
The only thing I'd fully agree on is that using Go saved us a lot of resources as well. It's an awesome choice for stuff like this that needs to be reasonably performant as well as being simple, understandable and reasonably fast built.
Well, a problem in tech certainly. An alternative possibility (and another problem in tech) could be some mid-level developer could have figured out the problem at the start but because of artificial time pressure to deliver they didn't have the time to, so went with the first bad idea that popped into their head without taking the necessary time to evaluate it.
The fact this had to happen in a hackathon suggests a typical disconnect between management and development (and probably poor prioritization by management). Because development knew this was a problem and how to fix it (evidenced by the fact they fixed it), but it took removing management (aka a hackathon) to give development the space to fix it. And now the company pats itself on the back for having the vision to host a hackathon instead of structuring and prioritizing correctly in the first place so this would just get fixed on the clock.
I do think the author's takeaway about the value of simplicity and pragmatism are on point, but that applies not just to code but to management as well.
It is worth noting that on their website, their management team doesn't include a CTO, even though their main product is basically a software solution. They have a few sales people represented though, so management might not be great techwise.
CTO has moved to another company with not so shiny tech stack, as far as I know they were the main adopter of Go and other solutions to replace Java and then Scala. Perhaps Movio did not yet find the best fit for the company.
I think it is not necessarily the age but the mindset of focusing on solutions instead of understanding the problem first.
It often goes like this:
Oh snap, we encountered a problem! Lets find a tool, framework, language that promises to solve a similar sounding problem.
Now we have a problem with a layer of abstraction on top. Soon to be two problems.
Lets find a tool, framework, language to solve both of them ...
It is a spaghetti to the wall approach, where you just throw a bunch of things at your problem hoping that something sticks. And who cares how long it will stick.
Secondly as a developer I think in start-ups dedicated db experts are way underrated. Sure your fullstack devs can cobble together some tables, changing them 15 times a day to accommodate business requests and slap indexes on everything that gets slow. That is also the way to get into trouble once you scale, and instead of reflecting why this is, people reach for the bowl of pasta.
I was no different, when just starting out. I thought my biggest strength was, how quickly I can come up with easy "solutions" for any problem the company had. Took me years to realize how silly of an approach this is.
>It often goes like this: Oh snap, we encountered a problem! Lets find a tool, framework, language that promises to solve a similar sounding problem. Now we have a problem with a layer of abstraction on top. Soon to be two problems. Lets find a tool, framework, language to solve both of them ...
is absolutely real, I've actually seen in happen both in projects I was in, and heard or read about.
Yeah, my opinions on this are "slightly" influences by the Rich Hickey talks "simple made easy" and "hammock driven development":)
The difficulty I find is, identifying the moment to leave the hammock again in a startup enviroment. To what degree do you need to understand a problem before you take action. If you try to understand it 100%, you'll never get anything out there.
But I'm already very happy that I was able to convince the business side of the company of the approach in a brief talk about it and they now referrer to "the hammock" themselves :)
>The difficulty I find is, identifying the moment to leave the hammock again in a startup enviroment. To what degree do you need to understand a problem before you take action. If you try to understand it 100%, you'll never get anything out there.
Agreed. The problem, though, (and I'm painting with a broad brush here) is that the erring tends to be much more on the side of not trying to understand much or at all, of the problem, before jumping into action. I think a lot of it is due to peer pressure and wanting to be "seen" by peers and bosses (and VCs) to be doing stuff, as opposed to really getting things done better in the medium term, even if in the short term it looks like you are not acting but "only" thinking or analyzing or designing stuff. Hence my comment in that post I linked to, about "we have to ship next week". All too common - been there, seen a good amount of that. In fact, this subthread between HN user jacquesm and me just recently, is basically about the same point, although described in different words:
>But I'm already very happy that I was able to convince the business side of the company of the approach in a brief talk about it and they now referrer to "the hammock" themselves :)
Somewhat related: I've seen this problem exacerbated by the presence of "architects" who don't seem to have implemented running systems in a long time, and especially have no experience with running newer technologies; or limited experience which breaks at scale. Not saying this applies to all software architects, but I've seen this often enough.
e.g. I remember using a dedicated jenkins environment to run continuous, scheduled integration tests for my service. When the architect found out, he immediately sent me links to software packages that are dedicated to running continuous tests. I asked whether he had any experience running these new packages and if he would be willing to set it up/maintain it.... radio silence.
Some time ago, I thought it was <easy> to write code to do things. By now, I mostly ponder how I put things into postgres/kafka|rabbitmq|../memcache|redis|.../elasticsearch/neo4j so I can reduce everything to good queries into these systems.
You're right. I started using mysql in 2000 when it was still a toy. It's an outdated bias :) I was more flabergastered they would choose 'InifiniDB' when there are so many other great options out there.
Well, MySQL still doesn't implement the SQL standard from 1999 (20 years ago), because it's missing common table expressions. Although the next release will support them, thankfully. Sorry, I just had that axe to grind.
I disagree with your assertion that MySQL /is/ a serious database.
The questions I usually ask myself when evaluating database solutions is:
* Does it accept invalid data?
* Does it change data on error?
* Does the query planner change drastically between minor versions?
* How strong is transaction isolation? can I create constraints, columns or tables in a transaction?
* Does it scale vertically above 40~ CPU threads and 1M IOPS?
The answer to all these questions, for MySQL is "No". You could argue the value of some of them, but a lot of them highlight architectural or development procedural misgivings.
Not the OP but one example of accepting invalid data is Mysql defaulting values to null/0/0000-00-00 when no value assigned and no default on column. Unless one is in strict mode.
I appreciate that it might sound like that to someone who hasn't used MySQL in production for 10+ years.
To start with, this is still true today: https://vimeo.com/43536445 Despite being 6 years old, strict mode is still required.
Anything prior to MySQL 5.7 will accept "0000-00-00 00:00:00" as a valid date, 5.7 will not (which is sane) however this means migrating from 5.6 -> 5.7 just got a little harder.
In fact it wouldn't validate /any/ date so it would assume every year was a leap year and febuary always had 29 days.
Regarding performance:
This is what I found from my own experience: I was given the task of testing the limits of MySQL, MySQL was the chosen technology and I was no involved in making that decision so- whatever.
We were given 10 servers, with 40 cores (2014-2015~) 128G of DDR3/ECC and 8 SATA SSDs in RAID-0 with 1G of RAID cache for write-back.
We managed to get MySQL to bottleneck pretty quickly, our queries involved a lot of binary data so we should have been raw IOPS bound, but we weren't we were memory bound. So we replaced the memory allocator with a faster one (jemalloc) and we get a 30% performance improvement. We suspected that the kernel sockets implementation was slowing us down so we compiled a custom "fastsockets" linux kernel. The improvement was around 4%, but we were bottlenecked on memory. After doing a full trace of what MySQL was doing we saw that InnoDB was spinning on a lock quite a lot.
I asked if we could try other SQL solutions (MSSQL/PostgreSQL) Postgresql was first chosen because we could just install it, no license and no OS change... it was twice as fast as the optimised MySQL installation out of the box with a stock CentOS6 kernel.
We never even bothered testing MSSQL because PostgreSQL met our performance targets, we were now IOPS bound.
--
More anecdatum:
Regarding data consistency we (tried) to migrate to postgresql for performance reasons in 2014 (my previous company), and failed because MySQL had been corrupting our data very slowly and silently for many years (corrupting meaning not honouring NOT NULL, not honouring type safety, allowing invalid dates, inserting data on error) So far in that actually reimporting the output of `mysqldump` would not work.
Isn't it? I thought that today()-(2018 years, 4 months and 10 days) would be approximately that date? Maybe you prefer +0000 vs just 0000?
'ISO 8601 prescribes, as a minimum, a four-digit year [YYYY] to avoid the year 2000 problem. It therefore represents years from 0000 to 9999, year 0000 being equal to 1 BC and all others AD. However, years prior to 1583 are not automatically allowed by the standard. Instead "values in the range [0000] through [1582] shall only be used by mutual agreement of the partners in information interchange."
To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver.[19] An expanded year representation [±YYYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign[20] instead of the more common AD/BC (or CE/BCE) notation; by convention 1 BC is labelled +0000, 2 BC is labeled −0001, and so on.'
I think a lot of what have you written can be solved software-side. Good Database should be no excuse for bad code.
I do not think MySQL is a technological debt as in 80% startups moving to the different solution is cheap and non-problematic. The LAMP is good enough and quickest/cheapest for the majority of tech companies.
I'm being perfectly fair in being critical of a software which claims to be doing those things.
You can solve issues in your application if you know there will be issues like these, knowing the pitfalls and drawbacks of a technology is certainly noble- but if you do then why not choose something that follows principle of least surprise. (There might be reasons).
I would never claim that you should move everything from MySQL if you use it. However if you care about data consistency ensure that you change the defaults, engage strict mode, ensure that your application has no bugs in handling data.
This is actually hard to do correctly, it's overhead in development that you shouldn't be caring about. Just choose something that has sane error conditions and the problem vanishes.
Considering many, many of the world's largest tech companies use MySQL or MySQL compatible databases, it's rather absurd to say that MySQL isn't a serious database. Regardless of whether it matches yours or someone else's personal list of capabilities.
To be perfectly fair with you, you can make bad choices and still get something useful done.
Most companies are not alive "because they chose mysql over something else" they're alive because they have "good enough" tech to get the job done. The job that they're trying to accomplish is the thing that makes them successful.
Uber isn't super huge because it used a specific database technology. It's huge because it's good at marketing, it's providing some value to people.
If it got the work done at a reasonable cost and performed reasonably well (i.e. it served the purpose it was meant to serve), how “bad” a choice could it have been?
I'm reminded of an article about zombie companies I read recently: they're companies which are inefficient/pporly-managed/poorly-executing, but due to market/regulatory inefficiencies they're not dead yet. Companies which use MySQL are in a similar situation: they're not doing as well as they could be, and all other things being equal they ought to be put out of business by their competitors — but all other things are rarely equal.
Still, if you are making choices for yourself, you don't choose mediocrity and hope to muddle through: you choose excellence. Choosing MySQL isn't choosing excellence.
“Excellence” is a poor criterion for comparative analysis because it is (a) subjective and (b) unquantifiable.
Do you have objective or quantifiable data and references upon which your opinion is based, _and_ is universally applicable to any arbitrary problem that a SQL database might be an appropriate solution for?
If it costs development time because they need to be extra careful about not sending queries that ERROR and 'wipe' data.
If it silently corrupts data over years and gets discovered much later. (As was the case with my previous company, an e-commerce retailer that lost large chunks of order history)
Are those problems still unresolved in MySQL today? How do you know that similar or worse problems did not exist in alternative solutions at the time it was implemented?
MySQL is making strides to fix these kinds of issues ever since the Oracle acquisition for sure.
> How do you know that similar or worse problems did not exist in alternative solutions at the time it was implemented?
Because I've been working on database solutions for over 10 years, there are problems in other software but I consider data loss to be worse than any of them. For example the autovacuum in postgresql 8.3 and before was mostly garbage which ended up bloating highly transactional databases. But deleting data when you fail a constraint is worse.
I have 15 years of experience and can built a decent clean system using "boring" technologies. But all the decent paid work where I live is maintaining big balls of mud with tech that was obviously peak hype when it was chosen, and nothing done according to best practices because of that would require sticking with a tech and learning it properly. Its quite frustrating.
Then we have the interview process where people expect me to give up my weekend for their coding test and can't even be bothered to give you feedback afterwards. Or some ridiculous algorithmic nonsense that has no relevance to the job. Getting bored of it all.
If I ever be a CEO of the company / Startup, that one criteria I made is either I decide on all the technologies we use, or there is no CTO so i make those decision.
And that criteria of technologies could be summed into one sentence. Use something boring. No Hyped programming languages / DB / tools allowed.
Of course some would argue you would be doing it wrong even if it was using old tech / programming / tools. Well yes, but you have a sea of recourse and expertise there to ask for help. Instead of spending energy and time doing figuring it out.
Of course if your company is all about tech innovation, AI or something cutting edge there surely you will have to tried something new. But 80% of those startup aren't.
Quoting Dan McKinley's "choose boring technology" [cbt]:
> Embrace Boredom.
> Let's say every company gets about three innovation tokens. You can spend these however you want, but the supply is fixed for a long while. You might get a few more after you achieve a certain level of stability and maturity, but the general tendency is to overestimate the contents of your wallet. Clearly this model is approximate, but I think it helps.
> If you choose to write your website in NodeJS, you just spent one of your innovation tokens. If you choose to use MongoDB, you just spent one of your innovation tokens. If you choose to use service discovery tech that's existed for a year or less, you just spent one of your innovation tokens. If you choose to write your own database, oh god, you're in trouble.
> Any of those choices might be sensible if you're a javascript consultancy, or a database company. But you're probably not. You're probably working for a company that is at least ostensibly rethinking global commerce or reinventing payments on the web or pursuing some other suitably epic mission. In that context, devoting any of your limited attention to innovating ssh is an excellent way to fail. Or at best, delay success.
I'm a fan of taking a 'one new technology' approach. When I'm building something new, I get to choose zero or one new technologies to play with, depending on whether I want to get shit done or learn something new.
By choosing at most one new thing, you can better control for how your stack should work and how you expect it to respond to certain unexpected circumstances, which means you should be able to more effectively solve issues as they crop up than you'd be able to if you were using multiple new technologies.
I agree with the main idea: working with hyped technologies is not a solution and you can build most of the things out there with boring technology.
But then ... you have to find, attract and hire good developers. That's already difficult, adding an extra layer of 'boring technology' will make this task even more challenging.
"In June 1970, E. F. Codd of IBM Research published a paper [1] defining the relational data model and
introducing the concept of data independence. Codd's thesis was that queries should be expressed in terms
of high-level, nonprocedural concepts that are independent of physical representation."
The key, the whole key, and nothing but the key so help me Codd.
Also said as... "In Codd we trust."
If none of these DB jokes mean anything to you, take a DB concepts class at a CS university. There's a lot of great research going back 50 years and you can learn a great deal about why things are the way they are (tuple algebra and calculus). And before changing anything for something you think may be better, you should fully understand what you are giving up.
This talks a lot about 5.5 and mentions that 5.6 is “due out soon”. The current release series is 5.7. How much of this is outdated and how much has stayed the same?
"Some statements cannot be rolled back. In general, these include data definition language (DDL) statements, such as those that create or drop databases, those that create, drop, or alter tables or stored routines." [0]
I like the part "We used it because it was there. Please hire some fucking software developers and go back to writing elevator pitches and flirting with Y Combinator."
Well, you should school those fools at Google, Facebook, Twitter, Pinterest, Amazon... and tell them how they are wasting time with their toy MySQL databases.
> those fools at Google, Facebook, Twitter, Pinterest, Amazon
... have dedicated hundreds of engineers and millions of dollars to nothing more than keeping MySQL up, running, and not crapping the bed every time someone looks at it funny. If you can afford that resource expenditure, by all means go nuts with MySQL. Most companies can't and would be far better served by something which doesn't need that amount of handholding to serve its basic purpose.
There is one thing that bugs me about all the talk of postgres' superiority: why haven't these companies switched to postgresql? Surely they weren't all too far invested in mysql before a "more knowledgeable" DBA came along saying postgresql is better.
Following on from that, I suspect a lot of large companies use MySQL because they always have, not because it's actually any good. For example, Basecamp used MySQL while I was there, but I never met a single Sysadmin there who would use it over Postgres if they were to start a new project.
PHP was built for the web and has been successful at that job. It is easy to use because the core developers have made some good design choices for the task at hand. For example no threads, stateless requests, core functionality focused on outputting HTML, etc.
MySql and PHP are good. They do the job they were designed for in a cost effective way and of course that means there will be trade offs.
PHP is crap. It's actively hard to write good code in it. Not good code like SOLID or pretty code that's self documenting, it's hard to write code that's not going to break in unique and interesting ways.
Sure, you can knock up a contact form in it really quickly, but that ease of use hides significant dangers.
This might have been true 10 years ago with versions like PHP 4. But remember many companies have invested a lot into PHP including Facebook. In the newest version of PHP what you said cannot apply with a type system, OOP features like traits, class inheritance etc.
> Surely they weren't all too far invested in mysql
I think by the time Postgres sorted itself out into a more user/admin-friendly system (which is still fairly recently, really), MySQL had pretty much conquered the "quick and easy" mindshare and was deeply embedded almost everywhere.
And if you've spent millions of dollars architecting your systems such that MySQL's flaws aren't killer issues, there's very little financial benefit to switching, I guess.
MySQL for years was far easier to install, configure and run than PostgreSQL, especially features like replication which was much better than other options. Big companies use these databases as more like simple key/value systems rather than complex relational schemas so strong replication and operational simplicity was favored over the rich featureset of Postgres.
Eventually Postgres caught up in most things, and the delay was in some part because of implementing those features "correctly" and with more thought, but it's still a delay that hurt the uptake in the early days.
You are right. It is a solid DB. I was more pointing out that if you have to move off Mysql, there are excellent options other than adopting a new columnar datastore.
Can you demonstrate that Postgre is significantly faster than MySQL on average? Highly doubt so. The problem is in the way data model was architected and implemented.
PG isn't usually faster than MySQL on any naive database (which is 90% of database in the world I suppose)
Most E-Shops will be fine running MySQL or MariaDB.
The one thing PG excels at however is that you can tune it much more to your workload and it allows tuning the workload much more finely than MySQL/MariaDB. That and the ability to extend PG arbitrarily (try adding native functions to mysql without recompiling) via the C-FFI offered. You can write and define your own index methods that let you use an index that is perfect for the workload or you can add a new data type to support a new input with validation.
You can sink a lot of work into getting the most out of a PG database, MySQL not so much. But again, for most people MySQL will provide the same (or even better) performance than PG. (I still trust PG over MySQL after MySQL nulled out all entries of a table with only NOTNULL columns after a nasty crash)
apples to oranges. you need to compare PostgreSQL with other database that don't take shortcut around ACID for performances (for example DDL forcing implicit commit in transactions)
Seriously? You don't need to do DLL operations in day-to-day use but you'll need to when you're doing development work.
For example, you had a "color" column on a table, for a new feature youre now adding the ability to have multiple colors. You're going to create a new column, create a new table, populate that table, and drop the old column. If anything fails during that process you'd like to be able to roll back.
There is a concept or "forward-compatible change". Basically, you don't do things that will break your software.
Example, you don't add a NOT-NULL column unless you can give it a good DEFAULT value, to make it work.
Also dont' drop columns until the software is ready for it, etc.
If you have a decent ORM, it will compare your "how it needs to be" sql schema with the "how it is" schema. Then it will generate appropriate "ALTER TABLE ...." "CREATE INDEX " etc statements. Note that this is automated and you never need to type SQL statements to achieve that.
All together in the last XXX years, I did not really need to do a rollback on a dml statement.
To be fair, we used to die at 40. Now you can get a CS degree at 21 or 22, and you still "have no idea what are you doing". Maybe it's just that technology is really complicated, the world is really complicated, and everything is changing fast.
I don't mean it to be conformist, but it's easy to forget some things are actually hard when you are very clever or old enough to forget how it was like when you were still learning too.
But at the same time it's easier than ever to find information on nearly everything. Asking experts also is easier than ever before. I don't think it's feasible to dismiss GP's claim with yours.
Sorry if it came across like that. Not trying to dismiss his claim, just trying to give a wider perspective. Similarly, in the same way you are right saying it's easier to find information on nearly everything nowadays, it might also be relevant to remember that we still have limited time and attention spans.
I barely ever see an engineer knowing everything that happens from the top to the bottom on any platform. Very few companies are forced to get to know their stack in depth, usually they throw more money to the problem.
> I've come to the conclusion that the problem in tech is that all the people doing the work are in their early twenties and have no idea what they are doing.
Not that I disagree, but to be fair, I've seen plenty of tech ignorance with experienced and older engineers as well that has been pretty crippling.
Agree. I'm not sure per se this is an age thing: the field is just so big and apparently (but perhaps less so in reality) in a constant state of tech churn, that it is hard to anchor to consistent proven techniques and practices.
I don't quite get this. How fast was running this query:
Select loyaltyMemberID
from table
WHERE gender = x
AND (age = y OR censor = z)
Why the random complexity with individual unions and a group? Of course that's going to be dog slow.
Sure, the filters can be arbitrary but with an ORM it's really really simple to build them up from your app code. The Django ORM with Q objects is particularly great at this.
Obviously I'm armchairing hard here but it smells like over engineering from this post alone. Stuff like this is bread and butter SQL.
Edit: I've just read the query in the post again and I really can't understand why you would write it like that. Am I missing something here?
Seems like a fundamental misunderstanding of SQL rather than a particularly hard problem to solve.
Ten or fifteen years ago, sure - a DBA would look at a query plan and figure out how to do it properly. Worse case you'd slap a materialized view in and query that.
But this is 2018! Programmers don't want to treat the database as anything but one big key value store ;)
Yeah, sadly, this is not too much of an exaggeration. I've worked on teams that insisted they needed DynamoDB, because, well, Dynamo is for "Big Data", and they certainly wouldn't work somewhere that had "Small Data"! Replace the buzzwords/products as applicable; you could actually probably just scramble them and it'd work just as well, since someone out there thinks "RabbitMQ means Web Scale", etc.
SQL databases are amazing, robust examples of engineering. They are your friends and they're the appropriate choice for the vast majority of software. They are not outmoded or passe. Though I acknowledge there is a separate use case for K-V stores, I almost want to make policy preventing their use just because I know so many developers will abuse them badly and then stare back at you blankly during the semi-annual massive downtime event, muttering something like "Well, it's based on research at Google, so I'm sure there's a way to recover the data..."
I think this is a case for the return of the traditional "sysadmin" as "devops"/"SRE" is now the role of unblocking deploying a solution instead of questioning it's complexity/fitness.
While I agree, I also think that overt gating and approval processes create a high tension dynamic that frequently breaks down, whether it's ops v. dev, security v. dev, or others. It's easy for people to get their pride wounded, and they end up encouraged to find workarounds to the process. The simple answers to this are pretty much imaginary, unfortunately.
Sure, of course. I was hoping to point out that there's an increasingly overlooked value in having someone question complexity. The "no, you don't need React" of the frontend dev or the "our data is actually relational" of the back end dev.
If your “SRE” team is only “unblocking deploying a solution” then I’m sad to say they are an operations team who has rebranded themselves to appear more relevant.
That's most "SRE" -- it's a title arms race in that field between the underqualified and those that wish to convey they know how to do more than write system scripts in DSLs
The column type would have to be the type of the encrypted value. The type of the unencrypted data could not be enforced by the DB and you would have to rely on code doing the correct thing.
I am however extremely wary of doing it that way. I don't know your requirements of course.
Here is the thing - encrypted stuff is just a weird encoded string. So I can’t really use columns normally.
What I really need is just a huge table with two fields: “token”, “content”
And the token is basically the primary key but encrypted with whatever encryption.
You could even do foreign keys this way.
Hmm I suddenly have an idea. What about a layer above the database that basically enforces foreign keys and joins in this way to support end to end encryption? The content would reference ENCRYPTED foreign keys. Only clients would decrypt stuff.
>What I really need is just a huge table with two fields: “token”, “content”
Sounds more like a key value store and less like a relational database. Although you can store key value data in a relational db of course, there may be a better tool for the job.
Do you have any recommendations for resources to learn best database practices? I'm currently designing my first database and I'm not sure what information is worth storing (like calculations) and how to choose which data to group in tables.
SQL Antipatterns is good as another commenter recommended. But my favourite book on the topic is Markus Winand's SQL Performance Explained. Most of it is online here: https://use-the-index-luke.com/ but I recommend buying it since it's tiny and worth its weight in gold.
It's short so you actually read it and possibly reread it. It's to the point. It has pretty pictures. And it had directly applicable advice.
I'm going to go out on a limb and suggest you start writing the application without a database first.
My early education on databases always seemed to follow a "how do we make a database do this?" rationale rather than "what data do we need to store to support these features?", which I think leads to a software design that is too strongly coupled with the database. Software modules end up dependent on database features, or table structure, and refactoring or switching data stores becomes more costly.
Instead, start with a simple in-memory data store - a list of objects with some interface for accessing them, will probably your starting point. Add some basic serialisation/deserialisation features (CSV, JSON, etc) when you get past initial testing and require some persistent data. Then, once you have your API in place and your software design is stabilising, you should be able to map that data to a database fairly easily:
* The primary structure maps to your main table
* Child structures become additional tables, with foreign keys
* Data used to lookup records can be indexed for better performance
Beyond that, you should profile/benchmark your application to find what needs to be optimised, and then investigate whether your software design or your data store should be doing the optimisation.
Let your software's features influence the design of your database. Don't let the database's features influence the design of your software.
Ok, first you need to decide: is this going to be a purely transactional database (for business processes) or do you also plan to do data analysis straight inside this database (meaning you won’t be extracting, transforming and loading data into another database and analyzing it there).
If it’s transactional, I recommend keeping calculations only if you need to access summarized data frequently. For example, if you are tracking inventory by storing the history of transactions that occur into and out of inventory, it’s trivial to find out how much of each item you have in stock at any point in time by doing a sum of the change in quantities for each item type up to that point.
If you were usually interested in the “current” count, it would be expensive to perform this sum every time, so instead you could keep a separate table for calculating the running total of inventory per item and referring to that. Keep this table up to date through the use of triggers on insert events (Note that your log of inventory transactions would thus be an immutable stream of events)
An example of something not worth storing is derived data that is a combination of separate columns in a table. For most queries it’s probably trivial to be lazy and wait to perform such a calculation until you actually need it. If you still want to have a ready made “table” that has all the computations you want already entered as columns for easy querying, use a view. If you find yourself making liberal use of views, you’re on the right track.
If you are using a separate data warehouse for data analysis, then precompute and denormalize as much as you can.
*Disclaimer: most of what I’m talking about is from a Postgres perspective.
I would describe myself full stack under that definition. Most "full stack" people I know that sit in my Uni courses have mostly learned Java EE + Oracle DB or Javascript + /dev/null^w^w MongoDB. Most of them would probably not be able to construct a relational database or libc from scratch.
Granted, such knowledge isn't immediately useful since it's something I or anyone is likely to do but it grants insight into systems. I know roughly how a query optimizer does and what it can, and more importantly, can't do.
When you know a system you can optimize for it. When you don't know a system you can only follow someone else's advice on how to optimize for it.
Isn't the current quest in quantum mechanics (to find a grand unified theory) to find an abstraction that does leak? At the moment it's too self contained and doesn't explain anything about the macroscopic world.
In any case, having a basic understanding of the next level up (the electron) has proved quite useful to my career, otherwise I wouldn't know how turning things off then on again affects the machines I'm working with.
Them, or the manager that hired them? If no one up the chain brings on a DBA what are they supposed to do? I hear ya. But this is as much a symptom of naive (and budget stretched) leadership as it is of the hands on deck.
1) It took a weekend to complete. The friction was building for far longer. There's a cost to that, esp if it effects customer satisfaction and retention. They didn't refactor for fun, did they :) How many dev teams aren't so lucky? Is this article a no choice outlier, or a best practice?
2) My comment wasn't directed at the article but on another comment that blamed the developers. These problems should be owned by ownership / leadership / management more and engineers less.
3) That said, hire a DBA? I don't think that's necessary.
>Seems like a fundamental misunderstanding of SQL rather than a particularly hard problem to solve.
Without knowing the rest of their stack, or what their data ingestion looks like, I think your query is oversimplified. If they are doing a union, then it's likely they aren't querying one table, but they are querying multiple tables. The article mentions that individual customers had as many as 500 million rows. Likely each customer has their own set of data they also pipe into the system. Next their own custom query language may support more complex algebra than standard equality.
IMO, the article doesn't sufficiently describe the problem for us to understand why their solution works. To you and I there are 100 other solutions they could have tried that seem simpler than the one they presented.
It's less likely that they overengineered - we are probably just underinformed.
>It's less likely that they overengineered - we are probably just underinformed.
Based on 15 years in software companies in the valley it's much less likely that this isn't over-engineered. Nearly every decision I've seen chasing technology hype has been based on ignorance of existing solutions.
This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.
Entity attribute value anti pattern - this has been well known for at least 20 years. It can be tempting when you want to design a "flexible" system but really needs to be used sparingly. I was BI team lead on a product where the architect insisted that it be used on every entity (>300) as you never knew when you might want to add some bit of data. It led to some interesting (multi-page) sqls and the project ultimately failed. This was one of the reasons. Slow performance and often runtime errors when expected data wasn't present and the application layer couldn't cope. It was a good learning experience.
https://mikesmithers.wordpress.com/2013/12/22/the-anti-patte...
We have this as a "meta" field (because MySQL is balls at adding new columns to big tables without multi-hour downtime) with some arcane nonsense format. Totally unqueryable with any efficiency.
EAV pattern has trade-offs you need to compensate for (performance). Production systems that use EAV have flat tables, and heavy caching to have be flexible with acceptable performance.
Oh gosh this pattern. The first time I encountered it was in my first job where we used Magento. Super flexible. Also super slow. Does anyone have any advice how to make a db design like this work faster? Generally I thought when data is arranged like this it might be a prime candidate for document based storage. But I'm no dba so I have no idea if that would be correct.
If you are using Postgres, the JSONB datatype will let you do exactly this while still using the full power of SQL. Simply create a column where you keep a JSON object full of random user properties, if flexibility is what you want. You can even index properties.
Or just create ad hoc tables with user fields. Quite often it's not that a customer has n different fields for n entities, but a few that apply to the majority (like internal ERP ids, classifcation etc.). Put them in a few tables, index them, join them. If you don't want to parse internal DB descriptors, create a set of "schema" tables to build queries from.
The question is whether it can be stored like this while allowing for fast queries. For example, unless it changed recently, Postgres doesn't calculate statistics to help the query planner on jsonb fields.
IIRC JSONB still has problem with index statistics
So values in JSONB columns can be indexed nicely, but the statistics can be much worse than for non-JSONB columns, which can lead the query planner astray.
This just isn't that hard. They don't have that much data. It is really late for me, but, put it all in memory and figure it out. These just aren't hard problems. DBAs have been solving performance issues for decades with a clever index on the right column for 30+ years. Sorry if this is get off my lawn-ish, but I have been on too many projects where I made a DB index and solved a major bottleneck. Too many new developers are ignorant to the nuance of RDBMS tuning. I am not even a DBA.
If they are using some sort of middleware orm, which they may well be because of their model, they are most likely using an EAV[0] schema which, although flexible for writes, is horrendous for reads. The join plus pivot is a disaster on virtually any relational system.
Hmm, that does seem probable. In fact that could make the SQL even more efficient as you'd only need a combined index on the 'prop' and 'value' columns, rather than N arbitrary combinations of indexes that may or may not be used.
Edit: Had some bad attempt at writing this query but it's rather late and it made no sense.
You would need to have a new join for each new property
SELECT DISTINCT loyaltyMemberID
from members as m
INNER JOIN properties as p1 on m.id = p1.user_id
INNER JOIN properties as p2 on m.id = p2.user_id
INNER JOIN properties as p3 on m.id = p3.user_id
AND (p1.prop = 'gender' AND p1.value = x)
AND ((p2.prop = 'age' AND p2.value = y) OR (p3.prop = 'censor' AND p3.value = z))
I'm not sure what you're asking - could you give me an example of what you're envisioning that couldn't be satisfied with a combination of Boolean expressions in the WHERE clause ?
Especially with partial indexes, I still feel like this structure will be significantly faster than the original UNION ALL ... GROUP BY on calculated fields.
And they mention in the post that most queries don't use that many fields.
Confession time: in my first job, I build something like this (and it worked pretty well in the sense that it was very flexible), but then I also had to do a 'select' based on iirc 14 of such properties. I don't really recall the exact problem I had at first, but my solution was to create two separate (temporary) tables, select 7 of the properties into one and 7 into the other, run a select on both of those tables, then join the results in code. This ran at and acceptable speed (I must have done something so that adding criteria made the run time increase non-linearly - doing it on 14 was orders of magnitude slower than on 7).
Then years later I ran into the guy who had to do some work on it after I left that company. I must have scarred him pretty badly, because he remembered it enough to bring it up as pretty much the first topic after the obligatory 'hey so what are you up to nowadays'. When I think back about it now, it was a cringey solution - then again, this was at a company where nobody had ever heard of a 'database index' (or if they did, never mentioned or implemented them).
This is a pretty popular pattern known as Entity-Attribute-Value [0]. It's used by many products where a) data model needs to be very flexible and allow new attributes without schema changes, or b) a typical entity has a large number of possible attributes that may or may not be set for all entities ("sparse" attributes). WordPress uses this to store post metadata, Magento uses this to store product attributes and most of other data, Drupal uses a variation of this to store all the posts and other content you create… I have too much experience with this model to be surprised.
hstore querying is quite slow (and GIN indexes on hstores are pretty massive). I'd always go jsonb over hstores these days, but jsonb has the same indexing problem. JSON has a well-optimized/spec compliant serializer/deserializer in every language you can imagine as a baseline, whereas hstore does not.
I once implemented a variation of this where there was a column called 'data_type', the valid values were the various SQL data types, and in code I would do a switch() on the (string) value of that column and then cast the contents of the 'value' column based on that... Ah the folly of youth...
> This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.
I think you're right. Oh ... my ... god ...
I wish I could say this is the worst example of a database schema I've ever seen, but it isn't.
Technology cycle:
X gets invented -> idiots abuse it -> X "is bad" -> Y (strictly worse than X) is "so much better" -> idiots abuse it -> Y "is bad" -> ...
EVA is a valid pattern if the keys are dynamic. For example, in a CRM, the user might want to store properties of their clients that you haven't thought of. In our platform, we use different schemas for each company, so we can actually do a ADD COLUMN ..., but you don't want to do that if you have a multi-tenant DB :)
Using the builtin type for that purpose is going to work way better. This depends on the DB you're using but is generally referred to as a "JSON" field (why ? Because they're a response to MongoDB, which calls is that). Oracle and SQL server have very similar things.
In Mysql, it is JSON data type [1], in Postgres JSON/JSONB [2].
Creating indexes across them is doable, through a workaround (involving what is generally referred to as "VIEWS", but can be called calculated columns or something like that).
And, frankly, in the worst case for indexing, these databases still perform comparable to key-value stores in speed (especially SQLite).
They may be generally a better option, but they have their own disavantages. For example, JSONB fields in Postgres won't deduplicate keys, so if you have large keys, your table size will increase quite a bit (which also makes it harder to keep it memory).
Using an EVA, you can have a "keys (id, key_name)" table, and then only use the IDs in the values table, reducing that waste.
By the way, you don't need views for indexing on JSONB fields, it's supported out of the box in PG.
At least in MySQL, json field types are rather new. MySQL 5.7 is not yet an option with AWS Aurora or Google cloud SQL even.
And I don't think you will necessarily get better performance with json fields vs an EAV model. Yes, you can index json fields by creating virtual views, but that requires that you know the field ahead of time. With an EAV model, you can have your values table indexed and then join.
But I am excited to start using the json field types. In many cases, it will really simplify things over the traditional EAV stuff.
The UI they showed in the blog post looks like it has enough data available to generate that kind of query, too. Like, the ands/ors/nots are right there on the page, the filters are already there too getting translated to SQL as well, just mash them together and you get the same "algebra of sets" stuff right in the WHERE clause.
As it stands the SQL query is quite silly. It gets a list of every user ID that is included by each filter and compares which ones are in the filters you want and not the filters you don't want. Much better is to pass the filters into SQL, let it figure out which users match the filters you want and not the filters you don't, and just use that result.
Most Enterprise CRM like solutions store tables of customer-property-value instead of using one column per property.
This leads to lots of unions in advanced queries, and makes filtering harder. Some databases even calculate column block statistics to optimize these queries by doing less IO even for seeming table scans.
Why not one table with all customers and one column per property?
There are a few reasons, having to do with anything from MySQL sucking at schema alters for really big tables, to expectations of Enterprise customers.
I think that jsonb may not be as performant as EAV. You don't need joins or unions, but if you are dealing with dynamic fields, you need to know the fields ahead of time and set indexes for them in jsonb. For eav you just have to index your values table.
PostgreSQL can use multiple indexes so you don't need to worry about needing to know about the fields ahead of time.
Likewise you can get away with a full document GIN index.
I played around with some basic report stuff at work last year, the EAV data on my local machine, the report took ~7 seconds to run. I shoved the same data into PostgreSQL as JSONB, indexed it just as full doc cos I was lazy, the same report took ~80ms.
Obviously this isn't 'proof' my dataset was only 1.5m by 15m records. But with my limited knowledge i do believe it would perform better, I don't know how much better... but I think better...
I’m not the author of the post. Your comment assumes a well known schema. My understanding from the post is that this solution can join and filter on “custom” datasets of arbitrary schema that each of their customers upload.
I've never played with this, but couldn't you create a table based on the dataset that the customers upload, and let your database engine handle filtering those queries? From the looks of it, even if they were doing full table scans for each query, it'd still be faster than all those unions...
I think the point is they don't know in advance what the query is and they didn't think they had a good solution to optimize all user entered variants across the range of possible groupings so they wanted a solution that was easier to optimize globally.
The general form of this is:
Select loyaltyMemberID
from table
WHERE V1_1= x_1 OR ... OR V1_n=x_n)
AND (V2_1 = x_2_1 OR V2_2=x_2_2 OR ... V2_n=x_2_n)
AND ...
AND (Vn_1 = x_n_1 OR ... OR Vn_n= x_n_n)
(some of these n's should actually be m_i's but I was lazy)
There may be some ability to optimize this in a number of ways but optimizing one example is not optimizing the general form. I can easily see how technology change could be a cleaner solution.
> There may be some ability to optimize this in a number of ways but optimizing one example is not optimizing the general form.
I totally get that, but isn't that the point of the query optimizer within the database itself? Why are you trying to outwit it? It should select the right indexes, provided the columns are indexed, and "do the right thing(tm)". It might take a bit of cajoling but they seem pretty good at this. Postgres collects statistics about the distribution of values themselves within the table to guide its choice of index, so in theory it could rewrite the boolean logic to use a specific index if it's sure that it will eliminate a higher % of the rows than another plan.
In any case, it seems the SQL they posted is a bit off. Why nest each individual filter as a UNION? If you wanted to go down the UNION route couldn't you do each individual group as a UNION, with standard WHERE filters?
I blame ORMs', if you don't understand SQL and how databases work you should not be allowed to use an ORM. If you know how databases work you, in many cases, will not use an ORM except for the most simple CRUD operations.
I use both. I know the Django ORM and I know where its limits are. For getting data in it saves a load of time. When the query gets complex it starts adding time, or making things impossible (multiple join conditions weren't possible unless its been update in the most recent version).
OK, the part about not using ORM if you know SQL is a bit of an exaggeration. At least when you know SQL you know when to use an ORM and when to not use it. If all you know is ORM then you will always use it, and ORM seems to lead to many developers not learning SQL
I think it depends... Who are we talking about here? Juniors, even intermediates, in my experience, haven't had enough time on the job to have learned enough to be writing raw SQL statements or query objects unless they're actively punching up on a daily basis. I am unfortunately talking from experience here.
What I am saying is, I really do not want a situation on my hands where the juniors that I work with, or most of the intermediates, and even a few of the seniors and leads, are writing raw SQL or query objects. Most of these folks have n years of experience in web and desktop application development and couldn't give you a passable answer to simple questions like, "What's a database index?" I know this isn't isolated to my current employer, or former employers, and I've seen it in other organizations where I've done some consulting on the side, and all of these folks I'm talking about here have largely worked else where in the past, too. And this in itself leads to other third-order effects, like the "SQL wizards" who get asked all of the "tough" SQL / database questions.
I want to stress that I understand the point that you're making, and I do agree with it, and of course, so do many (all?) ORM authors themselves, but I think the advice is wrong and is prone to take you to a much worse situation. I think we have an obligation as people who do grok SQL and databases to gently introduce our less experienced co-workers to the idea that ORMs are not a panacea to all database interactions, but until the companies we work have enough of an incentive to give us that sort of time and empowerment then I, for one, am going to recommend ORMs for everyone for everything unless they really, absolutely, demonstrably know what they're doing.
I do also understand your points and think we agree on most. I think that if a "developer" can't write SQL I would not trust that one to set up the ORM correct either. For basic usage, sure they will get it to work and all is good. But when you want to join table or run aggregate functions the same peoples who write bad SQL could also write bad ORM code with N+1 queries. ORM has its place and optimized beautiful SQL has its place, a craftsman know which tool to use where and when to ask for help.
One of the problems, in my opinion, is that SQL isn't "cool" or hip and by many seen as not important to learn. While the new fancy Javascript based language or framework which nobody use and that will be replaced next week is much more important to learn.
Oh, they absolutely do, and when we're lucky they actually catch them on their own before they get to code review. Some folks reach for tools like Bullet [0] and, that's great, but unfortunately, sometimes they treat that tooling like the Holy Gospel. They develop an over-reliance on them as if those tools exist to offload critical thinking. Drives me crazy... in my experience, it's been hard to combat this type of thing, too. The pace of "agile," the calculus between paying down technical debt and mentoring and progress, I don't really know why but I haven't had a lot of long-term luck.
> One of the problems, in my opinion, is that SQL isn't "cool" or hip and by many seen as not important to learn.
I think you're really right about that. I happen to like writing SQL quite a bit and I take a little bit of pride in that I kind of sort of actually understand a little about what is going on in the database and even then I neglect that skill. I picked up copies of both "SQL Anti-Patterns" and "SQL Performance Explained" based on recommendations from this thread and am eager to get in to them this weekend. Still lots to learn... And, I have some SQL problems that I can see coming up over the horizon today and I hope this gives me the edge I need to start grappling with them sooner rather than later.
> Edit: I've just read the query in the post again and I really can't understand why you would write it like that. Am I missing something here?
Oh I've seen this happen a lot. Somewhere along the line, often from a DBA, it is decided that sql in an app is evil and that everything must be in a stored proc. Then instead of some simple string concatenation you have to jump through hoops like this.
One of the things I've done to harden an app is to revoke all permissions other than EXEC on a particular schema, then make sure everything is done via paramatised stored procedures - no chance of SQL injection then.
But that creates situations like this where you have to jump through hoops to solve simple problems. You solved one potential issue at the cost of creating many more.
> no chance of SQL injection then.
You know you can have sql injection attacks inside stored procedures? If you think stored procedures are a panacea then you don't understand the problem you're solving.
I am building a rule engine quite similar to this. An AST parser will run all python DSLs and generated list of tables for INNER JOIN, then SELECT all the tables data out with filters in one pass, then run all results through the Python code.
age/gender are probably simple but i'm guessing censor is probably derived from a transaction table. if they are letting users select arbitrary time ranges to filter the transactions then you can't store a precomputed censor value for each user. but seeing that they are talking about caching maybe a lot of stuff can be precomputed.
Damn, you're not kidding. I wonder why they needed more than one query here plus UNION is slowwwwwwww. They never mention how frequent this query needs to run either, only the amounts of data involved in some aspects of this table.
I was going to ask how to optimize the SQL in your post as it seems like the obvious/naive implementation of the query. If you're missing something, so am I. I can only imagine it was built up over time from googling specific terms that already missed the point, e.g. "rds mySQL union query"
Gotta agree with others and say that they’re clearly skimming over the facts that:
- they didn’t have the expertise to actually fix the SQL. That query smells bad. The data model smells bad. For some reason HN is always superstitiously afraid of letting developers touch the database, but if you don’t let devs touch the database enough you end up with this sort of thing; or that crap data model with properties in rows instead of columns, because oh god, we can’t let devs actually do DDL so we’d better make it all really flexible (and incredibly slow because it’s a misuse of the database). I mean, implementing your own result caching mechanism? I don’t know about MySQL but surely it has its own caching mechanism (Oracle does) that isn’t being used because the query is bad.
- project management probably had no interest in fixing the performance/incorrect data problems, and devs were expected to do it in their own time.
In a way though this makes me feel better, other people are dealing with these problems too and their overengineered solutions work and keep the company running, I guess mine will too :)
I'm all for companies releasing technical blog posts, but there's some really strange framing here.
This is actually a story about how decisions get made, and how better ones can be made. Reading a company's mea cupla tells you they are well-informed and well-intentioned.
This is not a story which presumes good decisions were made and "the (tiny, startup) database company went bust". That's their framing. Yikes.
What else it shows - how expensive AWS hardware vs hosting own hardware. I guess you have to consider how often you have to scale, but hetzner offers dedicated servers with 64Gb and NVMe drives starting from 54 euros per month - https://www.hetzner.com/dedicated-rootserver?country=us - compare that to $580 per month these guys were paying for i3.2xlarge instance..
I’ve heard mention of Hetzner no less than a dozen times in the last couple days. What’s their deal? I’m not quite sure I grok this server auction thing they do, or how they’re so cheap.
Hetzner got in the news recently because they now offer a "cloud" product for VPS. Not in the sense like AWS where you can shut down instances and pay less but in the sense that you can buy, provision, and delete VPS via an API and pay per hour. They are also dirt cheap and offer 20 TB egress traffic with even their cheapest VPS.
How they do it? I don't know. They are using Xeon processors and not i7 like some others.
If all you need is Virtual Machines or dedicated machines, you are good to host your databases, services etc. using AWS is very expensive. You could literally buy from 3-4 different vendors to maintain availability in disaster case and still be cheaper. Hetzner is one of the most affordable providers and both dedicadted and cloud offerings are fast enough.
I am not talking about auction thing here, just their regular dedicated servers offerings. They are bare metal and it is up to you to set it up, but savings are in the 5x - 10x range if you can avoid waste by underutilizing hardware (i.e. cases where you have to scale daily 1 to 100 instances might not be financially advantageous)
Well.. It depends on what you are doing. If indeed you are using majority of AWS services, then yeah, getting off these will be hard to impossible.
But if all you use is RDS, EC2 and S3 - it is quite easy to move to dedicated hardware once your service is up and running on AWS. Unless your load is not jumping around 10x all the time of course (then you will need dynamic scaling which is not quite possible with dedicated hardware)
You can see there 64Gb boxes with NVMe drives startin around $160 USD per month. Not as much savings as with Hetzner, but still beats 1:1 price of AWS.
Unless you can directly see how a query can be optimized, first thing you do is get the execution plan (e.g: EXPLAIN query).
The execution plan will tell you how expensive is each bit of your query and help you adjust it.
From there, if things are not getting better, you have a lot of alternatives:
- Consider creating an index
- If the value doesn't change often, consider writing it into another table or caching it.
- Replication, partitioning, sharding, changing the schema.
- Reconsider the requirement being implemented in order to have a more scoped query or to perform the query less often.
Then... OLAP is not OLTP. If you can, do reporting in another database.
Finally, creating your own project in the end may not save you $50,000. How about maintenance? tooling built around it? integration costs? documentation? usability? new hires having to learn about it? You can hire people that already know SQL without having to incur that cost yourself. All the tooling is built, battle-tested and readily available. Plus, skills related to internal tools are harder to trade in the market because they're harder to verify and less transferable.
Elasticsearch works for this use-case quite well, you'd store a fairly straightforward representation of the MySQL row as a document, query by the fields you're interested in and ask for aggregations on the matching documents. Common bitsets get cached automatically.
This exactly how we implemented the rules engine in Kevy. We construct an Elasticsearch query based on the rules selected in the UI. Then we use the scroll API to retrieve the matching documents.
This is cool but it seems strange (to me) that this was a “Hackathon” project as opposed to just a stand-alone problem to be addressed as a normal course of doing business. It doesn’t make the solution less cool. It just seems like a strange distinction on what a Hackathon is.
A hackathon is something that used to be a cool party for geeks (i.e., a Mathletics competition or an ACM programming contest) until the corporate overlords bastardized it and converted a good thing into unpaid overtime with free beer.
Well, what would be the argument against that - if in fact it does deliver code that solves a problem in a short period of time? Why can't you just do that all the time?
Having a fairly small, well defined problem, working in a small self-selected team, with no outside interference and typically no clients outside the team. No managers, no PMs, no bug reports. This is not how day-to-day projects get done.
Regardless of all of that, in my experience a typical impressive hackathon project is still just a barely working demo that benefitted from a significant amount of research and planning beforehand, and will require an even greater amount of hardening and polish afterwards.
There is no magic, it's just a vastly different kind of work environment with both inputs and outputs incomparable to day-to-day work.
The place where I'm working has done a corporate "hackathon". It took place on a normal day and I don't think there was any overtime, we got free food, but all projects that delivered something relied on existing research and need a good deal of polish afterward.
Same reason you can't drug up or taze the special goose to produce more golden eggs.
Long term high intensity output will lead to burnout, even if the salary is 10x people would struggle and crash. Pushing at 100% full enthusiasm is like sprinting, it is not possible to maintain that intensity for very long. It can be fun, it can be productive, but the wiser approach has the long-term and end in mind.
Sounds like a good argument for always introducing an artificial fatal flaw into your project before presenting it... so proj management can’t see it and yell “my god, it works! Let’s put this cobbled together, coffee-fuelled mess right onto prod!”
Depending on the type of company, it could have been blocked by the manager or PM. It is easy for them to reject "tech stuff", when your time could be spent adding a feature from the product owner.
A proposal like this could easily have been seen as the developers wanting to test out a technology that was not approved or with a good business case. That business case is usually something that only sales/product can sell. The barrier to listening to developers is higher because they are assumed to not know enough about business.
You may have an "agile" environment, but you often need a very good reason to not pick the next item from the backlog, which was not created and maybe not even prioritized by you.
In those companies, the hackathon may be the only time developers can present their ideas.
My guess would be a dev that knew that something was wrong with the performance of their old system, but for whatever reason it wasnt something that anyone higher up prioritized. If thats the case, I can see why one might come up with an idea of trying to solve it outside the bounds of the normal production/SQL environment. At least in my company, the only time we devs ever get to specifically look at the performance of our website is when we have "hackathon" days where we choose our own projects. I often times feel like my regular time would be much better spent trying to optimize our 2 second+ initial pageload times, instead of all the other small tasks/tweaks/bugfixes that gets sent my way. But performance is something very few people higher up seems to care about. Or maybe its a case of users and managers becoming so accustomed to something being slow they dont notice anymore.
Cool post - Definitely an improvement and a good fit for Go services. I'm curious - did you run performance comparisons on optimizing the SQL itself as compared to adding this additional service?
Maybe I'm crazy, but just looking at that query it seems like there's definitely room for improvement with the SQL alone. Unless the "..." is hiding something I'm missing?
Its not that a Go Microservice solved their problem. Its the different algorithm they use for querying. That has nothing todo with Go, or Microservices.
This is a great example of how to do things wrong. I'm surprised they couldn't find any other columnstore database to take the place of InfiniDB in 2018.
The numbers they quote (5M members, 100M transactions) are tiny for any modern data warehouse. Many solutions would run these in sub-second speeds without changing the SQL at all, and it would be far better than building a quasi-SQL engine in Go.
Actually for the occasional querying + caching that they have, something like BigQuery or Snowflake data would be even cheaper with basically 0 operational effort.
It seems like it from his explanation. He says he'll explain why he thinks only Go could have done it, but then nothing specific to Go really materializes. This is how it seems to usually go with posts like these.
It's like those infamous enterprise benchmarks from yesteryear.
"NYSE moves from Solaris to RHEL and gains a 800% performance benefit".
While I don't doubt a brand new RHEL has more performance optimizations than what is actually a SunOS 5.2, the guys benchmarking should have also said that the original hardware was the equivalent of a PIII and now they're moving to the latest Xeons.
I'm not kidding, I've actually seen a press release like this.
Okay, before i've even read the article, i'm going to guess that they were doing something egregously expensive in the cloud, and the microservice helped them do it more efficiently - but still much more expensively than doing it in a simple, old-fashioned way.
Now i'll read the article ...
EDIT: I would say i'm no more than 30% right. They were doing heavyweight data crunching in the cloud, and so paying more for it than if they were doing it on rented hardware. But that's a constant-factor thing; it's not like they were downloading gigabytes of CSVs from S3 on every request or some such. Their query looks suspect to me: couldn't it be written to do one big scan, rather than unioning a load of things? Or is this the right way to write queries on column stores? Still, there is no glaring obvious (to me) old-school fix for this.
Caches could be stored in the database in a materialized view, in an external service like memcache or redis, or even in the application itself.
Expiry can take a few different forms. Some caches have a defined space and use a replacement scheme like "fill the cache up, then remove the least recently accessed value". Some don't have defined sizes but instead remove entries based on timestamps (cache for n minutes). Some depend on invalidation messages from the application. It all depends on the applications needs.
The most important thing to remember is that caching means your system becomes inherently a distributed one. State can become split across multiple sources, the cache can return stale data, invalidation might not happen when you expect, ...
Blogpost author here. Thank you so much for all the attention, comments, upvotes, likes, retweets, etc!
I've done a pass over the comments and can't really answer them all but I'd like to clarify a few things:
There seems to be a general opinion trend that the queries generated by the group builder algorithm are very inefficient, that it'd be easy to come up with a solution with much better response times, and that that would be achievable in any reasonable programming language in roughly the same time with similar results.
The language argument will always be controversial and I won't address it here; we have a point of view that is expressed in the Conclusion and on this blogpost: https://movio.co/en/blog/migrate-Scala-to-Go/
I can imagine that seeing a query with JOINs, subqueries, GROUP BYs and UNIONs can raise some eyebrows, but there is some lacking context in that story, and that's on me. Here's some of that context:
* The schema that the group builder algorithm operates on is not uniform in nature or composed of simple yes/no fields; it's an incredibly complex legacy schema that to a large degree wasn't even up to Movio: it's been up to the film industry as a whole, and it has evolved over the years, as is the case everywhere. Note that every different kind of filter translates to a very different kind of query, and we have more than 120 different filters, sometimes with dynamic parameters, and sometimes even bespoke for a particular customer!
* The group builder algorithm predates the team that built this service (myself included), as well as predating the first commercial release of Elasticsearch, MariaDB, mainstream Go success, etc. Nevertheless, it's still very fast and is being used today by ~88% of our customers (i.e. all the non-behemoths). It's been successful for many years, and continues to be, for the most part.
* But I don't like it because it's fast: I like it because it's simple and flexible. It allows our customers to build a really complex (and arbitrary) tree of filters to segment their loyalty member base, and it compiles all of that into one big SQL query, that in most cases is quite performant. That's pretty awesome. But yes; it doesn't scale to several million members.
* Migrating the very engine of the main product of a company is not a decision that is taken lightly. As is the case with every big company I can remember (e.g. Twitter, SoundCloud), behind a big success story there's always a legacy monolith, and our case is no exception. From that standpoint, achieving such breakthrough (i.e. cost reduction + significant response time improvement) within one hackathon day is really not all that common in my experience. Definitely something worth sharing, IMO.
Hopefully that clarifies some of the questions :) Cheers.
We had the same issue where I work and we are doing a very similar thing but on a way larger scale (adtech) for audience building and we actually resorted to compressed bitmaps since postgres was not cutting it.
It's fairly easy to just come on a forum and say hey: just use postgres/mysql/sql server without reading the full article and understanding what you guys are dealing with.
I once built a micro service in Go that saved about as much. Took me a couple of days (meetings and all that) from start to deployed. It still runs on the same cheap aws instance. Go is really good at that sort of thing.
Right. The blog post is a good narrative over time but isn't super clear about how they solved the problem. If I understand correctly they broke down the big MySQL query into separate queries that the Go service processes/caches?
Blogpost author here. That is correct. Sorry if the explanation isn't super clear; for this particular question, you can consider the two diagrams as a before and after. They pretty much convey what you have explained here.
mySQL is very slow when it comes to joins, groups, etc. You are always better off with simple select's. If you are using for example PHP the only viable solution is to have the db crunch it. But when using Go, NodeJS et al. you can pull the data out as a stream/array-like and apply filter/map/reduce and the logic would probably be easier to manage, rather then generating a complex SQL query. And would also allow you to stream the result to the client, instead of having the user wait for it all before they see anything. A lot of money could probably be saved by having the data on the client side, in for example web db, and only use the servers for backups and syncing the data between clients.
i have a 10+year old project on mysql doing 6+ way joins, groups, etc against 100million row tables in sub second time. It all depends on the indexing, disk layout and size of intermediate products etc. We dont know why they did "UNION ALL" with huge intermediate products (it seems like the query could be a single index/table scan on its face) but that is likely the slow down "EXPLAIN SELECT" would tell us
Stuff like the group builder in the article is hard to reason about as there are so many combinations. MySQL is bad at optimizing O(n2), no matter how you design the schema it will be slow. The solution, like they probably did in the article, is to break it down into separate more simple O(n) queries.
So, instead of fixing your messed up data model, you wrote a service (sorry, microservice) which tries to keep your main DB in sync with a columnar cache, so you can keep using that awful query with multiple UNIONs. Throw in some "Go" and boy, you have your Medium post going!
What we have here is a classic dbms problem and no one at Movio seems to know how to deal with that. Instead of migrating from Mysql to something serious (Postgres) they move to some columnar DB no one has heard of. Nevermind that postgres and a reasonably priced DBA and a little thought put into their data model/queries could probably handle all their issues.
Sorry for the snark, cheers on a successful product.