Hacker News new | past | comments | ask | show | jobs | submit login
A terrible schema from a clueless programmer (rachelbythebay.com)
842 points by zdw on Nov 7, 2021 | hide | past | favorite | 450 comments



Related to database indexes, but not the post: a busted database index brought down ticket sales of the 2008 Olympics Games.

This was the first time regular people could go buy tickets for events & they had been lining up overnight at Bank of China locations through the country. We were down for over a day before we called it off. Apparently this led to minor upheaval at several locations in Beijing & riot police were called in.

We were pretty puzzled as we had an index and had load tested extensively. We had Oracle support working directly with us & couldn't figure out why queries had started to become table scans.

The culprit? A point upgrade to DBD::Oracle (something like X.Y.3 to X.Y.4) introduced subtle but in character sets. So the index required using a particular Unicode character set, and we were specifying it, but when it was translated into the actually query, it wasn't exactly the right one, so the DB assumed it couldn't use the index. Then, when all the banks opened & a large portion of very populous country tried to buy tickets at the same time, things just melted.

Not a fun day.


Can't imagine the pain to even discover this problem. It normally takes me a long time to be like, "maybe I didn't make a mistake, and I have a third party bug?". Third party bugs are always the most annoying to me, usually the longest to diagnose and then ultimately I have to just tell my boss I can't do anything to fix it, but hopefully I can find a way to avoid it.


Imagine having to diagnose a bug in the processor!

http://gallium.inria.fr/blog/intel-skylake-bug/

"More experienced programmers know very well that the bug is generally in their code: occasionally in third-party libraries; very rarely in system libraries; exceedingly rarely in the compiler; and never in the processor."


Fork and fix? I’ve personally been pleasantly surprised by how fast maintainers will merge a PR I submit.

If it causes an issue with our own systems it also gives me a very good justification for doing it.


In Oracle's proprietary system?


Oracle doesn't appear to maintain the DBD::Oracle Perl client module.


Sometimes I've had a pr merged. Sometimes I've had stubborn maintainers that think that their project should not follow common domain standards for /shrug reasons.

And then many other times I honestly don't have the knowledge to be able to contribute into large third party systems and it seems very rare to get an issue resolved without submitting a pr.

And lastly sometimes it's hard to judge if unexpected behavior is really a 'bug'. Similar to the last point, it can honestly be hard to tell the difference sometimes.


At my previous job, a particular database was designed using ascii strings fields for a particular field rather than unicode. If you then query with a string in unicode format, the database decided that the comparison should be done in unicode. The only way was to table scan and convert all ascii fields to unicode on the fly. It was found only in production.

Given that you mention China in your story, did GB 18030 have anything to do with your problems?

https://en.wikipedia.org/wiki/GB_18030


That sounds like newbie developers wrote that and no DBAs were involved in checking the code, not a database design problem, you are throwing blame without knowing enough about the matter. If your string fit in a varchar, it makes absolutely no sense to change it to nvarchar because your query has unicode as input.


I reread the parent quote several times and cannot detect a hint of throwing blame at all.


Most, but not all of the team were young, and the DB schema was designed by themselves. I did not intend to throw blame. It is more the kind of situation where a sensible decision (at first glance) creates a footgun down the road.


What? The database made the call to compare the strings as Unicode. Are you accusing the DB developers of not doing their job?


Yes, databases don't decide, developers do.


The moral of this story, to me, is always lock in all dependencies to the exact specific versions tested for production release.


I love DBD::Oracle so much I've always arranged to have other people test oracle related DBIx::Class features for me.

Writing the pre-ANSI-JOIN-support SQL dialect code was pretty good fun though.


The one that always got us is much more mundane. Deleting a row requiring an index in every table with a FK to avoid tables scans.

Near as I can tell, we assume there is some bit of magic built into the foreign key concept that handles this for us, but that is not the case.


Yeah I’ve run into this a few times. In Postgres, you can’t add a foreign key referencing columns that aren’t indexed. I assume this is for performance reasons so inserts/updates are efficient, but there is no such check that the source columns are also indexed, so updates/deletes on the referenced table can be terribly slow.


Isn’t the source almost always an indexed primary key?


No. The target is normally a primary key but the source is not. The source is rarely even unique.


Foreign keys are naturally referencing primary keys, which have their own (unique) indexes by default. And there's some extra magic with CASCADE.

But it seems like I am missing your point: care to expand on it so I can learn about the gotcha as well? What are you attempting to do, and what's making the performance slow?


If I've understood correctly it's the foreign key column itself that isn't indexed by default. Deleting a record then requires a table scan to ensure that the constraint holds


Example: if you delete a record from the customers table, you want an index on the foreign key in the orders table to delete the corresponding entries. This also means the performance of the delete can have the unfortunate property where the small work of deleting a single row cascades into the work of deleting many rows.


Which is why many elect to just ban deletes.


That’s not the reason. The reason why people don’t delete is because nobody wants to be left with inconsistent data relations. Deleting a customer is more deleting their PII(our scrambling it) and leaving everything else in tact.

Or in the case of Silicon Valley, leave everything in tact with a disabled flag and then spam you for the next decade or so.


I don't buy that reason, because that inconsistency can be easily prevented with good schema hygiene (either ON DELETE NO ACTION or ON DELETE CASCADE). The problem is rather that in order to maintain consistent data relations, delete operations must be carefully designed and that part of the functional design is usually skipped because it's perceived as not important.

Which is more or less what the GP says as well, deletes are not implemented because doing it properly requires proper design.


Had a similar experience (albeit with much minor consequences): an Oracle Portal installation which worked fine in Dev/Test but with degrading performance on the actual Prod server (which was also immensely more powerful than what we used to develop).

Oracle was called in, "en masse". Keep also in mind that Oracle Portal was basically running on top of lots of dedicated tables and stored procedures and you could not access the data directly, you interacted to the contents only through vistas, aliases etc.

It was indeed due to a smallish version bump. But we had a pretty good Team Leader who had made Oracle declare that the two different version were absolutely compatible and allowing us to write code on the older version.

He made them state this in written form, before starting the development so it was later impossible to blame the developers for the problem.


I've never had to deal with a production-breaking issue resulting from this, but I have had to debug a collation-related issue in SQL Server that caused table scans to occur. Fortunately, the SQL profiler made it really obvious that a collation mismatch was the culprit.


Could you say more about why that was the culprit and not some problem with the load testing? For the Olympics, it seems like the most important test is "What happens when we launch?" That's the kind of thing I'd run on every commit.


an automated system level upgrade in the production env, exactaly 10 sections before launch.

load tests usually only happens on staging env, not production, that is the policy.


In that case, it sounds like a really good reason to question why the staging environment is not enough like prod that load tests are valid. Or why a not-fully-tested system upgrade would happen 10 seconds before launch.


Why exactly would it be so bad to just put a suitable index on the table containing strings? The time complexity of the resulting search would be the same, so I assume there will be some constant factor slowdowns. Is it that indices over string fields are stored inefficiently on disk? (If so, can that not be fixed in the db engine directly?) Or is this fine today but wasn't fine 15 years ago?


I think the article describes the solution that was used and not necessarily the best solution. If they made the “obvious” mistake in the first case they mightn’t arrive at the best solution in the second. Adding the single index and using the four tables have the same worst case complexity for reads, O(log n), with different constants and massively beat scanning at O(n).

It may be that the second solution wasn’t the best, or that it was better for write performance or memory usage—this was 2002 after all. It may also be the case that the cardinal it’s of some columns was low in such a way that the four table solution was better, but maybe getting the columns in the right order for a multi column index could do that too.

Either way, I don’t think it really matters that much and doesn’t affect the point of the article.


Indexing existed 15 years ago. The article never mentions why indexing didn't solve this problem. Super weird take on the author's part...


Early InnoDB* had pretty strict limits on varchar indexes and was not the most efficient. I don't remember the details but it's entirely possible the single-table format Rachel described ran head on into those limitations.

Also remember indexes take space, and if you index all your text columns you'll balloon your DB size; and this was 2002, when that mattered a lot more even for text. Indexes also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.

Finally, normalizing your data while using the file-per-table setting (also not the default back then) can additionally provide better write & mixed throughout due to the writes being spread across multiple file descriptors and not fighting over as many shared locks. (The locking semantics for InnoDB have also improved massively in the last 19 years.)

* Assuming the author used InnoDB and not MyISAM. The latter is a garbage engine that doesn't even provide basic crash/reboot durability, and was the default for years; using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.


indexes take space, and if you index all your text columns you'll balloon your DB … also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.

Indexes didn’t go away with these extra tables, they live in the ‘id’ field of each one. They also probably had UNIQUE constraint on the ‘value’ field, spending time on what you describe in the second half of my citation.

I mean, that should have saved some space for non-unique strings, but all other machinery is still there. And there are 4 extra unique constraints (also an index) in addition to 4 primary keys. Unless these strings are long, the space savings may turn out to be pretty marginal.


Right, indexes don't "go away" as you normalize, but the amount of data indexed drastically shifts and four unique indexes are pretty much guaranteed to be smaller in size than one massive combinatorial index of (all, four, possible, values). Not just in the case where any duplicates exist, but just alone in terms of the database's overhead in things like field delimiters and page metadata. While the specifics will vary a lot with specific DB implementations, generally with varchars involved you can usually assume a worst-case in terms of field delimiting (and from there sometimes a worst-case in average page size and how that impacts B-Tree balancing).

In general, Indexes alone can't save you from a normalization problems. From certain points of view an Index is itself another form of denormalization and while it is very handy form of denormalization, heavily relying on big composite indexes (and the "natural keys" they represent) makes normalization worse and you have to keep such trade-offs in mind just as you would any other normalization planning.

It is theoretically possible a database could do something much smarter and semi-normalized with for instance Trie-based "search" indexes, but specifically in 2002 so far as I'm aware of most of the databases at the time such indexes were never the default and didn't have great multi-column support and would have been expensive to compute if you did turn them own. Even such "smart" Indexes would likely still have suggested you normalize first.


To me it seems that spreading it over four tables would lead to a lot more potential read locks while the big combined table is waiting for a join on each of the others, and some process is trying to insert on the others and link them to the main. This is assuming they were using foreign keys and the main table 4-column index was unique.


Old best practice for InnoDB performance was actually to never use foreign keys because of these locking issues. Not sure if that's the case in 2021.


InnoDB uses row-level locking, and foreign keys are (usually) a great feature to ensure data integrity. But using multiple foreign keys from tables `a`,`b` as a composite index for table `x` can cause deadlock if both are being updated in rapid succession, because an update on `a` gets a lock on `x`, which needs a read lock on `b` which is waiting for the lock from `a` to be released. I try to never structure multiple foreign keys as a composite index.


This is fascinating, thanks for sharing.


> Also remember indexes take space

Indexes take space, except for the clustered index.

An important distinction.

If the point of this table is always selecting based on IP, m_from, m_to, then clustering on those columns in that order would make sense.

Of course, if it's expected that a lot of other query patterns will exist then it might make sense to only cluster on one of those columns and build indexes on the others.


But IP addresses are actually _numbers_, and you can also save them as bytes 32 bits for ip4, 128 bits for ip6 addresses. (Postgresql has a native datatype 'inet' which supports ip6 since 7.4, if you're using postgres and save an IP address in a character string, you're doing it wrong) If you would save the IP as number and put an index on it, all your queries would be blazing fast!


> using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.

The author mentions that these events took place in 2002. At that time the site was likely still running MySQL 3 and InnoDB was very new and considered experimental. Sure MySQL 4.0 had shipped in 2002 and InnoDB was a first class citizen, but back then upgrades from one version of MySQL to another weren't trivial tasks. You also tended to wait until the .1 release before making that leap.

So in fairness for the folks who originally set up that database MyISAM was likely their only realistic option.

I looked after and managed a fleet of MySQL servers from back then and for many years afterwards and even then it wasn't until MySQL 5 that we fully put our trust in InnoDB.


Surely more than 15 years! 50 maybe?


I had the same initial reaction in reading the post. My assumption was that indexing would have sufficiently sped up the query speed problem.

However, normalizing the fields that contain repetitive data into separate tables could create significant space savings since the full text for each column would not need to stored for each row. Instead of 20-30 bytes per email address, a 4 byte (assuming 32-bit era) OID is stored in its place.

It's pretty easy to imagine how quickly the savings would add up, and that it would be very helpful in the era before SSDs or even 1TB HDDs existed.


Normalization is important for deduplication, not only to index and compare a few short numbers instead of a few long string: those host names and email addresses are long and often repeated.


That's not what normalization is. You're thinking of deduplication or compression. This table is normalized as per usual database terminology.


But if it’s guaranteed to be 1:1, why? The two implementations (normalized and denormalized) in that case should be completely isomorphic, quirks of the DB engine aside.


Why what? I don't understand what you are asking.


Yeah, while this is not optimal (ip should be converted into integer, time should be ts), the table would be small (as old entries could be safely deleted). The only real issue is the lack of indices.

Also is helo field even needed?


IP should NOT be an integer. IPv6 does not fit in a database integer. MySQL has functions to convert to bytes.

Even better, use a database with a proper inet datatype. That way you get correctness, space efficiency, and ability to intelligently index.


In 2002 it definitely should have been an integer.

In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.


> In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.

This is the funniest IPv6 excuse I've heard on HN.

https://ipv6excuses.com/


Thanks! I was really surprised to find it myself. I used HN in some sample code for analyzing how SSL tickets are reused and was surprised to see an IPv4 address come back.


Yes. A proper mail implementation won't randomize it, so it should be consistent. Depending on the design of the validation stages it might only need a case-normalized lookup to the string index table and that id as part of the final multi-column unique index, and in that last case only to convey to the database the expectations to optimize around (and require the data adhere to).


Let's say the article is a critique to some system by someone that is not much better at databases. As many people also pointed out, a bad solution was replaced with a bad solution that works better for someone.


At the worst you could concatenate the strings and have a single primary key.. or hash the string concatenation.


It was fine on -some- databases 20 years ago. mysql of that era, less often.


The ending is the most important part.

> Now, what do you suppose happened to that clueless programmer who didn't know anything about foreign key relationships?

> Well, that's easy. She just wrote this post for you. That's right, I was that clueless newbie who came up with a completely ridiculous abuse of a SQL database that was slow, bloated, and obviously wrong at a glance to anyone who had a clue.

> My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?


(It seems a lot of folks are getting nerd-sniped by the set-up and missing the moral of the story, eh?)

I don't know the actual numbers, but it's been pointed out that at any given time something like half of all programmers have been doing it less than five years, for decades now.

That, plus the strident ignorance of past art and practice, seem to me to bring on a lot of issues.


I don't think it's purely nerd-sniping. If your story is "at first you are bad, but then you get good", but your example is of a case where you did something fine but then replaced it with something worse, that rather undermines the story.


If even someone with Rachel's level of experience still doesn't know all the minutiae of database optimization, I think that just amplifies her point about the importance of mentoring novices.


I recall reading a rant on another site about someone so upset they had to deal with clueless noobs at work.

They then went on to list the errors that this brand new entry level employee had made when writing ... an authentication system ...

I was more than a little shocked when I realized they were serious and hadn't realized the issue was sending the new entry level guy to do that job alone.


I would expect most seniors to want a set of eyes across their work doing any form of authk or z. If only to share blame :p


> authk

I've always seen authn. Where'd you pick this usage up?


good question! I meant authn. I think I just subbed in the phonic from the ..cation part of the word.


You get my upvote both for having a friendly response, and because your answer is what my brain filled in and I’m glad I don’t need to learn a new shorthand.


You need to distinguish authentication from authorization. "auth9n" and "auth8n" might be a slight bit too obsure for the purpose.


What a completely nonresponsive "response". To the best of my knowledge, no one has ever even tried to use auth9n or auth8n. But that's not where the 'n' in authn comes from. authn and authk are both equally distinct from authz.


You appear to be unfamiliar with the extremely common, and problematic, abbreviation "i18n" for "internationalization".

For my part, I have not encountered "authn", "authz", or "authk" before. But prior discussion did not even mention the word "authorization", and that seemed worth bringing out in the open.

So, not nonresponsive, just responsive to things you weren't personally interested in.


Generally authentication is authn and authorization is authz


> but your example is of a case where you did something fine but then replaced it with something worse

They way I see it is that it was a case of something that wasn't working, then replaced with something that was working.

I don't know what your metrics about "good" or "bad" are but eventually they got a solution that covered their use cases and the solution was good enough for stakeholders and that is "good".


Chronology alone only progresses; progress sometimes regresses.


> It seems a lot of folks are getting nerd-sniped

that is an excellent term.



Old enough some people under 35 may not have encountered it… ;)


I feel like that phrase has been there forever, and I'm exactly 35. Will this happen more often as I grow older? Ugh. Feels weird. Maybe also a bit depressing.


It was weird seeing all the spongebob squarepants memes take over the internet when I was too old to ever grow up with that. I turned 28 in 1999 when that first aired. That was my "holy shit I'm so old" moment when that finally trickled up into my awareness as someone nearly turning 40 or so.


As a 48 year old who has no idea who an awful lot of modern celebrities are, I'm constantly wondering if it is because I'm just old now or if its because there are just way more celebrities these days due to the large parasocial celebrity class that didn't really exist when I was younger.

Maybe a bit of both.


> parasocial

^H^H^H^H^Hitic

There, FTFY.


For me it is Pokemon. I was a few years older than the demographic it targeted.

I could have jumped on it and played it but I figured just another toy craze, it'll be over before long.


Same age as you. I have a lot of friends in the Philippines. I swear the Facebook employment info of half the people in the Philippines says they work at the Krusty Krab. (And for most filipinos, the internet = facebook.) For some reason I never asked what that meant, and for many years I thought that was just some odd joke, and was vaguely puzzled about how widespread it is. Eventually I happened on the Spongebob connection!


Well, then, they are part of the 10,000

https://xkcd.com/1053/


Only if they were born in the US


I'm 49, I don't know how I never saw that xkcd.


I just spent several minutes trying to solve that resistor problem (or a simplified version with adjacent nodes) before giving up and deciding I’ll look up other people’s analyses when I get the time. Definitely a good example.


The math would resemble xkcd's Lucky Ten Thousand: https://xkcd.com/1053/

With yes the added confounding factors of how often our industry seems to prefer to hire youth over experience, and subsequently how often experience "drains" to other fields/management roles/"dark matter".


> (It seems a lot of folks are getting nerd-sniped by the set-up and missing the moral of the story, eh?)

The narrative should lead to the conclusion. If I told you the story of the tortoise and the hare in which the hare gets shot by a hunter, then said the moral is "perseverance wins", you'd be rightfully confused.


Yep. Folks are getting lost in the weeds discussing indexing of database tables. That's _totally_ beside the point here.

The thing is, the first implementation was a perfectly fine "straight line" approach to solve the problem at hand. One table, a few columns, computers are pretty fast at searching for stuff... why not? In many scenarios, one would never see a problem with that schema.

Unfortunately, "operating in a vacuum with no mentorship, guidance, or reference points", is normal for many folks. She's talking about the trenches of SV, it's even worse outside of that where the only help you might get is smug smackdowns on stackoverflow (or worse, the DBA stackexchange) for daring to ask about such a "basic problem".


Yes and no and this post highlights a subtle issue with mentorship (which I think is important): Technology does not stand still. What was true in 2002, might not be true today. While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.

I've experienced this just by switching languages. C# had many articles dedicated to how much better StringBuilder was compared to String.Concat and yet, other languages would do the right thing by default. I would give advice in a totally different language about a problem that the target language did not have.

As the song goes:

"Be careful whose advice you buy but be patient with those who supply it

Advice is a form of nostalgia, dispensing it is a way of fishing the past

From the disposal, wiping it off, painting over the ugly parts

And recycling it for more than it's worth"


That might be true if you just take blanket advice. The key is to find out why say StringBuilder is better than String.Concat. If you understand the implementation details and tradeoffs involved, this makes the knowledge much more applicable in the future. The core concepts in technology do not move nearly as fast as individual projects, libraries, frameworks, languages, etc...


You wrote: <<While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.>>

Do you have a specific example that would help in the case described in Rachel's blog post?

I am still making (and finding) occasional indexing and 3NF mistakes. In my experience, it is always humans finding and fixing these issues.


But trust me on the sunscreen.


Totally off topic I know, but what is the concern with String.Concat ?


If you repeated use concat to build up a string, the amount of time grows exponentially. This is because the string I copied each time you concat. Note that the + operator on strings gets turned into a call to concat.

https://docs.microsoft.com/troubleshoot/dotnet/csharp/string...


I was wondering about it in the context of "yet, other languages would do the right thing by default". Repeatedly concatenating to the same string (as opposed to concatenating an array of strings in one go) would be slow in any language I know of, unless you allocate a larger buffer up front, which is what StringBuilder does.

Some languages have mutable strings, but you would still need to allocate a sufficiently larger buffer if you want to add strings in a loop.


In languages that have immutable string semantics and can detect that they are the exclusive owner of a string, they can modify them in place. For reference counting languages like Python, they can easily determine they are the exclusive owner by checking for a reference count of 1. See:

https://github.com/python/cpython/blob/main/Objects/unicodeo...

Rust's ownership system ensures that there cannot exist mutable and immutable references to the same object at the same time. So if you have a mutable reference to a string it is ok to modify in place.

I have not confirmed either the Python example or the Rust example too deeply, since I'm replying to 5 day old comment. But the general principal holds. GC languages like C# and Java don't have easy ways to check ownership so they always copy strings when mutating them. Maybe I'll write a blogpost on this in the future.


I don't think StringBuilder is faster specifically because it allocates a large buffer. Pretty much every language with growable data structures can already grow any such structure pretty fast, including arrays, maps, and mutable strings. They already have a bunch of pre-set constants about stuff like how big to allocate for the initial empty one and how much more to allocate every time you overflow to balance speed and memory efficiency. It's faster because it's mutable and keeps adding new data to the existing buffer until it gets too large.


The cost is not the allocation per se, the cost is copying the bytes. When concatenating two strings, the bytes from both strings are copied into the new string. If you repeatedly concatenate, the bytes end up getting copied many times. E.g if you concatenate 100 strings by appending one by one, the bytes in the first string is copied 99 times, the bytes in the second string is copied 98 times and so on.

Using a StringBuilder, the strings are only copied once when copied into the buffer. If the buffer need to grow, the whole buffer is copied into a larger buffer, but if the buffer grows by say doubling the size, then this cost is amortized, so each string on average is still only copied twice at worst.

Faster yet is concatenating an array of strings in a single concat operation. This avoids the buffer resize issue, since the necessary buffer size is known up front. But this leads to the subtle issue where a single concat is faster than using a StringBuilder, while the cost for multiple appends grows exponentially for concat but only lineary for StringBuilder.


Repeatedly concatting a string is the fastest way I am aware of to build a longer one in Javascript. (This has been deliberately optimized for) I believe PHP this might also be the case for, or at least very fast. Perl might be pretty fast at this but I could be wrong.


I know JS's case somewhat well and the lesson there is not the comment such as above that "it just does the right thing". JS doesn't use a "String Builder" under the hood with a lot of string + calls, but instead that JS does subtly the "wrong" thing as an optimization: in the .NET CLR strings are always, always immutable (or they are not strings according to guarantees that the CLR makes). JS lives in a single language VM that makes no such guarantees, and often lives in strictly single threaded worlds where the VM doesn't have to guarantee the immutability of strings to anyone at the binary memory layout level, so most JS VMs are free to do the "wrong thing" (from the .NET CLR perspective) and just mutably alter strings under the hood as a deliberate optimization.

There's a lot of interesting conflations of "right" versus "wrong" in this this thread. There's the C# "best practices" "right" versus "wrong" of knowing when to choose things like StringBuilder over string.Concat or string's + operator overloading (which does just call string.Concat under the hood). There's the "does that right/wrong" apply to other languages question? (The answer is far more complex than just "right" and "wrong".) There's the VM guarantees of .NET CLR's hard line "no string is mutable" setting a somewhat hard "right" versus "wrong" and other language's VMs/interpreters not needing to make such guarantees to themselves or to others. None of these are really ever "right" versus "wrong", almost all of them are trade-offs described as "best practices" and we take the word "best" too strongly especially when we don't know/examine/explore the trade-off context it originated from (and sometimes wrongly assuming that "best" means "universally the best").


I believe joining an array of strings is still faster in JavaScript, since the combined size can be known up front.


I agree, and would even go one step further and say the first implementation was a decent first pass. Sometimes the 'awful' implementation is good enough and your time is better used on something else. However, this pattern can sometimes bite you in the long term. As you can have tons of little 'hacks' all over the place and people become afraid to touch them, or worse copy from them.

This also nicely shows one of the fun things with table scans. They look decent at first then perf goes crappy. Then you get to learn something. In this case it looks like she used normalization to scrunch out the main table size (win on the scan rate, as it would not be hitting the disk as much with faster row reads). It probably also helped just on the 'to' lookups. Think about how many people are in an org, even a large company has a finite number of people but they get thousands of emails a week. That table is going to be massively smaller than keeping it in every row copied over and over. Just doing the 'to' clause alone would dump out huge amounts of that scan it was doing. That is even before considering an index.

The trick with most SQL instances is do less work. Grab less rows when you can. Use smaller tables if you can. Throw out unnecessary data from your rows if you can. Reduce round trips to the disk if you can (which usually conflicts with the previous rules). Pick your data ordering well. See if you can get your lookups to be ints instead of other datatypes. Indexes usually are a good first tool to grab when speeding up a query. But they do have a cost, on insert/update/delete and disk. Usually that cost is less than your lookup time, but not always. But you stick all of that together and you can have a DB that is really performant.

For me the fun one was one DB I worked in they used a GUID as the primary key, and therefore FK into other tables. Which also was the default ordering key on the disk. Took me awhile to make them understand why the perf on look up was so bad. I know why they did it and it made sense at the time to do it that way. But long term it was holding things back and ballooning the data size and crushing the lookup times.

How did I come by all of this? Lots of reading and stubbing my toe on things and watching other do the same. Most of the time you do not get a 'mentor'. :( But I sure try to teach others.


I’d be a lot more sympathetic if the major RDMSes didn’t have outstanding and thorough reference manuals or that there weren’t a mountain of books on the subject that cover, among other things, the topic of indexing and its importance. MySQL’s manual, for example, has covered this subject from the very beginning: https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I don’t have the 3.x manuals handy but it was there back then too).

It’s not clear from the article whether the author spent any time studying the problem before implementing it. If she failed to do so, it is both problematic and way more common than it ought to be.

“Ready, fire, aim”

But you know what they say: good judgment comes from experiences and experience comes from bad judgment.

Compounding the problem here is that the author now has much more experience and in a reflective blog post, still got the wrong answer.

IMO the better lesson to take away here would have been to take the time getting to know the technology before putting it into production instead of jumping head first into it. That would be bar raising advice. The current advice doesn’t advise caution; instead it perpetuates the status quo and gives “feel good” advice.


I couldn't disagree more with this comment. No amount of reading documentation teaches you how to build production systems. You progress much faster by getting your hands dirty, making mistakes, and learning from them.

The challenge of writing good software is not about knowing and focusing on the perfection every gory detail, it's about developing the judgement to focus on the details that actually matter. Junior engineers who follow your advice will be scared to make mistakes and their development will languish compared to those who dive in and learn from their mistakes. As one gains experience it's easy to become arrogant and dismissive of mistakes that junior engineers make, but this can be extremely poisonous to their development.


Respectfully, I believe you are disagreeing with an argument that I am not making. It's not an either-or scenario. Both access to and reliance on reference materials and the ability to safely experiment are part of the professional engineer's toolbox. I can say this with confidence because I was a junior engineer once - it's not like I'm speaking without experience.

Everyone - new and experienced developers alike - should have a healthy fear of causing pain to customers. Serving customers - indirectly or otherwise - is what makes businesses run. As a principal engineer today my number one focus is on the customer experience, and I try to instill this same focus in every mentee I have.

Does that mean that junior developers should live in constant fear and decision paralysis? Of course not.

That's where the mentor - the more experienced and seasoned engineer - comes in. The mentor is roughly analogous to the teacher. And the teaching process uses a combination of a mentor, reference materials, and the lab process. The reference materials provide the details; the lab process provides a safe environment in which to learn; and the mentor provides the boundaries to prevent the experiments from causing damage, feedback to the mentee, and wisdom to fill in the gaps between them all. If any of these are absent, the new learner's development will suffer.

Outstanding educational systems were built over the last 3 centuries in the Western tradition using this technique, and other societies not steeped in this tradition have sent their children to us (especially in the last half-century) to do better than they ever could at home. It is a testament to the quality of the approach.

So no, I'm not advocating that junior developers should do nothing until they have read all the books front to back. They'd never gain any experience at all or make essential mistakes they could learn from if they did that. But junior developers should not work in a vacuum - more senior developers who both provide guidance and refer them to reference materials to study and try again, in my experience, lead to stronger, more capable engineers in the long term. "Learning to learn" and "respect what came before" are skills as important as the engineering process itself.


> That's where the mentor [...] comes in.

The thing is, specifically, that the OP did NOT have a mentor. They had a serious problem to solve, pronto, and knew enough to take a crack at it. OK, the initial implementation was suboptimal. So what? It's totally normal, learn and try again. Repeat.

It would be nice if every workplace had a orderly hierarchy of talent where everyone takes care to nurture and support everyone else (especially new folks). And where it's OK to ask questions and receive guidance even across organizational silos, where there are guardrails to mitigate accidents and terrible mistakes. If you work in such an environment, you are lucky.

It is far more common to have sharp-elbow/blamestorm workplaces which pretend to value "accountability" but then don't lift a finger to support anyone who takes initiative. I suspect that at the time Rachelbythebay worked in exactly this kind of environment, where it simply wasn't possible for an OPS person to go see the resident database expert and ask for advice.


Right. Hence my concern about the lack of helpful advice other than “accept yourself”[1]: Neophyte engineers lacking guardrails and local mentors should fall back to what they learned in school: experiment, use your books, and ask questions from the community. Mentors can be found outside the immediate workplace, after all.

And when it comes to taking production risks, measure twice and cut once, just like a good carpenter.

[1] Of course you should accept yourself. But that alone won’t advance the profession or one’s career.


> measure twice and cut once

Being able to do that is a luxury that many do not enjoy in nose-to-the-grindstone workplaces. You make an estimate (or someone makes it for you), then you gotta deliver mentor or no mentor, and whether you know the finer points of database design/care-and-feeding or not.

There's something good to be said for taking action. Rachelbythebay did just fine. No one died, the company didn't suffer, it was just a problem, it got corrected later. Big whoop.


But it's also about constantly asking yourself "how can this be done better? What do I not know that could improve it?"

I learned probably 80% of what I know about DB optimization from reading the Percona blog and S.O. The other 20% came from splashing around and making mistakes, and testing tons of different things with EXPLAIN. That's basically learning in a vacuum.


I couldn't disagree more with this comment. No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes, nor changes the impact of the mistakes. You also progress much faster by learning from other peoples' mistakes, this is why written language is so powerful.

Honestly, I think your comment is okay minus the fact that you're trying to highlight such hard disagreement with a sentiment that people should read the fucking manual. Really, you couldn't disagree MORE?

There is definitely value in RTFM. There are cases where making mistakes in production is not acceptable, and progressing by "making mistakes" is a mistake on its own. I don't think the case in this article sounds like one of those, but they do exist (e.g. financial systems (think about people losing money on crypto exchanges), healthcare, or, say, sending rockets into space). In many cases, making mistakes is fine in test systems, but completely, absolutely, catastrophically unacceptable in production. Although, I refuse to blame the junior engineer for such mistakes, I blame management that "sent a boy to do a man's job" (apologies for a dated idiom here).

(As an aside, overall, I disagree with many of the comments here nitpicking the way the author solved the problem, calling her "clueless", etc. I really don't care about that level of detail, and while I agree the solution does not seem ideal, it worked for them better than the previous solution.)


> No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes...

Categorically wrong.

Mistakes, failure, and trial and error are very much a part of developing skills. If you're not making mistakes, you're also not taking enough risks and thus missing out on opportunities for growth.


Yes, mistakes are required to learn from them, but making mistakes does not imply that a person will learn from their mistakes.


You can't really search a problem if you don't suspect its existence. At this point you might feel you have everything you need to start the implementation, how do you guess?


… I appreciate her sentiment at the end there that we should try to change the status quo (which I think does suck), but I'm not sure how much power the average employee has over it. Most employers I've worked at seem loathe to retain anyone past about 2 years. (E.g., currently I'm in the 95'th percentile, after just over 2 years.) IME it takes about 6 months to really learn how a company's systems work to where to proficiency (yes, six months.) which means we're spending ~25% of the time "training", where "training" is usually someone trying something crazy and failing, and getting corrected as opposed to some structured form of learning. Oftentimes the systems that these engineers are stumbling I think are the internal ones; they need more features, or refactoring to account for organic growth, but since they're not shiny new customer facing features they'll get exactly 0 priority from PMs. The engineers who build experience working with these systems and actually know what needs to change and how without breaking the existing use cases … are about to leave the company, since nobody is retained beyond ~2 years.

I also find my own team is usually strapped for resources, normally, people. (Usually politely phrased as "time".) Yes, one has to be wary of mythical man-month'ing it, but like my at my last employ we had essentially 2 of us on a project that could have easily used at least one, if not two more people. Repeat across every project and that employer was understaffed by 50-100%, essentially.

Some company just went for S-1, and they were bleeding cash. But they weren't bleeding it into new ventures: they were bleeding it into marketing. Sure, that might win you one or two customers, but I think you'd make much stronger gains with new products, or less buggy products that don't drive the existing customers away.

Also there's an obsession with "NIT" — not invented there — that really ties my hands as an engineer. Like, everything has to be out sourced to some cloud vendor provider whose product only fits some of the needs and barely, and whose "support" department appears to be unaware of what a computer is. I'm a SWE, let me do my thing, once in a while? (Yes, where there's a good fit for an external product, yes, by all means. But these days my job is 100% support tickets, and like 3% actual engineering.)


What's funny is there seems to be a lot of debate among grizzled veterans here about whether her old solution is actually better than the new. We're getting into index types, engine implementations, normal forms, etc. And really, this is a relatively simple development problem that I can easily imagine on an interview test.

Now, imagine a new engineer, just getting started trying to make sense of it all, yet being met with harsh criticism and impatience.

Maybe that was exactly her point--to get everyone debating over such a relatively trivial problem to communicate the messages: Good engineering is hard. Everyone is learning. Opinions can vary. Show some humility and grace.


Two engineers I interact with on a regular basis:

1. Me from six months ago. Clueless about a lot of stuff, made some odd technical choices for obscure reasons, left a lot of messes for me to deal with.

2. Me six months from now. Anything I don't have time for now, I can leave for him. If I was being considerate I'd write better docs for him to work with, but I don't have the time for that either.


> tend to chase off anyone who makes it to the age of 35

That's definitely not true anymore, if it ever was. Senior/staff level engs with 20+ years of experience are sought after and paid a ton of money.


I was reading her article like "yeah, well… wow, she's tough / hard with this programmer, it's not that unreasonable anyway!", until this passage.

I would not expect her to be like this from what I've read on her blog so I was surprised. Well written!


The way she kept referring to the programmer made me suspicious that that would be the ending. I personally really like that as a way of conveying the message of us all having growing pains and learning the hard way from some inefficient code.

I know I've also had these encounters (specifically in DBs), and I'm sure there are plenty more to come.


Seems like the definition of hacker to me.


> My point is: EVERYONE goes through this

A lot of the basic issues beginners go through can be mitigated by paying attention in class (and/or having formal education in the first place).

I’ll grant that there are other, similar, things that everyone will go through though.

My own was manually writing an ‘implode’ function in Javascript because I didn’t know ‘join’ existed.


While the normalized version is more compact and doesn't store redundant data, it _also_ needs an index on the four columns or it'll have to check every row in the table. A similar index added to the original denormalized table would have given comparable query performance.

The table schema isn't terrible, it's just not great. A good first-pass to be optimized when it's discovered to be overly large.


> The table schema isn't terrible, it's just not great.

It depends - if this was the full use-case then maybe a single table is actually a pretty good solution.


that is not true. the sub tables guarantee that value is unique in the entire column.

while main table did not have any such guarantees, so, there will be a lot of indexing and duplication in atleast 3 (of 4 sub tables)

Moreover, indexing would be much faster and efficient as there are only unique values in sub tables


Why wouldn't it have such guarantees, if the indices in question are defined as UNIQUE?


Because they are not unique in main table.

For eg: if there are multiple emails from same domain name, then domain name column will have same entry multiple times.

When domain is moved to sub table, main table still has duplicate entries, in form of foreign keys, while domain table has unique entry for each domain name


Foreign key columns (i.e. the ones in the main table) are often indexed themselves, and those indices can also be UNIQUE.


Feels like you could just concatenate and hash the 4 values with MD5 and store the hash and time.

Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.


Yup, we do this at work for similar purposes and it works a-ok. We also have some use cases that follow the original “problem” schema and they work fine with the correct indices involved.

My guess is that in 2002 there were some issues making those options unappealing to the Engineering team.

When we do this in the realm of huge traffic then we run the data through a log stream and it ends up in either a KV store, Parquet with SQL/Query layer on top of it, or hashed and rolled into a database (and all of the above of there are a lot of disparate consumers. Weee Data Lakes).

This is also the sort of thing I’d imagine Elastic would love you to use their search engine for.


Isn't that exactly what adding an index would do internally?


2002 MySQL had some limitations on indexing variable length string (varchar) columns. That's the gist of the linked story.


Depending on the RDBMS system, in most cases it does not.


For the 2021 version, you'd just generate a bloom filter/cuckoo hash from all the data gathered by your spamhaus database periodically. Make a separate one for each value in your tuple and your score would be the number of the sub-hashes that matched.


“Periodically” here would have to be quite frequent, as you have to rebuild the database before the next retry.


A couple times a day at most?


From the post, you'd have to be rebuilding it every 15 minutes:

> A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.


Still better than locking the tables every time you receive a piece of mail.


I'm not super familiar with this stuff, but I believe you could then use a key-value store with automatic expiry like Redis for automatic pruning and faster lookups.


Redis and memcached didn't exist in that timeframe.


I'd say that pretty much everyone is a better DB programmer than I am, so I don't really go out of my way to design anything especially robust. I try to design stuff I'm not good at, in a way that makes it easy for someone that knows more than I do, to come in and toss out the work I did, to be replaced with good work.

There's lots of stuff that I'm bad at. I'm also good at a fair bit of stuff. I got that way, by being bad at it, making mistakes, asking "dumb" questions, and seeing how others did it.


> I try to design stuff I'm not good at, in a way that makes it easy for someone that knows more than I do, to come in and toss out the work I did, to be replaced with good work.

Sounds familiar. And sometimes that other someone is just a later version of yourself :)

I am very grateful when my past self thought to add comments for my future self.


People in the comments are getting (rightfully) outraged about the poor understanding of indexing, but I'm a little surprised that everyone here doesn't seem to understand normalization either. The original schema is perfectly normalized and is already in 3NF: none of the columns shown has a dependence on any of the other columns outside of the primary key (in other words, if you knew eg the values of the ip, helo, and from columns, you'd still have no information about the "to" column).

Normalization does not mean "the same string can only appear once". Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table. Now instead of "address1@foo.bar" in 20 different locations, you have "id1" in 20 different locations. There's been no actual "deduplication", but that's again not the point. Creating the extra tables has no impact on the normalization of the data.


Thank you. Was thinking that I was the clueless one... (probably a little) because that is not my understanding of what normalization is.

I thought perhaps swapping the strings to integers might make it easier to index, or perhaps it did indeed help with dedupilcation in that the implementation didn't "compress" identical strings in a column—saving space and perhaps help performance. But both issues appeared to be implementation issues with an unsophisticated Mysql circa 2000, rather than a fundamentally wrong schema.

I agreed with her comment at the end about not valuing experience, but proper databasing should be taught to every developer at the undergraduate level, and somehow to the self-taught. Looking at comptia... they don't seem to have a db/sql test, only an "IT-Fundamentals" which touches on it.


Came here to say this. Whatever may have been true of InnoDB 20 years ago, don't follow this article's advice for any modern relational database.

To add: Not sure about MySQL, but `varchar`/`text` in PostgreSQL for short strings like those in the article is very efficient. It basically just takes up space equaling the length of the string on disk, plus one byte [1].

[1] https://www.postgresql.org/docs/current/datatype-character.h...


>>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.


>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

>How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

Ok, but that's not what normalization means.

If you have a table as described in the article that looks like:

foo | bar | baz

------------------

foo1 | bar1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | bar1 | baz2

foo3 | bar1 | baz4

and then you say "ok, bar1 is now actually called id1", you now have a table

foo | bar | baz

------------------

foo1 | id1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | id1 | baz2

foo3 | id1 | baz4

you haven't actually changed anything about the relationships in this data. You've just renamed one of your values. This is really a form of compression, not normalization.

Normalization is fundamentally about the constraints on data and how they are codified. If you took the author's schema and added a new column for the country where the ip address is located in (so the columns are now ip, helo, from, to, and country), then the table is no longer normalized because there is an implicit relationship between ip and country--if ip 1.2.3.4 is located in the USA, every row with 1.2.3.4 as the ip must have USA as the country. If you know the IP for a row, you know its country. This is what 3NF is about. Here you'd be able to represent invalid data by inserting a row with 1.2.3.4 and a non-US country, and you normalize this schema by adding a new table mapping IP to country.

But none of that is what's going on in the article. The author described several fields that have no relationship at all between them--IP is assumed to be completely independent of helo, from address, and to address. And the second schema they propose is in no way "normalizing" anything. The four new tables don't establish any relationship between any of the data.

>This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.

It's not "very narrowly" 3NF. It's pretty clear-cut! A lot of commenters here are referring to the second schema as the "normalized" one and to me that betrays a fundamental misunderstanding of what the term even means. And sure, if you had a different schema that wasn't normalized, then it wouldn't be normalized, but that's not what's in the article.


Your point is well taken. In the bare description provided for the first schema in the article, the data is already normalized, and all that can be achieved is compression.


> This is really a form of compression, not normalization.

First of all, the index hashes of the emails depend on the email strings, hence the indexed original schema is not normalised.

Secondly, it would not effectively be compression unless there were in fact dependencies in the data. But we can make many fair assumptions about the statistical dependencies. For example, certain emails/ips occur together more often than others, and so on. In so far as our assumptions of these dependencies are correct, normalisation gives us all the usual benefits.


I think I’ve determined very long ago that the only normal form worth keeping in mind is the Boyce-Codd normal form.

While you may technically be correct, I think there’s few people that think of anything else when talking database normalization.

That said, I cannot quickly figure out if it’s actually true here.


Okay, this will be my last comment in this thread because I don't have any new way of presenting this and checking this is not a good use of my time.

This schema is also in Boyce-Codd normal form. It's normalized in every usual sense of the word. Trivially so, even. It's not a question of being "technically" correct. If you think the second schema is more normalized than the first one, you need to re-evaluate your mental model of what normalization means. That's all there is to it.


Plus depending on the amount of emails they get, that optimisation could be unnecessary. That database schema was perfectly fine in some cases.


This is very slightly not quite true, because the HELO string and the remote address should go hand-in-hand.


You might be right (I don't actually know what a HELO string is, I don't know anything about SMTP :). I was just going off how the author presented the data, as a tuple of four completely independent things.

Of course the main point still stands, that the two schemas are exactly as normalized as each other.

Edit: rereading the original post, the author mentions that "they forged...the HELO"--so perhaps there was indeed no relationship between HELO and IP here. But again, I don't know anything about SMTP, so this could be wrong.


I do know about smtp and you were right regardless, because the author was talking about 4 database fields, not smtp. The details of smtp are irrelevant.


Normalisation depends on the semantics of the data, and so the details of SMTP are very much relevant.


incorrect


It's (somewhat) because the HELO is forged that there's no relationship between HELO and IP. The very first message of SMTP is "HELO <hostname>", hostname can either be a unique identifier (server1.company.com, etc.) or a system-level identifier (mta.company.com for all of your company's outbound mail agents, or in the case of bulk mailers they might use bulk.client1.com when sending as client1, bulk.client2.com, etc). But there is/was no authentication on what you send as HELO (Now you can verify it against the TLS certificate, though many implementations don't do that at all or well), so correlating based on the hostname in HELO was questionable at best. Thus, the combination of (HELO, IP) was the a single value as a tuple.


> But there is/was no authentication on what you send as HELO

Yep, and that explains the "foobar" rows - those should have resolved to the same IP, except because there's no authentication that blocks it you could put gibberish here and the SMTP server would accept it.

> so correlating based on the hostname in HELO was questionable at best

Eh, spambots from two different IPs could have both hardcoded "foobar" because of the lack of authentication, so I could see this working to filter legitimate/illegitimate emails from a compromised IP.


Right, useful as a signal in a bayesian filter most certainly, but there's no strong general rule.


Only if the SMTP client is following the RFC but being spammers they probably sent anything but the actual domain they sent from.


> Normalization does not mean "the same string can only appear once".

It can, if you want to easily update an email address, or easily remove all references to an email address because it is PII.


That's not what normalization normally means, no. See eg wikipedia https://en.wikipedia.org/wiki/Database_normalization


My piano teacher once told me that its common to commit a mistake but its disastrous if you keep practicing without realizing that you have made a mistake. What I look for in a developer is their ability to realize a mistake and find ways to fix it.


The problem, I think is that most people, me included, don't really know what databases really do. There is a whole lot about optimizing procedural code, with a variety of tools, the dangers of premature optimization and the tradeoff with readability. Anyone with a computer science degree has heard about algorithmic complexity, caches, etc...

But databases are just magic. You try things out, usually involving a CREATE INDEX at some point, and sometime it gets faster, so you keep it.

Rachel, in he blog post is a good example of that thought process. She used a "best practice", added an index (there is always an index) and it made her queries faster, cool. I don't blame her, it works, and it is a good reminder of the 3NF principle. But work like that on procedural code and I'm sure we will get plenty of reactions like "why no profiler?".

Many, many programmers write SQL, but very few seem to know about query plans and the way the underlying data structures work. It almost looks like secret knowledge of the DBA caste or something. I know it is all public knowledge of course, but it is rarely taught, and the little I know about is is all personal curiosity.


> But databases are just magic.

100%!!!

I hate how developers talk about a "database" as a monolithic concept. It's an abstract concept with countless implementations built off of competing philosophies of that abstract concept. SQL is only slightly more concrete, but there's as many variants and special quirks of SQL dialects out there as databases.


Database indexes are definitely one of the most approachable concepts of databases for computer science majors. If you have studied data structures and Big O notation you have all the concepts needed.

While I agree that how a query planner works is one of the most ‘magic’ aspects, I think the output from the query planner in most databases is very approachable as well to regular common programmers and will get you quite far in solving performance issues. We know what full scans are (searching through every element of an array), etc.

The challenge is usually discovering that the jargon used in your database really maps to something you do already have a concept about and then reading the database documentation…


This was great. Clever twist at the end.

I cringed a little because some of those mistakes looks like stuff I would do even now. I have a ton on of front/back end experience in a huge variety of languages and platforms, but I am NOT a database engineer. That’s a specific skillset I never picked up nor, admittedly, am I passionate about.

Sorry to go off on a tangent, but it also brings to mind that even so-called experts make mistakes. I watch a lot of live concert footage from famous bands from the 60s to the 90s, and as a musician myself, I spot a lot of mistakes even among the most renowned artist…with the sole exception of Rush. As far as I can tell, that band was flawless and miraculously sounded better live than on record!


Assuming this needs to be optimized for massive scale, just hash the values to a single indexed field.

And use something other than an RDBMS. Put the hash in Redis and expire the key; your code simply does an existence check for the hash. You could probably handle gmail with a big enough cluster.

That super-normalized schema looks terrible.


I too recommend using Redis in 2002.


Fair enough, but I think the issue here is recognizing that the underlying business operation is a hit test on an expiring hash key. You could have used MySQL as a key/value store or looked for something more specialized.


“Hash it and store in BerkeleyDB” would have been a ridiculously easy solution at the time.


++ correct answer here


I think the 'terrible schema' thing is a secondary issue. The important take away for me was this:

> Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?


Maybe I’ll change my mind in 5 years, but I have a hard time believing engineers over the age of 35 get chased off. What actually seems to be the case is that the field skews young because it’s rapidly growing, and older engineers choose to retire early because they can afford to.

Which results in the same problems but the root cause is different.


Unfortunately it is the penultimate sentence in a sizable post.


<pushes glasses up nose> Actually, the Correct Answer is a bloom filter.

(And, yes, we had math in the early 2000s.)

Snark aside, I'm frustrated for the author. Her completely-reasonable schema wasn't "terrible" (even in archaic MySQL)—it just needed an index.

There's always more than one way to do something. It's a folly of the less experienced to think that there's only One Correct Way, and it discourages teammates when there's a threat of labeling a solution as "terrible."

Not to say there aren't infinite terrible approaches to any give problem: but the way you guide someone to detect why a give solution may not be optimal, and how you iterate to something better, is how you grow your team.


To be fair, the “correct way” to do databases at that time was to use third normal form. Putting indices on string columns would have been considered a hack, much like using MySQL was.


3NF is largely orthogonal to indices on string columns, though. If you need to look something up by string fast, there's no avoiding an index of strings somewhere.


Plenty of databases at the time had pretty lame string indexing because it simply wasn't something people relied on if they wanted performance, and memory sayeth mysql's were both not particularly efficient and had some nasty technical limitations.

On the mysql she was using, breaking things out so it only needed to index ints was almost certainly a much better idea. On anything I'd be deploying to today, I'd start by throwing a compound index at the thing with the expectation that'd probably get me to Good Enough.


When has ever been putting indices on string columns "a hack"? If you're looking for that string, which seems to have been the case here in particular, you need that index.


We had a test database that contained 1 record. Nobody paid much attention since the focus was on the problem, not the database.

The database included several newly developed "stored procedures".

Time elapsed... and it was nearing the time to ship the code. So we tried to populate the database. But we could not. It turned out that the stored procedures would only allow a single record in the database.

Since a portion of the "business logic" depended on the stored procedures... well, things got "delayed" for quite a while... and we ended up having a major re-design of the back end.

Fun times.


Why do so many of us “forget” to populate the database with representative quantities of data until after we’ve baked in all of our poor design decisions?



Oops - missed that one. I guess there was more appetite for discussion.


> The rub is that instead of just being slow, it also cost a fair amount of money because this crazy vendor system charged by the row or somesuch. So, by scanning the whole table, they touched all of those rows, and oh hey, massive amounts of money just set ablaze!

Why _the hell_ is nobody mentioning that using a database that charges per row touched is absolute insanity? When has it become so normal that nobody mentions it?


You'll probably enjoy reading this article that was on the front page yesterday: https://briananglin.me/posts/spending-5k-to-learn-how-databa...


Thank you. I cannot say I enjoyed it, I am distressed now. Using a “planet scale” “serverless” database to store just 750k download records per day, and paying per DB row touched, and “thank you for letting us off the hook this time”. This reads like satire.


Nice. I was thinking of an earlier post involving cloud storage buckets with no index.


If the amount charged isn't proportional to the work done, the cloud provider would quickly go out of business.


My VPS provider gives me unlimited traffic, unlimited disk reads, and unlimited DB rows touched for fixed 30 dollars a month, regardless of how much CPU and RAM I keep loaded.

This makes me think that these things are at least 100x cheaper than AWS might want to make me believe.


On a single instance. The more instances you use, the more you will be charged. This service handles automatic scalability and resilience with multiple instances for you, allowing it to charge less than $30.


Sorry, no. The original schema was correct, and the new one is a mistake.

The reason is that the new schema adds a great deal of needless complexity, requires the overhead of foreign keys, and makes it a hassle to change things later.

It's better to stick the the original design and add a unique index with key prefix compression, which all major databases do these days. This means that the leading values gets compressed out and the resulting index will be no larger and no slower than the one with foreign keys.

If you include all of the keys in the index, then it will be a covering index and all queries will hit the index only, and not the heap table.


One thing that worth taking into consideration is that this happened in 2002. When the databases were not in cloud, the ops was done by dba’s and key prefix compression thats omnipresent today was likely not that common or potentially not even implemented/available.

But i don’t think the point of the post is whats right/wrong way of doing it. The point as mentioned by few here is that programmers makes mistakes. They are costly and will be costly if in tech industry, we continue to boot experienced engineers… the tacit knowledge those engineers have gained wi ll not be passed on and this means more people have to figure things out by themselves


>One thing that worth taking into consideration is that this happened in 2002. When the databases were not in cloud, the ops was done by dba’s

Are you implying that isn't the case today? Thousands of (big) companies are still like that and will continue to be like that.

I write my own SQL, design tables and stuff, submit it for a review by someone 10x more qualified than myself, and at the end of the day I'll get a message back from a DBA saying "do this, it's better".


If you have one, more power to you. I'm currently making a good living as a freelance DBA/SRE for startups. With a team of 3-5 devs, some of which frontend and/or mobile devs, proper database knowledge is definitely thin on the ground in some places.


The question of it being in the cloud or not is highly orthogonal to schema design.


In 2002, you started seeing major increases in query run time with as little as 5 joins. Once you hit six you had to start thinking about rearchitecting your data or living with slow response times.

There was a lot of pressure to relax 3NF as being too academic and not practical.

Around then, I had a friend who was using a pattern of varchar primary keys so that queries that just needed the (unique) name and not the metadata could skip the join. We all acted like he was engaging in the Dark Arts.


> When the databases were not in cloud, the ops was done by dba’s and key prefix compression thats omnipresent today was likely not that common or potentially not even implemented/available.

To my understanding, for example Firebird/Interbase had automatic key prefix compression as far back as early 2000s at the very least. I don't believe you could even turn it off.


You don't really need compression. Rows only need to persist for about an hour. The table can't be more than a few MiB.

We can debate the Correct Implementation all day long. The fact of the matter is that adding any index to the original table, even the wrong index, would lead to a massive speedup. We can debate 2x or 5x speedups from compression or from choosing a different schema or a different index, but we get 10,000x from adding any index at all.


Just to make this fun.

Adding an index now increases the insert operation cost/time and adds additional storage.

If insert speed/volume is more important than reads keep the indexes away. Replicate and create an index on that copy.


> Adding an index now increases the insert operation cost/time

that insert is happening after you've checked the table to see if the record is present. so two operations whose times we care about are "select and accept email" and "select, tell the sender to come back in 20 minutes, and then insert".

the insert time effectively doesn't matter, unless you've decided to abandon discussion of the original table entirely without mentioning it.


I agree. In the "new" FK-based approach, you'll still need to scan indexes to match the IP and email addresses (now in their own tables) to find the FKs, then do one more scan to match the FK values. I would think this would be significantly slower than a single compound index scan, assuming index scans are O(log(n))

Key thing is to use EXPLAIN and benchmark whatever you do. Then the right path will reveal itself...


And even if this is somehow not an ideal use of the database, it's certainly far from "terrible", and to conclude that the programmer who designed it was "clueless" is insulting to that programmer. Even if that programmer was you, years ago (as in the blog post).

Moreover, unless you can prove with experimental data that the 3rd-normal-form version of the database performs significantly better or solves some other business problem, then I would argue that refactoring it is strictly worse.

There are good reasons not to use email addresses as primary or foreign keys, but those reasons are conceptual ("business logic") and not technical.


She doesn’t mention the write characteristics of the system but she implies that it was pretty write heavy.

In that case it’s not obvious to me that putting a key prefix index on every column is the correct thing to do, because that will get toilsome very quick in high write loads.

Given that she herself wrote the before and after systems 20 years ago and that the story was more about everyone having dumb mistakes when they are inexperienced perhaps we should assume the best about her second design?


It's not an index on every column. It's a single index across all columns.


Which will be updated every time there is a new combination.


Which is OK. It's not like the whole index has to be rebuilt. It's cheaper than O(log N).


I think it depends a lot on the data. If those values like IP, From, To, etc keep repeating, you save a lot of space by normalizing it as she did.

But strictly from a performance aspect, I agree it's a wash if both were done correctly.


Space is cheap now tho. Better to duplicate some data and avoid a bunch of joins than to worry about saving a few gb of space.


It’s not that easy: you need to consider the total size and cardinality of the fields potentially being denormalized, too. If, say, the JOINed values fit in memory and, especially, if the raw value is much larger than the key it might be the case that you’re incurring a table scan to avoid something which stays in memory or allows the query to be satisfied from a modest sized index. I/O isn’t as cheap if you’re using a SAN or if you have many concurrent queries.


This system was written in 2002.


Using space to avoid joins will not necessarily improve performance in an RDBMS -- it might even make it worse.


The inverse of your statement is also true. Denormalizing the database to avoid duplicating data will not necessarily improve performance in an RDBMS - it might even make it worse.


> all major databases do these days

Did everyone on HN miss that the database in question was whichever version of MySQL existed in 2002?


Which I was using in production at the time, and, yeah, what she ended up doing was a much better plan for that thing.


It actually doesn't matter. Even simple indexes would have made that schema work just fine, and MySQL could certainly index strings in 2002.


> Sorry, no. The original schema was correct, and the new one is a mistake.

Well, you also save a space by doing this (though presumably you only need to index the emails as IPs are already 128 bits).

But other than that, I'm also not sure why the original schema was bad.

If you were to build individual indexes on all four rows, you would essentially build the four id tables implicitly.

You can calculate the intersection of hits on all four indexs that match your query to get your result. This is linear in the number of hits across all four indexes in the worst case but if you are careful about which index you look at first, you will probably be a lot more efficient, e.g. (from, ip, to, helo).

Even with a multi index on the new schema, how do you search faster than this?


Deleted


> compute is cheap

There was a whole thread yesterday about how a dude found out that it isn't: https://briananglin.me/posts/spending-5k-to-learn-how-databa... (also mentioned in the RbtB post)


Depends on the provider. I've checked the pricing of 4 top relational database cloud vendors (Google, Amazon, Azure, IBM) and they all charge for the number of CPU cores and RAM you buy, not by the number of rows scanned like PlanetScale did in the post you refer to. They'll be more expensive than buying your own server but not nearly what PlanetScale charges for full table scan queries.


When do the rest of the folks figure it out?


To a degree, this is true though. An engineer's salary is a huge expense for a startup, it's straight up cheaper to spend more on cloud and have the engineer work on the product itself. Once you're bigger, you can optimize, of course.


Deleted


If this is a case from real life that you have seen, you should definitely elaborate more on it. Otherwise, you are creating an extreme example that has no significance in the discussion, as I could create similarly ridiculous examples for the other side.

There are always costs and benefits to decisions. It seems that are you only looking at the costs and none of the benefits?


Deleted


If you're suggesting that we should double the size of the hardware rather than add a single index, then I respectfully disagree.

But your point is well-taken. Hardware is cheap.


> The observation was that we could probably store the IP address, HELO string, FROM address and TO address in a table, and send back a 4xx "temporary failure" error the first time we saw that particular tuple (or "quad"). A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.

I've run into this form of greylisting, it's quite annoying. My service sends one-time login links and authorization codes that expire in 15 minutes. If the email gets delayed, the user can just try again, right? Except I'm using AWS SES, so the next email may very well come from a different address and will get delayed again.


You could have your codes expire after an hour or day instead?


That's a solution, but with tradeoffs:

* security, an attacker has more time to intercept and use the links and codes.

* UX, making the user wait 15+ minutes to do certain actions is quite terrible.

I've had a couple support requests about this. The common theme seems to be the customer is using Mimecast, and the fix is to add my sender address in a whitelist somewhere in their Mimecast configuration.


btw. ses now offers unique sending ips


I'd really love to be snarky here but I'll try to be polite: all those comments about the example situation are missing the whole point of the post. And it really worries me that there is a good chunk of the tech workers that just ignores the real meaning of something and just nitpick about stupid implementation details. The post is about managing rookie errors, being empathetic and also warn the ageism that pervades the sector. TBH about this last point IDK the situation nowadays in Silicon Valley, but in Europe my limited experience is that ageism is not that bad; kt's actually difficult to find seasoned developers.

Edit: typos


I don't buy it. If the only point was managing rookie errors, etc., the blog post shouldn't have been 15~ paragraphs of technical discussion and 2 paragraphs at the bottom of "my point is ..." You can't advocate that people just ignore 80% of the article because they're "not the point of the post."

I'm sure a good chunk of tech workers are worried that when an influential blogger writes something like this a couple hundred developers will Google 3NF and start opening PRs.


It's a narrative. A story. She told it in that way to lure people who get hooked on technical discussion, and then make the point that they probably would have missed or ignored. Without a narrative example, she could have just said "I made mistakes too, live and learn" but she chose to provide an example before presenting the thesis.

And hey, what's so bad about people learning about 3NF? Are you not supposed to know what that is until you're some mythical ninth level DBA?


Sure, I don't mind that the hook, and controversial posts often create the most educational discussion. I only take exception to the idea that we shouldn't debate it because it's "not the point." (specifically for technical advice; being over-pedantic is a thing).

I certainly don't mean to criticize the author or say that they shouldn't have posted the article. If articles needed to be 100% correct then no-one has any business writing them. And it's still open to debate whether the article is even wrong or misleading.

> Are you not supposed to know what that is until you're some mythical ninth level DBA?

I think the world might be a better place if people _did_ wait until they were ninth level DBAs :p

Jokes aside, I have no problem with people learning 3NF. I'm more concerned that people can be too quick to take this sort of rule-of-thumb advice to heart (myself included). And in my opinion, database optimization is too complex to fit well into simple rules like "normalize when your query looks like this." My only 'rules' for high-frequency/critical tables and queries is "test, profile, load test, and take nothing for granted." But I know just enough to know that I don't know enough about SQL databases to intuit performance.


If she told it to lure people who get hooked on technical discussion, then it’s not fair for you to complain about said people analyzing said technical discussion.


But if she told it to show people who get hooked on technical discussion that this often makes them miss a more important point, then they certainly proved her right.


The problem with her point is that it doesn't go anywhere. Mentorship is obviously good and ageism is obviously bad, and she doesn't have specific suggestions on how to solve any of the problems, so the hook is the most substantial thing to talk about.


The reason "the details are important" here are not because of the nitty gritty around what mistakes a "novice" programmer made.

They are important because the present incarnation of the author is making all the wrong diagnoses about the problems with the original implementation, despite doing it with an air of "Yes, younger me was so naive and inexperienced, and present me is savvy and wise".


She explained the problem, the first not working solution and the second working solution as they really happened in 2002 as an example. The real point is the last part of the post. And it was not "how to properly implement an sql based filter for open proxies in your MTA".


I get all that. But 2020 version of this person still does not understand the problem, and she is arguing that she does while oddly self-deprecating the inexperienced version of herself, who arguably had a better solution to begin with.


Yes, I think an alternate design was found that didn't hit as many MySQL limitations as the previous one. This improved performance and was a win. But the post-mortem diagnosis was lacking.


I think she inadvertently made a different point, which is that even experienced developers sometimes misunderstand the problem and make mistakes.

Or an even better argument: you don't need to actually understand the problem to fix it, often you accidentally fix the problem just by using a different approach.


I would argue instead that this comment thread is making the point that people forget that things that work now wouldn't've worked then and design decisions have to be made based on the database engine you're running in production.


MySQL could index strings in 2002. It would have worked fine.


"mysql could index strings" and "using a compound index over four varchar columns would've worked out well" are significantly different propositions.


Yes, we won't know because she didn't try it. And we know she didn't try it because the problem she described is table scans, not "indexed strings are somehow slow in MySQL circa 2002".


Or she did try it and it didn't work - or researched the question and figured, quite reasonably based on my experience of mysql in that era, it probably wouldn't work - and kept the post to only the original problem and the final approach to avoid having even more technical details in a post that wasn't really about those.

I agree that we don't know, but it seems a little unfair to her to treat every unknown as definitely being the most negative of the possibilities.


I really don't understand your general thrust here. MySql certainly had lots of issues in 2003, but being able to support multi-column indexes was not one of them. Her analysis is simply wrong - it is wrong now and was wrong then. Here is the doc from MySql v3.23:

7.4.2 Multiple-Column Indexes MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist of up to 16 columns. For certain data types, you can index a prefix of the column (see Section 7.4.1, “Column Indexes”). A multiple-column index can be considered a sorted array containing values that are created by concatenating the values of the indexed columns. MySQL uses multiple-column indexes in such a way that queries are fast when you specify a known quantity for the first column of the index in a WHERE clause, even if you do not specify values for the other columns


As I said already: "mysql could index strings" and "using a compound index over four varchar columns would've worked out well" are significantly different propositions.

To be more verbose about it - there is an important difference between "can be created" and "will perform sufficiently well on whatever (likely scavenged) hardware was assigned to the internal IT system in question."

I wouldn't be surprised if the "server" for this system was something like a repurposed Pentium 233 desktop with a cheap spinning rust IDE drive in it, and depending on just how badly the spammers were kicking the shit out of the mail system in question that's going to be a fun time.


A composite index is literally just concatenating the fields and then indexing that value. This is not technology that was out of reach of MySQL in 2002 and there is no reason to presume it was so when TFA clearly described the problem as a complete lack of indexes.


They are not significantly different. It just means sorting using more than one column.


you know for performance the new table would've needed a string index aswell?


A single-column string index is a different beast to implement compared to a multi-column varchar index.


I’m happy the hn comments are nitpicking, I’m not particularly well versed in database schemas but while reading it I was going “???????” and it’s good to know I’m not going crazy


Yeah I was confused because the new design looked like an obvious anti-pattern to me, in contrast to adding an index to the original table. And then I was wondering whether I had a completely false understanding of what database normalization means (especially since this is the first time I’d heard of 3NF).


Just because the author was intending to get across a certain point, doesn't mean the implementation details aren't worth discussing too. I don't think many people here would disagree with the central point, so what's there to discuss about it? I think it's uncharitable to assume that everyone talking about the implementation details is "missing the point".


There's probably also some sense of people trying to paint the lesson of "sometimes rookie errors aren't actually errors, and your seniors will tell you you're doing something wrong, when in actuality their solution is worse than what you came up with."

Which I guess is saying the article's setup story makes the exact opposite point that it's trying to?


As a 48 year old who lives in San Diego (so not Silicon Valley but geographically much closer than Europe) and still works as a programmer and who still enjoys coding and continually fights any attempt to put me into any sort of management role...

I think ageism is a valid concern but its also not a black and white issue because while I know quite a few older folks like me that are still constantly keeping up to date I've also known some people my age or just a little bit older who basically stagnated into irrelevance as experts in now discarded technology who never really moved on, so the line between what is ageism and what is people just not bringing value to the table anymore can be a bit blurry.

I'm by no means trying to suggest ageism isn't an issue in tech, I do see echoes of it when I look around and if I suddenly lost the network of people I work with, have worked with in the past, etc, who understand I'm still very technically 'flexible' for an older person I'm pretty sure it would make it difficult to find a job in the practical sense of getting pre-filtered out often just based on the age thing. So my heart really goes out to people who do find themselves in situations where they are capable but can't find work due to age primarily.

But, uh, the issue is somewhat complicated and very case by case, I think.


The real meaning being that people learn and gain experience over time? Is this really something we need to read two pages of text to find out? I think people are justifiably miffed at having read two pages for such a trivial message. Not only that, but these "stupid implementation details" could seriously mislead someone who really is a clueless newbie.


I think you are leaping to conclusions here. It is possible that most people want to keep HN focused on technical conversations and use this as an opportunity to learn something (I certainly do), which is why you are seeing many more comments on the technical aspects.


In this blog post, senior engineer Rachel talks nonsense about normalization and promotes a bafflingly complicated solution as superior to a simple one, without identifying the actual fix, which appears to have happened accidentally.

In other words, with enough empathy and patience, a clueless rookie can grow into a clueless senior engineer!

Rachel usually makes more sense than that. That's why people are nitpicking implementation details.


Hah, pretty accurate and kinda funny, but could be nicer. I still make mistakes.


Meanwhile, Rachel posted an update. Apparently, the whole section about how "the system got reworked" doesn't describe the proper solution, but her own (Rookie Rachel's) attempt to fix it.

And in that context, everything makes sense.


Heh at the end I think I may have been called out for my harsh response to the post being referenced at the start here on HN since it seemed to raise some hackles. Borrowing from other social media terminology here, "subtweeted" but by a blog post, not sure how I feel about that.

Honestly it almost makes me feel compelled to start writing more for the industry but to be honest, my opinion is most ideas are not great, or not novel, up to and including my own. So writing about them seems likely not to be beneficial. I guess you could argue I could let the industry be the judge of that though.


I'm probably going to get smacked a bit for this but a lot of the comments here focus on a better fix for the 20 year old problem the author relates only in order to illustrate a point and that's a very HN kind of response. And I get it.

Years ago someone ask one of my grandkids what I did for a living and he told them "He's a fix it guy". That's because when he'd come over I'd be working on something and he'd ask what I was doing and I'd tell him "I'm fixing" this or that.

There are a lot of "fix it" people here. It's what we do.


   id | ip_address | from | to | content | updated_at | created_at
And indexes where you need them. Normalize data like IP addresses, from and to. And you are good to go.


One of the most useful lessons I ever learned about designing database schemas was the utter uselessness of indexing and searching on datetime fields. Since virtually every value is going to be different, indexing and searching on that field is (almost) no better than having no index on the datetime field at all.

It was a revelation to me when I decided to experiment with having a indexed date-ONLY field and using that to search instead. It improved performance by almost two orders of magnitude. In hindsight, this should have been totally obvious if I had stopped to think about how indexing works. But, as is stated in this article, we have to keep relearning the same lessons. Maybe I should write an article about how useless it is to index a datetime field...

EDIT: This was something I ran into about 10 years ago. It is possible there was something else going on at the time that I didn't know about that caused the issue I was seeing. This is an anecdote from my past self. I have not had to use this technique since then, and we were using an on-prem server. It's possible that the rest of the table was not designed well and index I was trying to use was already inefficient for the resources that we had at the time.


This a very strange, and incorrect, conclusion you have come to. It doesn't matter that datetime fields are unique, as most of the time you are not searching for a particular date time, but a range. I.e. "show me all of rows created between date X and Y". In that case, the ordering of the index makes it efficient to do that query.

Furthermore, often times your query will have an "ORDER BY dateTimeCol" value on it (or, more commonly, ORDER BY dateTimeCol DESC). If your index is created correctly, it means it can return rows with a quick index scan instead of needing to implement a sort as part of the execution plan.


This entirely depends on what kind of index is used. A sorted index, such as a B or B+ tree (used in many SQL databases), will allow for fast point/range lookups in a continuous value space. A typical inverted index or hash based index only allows point lookups of specific values in a discrete value space.


I was just going to make this comment.

Postgres as far as I know uses B-tree by default.

You can switch sort order I think for this as well, so "most recent" becomes more efficient.

Multi-column indexes also work, if you are just searching for first column postgres can still use multi-column index.


It was a sorted BTREE index in MySQL 5.x. I agree that its supposed to be fast but it just wasn't for some reason.


Are you sure it was actually using the index you expected? There are subtleties in index field order that can prevent the query optimizer from using an index that you might think it should be using.

One common misstep is having a table with columns like (id, user, date), with an index on (user, id) and on (date, id), then issuing a query like "... WHERE user = 1 and date > ...". There is no optimal index for that query, so the optimizer will have to guess which one is better, or try to intersect them. In this example, it might use only the (user, id) index, and scan the date for all records with that user. A better index for this query would be (user, date, id).


One of the more bizarre things we'd found in MySQL 5.something was that accidentally creating two identical indexes significantly slowed down queries that used it.

I wouldn't be surprised if you hit some sort of similar strange bug.


> Since virtually every value is going to be different, indexing and searching on that field is (almost) no better than having no index on the datetime field at all.

Do you mean exact equality is rarely what you want because most of the values are different?

Or are you talking about the negative effect on index size of having so many distinct values?

I think the latter point could be quite database-dependent, eg. BTree de-duplication support was only added in Postgres 13. However, you could shave off quite a bit just from the fact that storing a date requires less space in the index than a datetime.


B-Tree indexes perform just fine on datetime fields. Were you using hash or inverted indexes maybe?


You’re maybe thinking of compression? Primary id is unique too but indexed search is still logn compared to full table scan (n)


there's more index types than only hashes (which do indeed only support equality)

e.g. btrees allow greater than/less than comparisons

postgres supports these and many more: https://www.postgresql.org/docs/9.5/indexes-types.html


I was using a BTREE index and we were doing greater/less than queries. It was not a hash index.


Did you also learn that certain indexes allow for range queries, and modern databases are quite efficient at those?

And yes, please write an article, it would be quite interesting. With data and scripts to reproduce, of course.


Your result is surprising: I suspect your wide table with an index wasn't in cache, and your timestamp-only table was.

The whole point is that indexes (can) prevent full table scans, which are expensive. This is true even if the column(s) you're indexing have high cardinality: but it relies on performant table or index joins (which, if your rdbms is configured with sufficient memory, should be the case).


I have no idea how you came to this conclusion, but indices on datetime fields are completely required to bring down the seek time from O(n) to O(log(N)), plus the length of the range. Massive parts of the businesses I've worked at would be simply impossible without this.

The cardinality of the index has other ramifications that aren't generally important here.


If you were doing something like this then your story makes sense

`WHERE state_date > x AND end_date < x`

You can't index two ranges at once in a b tree!


people in IT are loners, lone wolfs, lone sharks, sometimes lone and also self centered. for the reasons in the article, and then for so many other reasons. for the reason that the online education favors 'am top learner on my own' principle. etc.

the under-35-thing is just another reason. another is the ever-so-new technologies that obliterate all previous knowledge of obvious stuff. another is the fact that 'history of computing' is not something that you need to learn to start doing shiny webpages.

then there's the reason that so many ppl bash at universities, at in-person schooling, at work-together places (a.k.a offices).

and because someone is going to ask me what I did to change it - here's what: for almost 20 years now i've been teaching introductory or intermediate classes in-person to hundreds of students, and observing what helps them learn as fast as possible. trying to speed it up, to assist the process. i cana tell you one thing - working together and understanding that you can always learn from someone that is next to your shoulder is of paramount importance.

after all - what good is gender or race diversity in the workplace, if you effectively do not know how to work along these people...? but this is too long for just a post here.

the ability, the skill, to be able to work with someone, with everyone, is what the vast majority of IT top-coders lack. and lack badly and painfully (for those around mostly). look around yourself for examples...

...

the day before I was thinking, that Larry Wall's timtoady principle is really about celebrating diversity in IT, about different approaches to the same problem. and not about celebrating diversity in Perl or any other language.


Oh well played!! I was setting up to give the author a hard time about being judgemental, particularly because storing IPs or UUIDs as strings is a mistake I've seen some pretty darn good devs make. Some folks just aren't super strong on schema design and performance but are great at other things.

Plus MySQL kind of rocks. Fight me. There are some interesting optimizations that Percona has written about that may improve the performance of OPs schema. The learning never ends.

TBH I haven't seen a shortage of mentors. In fact most of our dev team is 35+ and are very approachable and collegial and our internal mentoring programs have been extremely successful - literally creating world experts in their specific field. I don't think we're unique in that respect.

I think it's rather unfortunate that the top comment here is a spoiler. Actually reading the post and going on the emotional rollercoaster that the author intended to take you on is the point. Not the punch line.


> Plus MySQL kind of rocks.

Agreed. I prefer Postgres for personal projects, but MySQL is a fine database. Honestly, even a relational DB I wouldn't want to use again (DB2...) is still pretty solid to me. The relational model is pretty damn neat and SQL is a pretty solid query language.

I wonder how many people disagree with that last part in particular...


It’s actually the core of all dev that involves any data storage, today. HN is both aspirational in tech and has a younger demographic so you won’t get any points here for mentioning anything SQL. Hence the downvotes I’m getting. But hey, it’s all good!!


There is a time and a place for a denormalized schema. If you are trying to search a large data-set a denormalized schema in a single table with some good indexing is much preferable to joining across 30 tables.

Whether you choose normalized or denormalized depends very much on a) what you need to do now b) what you may need to do in the future. Both are considerations.


> The first time you encounter something, you're probably going to make some mistakes. There's a post going around tonight about how someone forgot to put an index on some database thing and wound up doing full table scans (or something like that).

Which post is referenced here?

I also believe this to be a tooling issue. It's often opaque what ends up being run after I've done something in some framework (java jpa, django queries whatever). How many queries (is it n+1 issues at bay?), how the queries will behave etc. Locally with little data everything is fine, until it blows up in production. But you may not even notice it blowing up in production, because that relies on someone having instrumented the db and push logs+alarms somewhere. So it's easy to remain clueless.



I think there's an easy "best of both worlds" take here:

1. The majority of the performance problem could've and probably should've been summarized as "you need to use an index". (Maybe there were MySQL limitations that got in the way of indexing back then? But these days there aren't.)

2. Everyone makes mistakes! New programmers make mistakes like not knowing about indexes. Experienced programmers make mistakes like knowing a ton about everything and then teaching things in a weird order. All of these things are ok and normal and part of growing, and it's important that we treat each other kindly in the meantime.


Based on my own war stories, if it was an early '00s mysql then I expect "just throw an index at the varchar columns" would've helped but not nearly as much as what she actually did.


You don’t need to store an IP in a different table as ID/IP, IP can be represented as 128 bit integer. A small “mistake” I’ve made before that makes some things a little more annoying - not that it’s particularly important


A story of how a junior dev becomes a little less junior.

The original implementation was the 'obvious' one, and not terrible.

Then experience of the live system + MySQL demonstrated problems with that.

So a non-obvious but better-performing solution was implemented instead.

Great! Junior dev walks away with a mistrust of both long repeated strings appearing in a DB table, AND a mistrust of database internals. A fine lesson indeed.

(And remember, this was the era of the all-powerful DBA who could ruthlessly de-normalise your schema at a moments notice to improve performance).


The initial design was fine and btw it did not violate 3NF, actually (if you add auto increment pk). A single composite index would’ve most likely solved all problems


The popular advice in these comments to use indexes sounds really good. I haven't thought it out fully, but my first thought was "this sounds like a DIY columnar store". A columnar database does a lot of this work for you. I'm not 100% it's a good option in this case, because things like email subjects are not good columnar data. I would like to know more about how problematic that is.


While the article is not really about the database schema, could you do a table with just two fields?

1) hash of concat(ip & helo & m_from & m_to) 2) time


Guess this was one of the first implementations of greylisting :)

This was not an issue of being in a vacuum, it was an issue of not reading documentation beforehand, which for some reason is acceptable in the development world

I don't think there are so many people attempting to perform surgery or to build a bridge without reading a number of textbooks on the topic.

And this is why back in the day everyone said RTFM


I more or less thrown the towel on that "not being clueless anymore"

Sure, I'll try and study to keep myself on my feet, but there is soo much to learn that I'd rather focus on the technologies needed right _now_ to get things done

Way more often than not "thing A has to be done with tech B" and the expectation of getting it done without multiple foot guns in gone


What I got from this post is a headache after reviewing the wikipedia entries on 3NF (all all of it's dependencies: 1NF, 2NF, also 3.5NF). Man the language is dense.

j/k, always nice to review some of this "basic" stuff (that you barely use in everyday work but it's good to have it in the back of your head when changing something in the db schema)


I would add a counter point to foreign key constraints, they make deletions and updates a lot more costly [1].

[1] https://shusson.info/post/postgres-experiments-deleting-tabl...


I was preparing to rant about the tone that the post was taking towards the "clueless programmer" until I got to the end. Well done...

Still not sure why adding an index didn't work instead of breaking out into multiple tables indexed by IDs.

Besides, this use case screams for "just store the hashes and call it a day".


Not terribly convinced by arguments about normalization for normalization's sake. As with most things, the definitive answer is "it depends". It's not the 1990s anymore. https://youtu.be/8Fb5Qgpr03g


> id | ip | helo | m_from | m_to | time | ...

I'm not a database person, but this doesn't seem too bad if there was at least an index on IP. The performance there would probably be good enough for a long time depending on the usage.

But is 6 tables (indexed on the value) really better than 1 table with 6 indexes?


It's quite hard to create something that works optimally on a first try. I think I never managed to do so! The first schema was slow, but hey, it worked, good job! That is a huge first step. Modifying it so it runs much faster can definitely come next.


> this crazy vendor system charged by the row or somesuch

I don't really understand the text, but from what I could gather this is far worse than a forgotten index.

You should set them in most cases, but it doesn't really reduce lookup speed if you don't also add foreign keys.


Err. Wat. The original schema was (mostly) fine. It had no indexes. The second schema looks like it tried to work around lack of database optimization features. It's in no way "better" from a data design standpoint.

A database with good string index support isn't doing string comparisons to find selection candidates - at least, not initially.

What a bizarrely confident article.


> lack of database optimization features

That's a perfection description of mysql as of 2002.


MySQL, at least as far back as 2000, had indexes.

http://web.archive.org/web/20020610031610/http://www.mysql.c...


I wouldn't bet on these indexes optimizing anything back then. MySQL was legendary that while implementing the necessary standards (to some degree) it was neither reliable nor efficient.


Indexes in MySQL worked fine in 2002, at least for the use case described in this article.


That's pretty beside the point though, the concept of indexing had existed for decades even prior to that.


The quote was "database optimization features" and the scope was "MySQL as of 2002".

Of course, even my old DBase II handbook talks about indexes - all that's old hat. MySQL had them, too.

MySQL also used to have a long-earned reputation as a toy database though, and MySQL in 2002 was right within the timeframe where it established that reputation. So yeah, you could add indexes, but did they speed things up (as in them being an "optimization feature")? Public opinion was rather torn on that.


I was there and I don't remember it like you describe. MySQL was certainly not a toy database in 2002. I learned PHP from a book called "PHP and MySQL". They went together like white on rice. Every forum software used it. Hell, even today Uber uses MySQL at a VERY large scale to power their schemaless database.


I was there, too, and I wasn't alone in my assessment:

https://www.databasejournal.com/features/mysql/article.php/2... (to pick a contemporary example using that exact word) suggests that the 2003 release of mysql4 ends the "toy status". Others might have had different thresholds, but that reputation was hard earned and back then still pretty apropos (to a declining degree over time because MySQL caught up). I lost data to MySQL more than once (if memory serves but it's been a while: MyISAM was a mess, the InnoDB integration not ready yet)

To handwave two extreme positions of those days (because I'm too lazy to look up ancient forums where such stuff has been discussed at length), it's very possible that different groups approached things differently:

Members of a PHP User Group probably didn't consider MySQL a toy but the best thing since sliced bread, given how closely aligned PHP was to MySQL back then (it got more varied over time, but that preference can still be felt in PHP projects)

OTOH when you had a significant amount of perl coders in your peer group, it was rather likely that some of them could go on for days about how _both_ PHP and MySQL are toys when anybody dared to talk about PHP (which ties in neatly with the original post and its reference about "The One")

[bonus: more recent example: http://www.backwardcompatible.net/145-Why-is-MySQL-still-a-t...]


> The whole system was reworked to have a table for each set of values. There was now a table of IP addresses, a table of HELO strings, a table of FROM addresses, and a table of TO addresses.

Lisp interning, in database tables. A.k.a. Flygweight Pattern.


to me the first implementation seems the best, it just needed a multi column index

the 3NF "optimized" form seems worse and unreadable. I would advise to normalize only when it's needed(complex/evolving model), in this case it was not needed


The "NoSQL" movement -- which I wisely rejected -- is also responsible for this.


All "movements" have some merit, or they wouldn't exist in the first place. Nobody starts a "movement" just because. It exists because it solves something. "NoSQL" was a solution to something. Today you can benefit from best of all worlds, and still NoSQL has a place and a use case, just like anything else that exists.


I will play devil advocate

>and still NoSQL has a place and a use case, just like anything else that exists.

What actually makes you believe that it's the NoSQL that has "some use cases" and relational databases are "default ones" instead of NoSQL/no-relational by default?


I don't think there is (or should be) a "default". Default selections suggest they are picked "just because". There is a problem/use-case and a tool to solve it. Sometimes it's relational, sometimes it's nosql, sometimes it's both or even none...


There are problems that nosql databases solve easily that require a lot of fancy tricks in a RDBMS.


Huh? I strongly prefer relational databases for most things but this particular example is actually begging for a simple key-value store.


The real take away here is DB performance optimization (imo performance in general) is freaking hard and even people who’d successfully optimized something like OP often totally misunderstand why it actually worked. Pay for an expert advice


I once ended up doing 'rm -rf *' by mistake - https://ashishb.net/tech/my-rm-rf-moment/


I wonder how many people here are throwing stones while living in glass houses.


I’m glad people are critiquing the technical parts of the post because it had me questioning everything I know about databases, which really isn’t a whole lot.


I've got a few war stories myself, I've seen things you people wouldn't believe.

This case its important to recognize it was 2002, early days. This story just doesnt read the same in todays context.


If there are millions of people entries coming in per hour, I can see the value.

Otherwise, how long is their data retention? Wouldn’t there be more value in cleaning out entries older than, say, two hours?


The normalized solution sounds like a performance nightmare?! Stick the original data into something like Clickhouse? (After checking whether indexing would fix it for a few years more)


What is worse is Prisma obscuring actual SQL queries with a query language that puts usability first and allows you to quickly build prototypes that just don't scale for n=1000.


The one thing I would add here is that the devlead (or equivalent - there always is someone) should have added at least one test. Not a unit test or 99% co drags test but a "does it do what we want" test - a golden path test.

I came across something like this with an apprentice recently. I let the lack of a repeatable test pass because speed / time / crunch etc and the obvious how can that possibly fail code of course failed. The most basic test would have caught it in the same way some basic performance testing would have helped rachel however many years back.

It's hard this stuff :-)


I did this (bad schema, too much scanning) too.

I also ran newfs on a running Oracle database. (That did vastly less damage than you would have thought, actually.)


Why not put a hash index on the four columns? By normalising, you get an index but maybe b tree, which isn't best here for exact matching?


The responses in this thread are classic pedantic HN with the actual point whoooshing over so many people's head.


I think it actually makes quite a bit of a difference if 20 years later the author still does not understand the actual problem and solution she is using as an example. It actually does undermine her argument that people make mistakes, learn from them and grow etc if the lack of growth is still in evidence 20 years later in the singular example under consideration.


No I'd say the responses in this thread are classic pedantic HN with users like you taking the high moral ground and looking down on people who actually are trying to address the technical issues with a post in a technical forum.


We've banned this account for breaking the site guidelines. Creating accounts to do that with will eventually get your main account banned as well, so please don't!

https://news.ycombinator.com/newsguidelines.html


Proud of myself for recognizing the conceit (pun intended) here pretty quickly. Good point in the end too!


I’m so grateful for rachelByTheBay.


> You know what it should have been? It should have been normalized.

No, it should have had a unique index on the first five columns.

(Oooh, I just hate it so much when someone gets up on their high horse and opens a blog post with a long condescending preamble about how some blogger got it wrong, and then they get it even wronger.)


Or store the ip address as int64 with index, or better both solutions together


orr.. could have switched to https://man.openbsd.org/spamd greylisting a few months later in 2003


I think this post showed us how an engineer on the Mt. stupid looks like. Apparently this design is wrong. We should rely on the indexing ability of database itself, instead of inventing our own indexing system.


Sorry Rachel but the best solution was a simple fix as mentioned by most people here - columns having the right types and indexes.

When I was still wet behind the ears I was doing what you are now proposing - but its overkill.


I’m pretty sure all the people who are criticizing the database design haven’t read the ending. The article isn’t about the schema, it’s about helping those entry level programmers


It would be more effective at making that point if it didn’t have the confused and arguably incorrect section in the middle. If as a writer, the audience misses your point, maybe it’s a problem with your writing rather than the audience.


The problem is that the technical narrative in the post contributes to imposter syndrome and gaslights entry level programmers into thinking they’re doing database design wrong.


You don’t think it’s helpful for people to steer entry level programmers away from bad ideas like this one?


> what about you?

Will try to keep making money coding


short answer: composite indexes


They could have just made a hash of the 4 values and indexed on it...


Wow what a twist. Somebody call M Night Shyamalan!


This post is bizarre, precisely because there is nothing particularly wrong about the original schema, and the author seems to believe that the problem is that the column values were stored as strings, or that the schema wasn't in "third normal form".

Which is nonsense. The problem with the original DB design is that the appropriate columns weren't indexed. I don't know enough about the problem space to really know if a more normalized DB structure would be warranted, but from her description I would say it wasn't.


Please make your substantive points without calling names. This is in the site guidelines:

https://news.ycombinator.com/newsguidelines.html


Based on this piece of old documentation I found [0] for MySQL 3.23 (the most recent version in 2002 as far as I can tell), certain types of indices were only available on certain types of engines. Furthermore, columns were restricted to 255 characters, which may be too short for some of the fields saved in the database.

Modern databases abstract away a lot of database complexity for things like indices. It's true that these days you'd just add an index on the text column and go with it. Depending on your index type and data, the end result might be that the database turns the table into third normal form by creating separate lookup tables for strings, but hides it from the user. It could also create a smarter index that's less wasteful, but the end result is not so dissimilar. Manually doing these kinds of optimisations these days is usually a waste of effort or can even cause performance issues (e.g. that post on the front page yesterday about someone forgetting to add an index because mysql added them automatically).

All that doesn't mean it was probably a terrible design back when it was written. We're talking database tech of two decades ago, when XP had just come out and was considered a memory hog because it required 128MB of RAM to work well.

[0]: http://download.nust.na/pub6/mysql/doc/refman/4.1/en/create-...


The fact remains that whether the text column existed on her original table, or whether it was pulled out to a normalized table, literally all of the same constraints would apply (e.g. max char length, any other underlying limitations of indexing).

The issue is that her analysis of what the issue was with her original table is completely wrong, and it's very weird given that the tone her "present" self is that it's so much more experienced and wise than her "inexperienced, naive" self.

My point is that she should give her inexperience self a break, all that was missing from her original implementation were some indexes.


Multiple-long-column compound indices sucked in old mysqls if you could even convince it to use them in the first place.

Being able to look up each id via a single-string unique index would've almost certainly worked much better in those days.


I used MySql a lot back then, had many multi-column indexes, and never had an issue.

More importantly, given the degree of uniqueness likely to be present in many of those columns (like the email addresses), she could have gotten away with not indexing on every column.


Your first sentence was, I'm afraid, very much not a universal experience.

Your second is quite possibly true, but would have required experimentation to be sure of, and at some point "doing the thing that might be overkill but will definitely work" becomes a more effective use of developer time, especially when you have a groaning production system with live users involved.


Based on: https://web.mit.edu/databases/mysql/mysql-3.23.6-alpha/Docs/...

I'd say there's very little performance gain in normalizing (it usually goes the other way anyway: normalize for good design, avoiding storing multiple copies of individual columns; de-normalize for performance).

I'm a little surprised by the tone of the article - sure, there were universities that taught computer science without a database course - but it's not like there weren't practical books on dB design in the 90s and onward?

I guess it's meant as a critique of the mentioned, but not linked other article "being discussed in the usual places".


Knowing only what we can guess about the actual problem, i’d say indexing ip ought be enough to make everything super fast.


> All that doesn't mean it was probably a terrible design back when it was written. We're talking database tech of two decades ago, when XP had just come out and was considered a memory hog because it required 128MB of RAM to work well.

A lot of this stuff was invented in the 70s, and was quite sophisticated by 2000. It just wasn't free, rather quite expensive. MySQL was pretty braindead at the time, and my recollection is that even postgres was not that hot either. We've very lucky they've come so far.


I think if anything, all of it could've been put into a single indexed column since the query was AND ... AND ... not OR.

So you could've had a indexed column of "fingerprint", like

ip1_blahblah_evil@spammer.somewhere_victim1@our.domain

And indexed this, with only single WHERE in the query.

I don't understand at all how multiple tables thing would help compared to indices, and the whole post seemed kind of crazy to me for that reason. In fact if I had to guess multiple tables would've performed worse.

That is if I'm understanding the problem correctly at all.


This has been the advice given to me by Postgres experts in a similar scenario:

  "If you want to efficiently fuzzy-search through a combination of firstname + lastname + (etc), it's faster to make a generated column which concatenates them and index the generated column and do a text search on that."
(Doesn't have to be fuzzy-searching, but just a search in general, as there's a single column to scan per row rather than multiple)

But also yeah I think just a compound UNIQUE constraint on the original columns would have worked

I'm pretty sure that the degree of normalization given in the end goes also well beyond 3rd-Normal-Form.

I think that is 5th Normal Form/6th Normal Form or so, almost as extreme as you can get:

https://en.wikipedia.org/wiki/Database_normalization#Satisfy...

The way I remember 3rd Normal Form is "Every table can stand alone as it's own coherent entity/has no cross-cutting concerns".

So if you have a "product" record, then your table might have "product.name", "product.price", "product.description", etc.

The end schema shown could be described as a "Star Schema" too, I believe (see image on right):

https://en.wikipedia.org/wiki/Star_schema#Example

This Rachel person is also much smarter than I am, and you can make just about anything work, so we're all bikeshedding anyways!


>I'm pretty sure that the degree of normalization given in the end goes also well beyond 3rd-Normal-Form.

> I think that is 5th Normal Form/6th Normal Form or so, almost as extreme as you can get

The original schema is also in 5NF, unless there are constraints we aren't privy too.

5NF means you can't decompose a table into smaller tables without loss of information, unless each smaller table has a unique key in common with the original table (in formal terms, no non-trivial join dependencies except for those implied by the candidate key(s)).

The original table appears to meet this constraint: you could break it down into, e.g., four tables, one mapping the quad id to the IP address, one mapping it to the HELO string, etc. However, each of these would share a unique constraint with the original table, the quad id; hence the original table is in 5NF.

As for 6NF, I don't think the revised schema meets that: 6NF means you can't losslessly decompose the table at all. In the revised schema, the four tables mapping IP id to IP address etc. are in 6NF, but the table mapping quad id to a unique combination of IP id, HELO id, etc. is not: it could be decomposed into four tables similarly to how the original table could be.

(Interestingly, if the original table dropped the quad ID column and just relied on a composite primary key of IP address, HELO string, FROM address and TO address, it would be in 6NF.)


Would it not prevent some optimizations based on statistics regarding the data distribution, like using the most selective attribute to narrow down the rows that need to be scanned? I'm assuming there are 2 indexes, 1 for each column that gets combined.

Let's say you know the lastname (Smith) but only the first letter of the firstname (A) - in the proposed scenario only the first letter of the firstname helps you narrow down the search to records starting with the letter (WHERE combined LIKE "A%SMITH"), you will have to check all the rows where firstname starts with "A" even if their lastname is not Smith. If there are two separate indexed columns the WHERE clause will look like this:

WHERE firstname like "A%" AND lastname = "Smith"

so the search can be restricted to smaller number of rows.

Of course having 2 indexes will have its costs like increased storage and slower writes.

Overall the blog post conveys a useful message but the table example is a bit confusing, it doesn't look like there are any functional dependencies in the original table so pointing to normalization as the solution sounds a bit odd.

Given that the query from the post is always about concrete values (as opposed to ranges) it sounds like the right way to index the table is to use hash index (which might not have been available back in 2002 in MySql).


I won't pretend to be an expert in this realm, but see here:

https://www.postgresql.org/docs/current/textsearch-tables.ht...

Specifically, the part starting at the below paragraph, the explanation for which continues to the bottom of the page:

  "Another approach is to create a separate tsvector column to hold the output of to_tsvector. To keep this column automatically up to date with its source data, use a stored generated column. This example is a concatenation of title and body, using coalesce to ensure that one field will still be indexed when the other is NULL:"
I have been told by PG wizards this same generated, concatenated single-column approach with an index on each individual column, plus the concatenated column, is the most effective way for things like ILIKE search as well.

But I couldn't explain to you why


Oh, so by fuzzy search you mean full text search, not just a simple 'LIKE A%' pattern - it's an entirely different kind of flying altogether (https://www.youtube.com/watch?v=3qNtyfZP8bE).

I don't know what kind of internal representation is used to store tsvector column type in PG, but I think GIN index lookups should be very simple to parallelize and combine (I believe a GIN index would basically be a map [word -> set_of_row_ids_that_contain_the_word] so you could perform them on all indexed columns at the same time and then compute intersection of the results?). But maybe two lookups in the same index could be somehow more efficient than two lookups in different indexes, I don't know.

I'm still sceptical about the LIKE/ILIKE scenario though. "WHERE firstname LIKE 'A%' and lastname LIKE 'Smith' " can easily discard all the Joneses and whatnot before even looking at the firstname column, whereas "WHERE combined_firstname_and_lastname LIKE 'A%SMITH' " will only be able to reject "Andrew Jones" after reading the entire string.


Also, the "better" solution assumes IPv4 addresses and is not robust to a sudden requirements change to support IPv6. Best to keep IP address as a string unless you really, REALLY need to do something numerical with one or more of the octets.


Based on my experience I disagree. If there is a native data type that represents your data, you should really use it.

It will make sure that incorrect data is detected at the time of storage rather than at some indeterminate time in the future and it will likely be more efficient than arbitrary strings.

And in case of IP addresses, when you are using Postgres, it comes with an inet type that covers both ipv4 and ipv6, so you will be save from requirement changes


Keeping IP addresses as string isn't trivial. You can write those in many ways. Even with IPv4. IPv6 just brings new variations. And when talking about migration to IPv6 you have to decide if you keep IPv4 Addresses as such or prefix with ::ffff:.

In the end you have to normalize anyways. Some databases have specific types. If not you have to pick a scheme and then ideally verify.


That way you can get false positives unless you concatenate using a non-valid character in the protocol

foo@example.com other@example.net foo@example.co mother@example.net

Btw, storing just the domains, or inverting the email strings, would have speed up the comparison


Right, and when you do that, you don't even need a RDBMS. An key-value store would suffice. This essentially just becomes a set! Redis or memcached are battle-tested workhorses that would work even better than a relational DB here.

But this was also back in the early '00s, when "data store" meant "relational DB", and anything that wasn't a RDBMS was probably either a research project or a toy that most people wouldn't be comfortable using in production.


> this was also back in the early '00s, when "data store" meant "relational DB", and anything that wasn't a RDBMS was probably either a research project or a toy that most people wouldn't be comfortable using in production.

Indeed; her problem looks to have been pre-memcached.


Yeah, I'm not clear on how multiple tables fixed it other than allowing you to scan through the main table faster.

Multiple tables could be a big win if long email addresses are causing you a data size problem, but for this use case, I think a hash of the email would suffice.


Well she did say she had indexes on the new table. It could have been fixed with an index on the previous table, but a new table with indexes also fixed it.


As a 25 year “veteran” programmer, I’m glad I’m not the only one that likes the original schema (with indexes added). A fully normalized table is a lot more difficult to read, troubleshoot, and reason about. You would need a pretty good query to poke around the data.


When I first started using postgres I was amazed at how much of what I thought was "database knowledge" was actually "working around limitations of the versions of mysql I'd been using" knowledge.

These days, sure, the original schema plus indices would work fine on pretty much anything I can think of. Olde mysqls were a bit special though.


The point is she was being hyper critical of her past schema to make a larger point.


And IIUC, the larger point, made by the twist ending, is that we shouldn't be so critical, but try to actually help newbies learn what we insist that they should know.


It seemed to me the point was that if the industry remains designed to push out experienced engineers that there won't be anyone to mentor the engineers that are just beginning their careers. Further that we will bend the productivity of the field downward and have less effective systems.


In what way is the industry designed that way now? Maybe some get sucked into management roles, but I don’t see any evidence of a broad design to push them out.


Look at the way we screen resumes, or valorize work conditions incompatible with having family obligations. Ageism in the tech industry is rife.


Or how leetcode heavy hiring favors folks that tend to be more fresh out of college & folks who have time to prepare for them.


Role count pyramids, volume and inflexibility of working hours, abusive management techniques, a constant treadmill of the new hotness, more...

Some offices avoid many of the challenging patterns but not the vast majority. A lot of it is natural result of the newness of the industry and the high stakes of investment. Many of the work and organizational model assumptions mean that offering flexibility to employees complicates the already challenging management role. Given the power of those roles, this means workers largely must fit into the mold rather than designing a working relationship that meets all needs. As the needs change predictably over time a mounting pressure develops.

I'm only 17 years in but it has gotten harder and harder to find someone willing to leave me to simply code in peace and quiet and dig my brain into deep and interesting problems.


I have mixed feelings about this. On one hand, I appreciate the larger point, and pointing out mistakes again goes against that point. On the other, the post had so many technical details it overshadowed the larger point, especially for something that wasn't obviously broken.


Phew.. I was reading this post thinking “is basically every table I’ve ever made wrong?!”


I definitely got a spike of imposter syndrome thinking, of fuck I've been in software this long and haven't learned this yet?!


For modern databases, the extra normalisation is an incremental optimisation that you usually won't need.

But I think she's if anything been in the industry longer than I have and the versions of mysql I first ran in production were a different story.


Yeah… technically that’s (the solution) normalization, but really not a bad design originally. If you never want a particular atom of data repeated, yes you have to normalize each column, and that is efficient in the long run for storage size, but in more normal data storage that still means a lot of joins and you still want indices. On a modern SSD you could use that one table, throw on some indices as you suggest and not notice it again until you hit millions of rows.

Anything you are doing conditional logic or joins on in SQL usually needs some sort of index.


I especially feel the proposed solution could be problematic given that the data is supposed to be transient. How are keys deleted from the "id | value" tables when rows are deleted from the main table? Does it just keep accumulating rows or is there something like a GC query running occasionally that deletes all unreferenced values?


Clearly it worked.

However...

Consider the rows `a b c d e` and `f b h i j`. How many times will `b` be stored in the two formulations? 2 and 1, right? The data volume has a cost.

Consider the number of cycles to compare a string versus an integer. A string is, of course, a sequence of numbers. Given the alphabet has a small set of symbols you will necessarily have a character repeated as soon as you store more values than there are characters. Therefore the database would have to perform at least a second comparison. I imagine you're correctly objecting that in this case the string comparison happens either way but consider how this combines with the first point. Is the unique table on which the string comparisons are made smaller? Do you expect that the table with combinations of values will be larger than the table storing comparable values? Through this, does the combinations table using integer comparisons represent a larger proportion of the comparison workload? Clearly she found that to be the case. I would expect it to be the case as well.


> Clearly it worked.

No it didn't. Someone else put an index on the main table and that worked.


The article declared that it was running in production and satisfying the requirement. Do you have a different definition of working? I say that to clarify what I was attempting to communicate by "it worked".

I think you're remarking about the relationship between normalized tables and indexes. That relationship does exist in some databases.

Please say more if I'm missing your point.


GP's point is that the final schema has a primary key index on the main table that solved the problem, and then it has a bunch of useless additional tables. The solution was just to add a primary key index on the main table. Adding the additional tables just slows down the query because it now has to do five index lookups instead of one.


Using IDs may lead to a performance gain if it leads to a smaller database: less bytes to read/write/cache => less I/O volume, and better cache hits compensating the potentially higher amounts of seeks on mass storage. This is especially true when seeks are quick (SSD...).

Therefore as long as the size data type (C language's "sizeof") used for an ID is inferior to the average size of the column contents then using an ID will very probably lead to a performance gain.

On some DB states and usage patterns (where commonly used data+index cannot fit in RAM: the caches (DB+OS) hit ratios are < to .99) this gain will be somewhat proportional (beware of diminishing returns) to the value of the ratio (total data+index size BEFORE using IDs)/(total data+index size AFTER using IDs).

Creating queries then becomes more difficult (one has to use 'JOIN'), however there are ways alleviate this: using views, "natural join"...

Some modern DB engines let you put data into an index, in order to spare an access to the data when the index is used (Postgresql: see the "INCLUDE" parameter of the "CREATE INDEX"). As far as I understand using a proper ID (<=> on average smaller than the data it represents) will also lead to a gain(?)


Also you could just store IPv4 as an unsigned INT directly, rather than strings in a different table and an unsigned INT as a foreign key.


I have regretted every single time I've gotten cute about storing IP addresses as scalars in SQL.


What issues did you run into? Having admittedly never done this before, it feels like whenever you need the dotted-quad IP you could just query:

    SELECT INET_NTOA(ip) FROM ips WHERE...
...in MySQL at least. (Which might be why MySQL has this function built in?)

I guess if you ever need to match on a specific prefix storing the numeric IP might be an issue?


> I guess if you ever need to match on a specific prefix storing the numeric IP might be an issue?

If you want to search for range, you can do:

   SELECT INET_NTOA(ip) FROM ips WHERE ip > INET_NTOA(:startOfRange) AND ip < INET_NTOA(:endOfRange)


Because as shocking as this might sound like unlike the characters in the movie Matrix I don't have a built-in INET_NTOA function in my retina to see the string form of the IP addresses when glancing at table rows of a large table looking for patterns or a certain IP.


How often are you looking at the full contents of a table vs. looking at the results of a query that you could throw an INET_NTOA onto?


Many. You also expect me to remember how to spell INET_NTOA and on what fields to use it on. What if I wanted do a quick "SELECT * FROM"? I barely know how to spell English words what makes you think I'm going to remember how to spell INET_NTOA.


I realize you’re describing a general problem and not just this particular example, but fwiw NTOA and friends got a lot easier for me once I learned enough that I stopped trying to “remember how to spell” them and started thinking about them in terms of the underlying concepts, like “Network (as in big-endianness) to ASCII”. Learning the significance of the term “address family” was a big clue for me for self-directed learning in this space :)


dbeaver CE is free.


Idunno the real villain is that bullshit, costly "per-row" pricing


I usually store IPv4 as unsigned int, and IPv6 as two 64 bit unsigned ints. But supporting both variants require that I have an a table for each type and a glue table to reference the record in the correct table. Storing simple stuff quickly turns complicated…


Is there much point in optimising this as opposed to keeping it in strings, at what QPS do you think it would matter and how much?


The main reason to store it as numbers is because I select records based on what subnet an IP address is in. I can only do this when I use integers.


> This post is bizarre ... Which is nonsense ...but from her description I would say it wasn't.

This is what the post is about.

And btw, if they hadn't normalized those tables, it'd instead have been pointed out to them that their "post is bizzare..."


OK so change the article's technical solution from "normalize" to "add index". That doesn't change the point of the article.


Is this correct? Would indexing the columns instead of moving the values to another table lead to the same increase in performance?


Moving the data into another table would still require indexes: one on the original table's column (which now stores the new id) and one on the new table's primary key.

In most cases I'd expect just adding an index to the original table to be more efficient, but it depends on the type of the original column and if some data could be de-duplicated by the normalization.


The same O(N) -> O(log N) improvement for queries, yes. The constant factor on the separate tables might be better. It’s also a more complicated design.


Given this is a "from the start of her career" story, I'm guessing she was running similar versions of mysql to the versions I started with, and if my guess is correct then probably not.

On anything you're likely to be deploying today, just throwing a compound index at it is likely quite sufficient though.


An index is often easily viewed as another table. So, should.

(Some subtleties on projected values and such, but the point stands.)


Yes, the proposed ‘better’ structure basically amounts to building your own indexes. Normally, I’d assume it is better to use the RDBMS’s own engine to do that, don’t roll your own.

There may well be some subtlety to the indexing capabilities of MySQL I’m unaware of though - could easily imagine myself making rookie mistakes like assuming that it has same indexing capabilities. So, to the post’s point - if I were working on a MySQL db I would probably benefit from an old hand’s advice to warn me away from dangerous assumptions.

On the other hand I also remember an extremely experienced MS SQL Server DBA giving me some terrible advice because what he had learned as a best practice on SQL Server 7 turned out to be a great way to not get any of the benefits of a new feature in SQL Server 2005.

Basically, we’re all beginners half the time.


As a former SQL Server DBA, most SQL Server DBAs are absolute gurus on 1 or 2 specific versions of SQL server, and massively ignorant about anything newer.

It's a strange role where, in order to do your job well, you kind of have to hyper-specialize in the specific version of the tech that your MegaCorp employer uses, which is usually a bit older, because upgrading databases can be extremely costly and difficult.


Then this post is very misleading.


I wouldn't get too vehement on it. Knowing how to normalize isn't a bad thing. And, if you intricately know your write and read needs, you may be able to assist the work more appropriately.


It would be faster, I believe. As we’re only looking for existence, a single index traversal is all io we would need to do


Sure would! I think it would be marginally better, in fact, because you would just need to look at the index rather than five tables. Access would be more local.


The moral of this post really falls flat coming from this author, most of whose posts are snarky, passive-aggressive rants where she rants about someone's alleged incompetence or something similar.


Personal attacks are not allowed on HN.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

Edit: you've been breaking the site guidelines repeatedly. We ban accounts that do that. I don't want to ban you, so please stop doing this!


The ending:

> It's a massive problem, and we're all partly responsible. I'm trying to take a bite out of it now by writing stuff like this.

suggests to me that maybe she has recently had a change of heart about such rants. I'm willing to give her the benefit of the doubt, because we all need to give each other the benefit of the doubt these days.


This is a really weird ad hominem that is unrelated to the post itself.


Ad hominem?

It seems pretty relevant to me that the author mostly writes snarky articles which directly contradict the moral of this story.

For the record, I totally agree with the moral of the story. But we can't just blindly ignore context, and in this case the context is that the author regularly writes articles where someone else is ridiculed because the author has deemed them incompetent.

In a similar vein, no one would just ignore it and praise Zuckerberg if he started discussing the importance of personal privacy on his personal blog.


I have to agree with the OC. I read this post, the reply post (https://rachelbythebay.com/w/2021/11/07/select/) and "The ONE" (https://rachelbythebay.com/w/2018/04/28/meta/) post and I too was struck by the passive aggressive/defensive tone.

To me it really clashed with the original post which concludes we need guidance/mentorship/reference points. Most of the comments in Hackernews are constructive or genuinely curious, don't just dismiss the author and instead provide things to learn.

For the author to then write a reply and dismiss all these comments as comments that are really saying "I am THE ONE. You only need to know me. I am better than all of you." is just plain rude and completely counter to their own point.


[flagged]


The post has a link to an update at the bottom for you if you missed it


[flagged]


Why? She's clearly not a clueless newbie now.


I think the OP was sarcastically implying that she still is.


Such remarks generally backfire onto the author, as indeed in this case.


Right, not a newbie anymore


[flagged]


Personal attacks will get you banned here. Please review https://news.ycombinator.com/newsguidelines.html and don't do any more of this on HN.

We detached this subthread from https://news.ycombinator.com/item?id=29140272.


She has never been "clueless" and has written insightfully for years.


[flagged]


Without knowing which database engine (MySQL comes with more than one) she was using, nor the testing that she actually performed, what makes you say that?


I remember versions of mysql whose string indexing was sufficiently limited on every available backend that this would've been the correct change.

Hell, back in ... 2001? I think? I ended up making exactly the same set of changes to an early database design of mine, for pretty much the same stated reasons and with very similarly pleasant effects on its performance.


imo “clueful”/clueless is more about being able to sense situations where there is something to be known rather than knowing everything upfront all the time (which is obviously impossible)


The point isn't to determine who is or isn't clueless. The point is how we deal with each other's cluelessness.


[flagged]


Personal attacks will get you banned here. Please review the site guidelines and make your substantive points without swipes in the future.

https://news.ycombinator.com/newsguidelines.html


Rachel is a very, very smart lady with a hell of a track record, who also has strong ethics, something the world needs more of.

I think you missed the point of the article. Your reply must have felt great for you to write, though I'm not sure anyone got anything out of that but you.


This is exactly the attitude warned against in TFA.


I think this blog fails to take into account that 2021 is not 2002. Computer Science is a much more formal/mainstream field of study now, and people don't operate "in a vacuum with no mentorship, guidance, or reference points.”

Some comments on the previous blog post raised important questions regarding minimum understanding/knowledge of technology one utilizes as part of their day job. And I would agree that indexing is a fundamental aspect while using relational databases.

But unfortunately this isn't very uncommon these days, way too often have I heard people in $bigco say, let's use X, everyone uses X without completely understanding the implications/drawbacks/benefits of that choice.


I don't see how 2002 was ever different in this from 2021 (except maybe for a stronger international dimension today).


Why not store the IP Address as a 32-bit number (IPv4 addresses)? Why store it as a string in the first place? This is something I did not quite get. Also, wouldn't it be better to split out the domain from the email address for ease of filtering?

Also, how does performing joins give a performance advantage here. I'm assuming there would be queries to get at the IDs of at least one, but going up to 4, to get at the IDs of the items in the quad. Then there would be a lookup in the mapping table.

I have worked for some time in this industry, but I have never had to deal with relational databases (directly; query tuning and such were typically the domain of expert db people). It would be interesting to see an explanation of this aspect.

EDIT: To people who may want to comment, "You missed the point of the article!": no, I did not, but I want to focus on the technical things I can learn from this. I agree that ageism is a serious problem in this industry.


You could store the IP as a 32 bit unsigned int. There's no issue with that, but I would probably recommend nowadays to use an internal cidr or inet type if your database supports that. It probably wouldn't be better to split the email address from the domain for filtering since email address formats tend to be a bit tricky.

Joins give a performance advantage due to the fact that you aren't duplicating data unnecessarily. The slow query in question becomes five queries (4 for the data once for the final lookup) which can each be done quickly and if any one of them return nil, you can return a timeout.


Yes, some of the databases support native ip address types that can be manipulated efficiently. Better to use that (like inbuilt json querying capabilities in postgres) than come up with your own.

It is still not clear how is it better to do up to 5 separate queries or maybe a few joins, than to store the strings and construct indices and such on them? Is the idea that the cost of possibly establishing a new socket connection for some concurrent query execution or sequentially executing 5 queries is still < possible scans of the columns (even with indexing in effect)? Also, even if you had integers, don't you need some sort of an index on integer fields to avoid full table scans anyway?


I guess I've come away with a totally different takeaway than most. This post is rather strong on the blame game, which could be fixed by one thing...

RTFD! (Read The F**in Docs!) - I only skimmed the post, but its definitely something that would have been avoided had some SQL documentation or introduction been read.

I'd argue one of the huge things that differentiates "senior" developers from all the "other" levels - we're not smarter or more more clever than anyone else - we read up on the tools we use, see how they work, read how others have used them before... I understand this was from 2002, but MySQL came out in 1995 - there was certainly at least a handful of books on the topic.

Perhaps when just starting off as an intern, you may be able to argue that you are 'operating in a vacuum' but any number of introductory SQL books or documentation could quickly reveal solutions to the problem encountered in the post. (Some of which are suggested in these comments).

Of course we all make mistakes in software - I definitely could see myself creating such a schema and forgetting to add any sort of indexing - but when running into performance issues later, the only way you'll be able to know what to do next to fix it is by having read literature about details of the tools you are using.

Operating in a vacuum? Then break out of it and inform yourself. Many people have spent many hours creating good documentation and tutorial on many many software tools - use them.


The problem here is "unknown unknowns," coupled with the real business pressure to get solutions out the door. The MySQL manual is (currently) over 6,000 pages. A developer new to SQL or relational DBs doesn't know how many of those are necessary to read and comprehend to deploy a working solution.

And in this case, she designed and deployed the system, and it worked and met the business needs for several months. When performance became an issue, she optimized. I'm sure she had plenty of other unrelated work to do in the meantime, especially as a lead/solo dev in the early 2000's.

Sounds like a productive developer to me.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: