A terrible schema from a clueless programmer

wdr1 · on Nov 7, 2021

Related to database indexes, but not the post: a busted database index brought down ticket sales of the 2008 Olympics Games.

This was the first time regular people could go buy tickets for events & they had been lining up overnight at Bank of China locations through the country. We were down for over a day before we called it off. Apparently this led to minor upheaval at several locations in Beijing & riot police were called in.

We were pretty puzzled as we had an index and had load tested extensively. We had Oracle support working directly with us & couldn't figure out why queries had started to become table scans.

The culprit? A point upgrade to DBD::Oracle (something like X.Y.3 to X.Y.4) introduced subtle but in character sets. So the index required using a particular Unicode character set, and we were specifying it, but when it was translated into the actually query, it wasn't exactly the right one, so the DB assumed it couldn't use the index. Then, when all the banks opened & a large portion of very populous country tried to buy tickets at the same time, things just melted.

Not a fun day.

path411 · on Nov 7, 2021

Can't imagine the pain to even discover this problem. It normally takes me a long time to be like, "maybe I didn't make a mistake, and I have a third party bug?". Third party bugs are always the most annoying to me, usually the longest to diagnose and then ultimately I have to just tell my boss I can't do anything to fix it, but hopefully I can find a way to avoid it.

isolli · on Nov 8, 2021

Imagine having to diagnose a bug in the processor!

http://gallium.inria.fr/blog/intel-skylake-bug/

"More experienced programmers know very well that the bug is generally in their code: occasionally in third-party libraries; very rarely in system libraries; exceedingly rarely in the compiler; and never in the processor."

Aeolun · on Nov 8, 2021

Fork and fix? I’ve personally been pleasantly surprised by how fast maintainers will merge a PR I submit.

If it causes an issue with our own systems it also gives me a very good justification for doing it.

eatonphil · on Nov 8, 2021

In Oracle's proprietary system?

throwanem · on Nov 8, 2021

Oracle doesn't appear to maintain the DBD::Oracle Perl client module.

path411 · on Nov 10, 2021

Sometimes I've had a pr merged. Sometimes I've had stubborn maintainers that think that their project should not follow common domain standards for /shrug reasons.

And then many other times I honestly don't have the knowledge to be able to contribute into large third party systems and it seems very rare to get an issue resolved without submitting a pr.

And lastly sometimes it's hard to judge if unexpected behavior is really a 'bug'. Similar to the last point, it can honestly be hard to tell the difference sometimes.

superjan · on Nov 7, 2021

At my previous job, a particular database was designed using ascii strings fields for a particular field rather than unicode. If you then query with a string in unicode format, the database decided that the comparison should be done in unicode. The only way was to table scan and convert all ascii fields to unicode on the fly. It was found only in production.

Given that you mention China in your story, did GB 18030 have anything to do with your problems?

https://en.wikipedia.org/wiki/GB_18030

AdrianB1 · on Nov 8, 2021

That sounds like newbie developers wrote that and no DBAs were involved in checking the code, not a database design problem, you are throwing blame without knowing enough about the matter. If your string fit in a varchar, it makes absolutely no sense to change it to nvarchar because your query has unicode as input.

eyelidlessness · on Nov 8, 2021

I reread the parent quote several times and cannot detect a hint of throwing blame at all.

superjan · on Nov 8, 2021

Most, but not all of the team were young, and the DB schema was designed by themselves. I did not intend to throw blame. It is more the kind of situation where a sensible decision (at first glance) creates a footgun down the road.

xupybd · on Nov 8, 2021

What? The database made the call to compare the strings as Unicode. Are you accusing the DB developers of not doing their job?

AdrianB1 · on Nov 8, 2021

Yes, databases don't decide, developers do.

taberiand · on Nov 7, 2021

The moral of this story, to me, is always lock in all dependencies to the exact specific versions tested for production release.

mst · on Nov 7, 2021

I love DBD::Oracle so much I've always arranged to have other people test oracle related DBIx::Class features for me.

Writing the pre-ANSI-JOIN-support SQL dialect code was pretty good fun though.

hinkley · on Nov 7, 2021

The one that always got us is much more mundane. Deleting a row requiring an index in every table with a FK to avoid tables scans.

Near as I can tell, we assume there is some bit of magic built into the foreign key concept that handles this for us, but that is not the case.

zeroimpl · on Nov 7, 2021

Yeah I’ve run into this a few times. In Postgres, you can’t add a foreign key referencing columns that aren’t indexed. I assume this is for performance reasons so inserts/updates are efficient, but there is no such check that the source columns are also indexed, so updates/deletes on the referenced table can be terribly slow.

ComputerGuru · on Nov 8, 2021

Isn’t the source almost always an indexed primary key?

zeroimpl · on Nov 9, 2021

No. The target is normally a primary key but the source is not. The source is rarely even unique.

necovek · on Nov 7, 2021

Foreign keys are naturally referencing primary keys, which have their own (unique) indexes by default. And there's some extra magic with CASCADE.

But it seems like I am missing your point: care to expand on it so I can learn about the gotcha as well? What are you attempting to do, and what's making the performance slow?

nickkell · on Nov 7, 2021

If I've understood correctly it's the foreign key column itself that isn't indexed by default. Deleting a record then requires a table scan to ensure that the constraint holds

dan-robertson · on Nov 7, 2021

Example: if you delete a record from the customers table, you want an index on the foreign key in the orders table to delete the corresponding entries. This also means the performance of the delete can have the unfortunate property where the small work of deleting a single row cascades into the work of deleting many rows.

hinkley · on Nov 8, 2021

Which is why many elect to just ban deletes.

rjzzleep · on Nov 8, 2021

That’s not the reason. The reason why people don’t delete is because nobody wants to be left with inconsistent data relations. Deleting a customer is more deleting their PII(our scrambling it) and leaving everything else in tact.

Or in the case of Silicon Valley, leave everything in tact with a disabled flag and then spam you for the next decade or so.

tremon · on Nov 8, 2021

I don't buy that reason, because that inconsistency can be easily prevented with good schema hygiene (either ON DELETE NO ACTION or ON DELETE CASCADE). The problem is rather that in order to maintain consistent data relations, delete operations must be carefully designed and that part of the functional design is usually skipped because it's perceived as not important.

Which is more or less what the GP says as well, deletes are not implemented because doing it properly requires proper design.

Pamar · on Nov 8, 2021

Had a similar experience (albeit with much minor consequences): an Oracle Portal installation which worked fine in Dev/Test but with degrading performance on the actual Prod server (which was also immensely more powerful than what we used to develop).

Oracle was called in, "en masse". Keep also in mind that Oracle Portal was basically running on top of lots of dedicated tables and stored procedures and you could not access the data directly, you interacted to the contents only through vistas, aliases etc.

It was indeed due to a smallish version bump. But we had a pretty good Team Leader who had made Oracle declare that the two different version were absolutely compatible and allowing us to write code on the older version.

He made them state this in written form, before starting the development so it was later impossible to blame the developers for the problem.

fatnoah · on Nov 8, 2021

I've never had to deal with a production-breaking issue resulting from this, but I have had to debug a collation-related issue in SQL Server that caused table scans to occur. Fortunately, the SQL profiler made it really obvious that a collation mismatch was the culprit.

wpietri · on Nov 7, 2021

Could you say more about why that was the culprit and not some problem with the load testing? For the Olympics, it seems like the most important test is "What happens when we launch?" That's the kind of thing I'd run on every commit.

ppp3 · on Nov 8, 2021

an automated system level upgrade in the production env, exactaly 10 sections before launch.

load tests usually only happens on staging env, not production, that is the policy.

wpietri · on Nov 8, 2021

In that case, it sounds like a really good reason to question why the staging environment is not enough like prod that load tests are valid. Or why a not-fully-tested system upgrade would happen 10 seconds before launch.

tomsmeding · on Nov 7, 2021

Why exactly would it be so bad to just put a suitable index on the table containing strings? The time complexity of the resulting search would be the same, so I assume there will be some constant factor slowdowns. Is it that indices over string fields are stored inefficiently on disk? (If so, can that not be fixed in the db engine directly?) Or is this fine today but wasn't fine 15 years ago?

dan-robertson · on Nov 7, 2021

I think the article describes the solution that was used and not necessarily the best solution. If they made the “obvious” mistake in the first case they mightn’t arrive at the best solution in the second. Adding the single index and using the four tables have the same worst case complexity for reads, O(log n), with different constants and massively beat scanning at O(n).

It may be that the second solution wasn’t the best, or that it was better for write performance or memory usage—this was 2002 after all. It may also be the case that the cardinal it’s of some columns was low in such a way that the four table solution was better, but maybe getting the columns in the right order for a multi column index could do that too.

Either way, I don’t think it really matters that much and doesn’t affect the point of the article.

junon · on Nov 7, 2021

Indexing existed 15 years ago. The article never mentions why indexing didn't solve this problem. Super weird take on the author's part...

ericbarrett · on Nov 7, 2021

Early InnoDB* had pretty strict limits on varchar indexes and was not the most efficient. I don't remember the details but it's entirely possible the single-table format Rachel described ran head on into those limitations.

Also remember indexes take space, and if you index all your text columns you'll balloon your DB size; and this was 2002, when that mattered a lot more even for text. Indexes also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.

Finally, normalizing your data while using the file-per-table setting (also not the default back then) can additionally provide better write & mixed throughout due to the writes being spread across multiple file descriptors and not fighting over as many shared locks. (The locking semantics for InnoDB have also improved massively in the last 19 years.)

* Assuming the author used InnoDB and not MyISAM. The latter is a garbage engine that doesn't even provide basic crash/reboot durability, and was the default for years; using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.

wruza · on Nov 7, 2021

indexes take space, and if you index all your text columns you'll balloon your DB … also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.

Indexes didn’t go away with these extra tables, they live in the ‘id’ field of each one. They also probably had UNIQUE constraint on the ‘value’ field, spending time on what you describe in the second half of my citation.

I mean, that should have saved some space for non-unique strings, but all other machinery is still there. And there are 4 extra unique constraints (also an index) in addition to 4 primary keys. Unless these strings are long, the space savings may turn out to be pretty marginal.

WorldMaker · on Nov 8, 2021

Right, indexes don't "go away" as you normalize, but the amount of data indexed drastically shifts and four unique indexes are pretty much guaranteed to be smaller in size than one massive combinatorial index of (all, four, possible, values). Not just in the case where any duplicates exist, but just alone in terms of the database's overhead in things like field delimiters and page metadata. While the specifics will vary a lot with specific DB implementations, generally with varchars involved you can usually assume a worst-case in terms of field delimiting (and from there sometimes a worst-case in average page size and how that impacts B-Tree balancing).

In general, Indexes alone can't save you from a normalization problems. From certain points of view an Index is itself another form of denormalization and while it is very handy form of denormalization, heavily relying on big composite indexes (and the "natural keys" they represent) makes normalization worse and you have to keep such trade-offs in mind just as you would any other normalization planning.

It is theoretically possible a database could do something much smarter and semi-normalized with for instance Trie-based "search" indexes, but specifically in 2002 so far as I'm aware of most of the databases at the time such indexes were never the default and didn't have great multi-column support and would have been expensive to compute if you did turn them own. Even such "smart" Indexes would likely still have suggested you normalize first.

noduerme · on Nov 7, 2021

To me it seems that spreading it over four tables would lead to a lot more potential read locks while the big combined table is waiting for a join on each of the others, and some process is trying to insert on the others and link them to the main. This is assuming they were using foreign keys and the main table 4-column index was unique.

ericbarrett · on Nov 7, 2021

Old best practice for InnoDB performance was actually to never use foreign keys because of these locking issues. Not sure if that's the case in 2021.

noduerme · on Nov 8, 2021

InnoDB uses row-level locking, and foreign keys are (usually) a great feature to ensure data integrity. But using multiple foreign keys from tables `a`,`b` as a composite index for table `x` can cause deadlock if both are being updated in rapid succession, because an update on `a` gets a lock on `x`, which needs a read lock on `b` which is waiting for the lock from `a` to be released. I try to never structure multiple foreign keys as a composite index.

bt1a · on Nov 8, 2021

This is fascinating, thanks for sharing.

cogman10 · on Nov 7, 2021

> Also remember indexes take space

Indexes take space, except for the clustered index.

An important distinction.

If the point of this table is always selecting based on IP, m_from, m_to, then clustering on those columns in that order would make sense.

Of course, if it's expected that a lot of other query patterns will exist then it might make sense to only cluster on one of those columns and build indexes on the others.

roywashere · on Nov 8, 2021

But IP addresses are actually _numbers_, and you can also save them as bytes 32 bits for ip4, 128 bits for ip6 addresses. (Postgresql has a native datatype 'inet' which supports ip6 since 7.4, if you're using postgres and save an IP address in a character string, you're doing it wrong) If you would save the IP as number and put an index on it, all your queries would be blazing fast!

teh_klev · on Nov 8, 2021

> using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.

The author mentions that these events took place in 2002. At that time the site was likely still running MySQL 3 and InnoDB was very new and considered experimental. Sure MySQL 4.0 had shipped in 2002 and InnoDB was a first class citizen, but back then upgrades from one version of MySQL to another weren't trivial tasks. You also tended to wait until the .1 release before making that leap.

So in fairness for the folks who originally set up that database MyISAM was likely their only realistic option.

I looked after and managed a fleet of MySQL servers from back then and for many years afterwards and even then it wasn't until MySQL 5 that we fully put our trust in InnoDB.

quickthrower2 · on Nov 7, 2021

Surely more than 15 years! 50 maybe?

FlyingAvatar · on Nov 8, 2021

I had the same initial reaction in reading the post. My assumption was that indexing would have sufficiently sped up the query speed problem.

However, normalizing the fields that contain repetitive data into separate tables could create significant space savings since the full text for each column would not need to stored for each row. Instead of 20-30 bytes per email address, a 4 byte (assuming 32-bit era) OID is stored in its place.

It's pretty easy to imagine how quickly the savings would add up, and that it would be very helpful in the era before SSDs or even 1TB HDDs existed.

HelloNurse · on Nov 7, 2021

Normalization is important for deduplication, not only to index and compare a few short numbers instead of a few long string: those host names and email addresses are long and often repeated.

remram · on Nov 7, 2021

That's not what normalization is. You're thinking of deduplication or compression. This table is normalized as per usual database terminology.

ghayes · on Nov 7, 2021

But if it’s guaranteed to be 1:1, why? The two implementations (normalized and denormalized) in that case should be completely isomorphic, quirks of the DB engine aside.

remram · on Nov 7, 2021

Why what? I don't understand what you are asking.

jerjerjer · on Nov 7, 2021

Yeah, while this is not optimal (ip should be converted into integer, time should be ts), the table would be small (as old entries could be safely deleted). The only real issue is the lack of indices.

Also is helo field even needed?

ttfkam · on Nov 8, 2021

IP should NOT be an integer. IPv6 does not fit in a database integer. MySQL has functions to convert to bytes.

Even better, use a database with a proper inet datatype. That way you get correctness, space efficiency, and ability to intelligently index.

_3u10 · on Nov 8, 2021

In 2002 it definitely should have been an integer.

In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.

miyuru · on Nov 8, 2021

> In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.

This is the funniest IPv6 excuse I've heard on HN.

https://ipv6excuses.com/

_3u10 · on Nov 8, 2021

Thanks! I was really surprised to find it myself. I used HN in some sample code for analyzing how SSL tickets are reused and was surprised to see an IPv4 address come back.

mjevans · on Nov 7, 2021

Yes. A proper mail implementation won't randomize it, so it should be consistent. Depending on the design of the validation stages it might only need a case-normalized lookup to the string index table and that id as part of the final multi-column unique index, and in that last case only to convey to the database the expectations to optimize around (and require the data adhere to).

AdrianB1 · on Nov 8, 2021

Let's say the article is a critique to some system by someone that is not much better at databases. As many people also pointed out, a bad solution was replaced with a bad solution that works better for someone.

thehappypm · on Nov 7, 2021

At the worst you could concatenate the strings and have a single primary key.. or hash the string concatenation.

mst · on Nov 7, 2021

It was fine on -some- databases 20 years ago. mysql of that era, less often.

welder · on Nov 7, 2021

The ending is the most important part.

> Now, what do you suppose happened to that clueless programmer who didn't know anything about foreign key relationships?

> Well, that's easy. She just wrote this post for you. That's right, I was that clueless newbie who came up with a completely ridiculous abuse of a SQL database that was slow, bloated, and obviously wrong at a glance to anyone who had a clue.

> My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?

carapace · on Nov 7, 2021

(It seems a lot of folks are getting nerd-sniped by the set-up and missing the moral of the story, eh?)

I don't know the actual numbers, but it's been pointed out that at any given time something like half of all programmers have been doing it less than five years, for decades now.

That, plus the strident ignorance of past art and practice, seem to me to bring on a lot of issues.

twic · on Nov 7, 2021

I don't think it's purely nerd-sniping. If your story is "at first you are bad, but then you get good", but your example is of a case where you did something fine but then replaced it with something worse, that rather undermines the story.

mwcampbell · on Nov 7, 2021

If even someone with Rachel's level of experience still doesn't know all the minutiae of database optimization, I think that just amplifies her point about the importance of mentoring novices.

duxup · on Nov 7, 2021

I recall reading a rant on another site about someone so upset they had to deal with clueless noobs at work.

They then went on to list the errors that this brand new entry level employee had made when writing ... an authentication system ...

I was more than a little shocked when I realized they were serious and hadn't realized the issue was sending the new entry level guy to do that job alone.

gonzo41 · on Nov 8, 2021

I would expect most seniors to want a set of eyes across their work doing any form of authk or z. If only to share blame :p

thaumasiotes · on Nov 8, 2021

> authk

I've always seen authn. Where'd you pick this usage up?

gonzo41 · on Nov 8, 2021

good question! I meant authn. I think I just subbed in the phonic from the ..cation part of the word.

eyelidlessness · on Nov 8, 2021

You get my upvote both for having a friendly response, and because your answer is what my brain filled in and I’m glad I don’t need to learn a new shorthand.

ncmncm · on Nov 8, 2021

You need to distinguish authentication from authorization. "auth9n" and "auth8n" might be a slight bit too obsure for the purpose.

thaumasiotes · on Nov 8, 2021

What a completely nonresponsive "response". To the best of my knowledge, no one has ever even tried to use auth9n or auth8n. But that's not where the 'n' in authn comes from. authn and authk are both equally distinct from authz.

ncmncm · on Nov 9, 2021

You appear to be unfamiliar with the extremely common, and problematic, abbreviation "i18n" for "internationalization".

For my part, I have not encountered "authn", "authz", or "authk" before. But prior discussion did not even mention the word "authorization", and that seemed worth bringing out in the open.

So, not nonresponsive, just responsive to things you weren't personally interested in.

afrodc_ · on Nov 8, 2021

Generally authentication is authn and authorization is authz

emilsedgh · on Nov 7, 2021

> but your example is of a case where you did something fine but then replaced it with something worse

They way I see it is that it was a case of something that wasn't working, then replaced with something that was working.

I don't know what your metrics about "good" or "bad" are but eventually they got a solution that covered their use cases and the solution was good enough for stakeholders and that is "good".

adolph · on Nov 7, 2021

Chronology alone only progresses; progress sometimes regresses.

lamontcg · on Nov 7, 2021

> It seems a lot of folks are getting nerd-sniped

that is an excellent term.

jacobolus · on Nov 7, 2021

Origin: https://xkcd.com/356/

wwweston · on Nov 7, 2021

Old enough some people under 35 may not have encountered it… ;)

egeozcan · on Nov 7, 2021

I feel like that phrase has been there forever, and I'm exactly 35. Will this happen more often as I grow older? Ugh. Feels weird. Maybe also a bit depressing.

lamontcg · on Nov 7, 2021

It was weird seeing all the spongebob squarepants memes take over the internet when I was too old to ever grow up with that. I turned 28 in 1999 when that first aired. That was my "holy shit I'm so old" moment when that finally trickled up into my awareness as someone nearly turning 40 or so.

georgemcbay · on Nov 8, 2021

As a 48 year old who has no idea who an awful lot of modern celebrities are, I'm constantly wondering if it is because I'm just old now or if its because there are just way more celebrities these days due to the large parasocial celebrity class that didn't really exist when I was younger.

Maybe a bit of both.

CRConrad · on Nov 21, 2021

> parasocial

^H^H^H^H^Hitic

There, FTFY.

_carbyau_ · on Nov 8, 2021

For me it is Pokemon. I was a few years older than the demographic it targeted.

I could have jumped on it and played it but I figured just another toy craze, it'll be over before long.

yesenadam · on Nov 7, 2021

Same age as you. I have a lot of friends in the Philippines. I swear the Facebook employment info of half the people in the Philippines says they work at the Krusty Krab. (And for most filipinos, the internet = facebook.) For some reason I never asked what that meant, and for many years I thought that was just some odd joke, and was vaguely puzzled about how widespread it is. Eventually I happened on the Spongebob connection!

sseagull · on Nov 7, 2021

Well, then, they are part of the 10,000

https://xkcd.com/1053/

misnome · on Nov 7, 2021

Only if they were born in the US

lamontcg · on Nov 7, 2021

I'm 49, I don't know how I never saw that xkcd.

ummonk · on Nov 8, 2021

I just spent several minutes trying to solve that resistor problem (or a simplified version with adjacent nodes) before giving up and deciding I’ll look up other people’s analyses when I get the time. Definitely a good example.

WorldMaker · on Nov 8, 2021

The math would resemble xkcd's Lucky Ten Thousand: https://xkcd.com/1053/

With yes the added confounding factors of how often our industry seems to prefer to hire youth over experience, and subsequently how often experience "drains" to other fields/management roles/"dark matter".

beering · on Nov 7, 2021

> (It seems a lot of folks are getting nerd-sniped by the set-up and missing the moral of the story, eh?)

The narrative should lead to the conclusion. If I told you the story of the tortoise and the hare in which the hare gets shot by a hunter, then said the moral is "perseverance wins", you'd be rightfully confused.

crispyambulance · on Nov 7, 2021

Yep. Folks are getting lost in the weeds discussing indexing of database tables. That's _totally_ beside the point here.

The thing is, the first implementation was a perfectly fine "straight line" approach to solve the problem at hand. One table, a few columns, computers are pretty fast at searching for stuff... why not? In many scenarios, one would never see a problem with that schema.

Unfortunately, "operating in a vacuum with no mentorship, guidance, or reference points", is normal for many folks. She's talking about the trenches of SV, it's even worse outside of that where the only help you might get is smug smackdowns on stackoverflow (or worse, the DBA stackexchange) for daring to ask about such a "basic problem".

damagednoob · on Nov 7, 2021

Yes and no and this post highlights a subtle issue with mentorship (which I think is important): Technology does not stand still. What was true in 2002, might not be true today. While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.

I've experienced this just by switching languages. C# had many articles dedicated to how much better StringBuilder was compared to String.Concat and yet, other languages would do the right thing by default. I would give advice in a totally different language about a problem that the target language did not have.

As the song goes:

"Be careful whose advice you buy but be patient with those who supply it

Advice is a form of nostalgia, dispensing it is a way of fishing the past

From the disposal, wiping it off, painting over the ugly parts

And recycling it for more than it's worth"

nightski · on Nov 7, 2021

That might be true if you just take blanket advice. The key is to find out why say StringBuilder is better than String.Concat. If you understand the implementation details and tradeoffs involved, this makes the knowledge much more applicable in the future. The core concepts in technology do not move nearly as fast as individual projects, libraries, frameworks, languages, etc...

throwaway2037 · on Nov 8, 2021

You wrote: <<While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.>>

Do you have a specific example that would help in the case described in Rachel's blog post?

I am still making (and finding) occasional indexing and 3NF mistakes. In my experience, it is always humans finding and fixing these issues.

bigiain · on Nov 8, 2021

But trust me on the sunscreen.

goto11 · on Nov 8, 2021

Totally off topic I know, but what is the concern with String.Concat ?

MarkSweep · on Nov 8, 2021

If you repeated use concat to build up a string, the amount of time grows exponentially. This is because the string I copied each time you concat. Note that the + operator on strings gets turned into a call to concat.

https://docs.microsoft.com/troubleshoot/dotnet/csharp/string...

goto11 · on Nov 8, 2021

I was wondering about it in the context of "yet, other languages would do the right thing by default". Repeatedly concatenating to the same string (as opposed to concatenating an array of strings in one go) would be slow in any language I know of, unless you allocate a larger buffer up front, which is what StringBuilder does.

Some languages have mutable strings, but you would still need to allocate a sufficiently larger buffer if you want to add strings in a loop.

MarkSweep · on Nov 14, 2021

In languages that have immutable string semantics and can detect that they are the exclusive owner of a string, they can modify them in place. For reference counting languages like Python, they can easily determine they are the exclusive owner by checking for a reference count of 1. See:

https://github.com/python/cpython/blob/main/Objects/unicodeo...

Rust's ownership system ensures that there cannot exist mutable and immutable references to the same object at the same time. So if you have a mutable reference to a string it is ok to modify in place.

I have not confirmed either the Python example or the Rust example too deeply, since I'm replying to 5 day old comment. But the general principal holds. GC languages like C# and Java don't have easy ways to check ownership so they always copy strings when mutating them. Maybe I'll write a blogpost on this in the future.

ufmace · on Nov 8, 2021

I don't think StringBuilder is faster specifically because it allocates a large buffer. Pretty much every language with growable data structures can already grow any such structure pretty fast, including arrays, maps, and mutable strings. They already have a bunch of pre-set constants about stuff like how big to allocate for the initial empty one and how much more to allocate every time you overflow to balance speed and memory efficiency. It's faster because it's mutable and keeps adding new data to the existing buffer until it gets too large.

goto11 · on Nov 9, 2021

The cost is not the allocation per se, the cost is copying the bytes. When concatenating two strings, the bytes from both strings are copied into the new string. If you repeatedly concatenate, the bytes end up getting copied many times. E.g if you concatenate 100 strings by appending one by one, the bytes in the first string is copied 99 times, the bytes in the second string is copied 98 times and so on.

Using a StringBuilder, the strings are only copied once when copied into the buffer. If the buffer need to grow, the whole buffer is copied into a larger buffer, but if the buffer grows by say doubling the size, then this cost is amortized, so each string on average is still only copied twice at worst.

Faster yet is concatenating an array of strings in a single concat operation. This avoids the buffer resize issue, since the necessary buffer size is known up front. But this leads to the subtle issue where a single concat is faster than using a StringBuilder, while the cost for multiple appends grows exponentially for concat but only lineary for StringBuilder.

FreezerburnV · on Nov 8, 2021

Repeatedly concatting a string is the fastest way I am aware of to build a longer one in Javascript. (This has been deliberately optimized for) I believe PHP this might also be the case for, or at least very fast. Perl might be pretty fast at this but I could be wrong.

WorldMaker · on Nov 8, 2021

I know JS's case somewhat well and the lesson there is not the comment such as above that "it just does the right thing". JS doesn't use a "String Builder" under the hood with a lot of string + calls, but instead that JS does subtly the "wrong" thing as an optimization: in the .NET CLR strings are always, always immutable (or they are not strings according to guarantees that the CLR makes). JS lives in a single language VM that makes no such guarantees, and often lives in strictly single threaded worlds where the VM doesn't have to guarantee the immutability of strings to anyone at the binary memory layout level, so most JS VMs are free to do the "wrong thing" (from the .NET CLR perspective) and just mutably alter strings under the hood as a deliberate optimization.

There's a lot of interesting conflations of "right" versus "wrong" in this this thread. There's the C# "best practices" "right" versus "wrong" of knowing when to choose things like StringBuilder over string.Concat or string's + operator overloading (which does just call string.Concat under the hood). There's the "does that right/wrong" apply to other languages question? (The answer is far more complex than just "right" and "wrong".) There's the VM guarantees of .NET CLR's hard line "no string is mutable" setting a somewhat hard "right" versus "wrong" and other language's VMs/interpreters not needing to make such guarantees to themselves or to others. None of these are really ever "right" versus "wrong", almost all of them are trade-offs described as "best practices" and we take the word "best" too strongly especially when we don't know/examine/explore the trade-off context it originated from (and sometimes wrongly assuming that "best" means "universally the best").

goto11 · on Nov 9, 2021

I believe joining an array of strings is still faster in JavaScript, since the combined size can be known up front.

sumtechguy · on Nov 8, 2021

I agree, and would even go one step further and say the first implementation was a decent first pass. Sometimes the 'awful' implementation is good enough and your time is better used on something else. However, this pattern can sometimes bite you in the long term. As you can have tons of little 'hacks' all over the place and people become afraid to touch them, or worse copy from them.

This also nicely shows one of the fun things with table scans. They look decent at first then perf goes crappy. Then you get to learn something. In this case it looks like she used normalization to scrunch out the main table size (win on the scan rate, as it would not be hitting the disk as much with faster row reads). It probably also helped just on the 'to' lookups. Think about how many people are in an org, even a large company has a finite number of people but they get thousands of emails a week. That table is going to be massively smaller than keeping it in every row copied over and over. Just doing the 'to' clause alone would dump out huge amounts of that scan it was doing. That is even before considering an index.

The trick with most SQL instances is do less work. Grab less rows when you can. Use smaller tables if you can. Throw out unnecessary data from your rows if you can. Reduce round trips to the disk if you can (which usually conflicts with the previous rules). Pick your data ordering well. See if you can get your lookups to be ints instead of other datatypes. Indexes usually are a good first tool to grab when speeding up a query. But they do have a cost, on insert/update/delete and disk. Usually that cost is less than your lookup time, but not always. But you stick all of that together and you can have a DB that is really performant.

For me the fun one was one DB I worked in they used a GUID as the primary key, and therefore FK into other tables. Which also was the default ordering key on the disk. Took me awhile to make them understand why the perf on look up was so bad. I know why they did it and it made sense at the time to do it that way. But long term it was holding things back and ballooning the data size and crushing the lookup times.

How did I come by all of this? Lots of reading and stubbing my toe on things and watching other do the same. Most of the time you do not get a 'mentor'. :( But I sure try to teach others.

otterley · on Nov 7, 2021

I’d be a lot more sympathetic if the major RDMSes didn’t have outstanding and thorough reference manuals or that there weren’t a mountain of books on the subject that cover, among other things, the topic of indexing and its importance. MySQL’s manual, for example, has covered this subject from the very beginning: https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I don’t have the 3.x manuals handy but it was there back then too).

It’s not clear from the article whether the author spent any time studying the problem before implementing it. If she failed to do so, it is both problematic and way more common than it ought to be.

“Ready, fire, aim”

But you know what they say: good judgment comes from experiences and experience comes from bad judgment.

Compounding the problem here is that the author now has much more experience and in a reflective blog post, still got the wrong answer.

IMO the better lesson to take away here would have been to take the time getting to know the technology before putting it into production instead of jumping head first into it. That would be bar raising advice. The current advice doesn’t advise caution; instead it perpetuates the status quo and gives “feel good” advice.

dasil003 · on Nov 7, 2021

I couldn't disagree more with this comment. No amount of reading documentation teaches you how to build production systems. You progress much faster by getting your hands dirty, making mistakes, and learning from them.

The challenge of writing good software is not about knowing and focusing on the perfection every gory detail, it's about developing the judgement to focus on the details that actually matter. Junior engineers who follow your advice will be scared to make mistakes and their development will languish compared to those who dive in and learn from their mistakes. As one gains experience it's easy to become arrogant and dismissive of mistakes that junior engineers make, but this can be extremely poisonous to their development.

otterley · on Nov 7, 2021

Respectfully, I believe you are disagreeing with an argument that I am not making. It's not an either-or scenario. Both access to and reliance on reference materials and the ability to safely experiment are part of the professional engineer's toolbox. I can say this with confidence because I was a junior engineer once - it's not like I'm speaking without experience.

Everyone - new and experienced developers alike - should have a healthy fear of causing pain to customers. Serving customers - indirectly or otherwise - is what makes businesses run. As a principal engineer today my number one focus is on the customer experience, and I try to instill this same focus in every mentee I have.

Does that mean that junior developers should live in constant fear and decision paralysis? Of course not.

That's where the mentor - the more experienced and seasoned engineer - comes in. The mentor is roughly analogous to the teacher. And the teaching process uses a combination of a mentor, reference materials, and the lab process. The reference materials provide the details; the lab process provides a safe environment in which to learn; and the mentor provides the boundaries to prevent the experiments from causing damage, feedback to the mentee, and wisdom to fill in the gaps between them all. If any of these are absent, the new learner's development will suffer.

Outstanding educational systems were built over the last 3 centuries in the Western tradition using this technique, and other societies not steeped in this tradition have sent their children to us (especially in the last half-century) to do better than they ever could at home. It is a testament to the quality of the approach.

So no, I'm not advocating that junior developers should do nothing until they have read all the books front to back. They'd never gain any experience at all or make essential mistakes they could learn from if they did that. But junior developers should not work in a vacuum - more senior developers who both provide guidance and refer them to reference materials to study and try again, in my experience, lead to stronger, more capable engineers in the long term. "Learning to learn" and "respect what came before" are skills as important as the engineering process itself.

crispyambulance · on Nov 7, 2021

> That's where the mentor [...] comes in.

The thing is, specifically, that the OP did NOT have a mentor. They had a serious problem to solve, pronto, and knew enough to take a crack at it. OK, the initial implementation was suboptimal. So what? It's totally normal, learn and try again. Repeat.

It would be nice if every workplace had a orderly hierarchy of talent where everyone takes care to nurture and support everyone else (especially new folks). And where it's OK to ask questions and receive guidance even across organizational silos, where there are guardrails to mitigate accidents and terrible mistakes. If you work in such an environment, you are lucky.

It is far more common to have sharp-elbow/blamestorm workplaces which pretend to value "accountability" but then don't lift a finger to support anyone who takes initiative. I suspect that at the time Rachelbythebay worked in exactly this kind of environment, where it simply wasn't possible for an OPS person to go see the resident database expert and ask for advice.

otterley · on Nov 7, 2021

Right. Hence my concern about the lack of helpful advice other than “accept yourself”[1]: Neophyte engineers lacking guardrails and local mentors should fall back to what they learned in school: experiment, use your books, and ask questions from the community. Mentors can be found outside the immediate workplace, after all.

And when it comes to taking production risks, measure twice and cut once, just like a good carpenter.

[1] Of course you should accept yourself. But that alone won’t advance the profession or one’s career.

crispyambulance · on Nov 8, 2021

> measure twice and cut once

Being able to do that is a luxury that many do not enjoy in nose-to-the-grindstone workplaces. You make an estimate (or someone makes it for you), then you gotta deliver mentor or no mentor, and whether you know the finer points of database design/care-and-feeding or not.

There's something good to be said for taking action. Rachelbythebay did just fine. No one died, the company didn't suffer, it was just a problem, it got corrected later. Big whoop.

noduerme · on Nov 7, 2021

But it's also about constantly asking yourself "how can this be done better? What do I not know that could improve it?"

I learned probably 80% of what I know about DB optimization from reading the Percona blog and S.O. The other 20% came from splashing around and making mistakes, and testing tons of different things with EXPLAIN. That's basically learning in a vacuum.

isoskeles · on Nov 7, 2021

I couldn't disagree more with this comment. No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes, nor changes the impact of the mistakes. You also progress much faster by learning from other peoples' mistakes, this is why written language is so powerful.

Honestly, I think your comment is okay minus the fact that you're trying to highlight such hard disagreement with a sentiment that people should read the fucking manual. Really, you couldn't disagree MORE?

There is definitely value in RTFM. There are cases where making mistakes in production is not acceptable, and progressing by "making mistakes" is a mistake on its own. I don't think the case in this article sounds like one of those, but they do exist (e.g. financial systems (think about people losing money on crypto exchanges), healthcare, or, say, sending rockets into space). In many cases, making mistakes is fine in test systems, but completely, absolutely, catastrophically unacceptable in production. Although, I refuse to blame the junior engineer for such mistakes, I blame management that "sent a boy to do a man's job" (apologies for a dated idiom here).

(As an aside, overall, I disagree with many of the comments here nitpicking the way the author solved the problem, calling her "clueless", etc. I really don't care about that level of detail, and while I agree the solution does not seem ideal, it worked for them better than the previous solution.)

crispyambulance · on Nov 7, 2021

> No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes...

Categorically wrong.

Mistakes, failure, and trial and error are very much a part of developing skills. If you're not making mistakes, you're also not taking enough risks and thus missing out on opportunities for growth.

isoskeles · on Nov 7, 2021

Yes, mistakes are required to learn from them, but making mistakes does not imply that a person will learn from their mistakes.

jraph · on Nov 7, 2021

You can't really search a problem if you don't suspect its existence. At this point you might feel you have everything you need to start the implementation, how do you guess?

deathanatos · on Nov 7, 2021

… I appreciate her sentiment at the end there that we should try to change the status quo (which I think does suck), but I'm not sure how much power the average employee has over it. Most employers I've worked at seem loathe to retain anyone past about 2 years. (E.g., currently I'm in the 95'th percentile, after just over 2 years.) IME it takes about 6 months to really learn how a company's systems work to where to proficiency (yes, six months.) which means we're spending ~25% of the time "training", where "training" is usually someone trying something crazy and failing, and getting corrected as opposed to some structured form of learning. Oftentimes the systems that these engineers are stumbling I think are the internal ones; they need more features, or refactoring to account for organic growth, but since they're not shiny new customer facing features they'll get exactly 0 priority from PMs. The engineers who build experience working with these systems and actually know what needs to change and how without breaking the existing use cases … are about to leave the company, since nobody is retained beyond ~2 years.

I also find my own team is usually strapped for resources, normally, people. (Usually politely phrased as "time".) Yes, one has to be wary of mythical man-month'ing it, but like my at my last employ we had essentially 2 of us on a project that could have easily used at least one, if not two more people. Repeat across every project and that employer was understaffed by 50-100%, essentially.

Some company just went for S-1, and they were bleeding cash. But they weren't bleeding it into new ventures: they were bleeding it into marketing. Sure, that might win you one or two customers, but I think you'd make much stronger gains with new products, or less buggy products that don't drive the existing customers away.

Also there's an obsession with "NIT" — not invented there — that really ties my hands as an engineer. Like, everything has to be out sourced to some cloud vendor provider whose product only fits some of the needs and barely, and whose "support" department appears to be unaware of what a computer is. I'm a SWE, let me do my thing, once in a while? (Yes, where there's a good fit for an external product, yes, by all means. But these days my job is 100% support tickets, and like 3% actual engineering.)

unclebucknasty · on Nov 7, 2021

What's funny is there seems to be a lot of debate among grizzled veterans here about whether her old solution is actually better than the new. We're getting into index types, engine implementations, normal forms, etc. And really, this is a relatively simple development problem that I can easily imagine on an interview test.

Now, imagine a new engineer, just getting started trying to make sense of it all, yet being met with harsh criticism and impatience.

Maybe that was exactly her point--to get everyone debating over such a relatively trivial problem to communicate the messages: Good engineering is hard. Everyone is learning. Opinions can vary. Show some humility and grace.

evilotto · on Nov 8, 2021

Two engineers I interact with on a regular basis:

1. Me from six months ago. Clueless about a lot of stuff, made some odd technical choices for obscure reasons, left a lot of messes for me to deal with.

2. Me six months from now. Anything I don't have time for now, I can leave for him. If I was being considerate I'd write better docs for him to work with, but I don't have the time for that either.

shrimpx · on Nov 7, 2021

> tend to chase off anyone who makes it to the age of 35

That's definitely not true anymore, if it ever was. Senior/staff level engs with 20+ years of experience are sought after and paid a ton of money.

jraph · on Nov 7, 2021

I was reading her article like "yeah, well… wow, she's tough / hard with this programmer, it's not that unreasonable anyway!", until this passage.

I would not expect her to be like this from what I've read on her blog so I was surprised. Well written!

jjice · on Nov 7, 2021

The way she kept referring to the programmer made me suspicious that that would be the ending. I personally really like that as a way of conveying the message of us all having growing pains and learning the hard way from some inefficient code.

I know I've also had these encounters (specifically in DBs), and I'm sure there are plenty more to come.

5faulker · on Nov 7, 2021

Seems like the definition of hacker to me.

Aeolun · on Nov 8, 2021

> My point is: EVERYONE goes through this

A lot of the basic issues beginners go through can be mitigated by paying attention in class (and/or having formal education in the first place).

I’ll grant that there are other, similar, things that everyone will go through though.

My own was manually writing an ‘implode’ function in Javascript because I didn’t know ‘join’ existed.

webstrand · on Nov 7, 2021

While the normalized version is more compact and doesn't store redundant data, it _also_ needs an index on the four columns or it'll have to check every row in the table. A similar index added to the original denormalized table would have given comparable query performance.

The table schema isn't terrible, it's just not great. A good first-pass to be optimized when it's discovered to be overly large.

Closi · on Nov 7, 2021

> The table schema isn't terrible, it's just not great.

It depends - if this was the full use-case then maybe a single table is actually a pretty good solution.

rk06 · on Nov 8, 2021

that is not true. the sub tables guarantee that value is unique in the entire column.

while main table did not have any such guarantees, so, there will be a lot of indexing and duplication in atleast 3 (of 4 sub tables)

Moreover, indexing would be much faster and efficient as there are only unique values in sub tables

int_19h · on Nov 8, 2021

Why wouldn't it have such guarantees, if the indices in question are defined as UNIQUE?

rk06 · on Nov 8, 2021

Because they are not unique in main table.

For eg: if there are multiple emails from same domain name, then domain name column will have same entry multiple times.

When domain is moved to sub table, main table still has duplicate entries, in form of foreign keys, while domain table has unique entry for each domain name

int_19h · on Nov 8, 2021

Foreign key columns (i.e. the ones in the main table) are often indexed themselves, and those indices can also be UNIQUE.

tyingq · on Nov 7, 2021

Feels like you could just concatenate and hash the 4 values with MD5 and store the hash and time.

Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.

Moto7451 · on Nov 7, 2021

Yup, we do this at work for similar purposes and it works a-ok. We also have some use cases that follow the original “problem” schema and they work fine with the correct indices involved.

My guess is that in 2002 there were some issues making those options unappealing to the Engineering team.

When we do this in the realm of huge traffic then we run the data through a log stream and it ends up in either a KV store, Parquet with SQL/Query layer on top of it, or hashed and rolled into a database (and all of the above of there are a lot of disparate consumers. Weee Data Lakes).

This is also the sort of thing I’d imagine Elastic would love you to use their search engine for.

kingcharles · on Nov 7, 2021

Isn't that exactly what adding an index would do internally?

tyingq · on Nov 7, 2021

2002 MySQL had some limitations on indexing variable length string (varchar) columns. That's the gist of the linked story.

AdrianB1 · on Nov 8, 2021

Depending on the RDBMS system, in most cases it does not.

ok123456 · on Nov 7, 2021

For the 2021 version, you'd just generate a bloom filter/cuckoo hash from all the data gathered by your spamhaus database periodically. Make a separate one for each value in your tuple and your score would be the number of the sub-hashes that matched.

ummonk · on Nov 8, 2021

“Periodically” here would have to be quite frequent, as you have to rebuild the database before the next retry.

ok123456 · on Nov 8, 2021

A couple times a day at most?

ummonk · on Nov 8, 2021

From the post, you'd have to be rebuilding it every 15 minutes:

> A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.

ok123456 · on Nov 8, 2021

Still better than locking the tables every time you receive a piece of mail.

rkuykendall-com · on Nov 8, 2021

I'm not super familiar with this stuff, but I believe you could then use a key-value store with automatic expiry like Redis for automatic pruning and faster lookups.

tyingq · on Nov 9, 2021

Redis and memcached didn't exist in that timeframe.

ChrisMarshallNY · on Nov 7, 2021

I'd say that pretty much everyone is a better DB programmer than I am, so I don't really go out of my way to design anything especially robust. I try to design stuff I'm not good at, in a way that makes it easy for someone that knows more than I do, to come in and toss out the work I did, to be replaced with good work.

There's lots of stuff that I'm bad at. I'm also good at a fair bit of stuff. I got that way, by being bad at it, making mistakes, asking "dumb" questions, and seeing how others did it.

austinjp · on Nov 7, 2021

> I try to design stuff I'm not good at, in a way that makes it easy for someone that knows more than I do, to come in and toss out the work I did, to be replaced with good work.

Sounds familiar. And sometimes that other someone is just a later version of yourself :)

I am very grateful when my past self thought to add comments for my future self.

alksjoflkajsom · on Nov 7, 2021

People in the comments are getting (rightfully) outraged about the poor understanding of indexing, but I'm a little surprised that everyone here doesn't seem to understand normalization either. The original schema is perfectly normalized and is already in 3NF: none of the columns shown has a dependence on any of the other columns outside of the primary key (in other words, if you knew eg the values of the ip, helo, and from columns, you'd still have no information about the "to" column).

Normalization does not mean "the same string can only appear once". Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table. Now instead of "address1@foo.bar" in 20 different locations, you have "id1" in 20 different locations. There's been no actual "deduplication", but that's again not the point. Creating the extra tables has no impact on the normalization of the data.

mixmastamyk · on Nov 7, 2021

Thank you. Was thinking that I was the clueless one... (probably a little) because that is not my understanding of what normalization is.

I thought perhaps swapping the strings to integers might make it easier to index, or perhaps it did indeed help with dedupilcation in that the implementation didn't "compress" identical strings in a column—saving space and perhaps help performance. But both issues appeared to be implementation issues with an unsophisticated Mysql circa 2000, rather than a fundamentally wrong schema.

I agreed with her comment at the end about not valuing experience, but proper databasing should be taught to every developer at the undergraduate level, and somehow to the self-taught. Looking at comptia... they don't seem to have a db/sql test, only an "IT-Fundamentals" which touches on it.

colanderman · on Nov 7, 2021

Came here to say this. Whatever may have been true of InnoDB 20 years ago, don't follow this article's advice for any modern relational database.

To add: Not sure about MySQL, but `varchar`/`text` in PostgreSQL for short strings like those in the article is very efficient. It basically just takes up space equaling the length of the string on disk, plus one byte [1].

[1] https://www.postgresql.org/docs/current/datatype-character.h...

noduerme · on Nov 7, 2021

>>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.

alksjoflkajsom · on Nov 8, 2021

>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

>How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

Ok, but that's not what normalization means.

If you have a table as described in the article that looks like:

foo | bar | baz

------------------

foo1 | bar1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | bar1 | baz2

foo3 | bar1 | baz4

and then you say "ok, bar1 is now actually called id1", you now have a table

foo | bar | baz

------------------

foo1 | id1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | id1 | baz2

foo3 | id1 | baz4

you haven't actually changed anything about the relationships in this data. You've just renamed one of your values. This is really a form of compression, not normalization.

Normalization is fundamentally about the constraints on data and how they are codified. If you took the author's schema and added a new column for the country where the ip address is located in (so the columns are now ip, helo, from, to, and country), then the table is no longer normalized because there is an implicit relationship between ip and country--if ip 1.2.3.4 is located in the USA, every row with 1.2.3.4 as the ip must have USA as the country. If you know the IP for a row, you know its country. This is what 3NF is about. Here you'd be able to represent invalid data by inserting a row with 1.2.3.4 and a non-US country, and you normalize this schema by adding a new table mapping IP to country.

But none of that is what's going on in the article. The author described several fields that have no relationship at all between them--IP is assumed to be completely independent of helo, from address, and to address. And the second schema they propose is in no way "normalizing" anything. The four new tables don't establish any relationship between any of the data.

>This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.

It's not "very narrowly" 3NF. It's pretty clear-cut! A lot of commenters here are referring to the second schema as the "normalized" one and to me that betrays a fundamental misunderstanding of what the term even means. And sure, if you had a different schema that wasn't normalized, then it wouldn't be normalized, but that's not what's in the article.

noduerme · on Nov 8, 2021

Your point is well taken. In the bare description provided for the first schema in the article, the data is already normalized, and all that can be achieved is compression.

yobbo · on Nov 8, 2021

> This is really a form of compression, not normalization.

First of all, the index hashes of the emails depend on the email strings, hence the indexed original schema is not normalised.

Secondly, it would not effectively be compression unless there were in fact dependencies in the data. But we can make many fair assumptions about the statistical dependencies. For example, certain emails/ips occur together more often than others, and so on. In so far as our assumptions of these dependencies are correct, normalisation gives us all the usual benefits.

Aeolun · on Nov 8, 2021

I think I’ve determined very long ago that the only normal form worth keeping in mind is the Boyce-Codd normal form.

While you may technically be correct, I think there’s few people that think of anything else when talking database normalization.

That said, I cannot quickly figure out if it’s actually true here.

alksjoflkajsom · on Nov 8, 2021

Okay, this will be my last comment in this thread because I don't have any new way of presenting this and checking this is not a good use of my time.

This schema is also in Boyce-Codd normal form. It's normalized in every usual sense of the word. Trivially so, even. It's not a question of being "technically" correct. If you think the second schema is more normalized than the first one, you need to re-evaluate your mental model of what normalization means. That's all there is to it.

slim · on Nov 7, 2021

Plus depending on the amount of emails they get, that optimisation could be unnecessary. That database schema was perfectly fine in some cases.

twic · on Nov 7, 2021

This is very slightly not quite true, because the HELO string and the remote address should go hand-in-hand.

alksjoflkajsom · on Nov 7, 2021

You might be right (I don't actually know what a HELO string is, I don't know anything about SMTP :). I was just going off how the author presented the data, as a tuple of four completely independent things.

Of course the main point still stands, that the two schemas are exactly as normalized as each other.

Edit: rereading the original post, the author mentions that "they forged...the HELO"--so perhaps there was indeed no relationship between HELO and IP here. But again, I don't know anything about SMTP, so this could be wrong.

Brian_K_White · on Nov 7, 2021

I do know about smtp and you were right regardless, because the author was talking about 4 database fields, not smtp. The details of smtp are irrelevant.

twic · on Nov 8, 2021

Normalisation depends on the semantics of the data, and so the details of SMTP are very much relevant.

Brian_K_White · on Nov 8, 2021

incorrect

GauntletWizard · on Nov 7, 2021

It's (somewhat) because the HELO is forged that there's no relationship between HELO and IP. The very first message of SMTP is "HELO <hostname>", hostname can either be a unique identifier (server1.company.com, etc.) or a system-level identifier (mta.company.com for all of your company's outbound mail agents, or in the case of bulk mailers they might use bulk.client1.com when sending as client1, bulk.client2.com, etc). But there is/was no authentication on what you send as HELO (Now you can verify it against the TLS certificate, though many implementations don't do that at all or well), so correlating based on the hostname in HELO was questionable at best. Thus, the combination of (HELO, IP) was the a single value as a tuple.

Izkata · on Nov 7, 2021

> But there is/was no authentication on what you send as HELO

Yep, and that explains the "foobar" rows - those should have resolved to the same IP, except because there's no authentication that blocks it you could put gibberish here and the SMTP server would accept it.

> so correlating based on the hostname in HELO was questionable at best

Eh, spambots from two different IPs could have both hardcoded "foobar" because of the lack of authentication, so I could see this working to filter legitimate/illegitimate emails from a compromised IP.

GauntletWizard · on Nov 7, 2021

Right, useful as a signal in a bayesian filter most certainly, but there's no strong general rule.

ajnin · on Nov 8, 2021

Only if the SMTP client is following the RFC but being spammers they probably sent anything but the actual domain they sent from.

nzealand · on Nov 7, 2021

> Normalization does not mean "the same string can only appear once".

It can, if you want to easily update an email address, or easily remove all references to an email address because it is PII.

alksjoflkajsom · on Nov 7, 2021

That's not what normalization normally means, no. See eg wikipedia https://en.wikipedia.org/wiki/Database_normalization

krisgenre · on Nov 8, 2021

My piano teacher once told me that its common to commit a mistake but its disastrous if you keep practicing without realizing that you have made a mistake. What I look for in a developer is their ability to realize a mistake and find ways to fix it.

GuB-42 · on Nov 7, 2021

The problem, I think is that most people, me included, don't really know what databases really do. There is a whole lot about optimizing procedural code, with a variety of tools, the dangers of premature optimization and the tradeoff with readability. Anyone with a computer science degree has heard about algorithmic complexity, caches, etc...

But databases are just magic. You try things out, usually involving a CREATE INDEX at some point, and sometime it gets faster, so you keep it.

Rachel, in he blog post is a good example of that thought process. She used a "best practice", added an index (there is always an index) and it made her queries faster, cool. I don't blame her, it works, and it is a good reminder of the 3NF principle. But work like that on procedural code and I'm sure we will get plenty of reactions like "why no profiler?".

Many, many programmers write SQL, but very few seem to know about query plans and the way the underlying data structures work. It almost looks like secret knowledge of the DBA caste or something. I know it is all public knowledge of course, but it is rarely taught, and the little I know about is is all personal curiosity.

musingsole · on Nov 7, 2021

> But databases are just magic.

100%!!!

I hate how developers talk about a "database" as a monolithic concept. It's an abstract concept with countless implementations built off of competing philosophies of that abstract concept. SQL is only slightly more concrete, but there's as many variants and special quirks of SQL dialects out there as databases.

onionjake · on Nov 9, 2021

Database indexes are definitely one of the most approachable concepts of databases for computer science majors. If you have studied data structures and Big O notation you have all the concepts needed.

While I agree that how a query planner works is one of the most ‘magic’ aspects, I think the output from the query planner in most databases is very approachable as well to regular common programmers and will get you quite far in solving performance issues. We know what full scans are (searching through every element of an array), etc.

The challenge is usually discovering that the jargon used in your database really maps to something you do already have a concept about and then reading the database documentation…

temporallobe · on Nov 7, 2021

This was great. Clever twist at the end.

I cringed a little because some of those mistakes looks like stuff I would do even now. I have a ton on of front/back end experience in a huge variety of languages and platforms, but I am NOT a database engineer. That’s a specific skillset I never picked up nor, admittedly, am I passionate about.

Sorry to go off on a tangent, but it also brings to mind that even so-called experts make mistakes. I watch a lot of live concert footage from famous bands from the 60s to the 90s, and as a musician myself, I spot a lot of mistakes even among the most renowned artist…with the sole exception of Rush. As far as I can tell, that band was flawless and miraculously sounded better live than on record!

stickfigure · on Nov 7, 2021

Assuming this needs to be optimized for massive scale, just hash the values to a single indexed field.

And use something other than an RDBMS. Put the hash in Redis and expire the key; your code simply does an existence check for the hash. You could probably handle gmail with a big enough cluster.

That super-normalized schema looks terrible.

kodt · on Nov 7, 2021

I too recommend using Redis in 2002.

stickfigure · on Nov 7, 2021

Fair enough, but I think the issue here is recognizing that the underlying business operation is a hit test on an expiring hash key. You could have used MySQL as a key/value store or looked for something more specialized.

ummonk · on Nov 8, 2021

“Hash it and store in BerkeleyDB” would have been a ridiculously easy solution at the time.

teen · on Nov 8, 2021

++ correct answer here

chrischapman · on Nov 7, 2021

I think the 'terrible schema' thing is a secondary issue. The important take away for me was this:

> Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?

ummonk · on Nov 8, 2021

Maybe I’ll change my mind in 5 years, but I have a hard time believing engineers over the age of 35 get chased off. What actually seems to be the case is that the field skews young because it’s rapidly growing, and older engineers choose to retire early because they can afford to.

Which results in the same problems but the root cause is different.

mixmastamyk · on Nov 7, 2021

Unfortunately it is the penultimate sentence in a sizable post.

mceachen · on Nov 7, 2021

<pushes glasses up nose> Actually, the Correct Answer is a bloom filter.

(And, yes, we had math in the early 2000s.)

Snark aside, I'm frustrated for the author. Her completely-reasonable schema wasn't "terrible" (even in archaic MySQL)—it just needed an index.

There's always more than one way to do something. It's a folly of the less experienced to think that there's only One Correct Way, and it discourages teammates when there's a threat of labeling a solution as "terrible."

Not to say there aren't infinite terrible approaches to any give problem: but the way you guide someone to detect why a give solution may not be optimal, and how you iterate to something better, is how you grow your team.

Jolter · on Nov 7, 2021

To be fair, the “correct way” to do databases at that time was to use third normal form. Putting indices on string columns would have been considered a hack, much like using MySQL was.

int_19h · on Nov 8, 2021

3NF is largely orthogonal to indices on string columns, though. If you need to look something up by string fast, there's no avoiding an index of strings somewhere.

mst · on Nov 7, 2021

Plenty of databases at the time had pretty lame string indexing because it simply wasn't something people relied on if they wanted performance, and memory sayeth mysql's were both not particularly efficient and had some nasty technical limitations.

On the mysql she was using, breaking things out so it only needed to index ints was almost certainly a much better idea. On anything I'd be deploying to today, I'd start by throwing a compound index at the thing with the expectation that'd probably get me to Good Enough.

jhgb · on Nov 8, 2021

When has ever been putting indices on string columns "a hack"? If you're looking for that string, which seems to have been the case here in particular, you need that index.

daly · on Nov 7, 2021

We had a test database that contained 1 record. Nobody paid much attention since the focus was on the problem, not the database.

The database included several newly developed "stored procedures".

Time elapsed... and it was nearing the time to ship the code. So we tried to populate the database. But we could not. It turned out that the stored procedures would only allow a single record in the database.

Since a portion of the "business logic" depended on the stored procedures... well, things got "delayed" for quite a while... and we ended up having a major re-design of the back end.

Fun times.

hinkley · on Nov 7, 2021

Why do so many of us “forget” to populate the database with representative quantities of data until after we’ve baked in all of our poor design decisions?

marginalia_nu · on Nov 7, 2021

https://news.ycombinator.com/item?id=29137141

dang · on Nov 8, 2021

Oops - missed that one. I guess there was more appetite for discussion.

nlitened · on Nov 7, 2021

> The rub is that instead of just being slow, it also cost a fair amount of money because this crazy vendor system charged by the row or somesuch. So, by scanning the whole table, they touched all of those rows, and oh hey, massive amounts of money just set ablaze!

Why _the hell_ is nobody mentioning that using a database that charges per row touched is absolute insanity? When has it become so normal that nobody mentions it?

ricardobeat · on Nov 7, 2021

You'll probably enjoy reading this article that was on the front page yesterday: https://briananglin.me/posts/spending-5k-to-learn-how-databa...

nlitened · on Nov 8, 2021

Thank you. I cannot say I enjoyed it, I am distressed now. Using a “planet scale” “serverless” database to store just 750k download records per day, and paying per DB row touched, and “thank you for letting us off the hook this time”. This reads like satire.

notreallyserio · on Nov 7, 2021

Nice. I was thinking of an earlier post involving cloud storage buckets with no index.

lern_too_spel · on Nov 7, 2021

If the amount charged isn't proportional to the work done, the cloud provider would quickly go out of business.

nlitened · on Nov 7, 2021

My VPS provider gives me unlimited traffic, unlimited disk reads, and unlimited DB rows touched for fixed 30 dollars a month, regardless of how much CPU and RAM I keep loaded.

This makes me think that these things are at least 100x cheaper than AWS might want to make me believe.

lern_too_spel · on Nov 11, 2021

On a single instance. The more instances you use, the more you will be charged. This service handles automatic scalability and resilience with multiple instances for you, allowing it to charge less than $30.

ccleve · on Nov 7, 2021

Sorry, no. The original schema was correct, and the new one is a mistake.

The reason is that the new schema adds a great deal of needless complexity, requires the overhead of foreign keys, and makes it a hassle to change things later.

It's better to stick the the original design and add a unique index with key prefix compression, which all major databases do these days. This means that the leading values gets compressed out and the resulting index will be no larger and no slower than the one with foreign keys.

If you include all of the keys in the index, then it will be a covering index and all queries will hit the index only, and not the heap table.

maddynator · on Nov 7, 2021

One thing that worth taking into consideration is that this happened in 2002. When the databases were not in cloud, the ops was done by dba’s and key prefix compression thats omnipresent today was likely not that common or potentially not even implemented/available.

But i don’t think the point of the post is whats right/wrong way of doing it. The point as mentioned by few here is that programmers makes mistakes. They are costly and will be costly if in tech industry, we continue to boot experienced engineers… the tacit knowledge those engineers have gained wi ll not be passed on and this means more people have to figure things out by themselves

Philip-J-Fry · on Nov 7, 2021

>One thing that worth taking into consideration is that this happened in 2002. When the databases were not in cloud, the ops was done by dba’s

Are you implying that isn't the case today? Thousands of (big) companies are still like that and will continue to be like that.

I write my own SQL, design tables and stuff, submit it for a review by someone 10x more qualified than myself, and at the end of the day I'll get a message back from a DBA saying "do this, it's better".

WJW · on Nov 7, 2021

If you have one, more power to you. I'm currently making a good living as a freelance DBA/SRE for startups. With a team of 3-5 devs, some of which frontend and/or mobile devs, proper database knowledge is definitely thin on the ground in some places.

mixmastamyk · on Nov 7, 2021

The question of it being in the cloud or not is highly orthogonal to schema design.

hinkley · on Nov 7, 2021

In 2002, you started seeing major increases in query run time with as little as 5 joins. Once you hit six you had to start thinking about rearchitecting your data or living with slow response times.

There was a lot of pressure to relax 3NF as being too academic and not practical.

Around then, I had a friend who was using a pattern of varchar primary keys so that queries that just needed the (unique) name and not the metadata could skip the join. We all acted like he was engaging in the Dark Arts.

jhgb · on Nov 8, 2021

> When the databases were not in cloud, the ops was done by dba’s and key prefix compression thats omnipresent today was likely not that common or potentially not even implemented/available.

To my understanding, for example Firebird/Interbase had automatic key prefix compression as far back as early 2000s at the very least. I don't believe you could even turn it off.

adfgaertyqer · on Nov 7, 2021

You don't really need compression. Rows only need to persist for about an hour. The table can't be more than a few MiB.

We can debate the Correct Implementation all day long. The fact of the matter is that adding any index to the original table, even the wrong index, would lead to a massive speedup. We can debate 2x or 5x speedups from compression or from choosing a different schema or a different index, but we get 10,000x from adding any index at all.