Yep. Folks are getting lost in the weeds discussing indexing of database tables....

damagednoob · on Nov 7, 2021

Yes and no and this post highlights a subtle issue with mentorship (which I think is important): Technology does not stand still. What was true in 2002, might not be true today. While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.

I've experienced this just by switching languages. C# had many articles dedicated to how much better StringBuilder was compared to String.Concat and yet, other languages would do the right thing by default. I would give advice in a totally different language about a problem that the target language did not have.

As the song goes:

"Be careful whose advice you buy but be patient with those who supply it

Advice is a form of nostalgia, dispensing it is a way of fishing the past

From the disposal, wiping it off, painting over the ugly parts

And recycling it for more than it's worth"

nightski · on Nov 7, 2021

That might be true if you just take blanket advice. The key is to find out why say StringBuilder is better than String.Concat. If you understand the implementation details and tradeoffs involved, this makes the knowledge much more applicable in the future. The core concepts in technology do not move nearly as fast as individual projects, libraries, frameworks, languages, etc...

throwaway2037 · on Nov 8, 2021

You wrote: <<While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.>>

Do you have a specific example that would help in the case described in Rachel's blog post?

I am still making (and finding) occasional indexing and 3NF mistakes. In my experience, it is always humans finding and fixing these issues.

bigiain · on Nov 8, 2021

But trust me on the sunscreen.

goto11 · on Nov 8, 2021

Totally off topic I know, but what is the concern with String.Concat ?

MarkSweep · on Nov 8, 2021

If you repeated use concat to build up a string, the amount of time grows exponentially. This is because the string I copied each time you concat. Note that the + operator on strings gets turned into a call to concat.

https://docs.microsoft.com/troubleshoot/dotnet/csharp/string...

goto11 · on Nov 8, 2021

I was wondering about it in the context of "yet, other languages would do the right thing by default". Repeatedly concatenating to the same string (as opposed to concatenating an array of strings in one go) would be slow in any language I know of, unless you allocate a larger buffer up front, which is what StringBuilder does.

Some languages have mutable strings, but you would still need to allocate a sufficiently larger buffer if you want to add strings in a loop.

MarkSweep · on Nov 14, 2021

In languages that have immutable string semantics and can detect that they are the exclusive owner of a string, they can modify them in place. For reference counting languages like Python, they can easily determine they are the exclusive owner by checking for a reference count of 1. See:

https://github.com/python/cpython/blob/main/Objects/unicodeo...

Rust's ownership system ensures that there cannot exist mutable and immutable references to the same object at the same time. So if you have a mutable reference to a string it is ok to modify in place.

I have not confirmed either the Python example or the Rust example too deeply, since I'm replying to 5 day old comment. But the general principal holds. GC languages like C# and Java don't have easy ways to check ownership so they always copy strings when mutating them. Maybe I'll write a blogpost on this in the future.

ufmace · on Nov 8, 2021

I don't think StringBuilder is faster specifically because it allocates a large buffer. Pretty much every language with growable data structures can already grow any such structure pretty fast, including arrays, maps, and mutable strings. They already have a bunch of pre-set constants about stuff like how big to allocate for the initial empty one and how much more to allocate every time you overflow to balance speed and memory efficiency. It's faster because it's mutable and keeps adding new data to the existing buffer until it gets too large.

goto11 · on Nov 9, 2021

The cost is not the allocation per se, the cost is copying the bytes. When concatenating two strings, the bytes from both strings are copied into the new string. If you repeatedly concatenate, the bytes end up getting copied many times. E.g if you concatenate 100 strings by appending one by one, the bytes in the first string is copied 99 times, the bytes in the second string is copied 98 times and so on.

Using a StringBuilder, the strings are only copied once when copied into the buffer. If the buffer need to grow, the whole buffer is copied into a larger buffer, but if the buffer grows by say doubling the size, then this cost is amortized, so each string on average is still only copied twice at worst.

Faster yet is concatenating an array of strings in a single concat operation. This avoids the buffer resize issue, since the necessary buffer size is known up front. But this leads to the subtle issue where a single concat is faster than using a StringBuilder, while the cost for multiple appends grows exponentially for concat but only lineary for StringBuilder.

FreezerburnV · on Nov 8, 2021

Repeatedly concatting a string is the fastest way I am aware of to build a longer one in Javascript. (This has been deliberately optimized for) I believe PHP this might also be the case for, or at least very fast. Perl might be pretty fast at this but I could be wrong.

WorldMaker · on Nov 8, 2021

I know JS's case somewhat well and the lesson there is not the comment such as above that "it just does the right thing". JS doesn't use a "String Builder" under the hood with a lot of string + calls, but instead that JS does subtly the "wrong" thing as an optimization: in the .NET CLR strings are always, always immutable (or they are not strings according to guarantees that the CLR makes). JS lives in a single language VM that makes no such guarantees, and often lives in strictly single threaded worlds where the VM doesn't have to guarantee the immutability of strings to anyone at the binary memory layout level, so most JS VMs are free to do the "wrong thing" (from the .NET CLR perspective) and just mutably alter strings under the hood as a deliberate optimization.

There's a lot of interesting conflations of "right" versus "wrong" in this this thread. There's the C# "best practices" "right" versus "wrong" of knowing when to choose things like StringBuilder over string.Concat or string's + operator overloading (which does just call string.Concat under the hood). There's the "does that right/wrong" apply to other languages question? (The answer is far more complex than just "right" and "wrong".) There's the VM guarantees of .NET CLR's hard line "no string is mutable" setting a somewhat hard "right" versus "wrong" and other language's VMs/interpreters not needing to make such guarantees to themselves or to others. None of these are really ever "right" versus "wrong", almost all of them are trade-offs described as "best practices" and we take the word "best" too strongly especially when we don't know/examine/explore the trade-off context it originated from (and sometimes wrongly assuming that "best" means "universally the best").

goto11 · on Nov 9, 2021

I believe joining an array of strings is still faster in JavaScript, since the combined size can be known up front.

sumtechguy · on Nov 8, 2021

I agree, and would even go one step further and say the first implementation was a decent first pass. Sometimes the 'awful' implementation is good enough and your time is better used on something else. However, this pattern can sometimes bite you in the long term. As you can have tons of little 'hacks' all over the place and people become afraid to touch them, or worse copy from them.

This also nicely shows one of the fun things with table scans. They look decent at first then perf goes crappy. Then you get to learn something. In this case it looks like she used normalization to scrunch out the main table size (win on the scan rate, as it would not be hitting the disk as much with faster row reads). It probably also helped just on the 'to' lookups. Think about how many people are in an org, even a large company has a finite number of people but they get thousands of emails a week. That table is going to be massively smaller than keeping it in every row copied over and over. Just doing the 'to' clause alone would dump out huge amounts of that scan it was doing. That is even before considering an index.

The trick with most SQL instances is do less work. Grab less rows when you can. Use smaller tables if you can. Throw out unnecessary data from your rows if you can. Reduce round trips to the disk if you can (which usually conflicts with the previous rules). Pick your data ordering well. See if you can get your lookups to be ints instead of other datatypes. Indexes usually are a good first tool to grab when speeding up a query. But they do have a cost, on insert/update/delete and disk. Usually that cost is less than your lookup time, but not always. But you stick all of that together and you can have a DB that is really performant.

For me the fun one was one DB I worked in they used a GUID as the primary key, and therefore FK into other tables. Which also was the default ordering key on the disk. Took me awhile to make them understand why the perf on look up was so bad. I know why they did it and it made sense at the time to do it that way. But long term it was holding things back and ballooning the data size and crushing the lookup times.

How did I come by all of this? Lots of reading and stubbing my toe on things and watching other do the same. Most of the time you do not get a 'mentor'. :( But I sure try to teach others.

otterley · on Nov 7, 2021

I’d be a lot more sympathetic if the major RDMSes didn’t have outstanding and thorough reference manuals or that there weren’t a mountain of books on the subject that cover, among other things, the topic of indexing and its importance. MySQL’s manual, for example, has covered this subject from the very beginning: https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I don’t have the 3.x manuals handy but it was there back then too).

It’s not clear from the article whether the author spent any time studying the problem before implementing it. If she failed to do so, it is both problematic and way more common than it ought to be.

“Ready, fire, aim”

But you know what they say: good judgment comes from experiences and experience comes from bad judgment.

Compounding the problem here is that the author now has much more experience and in a reflective blog post, still got the wrong answer.

IMO the better lesson to take away here would have been to take the time getting to know the technology before putting it into production instead of jumping head first into it. That would be bar raising advice. The current advice doesn’t advise caution; instead it perpetuates the status quo and gives “feel good” advice.

dasil003 · on Nov 7, 2021

I couldn't disagree more with this comment. No amount of reading documentation teaches you how to build production systems. You progress much faster by getting your hands dirty, making mistakes, and learning from them.

The challenge of writing good software is not about knowing and focusing on the perfection every gory detail, it's about developing the judgement to focus on the details that actually matter. Junior engineers who follow your advice will be scared to make mistakes and their development will languish compared to those who dive in and learn from their mistakes. As one gains experience it's easy to become arrogant and dismissive of mistakes that junior engineers make, but this can be extremely poisonous to their development.

otterley · on Nov 7, 2021

Respectfully, I believe you are disagreeing with an argument that I am not making. It's not an either-or scenario. Both access to and reliance on reference materials and the ability to safely experiment are part of the professional engineer's toolbox. I can say this with confidence because I was a junior engineer once - it's not like I'm speaking without experience.

Everyone - new and experienced developers alike - should have a healthy fear of causing pain to customers. Serving customers - indirectly or otherwise - is what makes businesses run. As a principal engineer today my number one focus is on the customer experience, and I try to instill this same focus in every mentee I have.

Does that mean that junior developers should live in constant fear and decision paralysis? Of course not.

That's where the mentor - the more experienced and seasoned engineer - comes in. The mentor is roughly analogous to the teacher. And the teaching process uses a combination of a mentor, reference materials, and the lab process. The reference materials provide the details; the lab process provides a safe environment in which to learn; and the mentor provides the boundaries to prevent the experiments from causing damage, feedback to the mentee, and wisdom to fill in the gaps between them all. If any of these are absent, the new learner's development will suffer.

Outstanding educational systems were built over the last 3 centuries in the Western tradition using this technique, and other societies not steeped in this tradition have sent their children to us (especially in the last half-century) to do better than they ever could at home. It is a testament to the quality of the approach.

So no, I'm not advocating that junior developers should do nothing until they have read all the books front to back. They'd never gain any experience at all or make essential mistakes they could learn from if they did that. But junior developers should not work in a vacuum - more senior developers who both provide guidance and refer them to reference materials to study and try again, in my experience, lead to stronger, more capable engineers in the long term. "Learning to learn" and "respect what came before" are skills as important as the engineering process itself.

crispyambulance · on Nov 7, 2021

> That's where the mentor [...] comes in.

The thing is, specifically, that the OP did NOT have a mentor. They had a serious problem to solve, pronto, and knew enough to take a crack at it. OK, the initial implementation was suboptimal. So what? It's totally normal, learn and try again. Repeat.

It would be nice if every workplace had a orderly hierarchy of talent where everyone takes care to nurture and support everyone else (especially new folks). And where it's OK to ask questions and receive guidance even across organizational silos, where there are guardrails to mitigate accidents and terrible mistakes. If you work in such an environment, you are lucky.

It is far more common to have sharp-elbow/blamestorm workplaces which pretend to value "accountability" but then don't lift a finger to support anyone who takes initiative. I suspect that at the time Rachelbythebay worked in exactly this kind of environment, where it simply wasn't possible for an OPS person to go see the resident database expert and ask for advice.

otterley · on Nov 7, 2021

Right. Hence my concern about the lack of helpful advice other than “accept yourself”[1]: Neophyte engineers lacking guardrails and local mentors should fall back to what they learned in school: experiment, use your books, and ask questions from the community. Mentors can be found outside the immediate workplace, after all.

And when it comes to taking production risks, measure twice and cut once, just like a good carpenter.

[1] Of course you should accept yourself. But that alone won’t advance the profession or one’s career.

crispyambulance · on Nov 8, 2021

> measure twice and cut once

Being able to do that is a luxury that many do not enjoy in nose-to-the-grindstone workplaces. You make an estimate (or someone makes it for you), then you gotta deliver mentor or no mentor, and whether you know the finer points of database design/care-and-feeding or not.

There's something good to be said for taking action. Rachelbythebay did just fine. No one died, the company didn't suffer, it was just a problem, it got corrected later. Big whoop.

noduerme · on Nov 7, 2021

But it's also about constantly asking yourself "how can this be done better? What do I not know that could improve it?"

I learned probably 80% of what I know about DB optimization from reading the Percona blog and S.O. The other 20% came from splashing around and making mistakes, and testing tons of different things with EXPLAIN. That's basically learning in a vacuum.

isoskeles · on Nov 7, 2021

I couldn't disagree more with this comment. No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes, nor changes the impact of the mistakes. You also progress much faster by learning from other peoples' mistakes, this is why written language is so powerful.

Honestly, I think your comment is okay minus the fact that you're trying to highlight such hard disagreement with a sentiment that people should read the fucking manual. Really, you couldn't disagree MORE?

There is definitely value in RTFM. There are cases where making mistakes in production is not acceptable, and progressing by "making mistakes" is a mistake on its own. I don't think the case in this article sounds like one of those, but they do exist (e.g. financial systems (think about people losing money on crypto exchanges), healthcare, or, say, sending rockets into space). In many cases, making mistakes is fine in test systems, but completely, absolutely, catastrophically unacceptable in production. Although, I refuse to blame the junior engineer for such mistakes, I blame management that "sent a boy to do a man's job" (apologies for a dated idiom here).

(As an aside, overall, I disagree with many of the comments here nitpicking the way the author solved the problem, calling her "clueless", etc. I really don't care about that level of detail, and while I agree the solution does not seem ideal, it worked for them better than the previous solution.)

crispyambulance · on Nov 7, 2021

> No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes...

Categorically wrong.

Mistakes, failure, and trial and error are very much a part of developing skills. If you're not making mistakes, you're also not taking enough risks and thus missing out on opportunities for growth.

isoskeles · on Nov 7, 2021

Yes, mistakes are required to learn from them, but making mistakes does not imply that a person will learn from their mistakes.

jraph · on Nov 7, 2021

You can't really search a problem if you don't suspect its existence. At this point you might feel you have everything you need to start the implementation, how do you guess?