While I was at Google, someone asked one of the very early Googlers (I think it was Craig Silverstein, but it may've been Jeff Dean) what was the biggest mistake in their Google career, and they said "Not using ECC memory on early servers." If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.
It saved a few bucks in a time period where Google's hardware costs were rising rapidly, but the ripple-on effects on system design cost much more than that in lost engineer time. Data integrity is one engineering constraint that should be pushed as low down in the stack as is reasonably possible, because as you get higher up the stack, the potential causes of corrupted data multiple exponentially.
I'm increasingly intrigued by Rust. Can anyone recommend a good analysis / benchmarking of Rust compared to C/C++ code? Curious how it fares on a performance basis.
What an unimaginable horror! You can't change a single line of code in the product without breaking 1000s of existing tests. Generations of programmers have worked on that code under difficult deadlines and filled the code with all kinds of crap.
Very complex pieces of logic, memory management, context switching, etc. are all held together with thousands of flags. The whole code is ridden with mysterious macros that one cannot decipher without picking a notebook and expanding relevant pats of the macros by hand. It can take a day to two days to really understand what a macro does.
Sometimes one needs to understand the values and the effects of 20 different flag to predict how the code would behave in different situations. Sometimes 100s too! I am not exaggerating.
The only reason why this product is still surviving and still works is due to literally millions of tests!
Here is how the life of an Oracle Database developer is:
- Start working on a new bug.
- Spend two weeks trying to understand the 20 different flags that interact in mysterious ways to cause this bag.
- Add one more flag to handle the new special scenario. Add a few more lines of code that checks this flag and works around the problematic situation and avoids the bug.
- Submit the changes to a test farm consisting of about 100 to 200 servers that would compile the code, build a new Oracle DB, and run the millions of tests in a distributed fashion.
- Go home. Come the next day and work on something else. The tests can take 20 hours to 30 hours to complete.
- Go home. Come the next day and check your farm test results. On a good day, there would be about 100 failing tests. On a bad day, there would be about 1000 failing tests. Pick some of these tests randomly and try to understand what went wrong with your assumptions. Maybe there are some 10 more flags to consider to truly understand the nature of the bug.
- Add a few more flags in an attempt to fix the issue. Submit the changes again for testing. Wait another 20 to 30 hours.
- Rinse and repeat for another two weeks until you get the mysterious incantation of the combination of flags right.
- Finally one fine day you would succeed with 0 tests failing.
- Add a hundred more tests for your new change to ensure that the next developer who has the misfortune of touching this new piece of code never ends up breaking your fix.
- Submit the work for one final round of testing. Then submit it for review. The review itself may take another 2 weeks to 2 months. So now move on to the next bug to work on.
- After 2 weeks to 2 months, when everything is complete, the code would be finally merged into the main branch.
The above is a non-exaggerated description of the life of a programmer in Oracle fixing a bug. Now imagine what horror it is going to be to develop a new feature. It takes 6 months to a year (sometimes two years!) to develop a single small feature (say something like adding a new mode of authentication like support for AD authentication).
The fact that this product even works is nothing short of a miracle!
I don't work for Oracle anymore. Will never work for Oracle again!
This could just as easily be about the perils of knowing how to look up symptoms on the web.
If you're reading Hacker News you're probably used to reading fast, integrating new knowledge into a mental model and applying it all to solve the problem at hand. You often have to be an "instant expert" in everything. So the temptation is to be an instant expert in your own (self-diagnosed) condition.
Here's my advice after spending way too much time with doctors over the last year (since my wife was diagnosed with cancer).
1. Knowledge: A doctor will have far more general medical knowledge and experience than you. Your GP or specialist should be the bedrock of any diagnosis or treatment. However, it is absolutely possible to read the literature and get a more current or targeted understanding of some of the nuances of your particular disease. Don't be afraid to discuss them with your doctor, even seek second and third opinions, but don't just rely on your own judgement instead of theirs.
2. Motivation: Doctors are conservative: they need to avoid liability and maintain good relationships with their peers. As a patient you will have a different risk/benefit calculation. In some cases you will need to push your doctor to perform that specialised test instead of just offering reassurance, or to try that experimental treatment instead of the "reasonable futility" of standard care.
3. Symptoms: You know best your own symptoms, but you may not have the objectivity or experience to understand them and compare them to other people. You should document them as much as possible, maybe even graph them or make charts, to make this information accessible to your doctor. I've found that a simple timeline containing symptoms, measurements and interventions can be incredibly useful in tracking a condition.
It saved a few bucks in a time period where Google's hardware costs were rising rapidly, but the ripple-on effects on system design cost much more than that in lost engineer time. Data integrity is one engineering constraint that should be pushed as low down in the stack as is reasonably possible, because as you get higher up the stack, the potential causes of corrupted data multiple exponentially.