I Accidentally Deleted 7TB of Videos Before Going to Production

dsego · on May 5, 2022

> but at the time the code seemed completely correct to me

It always does.

> Well, it teaches me to do more diverse tests when doing destructive operations.

Or add some logging and do a dry run and check the results, literally simple prints statements:

    print("-----")
    print("Downloading videos ids from url: {url}")
    print(list of ids)
    ...
    ...
    ...
    # delete()  dangerous action commented out until I'm sure it's right
    print("I'm about to delete video {id}")

    print("Deleted {count} videos") # maybe even assert
    ...

Then dump out to a file and spot check it five times before running for real.

dkersten · on May 5, 2022

I was involved with archiving of data that was legally required to be retained for PSD2 compliance. So it was pretty important that the data was correctly archived, but it was just as important that it was properly removed from other places due to data protection.

This is basically the approach that was taken: log before and after every action exactly what data or files is being acted on and how. Don't actually do it. Then have multiple people inspect the logs. Once ok'd, run again, with manual prompts after each log item asking to continue, for the first few files/bits of data. Only after that was ok'd too did it run the remainder.

In other things I've worked on, I've taken the terraform-style plan first, then apply the plan approach, with manual inspection of the plan in between.

tauwauwau · on May 5, 2022

Once we get used to doing same thing multiple times a day, it doesn't matter if the log shows that we're about to take a destructive action, we'll still do it. Only thing that is foolproof is to not take the destructive action because people make mistake, it's human nature. I don't know how this can be implemented, may be encrypt the files, take a backup in some other location (which may not be allowed).

Multiple reviewers here didn't catch the mistake

https://www.bloombergquint.com/markets/citi-s-900-million-mi...

HowardStark · on May 5, 2022

While this is a huge issue, a solution (well, a partial mitigation) I've seen and used is the "Pointing and Calling" technique. The basic idea is that you incorporate more actions beyond reading and typing or pressing a button—generally by having people point at something and say aloud what it is they're doing and what they expect to happen.

It's used rather extensively in safety-critical public transportation in Japan [1] and to a lesser extent in New York (along with many other countries) [2]. This can easily extend to software without overcomplicating by just setting the expectation that engineers, Q&A, etc. do this even when alone.

[1] https://www.atlasobscura.com/articles/pointing-and-calling-j...

[2] https://en.wikipedia.org/wiki/Pointing_and_calling

samhw · on May 5, 2022

Hell, GitHub does that to an extent, with the "type the name of this repository to delete it" prompts. Typing the name of the repository isn't exactly perfect, but it's an interesting direction.

Blackcatmaxy · on May 5, 2022

There was a thread recently about a repo that accidentally went private and lost all of its stars because of confusion with GH teams vs GH profile readme repo naming. I think this type of prompt is very useful for explicitly preventing the rare worst case scenarios but the problem is making any type of prompt "routine" so that our brains fail to process it.

swid · on May 5, 2022

The suggestion in that post about how to fix it is good, and mirrors one I read in the Rachael by the Bay blog - type the number of machines to continue:

https://rachelbythebay.com/w/2020/10/26/num/

The take away by both is there is actually something to do which can wake people up when the stakes are high, and they might not be doing what they expect.

oauea · on May 5, 2022

And most importantly, don't let yourself get into the habit of copy pasting the value

underwater · on May 5, 2022

I wonder if your could print some non visible characters in there to taint the copied value in some detectable way.

dkersten · on May 7, 2022

Prompt in words, but expect the value in numbers, eg: "Twenty-five" and the box requires you to type "25"? At least this specific case, it would require you to type it.

weaksauce · on May 6, 2022

yeah, that would possibly stop the copy and paste problem. to make it robust they would need to use a string of a few non-visible characters but that would fail if the browser's clipboard system doesn't copy them over for some kind of privacy initiative. might be another way it fails that I can't think of right now.

lostlogin · on May 5, 2022

This is it I think. https://news.ycombinator.com/item?id=31033758

skrtskrt · on May 5, 2022

I always copy-paste into that box as well, they should probably make at least an attempt at disabling pasting into it

dustymcp · on May 6, 2022

Azure has the same when deleting a database just a verify this is the correct one by typing the db name

akavel · on May 5, 2022

I heard of this technique, but unfortunately I don't see how it can be easily applied in software engineering/devops.

Also, I now realized that aviation checklists seem to tend to be done similarly with gestures - at least from what I saw on YouTube, not sure if that's representative or only used during education (?)

samus · on May 5, 2022

Spelling out loudly the command you are about to execute and explaining the reasoning behind it can help a lot too.

akavel · on May 6, 2022

Ok, but am I to do it on every single command I do on my terminal? Or on which ones specifically? If the problem we're trying to solve is that I can sometimes overlook the "dangerous commands" among "safe ones", by definition of overlooking it won't work if I tell myself to "spell out the command only in case of the dangerous ones", no?

I'm honestly trying to think of the way how I could approach this for myself, just I don't see a clear solution yet that wouldn't require me to spell out everything I type in my terminal window.

emerged · on May 5, 2022

“I’m removing that semicolon!” (Pointing)

bbarnett · on May 5, 2022

Parent meant this sort of pointing.

https://t.co/TjfX5K54H7

irrational · on May 5, 2022

Because everyone assumes that everyone else is looking at it more closely than they are. “I’ll just do a cursory look since I’m sure everyone else is doing a in-depth look.” Narrator: nobody did an in-depth search.

slaymaker1907 · on May 5, 2022

I'm a fan of doing things temporally so data is very rarely actually deleted from the database. Most of the time, you just update the "valid_to" field to the current time. Sometimes real deleted are required such as with privacy requests, but I think that sort of thing is pretty rare.

If your application has space concerns, you can modify this approach to be like a recycle bin where you delete records which are no longer valid and have been invalid for over a month (or whatever time frame is appropriate for your application). However, I think this is unnecessary in most cases except for blob/file storage.

Danieru · on May 5, 2022

That form had a couple weird checkboxes with odd wording. It is a famous mistake, but also rather understandable just because the form was cryptic.

dkersten · on May 5, 2022

> Multiple reviewers here didn't catch the mistake

Sure, but we can only do so much. I find its good bang for buck and alternatives that might prevent that are not always available, so we do the best we can. You gotta make a call on whether its enough or not.

dredmorbius · on May 5, 2022

mv then rm is another idiom. So long as you have the space.

For database entries, flag for deletion, then delete.

In the files case, the move or rename also accomplishes the result of breaking any functionality which still relies on those file ... whilst you can still recover.

Way back in the day I was doing filesystem surgery on a Linux system, shuffling partitions around. I meant to issue the 'rf -rm .' in a specific directory, I happened to be in root.

However ...

- I'd booted a live-Linux version. (This was back when those still ran from floppy).

- I'd mounted all partitions other than the one I was performing surgery on '-ro' (read-only).

So what I bought was a reboot, and an opportunity to see what a Linux system with an active shell, but no executables, looks like.

Plan ahead. Make big changes in stages. Measure twice (or 3, or 10, or 20 times), cut once. Sit on your hands for a minute before running as root. Paste into an editor session (C-x C-e Readline command, as noted elsewhere in this thread).

Have backups.

marcosdumay · on May 5, 2022

You mean cp then rm?

And yes, copy, verify, delete. And make sure by the code structure that you either do the three on the same files, or their fail.

Also, do it slowly, with just a bit of data on each iteration. That will make the verification step more reliable.

Anyway, for a huge majority of cases, only having backups is enough already. Just make sure to test them.

dredmorbius · on May 6, 2022

No, mv.

Example:

  cd datadir
  mkdir delete
  mv <list of files to be deleted> ./delete
  # test to see if anything looks broken.  
  # This might take a few seconds, or months, though it's usually reasonably brief.
  rm -rf ./delete

The reasons for mv:

- It's atomic (on a single filesystem). There's no risk of ending up with a partial operation or an incomplete operation.

- It doesn't copy the data, it renames the file. (mv and rename are largely synonyms.)

- There's no duplication of space usage. Where you're dealing with large files, this is helpful.

The process is similar to the staged deletion most desktop OS users are familiar with, of "drag to trash, then empty trash". Used in the manner I'm deploying it, it's a bit more like a staged warehouse purge or ordering a dumpster bin --- more structured / controlled staged deletion than a household or small office might use.

andi999 · on May 5, 2022

I think mv then rm is probably meant as 'windows trash bin' style.

crispyambulance · on May 5, 2022

  > ... Then have multiple people inspect the logs. Once ok'd, run again, with manual prompts after each log item asking to continue...

This sort-of reminds me of some "critical" work I had to do a couple of decades ago. I was in a shop that used this horrifically tedious tool for designing masks for special kinds of photonic devices-- basically it was tracing out optical waveguides that would be placed on a crystal that was processed much like a silicon IC.

The process was for TWO of us to sit in front of computer and review the curves in this crazy old EDA layout tool called "L-edit" before it got sent to have the actual masks made (which were very expensive). It took HOURS to check everything.

The first hour was tolerable but then boredom started to creep in and we got sloppy. The whole reason TWO people got tasked with this was because it was thought that we would keep each other focused-- 2 pairs of eyes are better than one, right?. Instead, it just underscored the tedium of it all. One day someone walked in and found us BOTH in DEEP SLEEP in front of the monitor. Having two people didn't decrease the waste caused by mistakes, it just bored the hell out of more people.

foota · on May 5, 2022

How many mistakes did you catch?

crispyambulance · on May 6, 2022

ONE real one and some occasional nitpicks to show that we were busy (after being caught asleep).

Was it worth it? No, I don't think so from an opportunity cost perspective-- even though we were the most junior folks there. A mind is a terrible thing to waste!

Freestyler_3 · on May 5, 2022

From his story I can tell he found one big mistake. The tedious work itself.

mmmm2 · on May 5, 2022

Another good approach is do deletions slowly. Put sleeps between each operation, and log everything. That way if you realize something is broken, you have a chance of catching it before it's too late.

JadeNB · on May 5, 2022

> Then have multiple people inspect the logs.

I think that this is the most important part of any check. Your parent refers to checking the log five times, but, at least in my experience, I won't catch any more errors on the fifth time than the first—if I once saw what I expected rather than what was there, I'll keep doing so. Of course everyone has their blind spots, but, as in the famous Swiss-cheese approach, we just hope that they don't line up!

zeristor · on May 5, 2022

Yes, I love the idea of the Plan Apply.

water8 · on May 5, 2022

It never hurts to ask for another set of eyes to review. At the least if something goes awry, the blame isn't solely on you.

csours · on May 5, 2022

Make a plan, check the plan, [fix the plan, check the plan (loop)], do the plan

See PDCA for more a more time critical decision loop. https://en.wikipedia.org/wiki/PDCA

zrail · on May 5, 2022

Another technique that I've used with good success is to write a script that dumps out bash commands to delete files individually. I can visually inspect the file, analyze it with other tools, etc and then when I'm happy it's correct just "bash file_full_of_rms.sh" and be confident that it did the right thing.

francis-io · on May 5, 2022

This was taught to me in my first linux admin job.

I was running commands manually to interact with files and databases, but was quickly shown that even just writing all the commands out, one by one gives room personally review and get a peer review, and also helps with typos. I could ask a colleague "I'm about to run all these commands on the DB, do you see any problem with this?". It also reduces the blame if things go wrong if it managed to pass approval by two engineers.

While I'm thinking back, another little tip I was told was to always put a "#" in front of any command I paste into a terminal. This stops accidentally copying a carriage return and executing the command.

koolba · on May 5, 2022

> This stops accidentally copying a carriage return and executing the command.

For a one-liner sure, but a multi line command can still be catastrophic.

Showing the contents of the clipboard in the terminal itself (eg via xclip) or opening an editor and saving the contents to a file are usually better approaches. The latter let’s you craft the entire command in the editor and then run it as a script.

afiori · on May 5, 2022

From [0]:

[For Bash] Ctrl + x + Ctrl + e : launch editor defined by $EDITOR to input your command. Useful for multi-line commands.

I have tested this on windows with a MINGW64 bash, it works similarly to how `git commit` works; by creating a new temporary file and detecting* when you close the editor.

[0] https://github.com/onceupon/Bash-Oneliner

* Actually I have no idea how this works; does bash wait for the child process to stop? does it do some posix filesystem magic to detect when the file is "free"? I can't really see other ways

mh- · on May 5, 2022

It does create and give a temporary file path to the editor, but then simply waits for the process to exit with a healthy status.

Once that happens, it reads from the temporary file that it created.

remram · on May 5, 2022

The 'enable-bracketed-paste' setting is an easier and more reliable way to deal with that: https://unix.stackexchange.com/a/600641/81005

It will prevent any number of newlines from running the commands if they're pasted instead of typed.

You can enable it either in .inputrc or .bashrc (with `bind 'set enable-bracketed-paste on'`)

cruano · on May 5, 2022

That was our SOP for running DELETE SQL commands on production too, a script that generates a .sql that's run manually. It saved out asses a fair amount of times

ineedasername · on May 5, 2022

Yeah, wish I'd learned that the easy way. Fresh into one of my first jobs I was working with a vendor's custom interface to merge/purge duplicate records. It didn't have a good method of record matching on inserts from the customer web interface so a large % of records had duplicates.

Anyway, I selected what I though was a "merge all duplicates" option without previewing results. What I had actually done was "merge all selected". So, the system proceeded to merge a very large % of the database... Into One. Single. Record.

Luckily the vendor kept very good backups, and so I kept my job. Because I also luckily had a very good boss and I had already demonstrated my value in other ways, he just asked me "Well, are you going to make that mistake again?". I wisely said no, and he just smiled and said "Then I think we're done here."

I have been particularly fortunate throughout my career to have very good managers. As much as managers get a lot of flack here on HN, done well they are empowering, not a hindrance, and I attribute a lot of success in my career to them.

JadeNB · on May 5, 2022

> Yeah, wish I'd learned that the easy way.

I think that, if you've only learned something like that the easy way, then you haven't learned it yet. As long as everything's only ever gone right, it's easy to think, I'm in a rush this one time, and I've never really needed those safety procedures before, ….

karlding · on May 5, 2022

At a previous job the DB admin mandated that everyone had to write queries that would create a temporary table containing a copy of all the rows that needed to be deleted. This data would be inspected to make sure that it was truly the correct data. Then the data would be deleted from the actual table by doing a delete that joined against the copied table. If for some reason it needed to be restored, the data could be restored from the copy.

hinkley · on May 5, 2022

I tend to write one script that emits a list of files, and another that takes a list of files as arguments.

It's simple to manually test corner cases, and then when everything is smooth I can just

    script1 | xargs script2

It's also handy if the process gets interrupted in the middle, because running script1 again generates a shorter list the second time, without having to generate the file again.

When I'm trying to get script1 right I can pipe it to a file, and cat the file to work out what the next sed or awk script needs to be.

KMnO4 · on May 5, 2022

Ah, I’m glad I’m not the only one who did this. It also means that you can fix things when they break halfway. Say you get an error when the script is processing entry 101 (perhaps it’s running files through ffmpeg). Just fix the error and delete the first 100 lines.

wildmanx · on May 6, 2022

The only issue with that is if subsequent lines implicitly assume that earlier ones executed as expected, e.g. without error.

Over-simplified example:

1. Copy stuff from A to B

2. Delete stuff from A

(Obviously you wouldn't do it like that, but just for illustration purposes.) It's all fine, but (2) assumes that (1) succeeded. If it didn't, maybe no space left, maybe missing permissions on B, whatnot, then (2) should not be executed. In this simple example you could tie them with `&&` or so (or just use an atomic move), but let's say these are many many commands and things are more complex.

XorNot · on May 5, 2022

At the point you're doing this, you should be using a proper programming language with better defined string handling semantics though. In every place it comes up you'll have access to Python and can call the unlink command directly and much more safely - plus a debugging environment which you can actually step through if you're unsure.

zrail · on May 5, 2022

Eh, I think that misses the point a bit. Use whatever you want to generate the output, but make the intermediary structure trivial to inspect and execute. If you're actually taking the destructive actions within your complicated* logic then there's less room to stop, think, and test.

You could always generate an intermediary set, inspect/test/etc, and then apply it with Python. I've done that too, works just as well. The important thing is to separate the planning step from the apply step.

* where "complicated" means more complicated than, for ex, `rm some_path.txt` or `DELETE FROM table WHERE id = 123`.

bambax · on May 5, 2022

Yes. Also, maybe not have a delete action in the middle of a script. It's usually better to build a list of items to be deleted. In that case, two lists: items to be deleted, items to be kept. Then compare the lists:

- make sure the sum of their lengths == number of total current items

- make sure items_to_be_kept.length != 0

- make sure no two items appear in both lists

- check some items chosen at random to see if they were sorted in the correct list

At this point the only possible mistake left is to confuse the lists and send the "to_be_kept" one to the delete script; a dry run of the delete list can be in order.

ectopod · on May 5, 2022

This. The original approach can fail horribly if there's a problem on the server when you run the script for real. Your code can be perfect but that's no guarantee the server will always return what it ought to.

pc86 · on May 5, 2022

I've had good success with this approach, have two distinct scripts generate the two lists, then in addition to your items here also checking that every item appears in one of the lists.

ufo · on May 5, 2022

What do you recommend, to not get intro trouble if there are spaces or newlines in the file names?

marcosdumay · on May 5, 2022

Try not to delete stuff with Bash.

This is the most reliable way. Bash has a few niceties for error handling, but if you are using them, you would probably fare better in another language.

If you do insist on Bash, quote everything, and use the "${var}" syntax instead of "$var". Also, make sure you handle every single possible error.

ricardobeat · on May 5, 2022

`set -e` will abort on any error, anywhere in the pipeline. It’s a must for any critical script.

kevinmgranger · on May 5, 2022

Don't use a shell script.

ufo · on May 5, 2022

Do you mean, always pass the list directly to the next script via function calls, without writing it to an intermediate file / pipeline?

kevinmgranger · on May 6, 2022

I'm being flippant, because shell scripts are so inherently error prone they're to be avoided for critical stuff like this.

If you _absolutely_ must use a shell script:

0. Use shellcheck, which will warn you about many of the below issues: https://www.shellcheck.net/

1. understand how quoting and word splitting work: https://mywiki.wooledge.org/Quotes

2. if piping files to other programs, using `-print0` or equivalent (or even better, if using something like find, its built in execution options): https://mywiki.wooledge.org/UsingFind

3. Beware the pitfalls (especially something like parsing `ls`): https://mywiki.wooledge.org/BashPitfalls

(warning: the community around that wiki can be pretty toxic, just keep that in mind when foraying into it.)

plonk · on May 5, 2022

Yes, use the list argument to Python’s subprocess.run for example. It’s much easier to not mess up if your arguments don’t get parsed by a shell before getting passed.

gilleain · on May 5, 2022

Yes, I find command line tools that have a "--dry-run" flag to be very helpful. If the tool (or script or whatever) is performing some destructive or expensive change, then having the ability to ask "what do you think I want to do?" is great.

It's like the difference between "do what I say" and "do what I mean"...

bzxcvbn · on May 5, 2022

That's what I like about powershell. Every script can include a "SupportsShouldProcess" [1] attribute. What this means is that you can pass two new arguments to you script, which have standardized names across the whole platform:

- -WhatIf to see what would happen if you run the script;

- -Confirm, which asks for confirmation before any potentially destructive action.

Moreover these arguments get passed down to any command you write in your script that support them. So you can write something like:

    [CmdletBinding(SupportsShouldProcess)]
    param ([Parameter()] [string] $FolderToBeDeleted)
    
    # I'm using bash-like aliases but these are really powershell cmdlets!
    echo "Deleting files in $FolderToBeDeleted"
    $files = @(ls $FolderToBeDeleted -rec -file)
    echo "Found $($files.Length) files"
    rm $files

If I call this script with -WhatIf, it will only display the list of files to be deleted without doing anything. If I call it with -Confirm, it will ask for confirmation before each file, with an option to abort, debug the script, or process the rest without confirming again.

I can also declare that my script is "High" impact with the "ConfirmImpact = High" switch. This will make it so that the user gets asked for confirmation without explicitly passing -Confirm. A user can set their $ConfirmPreference to High, Medium, Low, or None, to make sure they get asked for confirmation for any script that declare an impact at least as high as their preference.

[1]: https://docs.microsoft.com/en-us/powershell/scripting/learn/...

spookthesunset · on May 5, 2022

I’m a bit confused (because I didnt read the docs)… does calling it with “—whatif” exercise the same code path as calling without, only the “do destructive stuff” automagically doesn’t do anything? Or is it a separate routine that you have to write?

Cause if it is an entirely separate code path, doesn’t that introduce a case where what you say you’ll isn’t exactly what actually happens?

justsomehnguy · on May 6, 2022

Well, just read the...

> because I didnt read the docs

Ouch.

> Or is it a separate routine that you have to write?

If you are writing a function or a module what would do something (eg API wrapper) then of course you need to write it yourself.

But if you are writing just a script for your mundade one-time/everyday tasks and call cmdlets what supports ShouldProcess then it works automagically. Issuing '-whatif' for the script would pass `-whatif` to any cmdlet what has 'ShouldProcess' in it's definition. Of course if someone made a cmdlet with a declared ShouldProcess but didn't write the logic to process it - you are out of luck.

But if have a spare couple of minutes check the docs in the link, it was originally a blog post by kevmarq, not a boring autodoc.

bzxcvbn · on May 5, 2022

It's the first option. And yes, sometimes you have to be careful if you want to implement SupportsShouldProcess correctly, it's not something you can add willy-nilly. For example, if you create a folder, you can't `cd` there in -WhatIf mode.

mmcclimon · on May 5, 2022

The rule we have is that anything that is not idempotent and not run as a matter of daily routine must dry-run by default, and not take action unless you pass --really. This has saved my bacon many times!

maweki · on May 5, 2022

Deleting actually is idempotent. Doing it twice wont be different from doing it once.

maccard · on May 5, 2022

Deleting * may not be though. Your selection needs to be idempotent.

maweki · on May 5, 2022

idempotency means that f(X) = f(f(X)). Modifying the X inbetween is not allowed. Is there really an initial environment where rm * ; rm * ; does something different than rm * once?

einsty · on May 5, 2022

In the case of any live system, i would say yes. Additional, and different, files could have appeared on the file system in between the times of each rm *.

mikeryan · on May 5, 2022

* is just short hand for a list of files. Calling rm with the same list of files will have the same results if you call it multiple times. That’s idempotent.

Your example is changing the list of files, or arguments to rm between runs. Same as pc85’s example where the timestamp argument changes.

pc86 · on May 5, 2022

In addition to what einsty said (which is 100% accurate), if you're deleting aged records, on any system of sufficient size objects will become aged beyond your threshold between executions.

jameshart · on May 5, 2022

Right. You can kind of consider the state of a filesystem on which you occasionally run rm * purges to be a system whose state is made up of ‘stuff in the filesystem’ and ‘timestamp the last purge was run’.

If you run rm * multiple times, the state of the system changes each time because that ‘timestamp’ ends up being different each time.

But if instead you run an rm on files older than a fixed timestamp, multiple times, the resulting filesystem is idempotent with respect to that operation, because the timestamp ends up set to the same value, and the filesystem in every case contains all the files added later than that timestamp.

hansel_der · on May 5, 2022

> Is there really an initial environment where rm * ; rm * ; does something different than rm * once?

if * expands to the rm binary itself, maybe.

maweki · on May 5, 2022

How is the system different after the first and after the second call?

jgoldshlag · on May 5, 2022

If there is an rm executable in the current directory, and also one later in your PATH, the second run might use a different rm that could do whatever it wants to

dotancohen · on May 6, 2022

This is actually a likely scenario, as it is common to alias rm to rm -i. Though your bash config will still run after .bashrc is nuked, some might wrap with a script instead of aliasing (e.g., to send items to Trash).

hansel_der · on May 6, 2022

# rm rm

rm: command not found

zrail · on May 5, 2022

Early in my career I used --yes-i-really-mean-it and then a coworker removed it with the commit message "remove whimsy".

T'was a sad day.

rjh29 · on May 5, 2022

Going further, make it dry run by default and have an --execute flag to actually run the commands: this encourages the user to check the dryrun output first.

FriedrichN · on May 5, 2022

All my tools that have a possible destructive outcome use either a interactive stdin prompt or a --live option. I like the idea of dry running by default.

kortex · on May 5, 2022

This is why I like to always write any sort of user-script batch-job tools (backfills, purges, scrapers) with a "porcelain and plumbing" approach: The first step generates a fully declarative manifest of files/uris/commands (usually just json) and the second step actually executes them. I've used a --dry-run flag to just output the manifest, but I just read some folks use a --live-run flag to enable, with dry-run being the default, and I like that much better so I'll be using that going forward.

This pattern has the added benefit that it makes it really easy to write unit tests, which is something often sorely lacking in these sorts of batch scripts. It also makes full automation down the line a breeze, since you have nice shearing layers between your components.

http://www.laputan.org/mud/mud.html#ShearingLayers

InfoSecErik · on May 5, 2022

I tend towards a --dry-run flag for creative actions and --confirm for destructive actions. Probably sightly annoying that the commands end up seemingly different, but it sure beats accidentally nuking something important.

mkr-hn · on May 5, 2022

This sounds like a "do nothing script."

https://news.ycombinator.com/item?id=29083367

It defaults to not doing anything so you can gradually and selectively have it do something.

Learned about when I posted my command line checklist tool on HN: https://github.com/givemefoxes/sneklist

(https://news.ycombinator.com/item?id=25811276)

You could use it to summon up a checklist of to-dos like "make sure the collection in the dictionary has the expected number of values" before a "do you want to proceed? Y/n"

mipmap04 · on May 5, 2022

I do this, too, but I also take a count of the expected number of items to be deleted as well. If my collection I'm iterating over doesn't have exactly that number of objects I expect, I don't proceed.

lifthrasiir · on May 5, 2022

Human-in-the-loop is so important concept in ops and yet everyone (that's including me) seems to learn it the hard way.

pc86 · on May 5, 2022

I just want to say as someone currently working on a script to delete approximately 3.2TB of a ~4TB production database, this subthread is pure gold.

rawgabbit · on May 5, 2022

To ensure that the files are actually are downloaded (step1), before deleting the original (step2). I would make make step1 an input to step2. That is step2 cannot work without step1. Something like:

    (step1) Download video from URL.  Include the Id in the filename.
    (step2) Grab the list of files that have been downloaded and parse to get the Id.  Using the Id, delete the original file.

veltas · on May 5, 2022

Yep, even writing a simple wildcard at command-line I will 'echo' before I 'rm'.

pjerem · on May 5, 2022

On computers I own, I always install "trash-cli" and i even created an alias for rm to trash. It's like rm, but it goes to the good old trash. It will not save your prod but it's pretty useful on your own computer at least.

cryptoboid · on May 6, 2022

That's a good tip, thanks!

mbiondi · on May 5, 2022

Agreed, I've also been burned doing stupid things like this and always print out the commands and check them before actually doing the commit.

As they say, measure twice, cut once.

Don't feel bad, I think every professional in IT goes through something similar at one time or another.

V__ · on May 5, 2022

This was my first thought too. Another think I like to do, is to limit the loop to say one page or 10 entries and check after each run that it was correctly executed. It makes it a half-automated task, but saves time in the long run.

hinkley · on May 5, 2022

Condensed to aphorism form:

    Decide, then act.

There's a whole menagerie of failure modes that come from trying to make decisions and actions at the same time. This is but one of them.

Another of my favorites is egregious use of caching, because traversing a DAG can result in the same decision being made four or five times, and the 'obvious' solution is to just add caches and/or promises to fix the problem.

As near as I can tell, this dates back to a time when accumulating two copies of data into memory was considered a faux pas, and so we try to stream the data and work with it at the same time. We don't live there anymore, and because we don't live there anymore we are expected to handle bigger problems, like DAGs instead of lists or trees. These incremental solutions only work with streams and sometimes trees. They don't work with graphs.

Critically, if the reason you're creating duplicate work is because you're subconsciously trying to conserve memory by acting while traversing, then adding caches completely sabotages that goal (and a number of others). If you build the plan first, then executing it is effectively dynamic programming. Or as you've pointed out, you can just not execute it at all.

Plus the testing burden is so drastically reduced that I get super-frustrated having to have this conversation with people over and over again.

GordonS · on May 5, 2022

It's amazing the number of times I look at some simple code and think "nah, this is so simple it doesn't need a test!", add tests anyway (because I know I should)... and immediately find the test fails because of an issue that would have been difficult to diagnose in production.

Automated tests are awesome :)

Too · on May 7, 2022

A few assertions would have also stopped this.

    During buildup of the our_id list: assert (vimeoId not in our_ids). 
    After creating the list:  assert len(set(our_ids)) > 10000 and assert len(set(our_ids)) == len(our_ids)
    Before each final deletion: assert id not in hardcoded_list_of_golden_samples. 
    Depending on the speed required you could hit the api again here as an extra check.

But as always everything is obvious in hindsight. Even with the checks above, Plan+Apply is the safest approach.

ineedasername · on May 5, 2022

>literally simple prints statements

Yes, that can be a simple but powerful live on screen log. I developed a library to use an API from a SaaS vendor, in much the same way as the author. It was my first such project & I learned the hard way (wasted time, luckily no data loss or corruption) that print() was an excellent way to keep tabs on progress. On more than one occasion it saved me when the results started scrolling by and I did an oh sh*t! as I rushed to kill the job.

aqme28 · on May 5, 2022

Rather than commenting it out, I suggest adding a --live-run flag to scripts and checking the output of --live-run=false (or omitted) before you run it "live."

sdevonoes · on May 5, 2022

But then you have double the chances of introducing a bug for the specific scenario we are talking about:

Before: there is chance there is a bug in my "delete" use case

Now: what we have before plus the change that there is a bug in my "--live-run" flag

aqme28 · on May 5, 2022

You can make automated tests for your flag. You can’t make automated tests for your code comments.

ivanhoe · on May 6, 2022

Beside doing this, I like to first just move files to another dir (keeping the relative path) instead of deleting them. It's basically like a DIY recycle bin.

If both paths are on the same disk moving files is a fast operation - and if you discover a screw up, you can easily undo it. On the other hand if everything still looks fine after a few days, you just `rm -rf` that folder and purge the files.

inglor_cz · on May 5, 2022

Yeah, that is what I recommend too.

Instead of performing the dangerous action outright, just log a message to screen (or elsewhere) and watch what is happening.

Alternatively, or subsequently, chroot and try that stuff on some dummy data to see if it actually works.

sam0x17 · on May 5, 2022

Indeed. I would say that framework or even language-level support for putting things in "dry-run" mode is something sorely missed from many modern frameworks and languages, that old C libraries used to do.

jagged-chisel · on May 5, 2022

This is how I do it in compiled code. In shell, I print the destructive command for dry runs - no conditions around whether to print or not, I go back to remove echo and printf to actually run the commands.

hayd · on May 5, 2022

I'd make sure those include WARN or ERROR (I'd use logging to do that), that way you can grep for those. Spot checking might be difficult if the logs get long.

krono · on May 5, 2022

The No. 2 philosophy!

Make sure you got everything out and off before you pull up your pants, or else you better be prepared to deal with all the shit that might follow!

password4321 · on May 5, 2022

   SELECT COUNT(1) FROM table 
   -- UPDATE table SET col='val'
   WHERE 1=1

worble · on May 5, 2022

    BEGIN TRANSACTION 
    UPDATE table SET col='val' WHERE 1=1
    ROLLBACK

password4321 · on May 5, 2022

Definitely better, when you can afford the overhead!

tomrod · on May 5, 2022

Exactly!

abrookewood · on May 5, 2022

100% on the logging and dry run.

thunderbong · on May 5, 2022

That is called experience.

Good decisions come from experience. Experience comes from making bad decisions.

dncornholio · on May 5, 2022

Dry run really is key here. Most automated tests wouldn't find this bug.

OrwellianTimes · on May 5, 2022

Experience is the best teacher™

qwertox · on May 5, 2022

Aaaahhh, the feeling you get when you notice that you fucked up. Everything gets quiet, body motion stops, cheeks get hot, heart starts to beat and sinks really low, "fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck, fucking shit". Pause. Wait. Think. "Backups, what do I have, how hard will it be to recover? What is lost?". Later you get up and walk in circles, fingers rolling the beard, building the plan in the head. Coffee gets made.

deltarholamda · on May 5, 2022

Pffft, it's not a real panic until you weigh the pros and cons of leaving the country with nothing but the clothes on your back and becoming a illegal immigrant shepherd in a nation with too many consonants in its name.

(Your description is so, so, spot on.)

CapmCrackaWaka · on May 5, 2022

The worst panic I've felt actually took me over the precipice into peaceful oblivion. I started simply saying to myself "oh well... It's just a job".

vsareto · on May 6, 2022

I don't think there's any public technical mistake that'll prevent you from ever getting a job in tech. Demand is just too high. Peaceful oblivion still isn't my default even though it should be.

beardedetim · on May 5, 2022

Ah, the goat farmer fantasy that always seems to come _at the cusp_ of the solution.

gwerbret · on May 5, 2022

I had this experience when, years ago on my first day as group lead at $JOB, I was being shown a RAID 5 production server that held years of valuable, irreplaceable data (because there were no backups. Let me repeat that there were no backups). For some bizarre reason, I thought "oh cool, hot-swappable drives" and pulled one out of the rack. This naturally resulted in loud, persistent beeping from the machine, which everyone ignored on the assumption that the fellow who was just hired as the group lead knew what the f he was doing.

While I didn't know what I was doing, I did manage to get the beeping to stop, and had to come in at 5 a.m. the next day to restripe the drive I'd yanked out.

Did I mention there were no backups? When I was a little bit more seasoned on the job, I raised a polite but persistent issue with management of the need for durable backups. Although I kept at it for months, they thought about it, talked about it, and ultimately did nothing. A few months after I left, the entire array failed. Since the group's work relied on the irreplaceable data, all work ground to a halt for the several months it took for an off-site company to recover the data.

ycmjs · on May 5, 2022

My previous boss stores company data this same way. I begged him to approve the $5 per month cost for Backblaze on the computers I used. He approved it for some, but not all (about half of the ten computers). He completely rejected the idea for the company's data. After all, it was already protected by RAID.

ricardobeat · on May 5, 2022

Isn’t RAID 5 supposed to survive a single disk being taken out?

arminiusreturns · on May 5, 2022

Theoretically but there are often other things at play. I know the story is older but since about 2015 raid5 has been dead to me, mostly because at current drive sizes a raid5 rebuild takes so long your chance of a cascade failure and losing a second drive which makes it a "send to a recovery lab" risk. Anywhere you would use raid5 just do raid6.

spiffytech · on May 6, 2022

To add to the comments of cascading failure: if a drive goes bad, another drive from the same manufacturing batch is disproportionately likely to go bad. RAID arrays are often built with drives from the same batch, since they were bought at the same time from the same vendor. This means array failures include multiple drives more often than you'd expect.

gwerbret · on May 6, 2022

Yes, the array itself was fine; was just a dumb action on my part given how brittle the system was.

windsurfer · on May 5, 2022

If a second drive fails after the first while rebuilding (which happens more often with larger and slower drives), the data is lost.

wonderwonder · on May 5, 2022

lol, its amazing how fast the blood leaves your face when your mind transitions from "cool that worked well" to "Oh no, what have I done?"

That backups comment sounds very familiar.

I accidentally deleted a clients products table from the production database in my early years as a solo dev. There was only a production database. Luckily I had written a feature to export the products to an excel sheet a while before and happened to have an excel copy from the prior day. I managed to build an export to ingest the excel and repopulate the table in record speed while waiting for my phone to ring and the client to be furious. Luckily they never found out.

Taylor_OD · on May 5, 2022

God the feeling of having your body temp rise based purely on realizing you fucked up is so relatable.

cntrl · on May 5, 2022

damn, your description is spot on and reading this triggered PTSD in me... Last time I had this feeling was two years ago when I destroyed one of our development servers because of a failed application update. I know exactly how I wished Ctrl + Z to exist in real life... We had backups of the machine, but it was still kind of a humiliating feeling to tell everybody and ask for restore from backup (everybody was cool though in the end)

sergiotapia · on May 5, 2022

I lost 1hr and 30 minutes of a Slack like app (chat messages). Luckily at the time we were pretty small so not much data was lost but holy shit did that make me almost throw up.

Thank God my automatic backups were so close to the mistake I made and I didn't lose 24 hours.

Haven't made a mistake like that since and I don't destroy DB records like that anymore.

Yhippa · on May 6, 2022

Don't forget that out-of-body experience where you just kinda float outside yourself.

mannykannot · on May 6, 2022

If it is for real, body motion does not exactly stop, it manifests itself in other ways.

Oarch · on May 5, 2022

Poetic! Love it

iamben · on May 5, 2022

I like these stories. I think they resonate well for 'the rest of us'. I've made plenty of mistakes like this - you learn and grow, right?

One of the best things about HN is that so many incredible, talented people post. It's incredibly inspiring to raise your own game, to see what the best are doing. But sometimes it's equally important to realise we all fuck up, and for every unicorn dev there's another thousand of us grinding away.

OP - well done for sorting the problem and telling us all about it!

rossdavidh · on May 5, 2022

muglug · on May 5, 2022

The root of this particular issue was Vimeo's failure to do this migration for their customers.

Vimeo OTT has a codebase written in Rails, whereas the main PHP application is written in PHP. At the time Vimeo acquired Vimeo OTT's codebase, the Vimeo OTT codebase was small — around 10,000 lines of Ruby. Rewriting that codebase inside the Vimeo PHP application would have been a tough technical challenge for the all-Ruby team, and they'd have likely lost some people along the way and missed out on some content deals, so they decided instead to maintain two separate codebases and two separate login systems.

The video-playback and video-storage infra has since been unified, but all the business logic is still siloed.

conductr · on May 5, 2022

He wasn’t asking them to refactor their internal code bases. But they should be able to whip up the 20 lines of code needed to do this between APIs (or just directly on their servers). Essentially what author was trying to do when he screwed up. For the author this was disposable code, for Vimeo this would have been a reusable utility.

I know how these things happen. Support ticket queues and all. And while I don’t fully know the difference in cost, I would assume a customer upgrading to an Enterprise plan would get a better support experience.

Whoever within authors company negotiated the upgrade to Enterprise (or didn’t) and failed to embed some agreement around OTT to Enterprise transition assistance was the one who made the first mistake.

chernevik · on May 5, 2022

Per the post, Vimeo DID do it -- without telling the customer! And then wouldn't help uncluster the situation.

macspoofing · on May 5, 2022

>The root of this particular issue was Vimeo's failure to do this migration for their customers.

Yes and No. At the end of the day, you as a business have to insulate yourself from your infrastructure provider.

notyourday · on May 5, 2022

Vimeo is the only infrastructure provider providing that service. It is impossible to insulate a business from it.

macspoofing · on May 6, 2022

You're saying it's impossible to not accidentally delete 7TB of videos, and when you do, to blame it on Vimeo?

tomkwong · on May 5, 2022

First, I want to say that this is a great post. You always grow stronger when you make mistakes. Writing it up solidify understanding in the learning process.

This story resonates with many people here because many experienced engineers had done something similar before. For me, destructive batch operations like this would be two distinct steps:

1. Identify files that need to be deleted; 2. Loop through the list and delete them one by one.

These steps are decoupled so that the list can be validated. Each step can be tested independently. And the scripts are idempotent and can be reused.

Production operations are always risky. A good practice is to always prepare an execution plan with detailed steps, a validation plan, and a rollback plan. And, review the plan with peers before the operation.

notyourday · on May 5, 2022

> 1. Identify files that need to be deleted; 2. Loop through the list and delete them one by one.

> These steps are decoupled so that the list can be validated. Each step can be tested independently. And the scripts are idempotent and can be reused.

This is the most underrated comment.

I'm saying it as someone who had the ultimate oversight of deleting hundreds of TBs per day spread of billions of files on different clouds and local storage.

spiffytech · on May 6, 2022

I've never regretted treating tasks like this as a pipeline of discrete steps with explicit outputs and inputs. Sending output to a file, viewing it, then having something process the file is such a great safety net.

RankingMember · on May 5, 2022

I'm impressed you went with an automated solution (PlayWright) for 500 videos after all that, considering they could be cross-loaded from Google Drive almost instantaneously. I'm glad it worked, but coding around a screw-up under the gun seems like a high-risk operation compared to spending 4 hours doing the task manually (albeit being super bored the whole time), but with the benefit of knowing it's being done correctly instead of hurriedly writing a script to potentially do something else wrong very efficiently and dig your hole deeper.

bruhbruhbruh · on May 5, 2022

+1 to this. After the few major screw-ups I've caused at work, my self-confidence in my coding ability is rocked, and I tended to react by erring towards manual cleanup, rather than coding some scalable solution for fixing the issues

leokennis · on May 5, 2022

Actually I was surprised reading that the person wrote a script to delete 900 videos.

If you need to do it once, it’s probably 2-3 hours of work? That is identifying a duplicate video and then clicking the button(s) to delete it once every 20 seconds.

Reminds me of https://xkcd.com/1205/

rexreed · on May 5, 2022

A big part of the reason for the problem in this post is because Vimeo made it impossible to move videos from one Vimeo product to another Vimeo product: "There were roughly 500 videos on VimeoOTT that had to be transferred to Enterprise and Vimeo doesn't provide an easy way of doing it."

I have found working with Vimeo to be very frustrating, especially recently. They have a great video solution, especially for streaming, but they seem to put these unnecessary and frustrating roadblocks that make me constantly question my decision to use Vimeo. From in ability to move videos from one place to another, requiring complete uploads (resulting in problems like this post) to nonsensical limits and pricing, especially on their new webinar offering, which has a limit of 100 registered attendees. For anyone who has run webinars before, this makes no sense since 100 registered attendees usually means 20-30% of those people actually attend, so you're capped at 20-30 live attendees. They should price it like most event sites and charge per live attendance rather than registration.

Regardless, I've been very frustrated with Vimeo since it could be so much better if they didn't have these roadblocks in place. If they could have easily enabled moving videos from one product to another, the post (and 7TB of lost videos) would never have happened. It wasn't always this way with Vimeo, but they went IPO in May 2021 and it's no surprise they're turning the screws on their product offering and pricing now.

NikolaNovak · on May 5, 2022

Honestly, this is positively representative of any junior developer with comparable experience. Depending on their background and how much production work they had, there's an overwhelming sense of eagerness and enthusiasm. Quick to script and perhaps a bit too quick to execute.

A friendly team will harness that enthusiasm and tame the quickness / encourage respect for production. We all made a massive doo doo and its how you proceed that'll define your career.

johnklos · on May 5, 2022

We can all poke at this person for doing things incorrectly, but one has to wonder what mindset could lead to any programmer ever thinking that:

  1) parsing a web page shouldn't be considered incredibly fraught with problems
  2) that reloading web pages should be part of (1)
  3) that this should ever possibly be run without validating the list of files that would be deleted

So forget the specifics. Where are people learning these things, and what do we do to teach them better things?

dncornholio · on May 5, 2022

Some mistakes can only be learned by making them. Sometimes you can tell someone a hundred times something, they won't learn until they experience it.

The point is not to prevent these mistakes, but to keep the consequences low.

Have backups, have version control, etc.

ufmace · on May 5, 2022

True, and worth remembering why. Most of us are constantly getting warned about the dire potential consequences of huge numbers of things, most of which are either massively unlikely to ever happen or not actually that bad, or both. It's very difficult to tell which of the things we get warned about are actually high risk until something bites us.

dboreham · on May 5, 2022

College? Parents? In my experience it runs pretty deep so not sure it can be easily trained out. This mindset is probably quite useful in evolutionary terms: rush at the attacking bear without thinking, for example.

plonk · on May 5, 2022

> rush at the attacking bear without thinking, for example

Would that work? I don’t see a bear backing down and I don’t see the human winning either.

qayxc · on May 5, 2022

> Where are people learning these things, and what do we do to teach them better things?

Learn to learn and learn to work carefully. It starts in school and should be part of a proper college/university education or vocational training.

There's several ways of learning the specifics: by experience on-the-job, which can be hard if mistakes can get you fired; or by putting in the work in your free time.

If your job is to work with certain web frameworks and you're not very experienced, either ask senior devs to assist/review before going live with critical changes. Alternatively, practice at home. Unpopular, but you need to get experience from somewhere. OSS projects are a great way to do that - be that by creating your own or by contributing to an existing one.

bsder · on May 5, 2022

"rm -rf" blowing you foot off is a Unix Right of Passage(tm).

You will do it at least once in your career. If you're old enough you will do it twice. If you're really old, you get the joy of doing it a third time.

The subtlety increases each time because you do learn.

Mo3 · on May 5, 2022

Seriously.. also, looking at these code snippets...

If someone delivers code that looks like that, especially if intended for a production system, I'm firing immediately.

It's a miracle nothing has happened sooner.

ziddoap · on May 5, 2022

From the article:

>I'm a Junior Developer with less than one year of actual experience.

>The bad news is that this was on Friday, and we needed to have the videos back up at most for Tuesday morning.

You say:

>If someone delivers code that looks like that, especially if intended for a production system, I'm firing immediately

Fire immediately? What a miserable sounding place to work.

Mo3 · on May 5, 2022

In this case - seeing how they let them have direct access to production - I agree on the miserable sounding place to work and repeat myself -

It’s a miracle nothing happened sooner

ziddoap · on May 5, 2022

I was referring to your workplace.

Mo3 · on May 5, 2022

At least we don’t let junior developers with close to zero experience anywhere near production..

I didn’t quite read the part about his experience in the article, I agree firing over that wouldn’t be fair, but that just raises other questions.

ghoomketu · on May 5, 2022

The more I read about vimeo the more I wonder what's up with these guys.

Only recently they made some god aweful policy changes for content creators(1), but it looks like they treat their enterprise customers just the same.

Surely, there must be better alternatives for hosting videos than being at the mercy of a company who couldn't care less about big paying customers.

(1) https://www.theverge.com/2022/3/18/22985820/vimeo-bandwidth-...

pfista · on May 5, 2022

mux.com seems like a great alternative and is super developer focused.

RcouF1uZ4gsC · on May 5, 2022

This is one of those times that even if you don’t use a fully functional language, trying to make as much of your program logic pure functions would be helpful.

It also makes it more testable. Instead of putting the delete call right in the loop, split it into four functions.

    function getAllVimeoVideos()

    function getAllDbVideos()

    function getVideosToDelete(vimeo_videos, db_videos)

    function deleteVideos(videos_to_delete)

Your core logic lives in getVideosToDelete which is simply a set difference.

Given that there are only a few hundred videos, it is easy to run the getter functions above and quickly verify they are returning what you expect.

tomhallett · on May 5, 2022

This was going to be my exact recommendation. By “separating the concerns”, you make it easier on my pretty much every dimension: testing in unit tests, doing a dry run in production, ability to read the code (you and code reviews), and in some cases your code will be written in a more functional way reducing variable scoping issues.

acutis_fan · on May 5, 2022

Yes that's fun. a

    List<Foo> getFoosToUpdate(List<Foo> foos, List<Bar> bars)

function is the first time I thought about time complexity in my job.

Say Foo and Bar have fields in common, such that you can say a Foo object "equals" or "matches to" a Bar object, like if they have name and dateOfBirth fields or something else that are the same (nothing like a common ID between the two). Now say there are some other fields too, like amountSpentThisYearOnDogFood that you know is always accurate for Bars, but might be out of date for Foos. How do you get the list of all the Foos to update?

Initially I did the nested for loop solution that's like

   List<Foo> getFoosToUpdate(List<Foo> foos, List<Bar> bars)
   {
    List<Foo> returnList = new List<Foo>();
    foreach (var foo in foos)
    {
     foreach (var bar in bars)
     {
      // check if "equal" or "matching" based on some criteria
      // if equal, update foo dog food expenditure with bar dog food expenditure, add to returnList, and break
     }
    }
    return returnList;
   }

but that's O(n^2) right.

The solution with a Dictionary is obviously better. All you need to ensure is that you have a method for both the Foo and Bar classes that will produce the equivalent hash for both, if they would be considered equal or matching by whatever criteria you are using.

So you could have something like

    int GetHashOfFoo(Foo foo)
    {
     string firstName = foo.FirstName;
     string lastName = foo.LastName;
     DateTime dob = foo.Dob;

     return (firstName, lastName, dob).GetHashCode(); // convenient c# method
    }

    int GetHashOfBar(Bar bar)
    {
     string firstName = bar.FirstName;
     string lastName = bar.LastName;
     DateTime dob = bar.Dob;

     return (firstName, lastName, dob).GetHashCode();
    }

These two functions will return the same value if those fields are the same. So then you can do something like

   List<Foo> getFoosToUpdate(List<Foo> foos, List<Bar> bars)
   {
    List<Foo> returnList = new List<Foo>();
    Dictionary<int, Bar> barsByHash = new Dictionary<int, Bar>(bars.Count);

    foreach (var bar in bars)
    {
     int barHash = GetHashOfBar(bar);
     barsByHash[barHash] = bar;
    }

    foreach (var foo in foos)
    {
     int fooHash = GetHashOfFoo(foo);
     if (barsByHash.ContainsKey(fooHash) 
     {
      returnList.Add(foo.CopyWith(dogFoodExpenditure: barsByHash[fooHash].DogFoodExpenditure))
     }
    }
    
    return returnList;
   }

Which is faster cause you only have to go through the bars list once.

I actually messed up something like OP with this, but with doing undesired additions instead of undesired deletions.

You can think of it as having two endpoints, both expecting a .csv with rows being the things you were updating/changing/deleting.

The problem was, there was a column to indicate (with a character) whether the row was for an edit, or addition, or deletion, but this was only with one of these endpoints. For the other, there was only addition functionality, but I thought changes and deletions were also options for the other kind of .csv due to some unwise assumptions on my part (thinking that the other .csv would have the same options as the other). That's how we accidentally put in over 100 additions that should have been changes that had to be manually deleted. Luckily I had a list of all the mistaken additions.

KingOfCoders · on May 5, 2022

"I'm under an NDA"

Don't write a blog post.

desarun · on May 5, 2022

Oh dude, we've all been there.

9 years ago I was working for a major broadcasting company in the arse end of London as a junior dev, building one of their Android apps.

We'd roll features out months before & enable them with feature flags via a json file we'd manually push to a prod server at a later date.

We'd just built a huge new feature letting you request content to be downloaded to your set top box remotely & it had a 250k marketing campaign to go along with the launch.

Senior dev trusted me with prod deployment rights.

I pushed the wrong json config to prod, launching the feature weeks before the marketing campaign.

Thank god I was a junior perm, that was definitely a firing offence.

hayd · on May 5, 2022

> Senior dev trusted me with prod deployment rights.

That part's crazy! If you think it was a firing offence wouldn't they've been fired? (I don't think it is, but obviously requires system changes/explanation.)

JacobiX · on May 5, 2022

> It involves bad practices and errors from multiple parties in a world that might seem

> foreign to the "Silicon Valley" world but paints an accurate picture of what

> development is for small IT companies around the world

Everybody makes mistakes even in the "Silicon Valley" world, but such problems cloud be easily caught by testing (which he did but it was restricted to the first page) and performing a simple dry-run.

crispyambulance · on May 5, 2022

Exactly, everyone makes mistakes. Sometimes huge ones. In hindsight or on the sidelines it's always easy to point out a few technical things that WOULD HAVE avoided catastrophe, but does that help? I think not (aside from a cautionary parable for interns).

Things are complicated, people are human and forget things, there are pressures to "get it done" and override the guardrails. Everybody has horror stories. Some worse than others. Welcome to the OP's day of horror. I would think "Silicon Valley" dev-ops horror stories make this one seem like a triviality.

photon-torpedo · on May 5, 2022

Apart from all the advice on how to do such destructive operations more safely, I think there's also a lesson to be learned about communicating more actively:

1. Vimeo responds to the original request with "will look into it", then... nothing happens? This may depend on culture, but at least from my experience in the UK, this is a very non-committal response, and if you really want them to do something, you'll need to chase them. Wait a few days and inquire if they have any estimate for when it might get done, or if they need more information. I find that the "looking into it" response is sometimes used to gauge how important the request is to you.

2. Once you go with your own solution, just drop a quick message to Vimeo: "Hey, just wanted to let you know we've found our own solution for this, and won't require your help any more. Sorry if you've already committed any resources for this task. Have a nice day, yada yada." This not just avoids what happened here, but is also a courtesy to them.

aristus · on May 5, 2022

Hey, everyone, ease up. I have: 1) dropped a production database because I thought it was the test database. 2) screwed up a print job costing $100,000 in today’s money and had to do it again 3) crashed all of Facebook with a C++ bug. 4) crashed Facebook photo uploads, with a JavaScript bug, in my first month. 5) literally killed a startup’s cash flow and caused them to lose their merchant account because I over focused on the wrong bugs.

hbn · on May 5, 2022

At my first development job (paid internship at a moderately-sized, though fast-growing business - maybe 300 people at the time?) I introduced a bug that didn't appear until a certain microservice stopped working (my code defaulted in the wrong direction when the ms failed) and as far as I can tell they may have lost or almost lost a pretty big account from it. In an after-hours meeting regarding the issue, one of the higher ups ended up storming out and never showing up again.

In my defence, we had to get 2 PR approvals before anything was merged! But I definitely learned a thing or two from that experience

paintman252 · on May 5, 2022

You worked at Facebook, we get it

wruza · on May 5, 2022

Code without constant logging of “utc [who] does what exactly” is a no-go for me for a long time. Also, if you have to be destructive, replace the <rm/sell/halt> with log() for at least one time (aka --verbose --dry-run) and check your expectations. One-shot scripts like this are screaming disaster.

(The problematic line lacks the closing ", probably a typo? I though it closed in an unexpected location)

0xbadcafebee · on May 5, 2022

This is more common than you think. Not just losing data, but not having a good handle on where the important parts of the system are, and how close you are to catastrophe. I find diagrams really help. I can recall a visual map of the system when I work on some component, and think, "OH, I remember seeing this component connected to a really critical thing, I need to check something first."

Start by creating one empty page for every component of your system. You won't remember them all, but over time you can add missing ones. Each page is the authoritative source of info on that component. If you need more pages for one component, put them in a directory of the same name as the page and add ".d" to the directory name, and link to them from the first page. Finally, create a diagram (however you want) that includes every component you have a page for. Add the count of components to the top of the diagram. If the count on the diagram doesn't match the number of documents, time to update the diagram. If you ever add, remove or rename a page, time to update the diagram. If you do this the same way for every different system you have, you can link them all together and get both small and large scale diagrams. (p.s. don't waste time automating this unless you find the system changing constantly or you have a very big system)

JasonFruit · on May 5, 2022

I believe if we're honest, we've all done stupid things we should have avoided. I remember a group of about 3000 emails that went out to insurance agents saying that policy #123456789 for Someone Funky was going to be cancelled by underwriting. I also remember very quickly figuring out how to automate Outlook's email recall feature.

We've all made big dumb mistakes. Recover and learn.

batch12 · on May 5, 2022

It's like the first time you run

  rm -rf /path/to/delete/ *

And realize it is taking too long...

SnowHill9902 · on May 5, 2022

Can you explain? I feel like it removes / but not sure why.

pwg · on May 5, 2022

   rm -rf /path/to/delete/ *

Note the space between the last / and the *

This will recursively remove the directory /path/to/delete and remove every file/directory that matches * in the current directory where 'rm' is being run.

When what was most likely meant was:

   rm -rf /path/to/delete/*

Note the lack of a space between the last / and . This will remove all files that match that reside in the /path/to/delete/ directory.

KarlKode · on May 5, 2022

Besides recursively deleting /path/to/delete/ the command also deletes all (non hidden) content of the current directory (note the * at the end of the line). I assume the correct command would be /path/to/delete/*.

switch007 · on May 5, 2022

The error is the space before the asterisk. The original intention was to delete the contents of the folder /path/to/delete/. Instead, the asterisk enumerates files in the current directory and they get deleted

Tesl · on May 5, 2022

It removes everything in the current directory

mastazi · on May 5, 2022

> Vimeo doesn't provide an easy way of doing it. I wrote to the support team around October asking them if it was possible to do a migration, and they told us that they "will look into it" without letting us know anything ever since. [...] At one point, without letting us know anything, Vimeo decided it was a great idea to comply with our request and dumped all the videos present on OTT onto the new platform. No questions were asked [...] they were duplicating videos that were already uploaded.

Oh yes Vimeo, the crappy company that won't let you play videos unless you enable autoplay in your browser[1].

Selecting them as a provider was the actual mistake.

[1] https://askubuntu.com/questions/777489/vimeo-video-not-playi...

Fritsdehacker · on May 5, 2022

This is why you have backups. Good on you to have them!

When I just started as a junior dev at a small company I made the classic mistake of emptying the prod db instead of my local dev db. This was a small and in hindsight insignificant project. But Google was our customer, so it didn't feel insignificant at the time.

In this case my inexperience was partly my savior. All the data was inputted by people via a web form. Normally you're supposed to use POST to submit a form. But I was quite clueless at the time, so I had used GET. This meant all requests were still in the Apache logs. I could simply replay all requests.

I still feel my hard pounding when I think about the moment I realized what had happened. I was really relieved when everything was back!

What I learned from this incident:

- make automated backups

- no access to prod db from anywhere but prod

cassandratt · on May 5, 2022

Yea, I’ve wiped out an entire government’s form library once. Backups are a career saver.

BillyTheKing · on May 5, 2022

For larger 'live' production changes I've now started to rely on generative programming. I've got one script in some 'normal' programming language like javascript, or python, which in turn generates a script that contains a list of curl or other cli commands which do the actual deletion, modification, addition, etc.

This allows me to run a small sub-set of commands and test those under a live-environment before running all commands at once. In addition, this also functions as a complete log of what has been changed manually in production.

brunooliv · on May 6, 2022

Kudos to you for "learning in public" by showcasing part of your learnings online!!! I think this is extremely important to do!

Not everyone is an innate rockstar developer who provisions k8s clusters for breakfasts and delivers features for lunch!

Being a developer is a really hard job and there are endless complexities and difficulties along the way and when we are more seasoned already.

Don't let any negative feedback deter you from keeping doing what you're doing: learning from your mistakes and improving along the way!

LinAGKar · on May 5, 2022

Shouldn't that be `page={page}` rather than `page{page}`? Or better yet, use the requests `params` argument.

ricardobayes · on May 5, 2022

Any process that makes a junior directly access prod codebase/database is flawed. No matter how small of a company you are, you can set up a proper CI/CD pipeline.

thevinter · on May 5, 2022

90% of IT companies in Italy don't even know what a CI/CD pipeline is. That said I don't think it's something we could've integrated in our pipeline as it's an error that originated from an external service!

thisNeeds2BeSad · on May 5, 2022

The only thing that I can remember helping against such actions, is the exponential need for confirmation by intent.

Means, if you delete one small file you need one confirmation, if you delete thousands, you need a intent stating i expect thousand files to be deleted. Same goes for size. So not a okay button, but instead a form allowing you to enter the dimension of the intented outcome. 100 files max, 1 gb max deleted.

If the request goves over the intent, the system aborts.

andreagrandi · on May 5, 2022

It should really be something like: "a flaw in our system allowed me to delete 7am TB of videos". Not entirely your fault.

mrkwse · on May 5, 2022

System and/or development processes

franciscop · on May 5, 2022

This is a great technical write up, I'd love to hear the human side of this story as well! When did you tell the higher ups that you deleted production? Was no one more senior on call to try to fix it? Did they want you to learn how to fix it? Or were you the most senior responsible for this whole area? Or did they don't know?

thevinter · on May 5, 2022

The first part of my write up slightly explains it but the point is that HN is the top 1%. In my current company we have 10 developers, most of them without a technical degree. They know how to do what they've been doing for the past 10 years but (as with most small companies here in Italy) people don't know what best practices are used in the industry, what a pipeline is or what a dry-run is (I learned about it today myself!).

What happened is that no one knew how to react and I was probably the best suited for it, we don't really have seniority in office.

That said when I deleted the videos I immediately told my boss. He was kind of scared but his reaction was mostly "Well, now we have to re-upload them immediately, find a way. The people that uploaded them once won't be doing it twice". I was basically left on my own to find a solution (which I luckily did).

Please note that I'm in no way blaming my company or accusing it of something, this is the standard knowledge base and way of dealing with things in many places, contrary to what working in big tech or reading HN might make you believe!

franciscop · on May 5, 2022

Thanks for the explanation, that makes a lot of sense!

> "HN is the top 1%" + "this is the standard knowledge base and way of dealing with things in many places, contrary to what working in big tech or reading HN might make you believe!"

I'm in fact from Spain and now live in Japan, and I believe the practices in Spain would be as bad as Italy, and in Japan they are def worse (great at hardware, horrible at software), so I do understand a lot of what you are saying. FWIW, in Spain I've seen whole dev teams composed only of interns!

> "we landed a big contract for one of the biggest gym companies in Italy, the UK and South Africa" + "we don't really have seniority in office"

Maybe now that seems like you have the budget it's a good time to go to management and suggest to hire some senior devs who can mentor the rest into learning best practices? You can sell it like a reinvestment in the company to management if they want to take it as pure profit. If Italy is like Spain, many devs won't really even want to learn these things, but some will and then those will become seniors at some point.

Reason077 · on May 5, 2022

> "What does this teach us? Well, it teaches me to do more diverse tests when doing destructive operations."

I think it also teaches us that adversity sometimes leads to better solutions. I love that the OP made a hacky script that did in 4 hours what a guy was paid to do manually over several months!