As someone who is impacted, this is obviously immensely frustrating.
Worse, outside of "we have rebuilt functionality for over 35% of the users", I haven't seen any reports from the people who have ostensibly been recovered.
Finally, there have been some hints that this is related to the decommissioning of a plugin product Atlassian recently acquired (Insight asset management) which is only really useful to large organizations. I suspect that the "0.18% impacted" number is relative to ALL users of Atlassian, including free/limited accounts, and that the percentage of large/serious organizations who are impacted (and who would have a use for an asset management product), is much higher.
RTO is so hard to state properly even with regular testing. If someone blows away a critical database, sure you can meet your published RTO. What if we lose 300 of our databases and need to copy snapshots from another region. AWS limits you to 20 concurrent snapshot copies cross region. Which of those databases should you do first? Do you know your entire dependency graph for all 1000 of your services to make the right call? Meanwhile tick-tock your "6 hours" is slipping away. And what if someone nukes our entire AWS account with all of our prod resources? Databases, load balancers, S3 (no such thing as snapshotting there), EC2 instances, etc, etc
Those last two are examples are very unlikely but no company is going to say RTO = "probably 6 hours but it could be three weeks if we get ransomwared"
For what you suggest some combination of these things should have happened.
- Some employee has root access to AWS account and uses it operationally
- Given wildcard S3 permissions to an IAM user and allowing delete bucket
- Not enabled object versioning
- Cross Region replication not enabled
- no large bucket protection
- don't have basic security monitoring and setup of Cloudtail alerts
- have not invested in full fledged tools for IDS and so on.
If some vendor have any of these issues I don't think any customer would approve these software to be used, these are not normal or best practices .
Large apps have detailed playbooks on how and what gets turned on in what order, and most do DR drills and time those runs periodically. These are well established workflows in any large org.
Yes in a real world downtime you can't have planned for every scenario, maybe you miss the target by 25 % like 2 hours more, or maybe in a very situation you double or even triple it say 12-18 hours. You don't go from 6 to 600+ .
The way RTO is calculated starts by looking at limits on cloud/ hardware / bandwidth/ machine sizes, if basic limits are not factored in like cross region concurrency there is no point in RTO being computed. Even if something like that was missed and you spend tens of millions of dollars on AWS then AWS will work with you and relax those limits .
100x missing the plan either means extremely poor planning or they screwed up something very very badly.
I haven't ever seen it be as perfectly done as you've described. It is always shades of gray across teams and companies. They have most of what you described, but not uniformly across the company.
e.g. versioning may be enabled, but not cross region replication because it is cost prohibitive. Someone runs a job to clean up a bucket that includes deleting old versions. They point it at the wrong bucket or wrong path in the bucket. Or a malicious user does it on purpose. Monitors and alerts really tell you after the fact that you now have a major problem.
Also limits (like cross region concurrency) may not be known about until it is time to actually do a mass scale restore. DR tests might have been done but only in isolation of one app at a time. By the time you realize your mistake you're dealing with physics. Maybe AWS can bump it a bit to help you in that particular circumstance though.
No idea what happened at Atlassian. My only point is it is very hard to get it right without a huge amount of effort.
It is hard to get it perfectly right yes , missing by small margins or even doubled/trippled the declared time would be reasonable if it is just prediction problem.
However going like 100x is not probably cause this is hard to get 100 % accurate it look more likely deleted data as being rumoured and more importantly not actually having functioning backups that were ever tested and manually reconstructing from logs and other sources.
More than just RTO, they are not going to be able to meet RPO objectives for affected customers , depending on how much loss that is going to pretty bad.
> 100x missing the plan either means extremely poor planning or they screwed up something very very badly.
Like most airliner accidents, this is probably an unfortunate combination of both of those things happening at the same time. My guess would be they have fairly decent planning overall but there's one (or more) small-ish areas where their planning is extremely poor - which crossed over with a screwup in a very specific fashion that laser focussed on that particular piece of poor planning. The "this can never happen" immovable object and the "You can't do that" irresistible force.
Aviation has paid with so much blood resulting in tight regulations and internal controls to get the point where any current accidents are unlikely edge case scenarios. [1]
SaaS has lot more tolerance for failure, so my money is it is something simpler but difficult to get implemented in large org.
---
In a ideal world this incident should impact their revenue, growth and stock price substantially.
It is unlikely to do so, because of stickiness of enterprise customers, no better alternatives, compared to say Google, Facebook, Amazon where a minute of downtime is immediate quantifiable revenue loss so FAANG really obsess so much over how many 9s of uptime.
The typical management of enteripse app companies like Atlassian have no incentive to do anything beyond cursory lip service and get away with under investing in tech.
---
[1]3 years back i would have stood by that, but after Boeing 737 max twin disaster and systematic problems leading to it , i am not so sure those lessons are not forgotten.
We are one of the customers that has been restored with access to systems restored about 24 hours ago. We have some lingering but relatively minor issues with some plugins and the like but there doesn't appear to be any data loss and performance is healthy.
I bet they screwed up royally, deleted some data and are down to either rebuilding it from logs, caches or other side-effects, or using data recovery software on the storage drives (which might involve third-party companies). I can't see many other reasons why this should take 2 weeks.
A few years ago I depended on Vertica (snowflae/clickhouse) type database.
When a node went down there was no hope of ever coming back up unless you shut the other nodes down as well. While this was going down of course none of our ingress data was being inserted so it built a queue. When we turned things back on the queue would overload vertica again and we had to repeat the whole thing.
Fortunately for us we only stored analytic type data on vertica where customers usually were only interested in the last few hours anyways. So we ended up deleting all historical data and just reprocessing it over months while occasionally prioritizing customers that complained.
I've actually done exactly that many years ago for a self-hosted Jira installation that didn't have any backups. You can bet we had backups with regular testing after that.
Honestly it doesn't sound too difficult and like a challenge to script something fun. To me. If you ever find yourself in that situation, email is in my profile and we'll work something out :)
In my limited experience these diffs can be missing information. I recently had to reconstruct an issue description using these email diffs after two people where editing the description at the same time and it was not 100% accurate, several lines were missing. Going to the 'history' tab on the issue I was able to get the missing lines however, if all you have are emails though you might be out of luck.
The idea is to not have the backups stored on the same hardware or even same type of hardware. Same hardware is obvious but same type of hardware is listed because if a manufacturing defect or a known vulnerability is present it would make all of your backups at risk. So you want to have backups stored on 2 desperate types of storage media. HDD and Tape, or Cloud etc...
Exactly. If tape was easier for me to implement in my setup, I would do it, but I'd rather just use cloud with a fast fiber connection for now. Offsite I can send data on an external drive to another physical location if needed.
I made efforts to buy HDDs from different sellers even, to avoid sequential failures from singular bad batches. That's something else I'd want to add to a "3-2-1", with regards to HDD as a form of backup or storage media.
for me it was always 2 different formats (db-dump, vm-dump) because i trust (good) storage-media more then backup-software, for example old veeam-backups cannot be restored with new version, old veeam-software runs not on new esxi etc...
Yes seems I used the wrong word here. Indeed I meant media, rather than format, as personally for my backup setups it's different media I want to trust, rather than the actual backup file formats (which are easily interchangeable depending on what's used; read and write data to and from formats if needed).
My 3-2-1 comes from a personal non-professional standpoint, thus not having the extra 1-0. However I have been considering immutable offline backups, using burned DVDs or Blu-Ray discs. That's another project for another time though, for now I'm trusting paid cloud providers.
As for verification tests, hashsums are a simple solution in my opinion, but I've moved to ZFS and BTRFS to avoid having blips.
As the person responsible for running Jira and Confluence on premises at my employer I‘m looking forward for the next time one of their sales droids contacts me to make us move to their cloud services (despite me stating that we are not interested multiple times)…
It's insane. Since they've all but killed off on-premise Jira and Confluence, they've been spamming me regularly trying to convince me to "upgrade" to the cloud. Eventually I gave in and replied to them asking two simple questions.
One is that we've had many bad reports from partners that Jira Cloud is incredibly slow, even when compared to the already underperforming Jira Server and I wonder what their performance guarantees are. The other one is that it's so, so pricey.
It helped, but not in the way I thought it would! There's been no reply, but also no more spam emails now!
I haven’t used Jira Cloud in any great depth, but I did play around with it as part of trialling the free plan and was amused that you could quite easily come across warnings to backup your installation and consult your system administrator before proceeding…how exactly do I do that for a cloud service? Doesn’t exactly inspire confidence.
At my last job we used Bitbucket Cloud and that was awful. Dog slow, ridiculously low threshold for being unable to render diffs, and constant incidents. We used to joke that they could make the “Bitbucket is experiencing an incident” banner a permanent fixture on the page and it would be right more often than it was wrong.
We’re still using on-prem Jira at my current job, but we just migrated away from on-prem Bitbucket to GitHub, as Bitbucket was becoming infeasible and the cloud offering is a bad joke.
But slowness isn't really the problem, the problem is that it's unpredictable.
I wait for the interface to be fully loaded, so I click on a text box and i start typing. Then FU--ING something takes the focus to some other element in the web page and now i'm typing random shortcuts (like reassigning tickets, changing status or whatever).
It's painfully slow but the real problem is that it's unpredictable in its behaviour.
>now i'm typing random shortcuts (like reassigning tickets, changing status or whatever).
This happens to me often and it's absolutely infuriating. I'd prefer a blocking spinning wheel of death over that nonsense. It's all but ensured that I'll be looking elsewhere when choosing project tracking software in the future.
Well you’re just a user, what do you know anyhow? It’s not like paying money for a service entitles you to be able to have something which works and delivers what you paid for. In fact, if you look at the EULA I’m positive that it states you’re paying to access the almighty godlike code of Jira.
Also if you don’t like it, simply construct your own industry standard and train your users on it, maintain it, and keep the costs down!.
I think software companies need to have a serious “Come to Jesus” talk with their users about who needs to control what.
My employer recently switched from on-prem to cloud. The cloud service is insanely slow, or maybe it's my aging Macbook, but every single component on the page seems to have to load separately. (It's a newer UI versus what we had on-prem).
Thankfully we haven't been impacted by this outage.
I've also heard that the cloud service is slow; you can easily check if it's your machine or the server by watching devtools -> network tab and seeing how many requests are `(waiting)`, because chances are it's Atlassian's server speed.
There’s nothing that makes me happier than the fearsome squealing noises that enterprise sales drones make when you drop the sales equivalent of a Paveway IV on their pitch.
My favourite one was running some software supply chain compliance software on itself and explaining how it was constructed on top of a CVE riddled garbage dump.
My favorite experience with a sales rep (not Atlassian related at all) was when I went to the vendor's booth at NAB (huge industry convention for those not familiar). I saw our sales rep who quickly looked down to catch her breath before greeting me. I smiled and let her know that today was her lucky day if she could just introduce me to the tech team she promised I could meet. At one point in my conversation with the tech team, I noticed that a small crowd had gathered around my conversation. I was not intending to hold public court, but I was not going to miss a chance to talk directly in person to these team members. To a non-tech person, it might have been viewed as confrontational. To a tech person to tech person, it was just direct questions being held to the fire for an actual answer vs CSR/Sales rep platitudes.
We will probably bite the sour apple (not sure if this is correct English but you’ll get the meaning even if it’s not) and switch to the data center edition which is still on prem but costs approximately twice as much.
A bit from the Boom Chicago show I saw in Amsterdam around 2004: “The Dutch expression ‘bite the sour apple’ means the same thing as ‘bite the bullet’ - Americans are obsessed with guns, and the Netherlands is full of shitty food.”
> Americans are obsessed with guns, and the Netherlands is full of shitty food.
"Bite the Bullet" is a phrase from an Englishman, Rudyard Kipling, in his first novel, "The Light That Failed". It's believed to have come from the other English idioms, "to bite the cartridge" and "chew a bullet", which date back to 1891 and at least 1796. [1]
Weird... As another commenter pointed out, the origin of 'bite the bullet' is not American at all. And 'biting the sour apple' doesn't imply an affinity for bad food (Apples are shitty? Really?). It's a metaphor about context necessitating action, not a celebration of consuming unpleasant foods.
You can probably use this 1 week (3 scheduled) outage to ask for a discount, "your cloud offering is a bucket of shit, your data center edition is too expensive, my higher-ups told me to find something else...".
You can still buy datacenter edition, and some of us are forced to do so, if we wish to continue to use Jira and Confluence. For us it's not a problem, we're heavily invested in the Atlassian suite of product, and sell consulting, so we get a significant discount. For some of our clients it's a massive problem, as Atlassian cannot promise them that data won't leave the country, if fact it's certain that it will, because there's no AWS datacenter within our borders.
Atlassian completely ignored the large number of smaller customers who are legally forced to use an on-premise solution. If the software industry was so hell-bent on SaaS there would be a great business oppotunity in creating an on-premise Jira competitor.
It is not a "contact us" pricing plan. The Data Center prices are publicly available on their website. It is more than twice as expensive as the Server offering, but still substantially cheaper than Cloud for the same number of users.
And it is the exact same software as Server with some extras enabled like support for multiple nodes, so upgrading to it is as simple as pasting in a new product key.
You never would have converted anyway though, right? Black swan events are pretty weak justifications for any decision.
In other words, this outage is not -because- it's cloud software. It's because someone, somewhere, broke something fundamental. That can (and does) happen in on prem at a much higher rate.
* It's been deleted for a week already, they estimate they might need two more weeks. Three in total.
* They claim to have "extensive backups", and hundreds of engineers working on it.
What? How? This simply doesn't go together. Why would restoring from backup take three weeks?
Either their backups aren't complete, or they need new software written for the restore, or something else doesn't add up.
I haven't administered their software yet, but what I've learned from the sidelines, at least Jira doesn't seem to be rocket science. A database, an application server (maybe a few instances for larger sites), a bit of config, some caches. This really shouldn't take three weeks to restore.
If you would permanently delete data for selected customers from a large multitenant system, it could actually take some time to restore it - even with proper backups.
You can’t just do a full recovery as that would mess those customers who were not affected (it likely takes time to notice the mistake - others have continued to use the system). You might need to write some tools to migrate the data from backups. Also you really need to test everything very carefully - otherwise you might be in even deeper trouble (looking at corrupted instead of lost data).
In large organization this kind of ”manual” recovery might require people from multiple teams as no single person knows all the areas. This adds overhead. Throwing too many people in does not help either. When you start thinking about it, few weeks is not that long.
And JIRA is definitely not simple. It’s complicated beast and likely the SaaS features combined with all the legacy makes it even more complicated.
What a nightmare situation. Makes you wonder if some kind of 1 database per customer setup would be preferable here since you could restore only affected customers.
I was in a similar situation many years ago (different ticket software though). What we did was spinning up a spare server with the backup data and script an extraction/injection tool to populate the production multitenant sass.
how would that require hundreds of engineers though? Would one not build a script and then just run it for each customer or each dept build a script? I honestly have no idea, never been in that sort of recovery situation before.
Restore from off-site tape backup. The kind of service where you ship them ~dozen new tapes in a lockbox each week and they ship you the oldest dozen back. It's supposed to be the "if all of a data centers happen to burn to ashes simultaneously" option. If you say "give us all of our tapes, asap" and then have some pour souls swapping them out as fast as the data can be read... it would probably take a few weeks.
In support of your point, 360MB/s is an extremely conservative estimate. I'd expect that from LTO-6, which is around ten years old, and I would certainly hope their backups are on more modern gear than that.
I work for a pretty huge well-known fortune 500 company with a global presence and personally handle the tape rotations for several of our data centers, and they're all still on LTO-6 tapes
I'm not sure that's relevant. I would expect Home Depot, Berkshire Hathaway, Alphabet, and Atlassian to have wildly different priorities. Two of them are Fortune 500 companies that I think would do fine with LTO-6. Atlassian is supposedly a cloud-first organization which arguably should be pretty aggressively tiered anyway, much less tiered onto outdated tech.
My theory in my other comment is that they've deleted some data and are waiting on third-party data recovery specialists. That would explain the timescale.
Jira cloud became much more complex when they went all in on AWS. The reason for the cloud/server fork 4 or 5 years ago was so that cloud engineers could couple to a zillion AWS services without having to build back any backwards compatibility. So data stores are very much more disparate than just a PgSQL DB and redis (which is how it used to be).
Something that can make restore-from-backups harder, and that I've seen happen, is when the backup/restore systems themselves get destroyed by the same black swan event. Then you have to first recover those by doing fresh installs, and you have to have all the people on hand who know what the configurations would have been to be able to then use the backup library. Then you have to begin restoring a few target systems to check that everything is OK with the restore process, then you have to restore everything though you'll be limited by the restore system's bandwidth.
How could this happen? Well, a disgruntled employee could make it happen. It happened at Paine Webber in 2002 [0]. In that case the attacker left a time bomb in the boot process on all systems they could reach, and that included the backup/restore servers. Worse, the time bomb was in the backups themselves, so restored systems ate themselves as soon as they were booted, which slowed down the recovery process.
I remember a situation where we had a near miss with data loss (replica failed and master had a bad disk). We didn't want to put the production database under extra load by taking a live backup while it was handling all production traffic, so we restored a backup. But it was "bad". Tried the one before it, and the one before that. Apparently they were busted for over a month due to a config change. We restored a month-old backup and started applying binlogs (which thankfully we had been backing up). But that meant replaying a month of transactions into the restored database. I can't remember the details but I think we ended up replacing the bad disk, resilvering the array and live-cloning the primary before the binlogs got fully applied to the one we restored from the old backup.
Went through that once in the mid- late 90’s. Each restore and test took hours, so 3-4 attempts took 2 days - with me sleeping, crying and praying in a conference room.
My guess is they failed halfway through a major schema or api migration. If some of the services have already progressed too far, then rolling back another service to previous backup snapshot will make the two incompatible. Especially if one of the services is global and the other is per customer.
The only way out is to figure out the bugs and continue migrating forward, fixing issues as they appear one by one.
Reminder: never delete data for real as your first step. Always mark it deleted along with a time stamp saying when. Then you can hide deleted itemsfrom everything. When a maintenance script goes haywire you can fix the problem quickly. Have a daily job that really deletes records marked deleted after 30 days.
If that is too complicated to retrofit then have any mass cleanup script move the records to a CSV file or temporary table.
Never ever ever be in a situation where a rogue script or bad SQL WHERE clause means restoring from backups.
I agree that data must never be "deleted and forever gone" unless you've already been very sure about it a few time.
But I would like to warn people about certain implementations of database "soft deletes" that I'm not a fan of. To be clear, I'm talking about the idea of having a "deleted" and/or a "date_deleted" column and using those columns in the WHERE clause to filter out rows that shouldn't be visible.
That pattern complicates the table structure, queries, and indexes. It increases table and index size, thus more data has to be sifted through (either table data or index data) to ensure only non-deleted entries are returned. More data to go through means slower queries. It's also really easy for people to write SQL that accidentally leaves the "deleted" column out of the WHERE clause. Then old, irrelevant data is being returned.
Accidentally deleting data that needs to be undeleted is usually rare so I don't think people should optimize for it. We should optimize for things that happen frequently.
I have dealt with the rare "Oops! I deleted important data!" by restoring from backups and it has worked fine. I think it may be too strong to say you should never be in a position to restore data from a backup. In fact, I think it's important to streamline the restore process.
For cases where we know ahead of time that we want to query deleted data I'll move deleted data to another database table that exists solely for maintaining history. For example, an ORDER table will have a DELETED_ORDER table, or an ORDER_HISTORY table. The HISTORY tables can also record data overwritten from updates.
These tables take up disk space, but never affect the structure or size of the original table and its indexes. Queries to the original table don't need to be modified to account for soft deletes.
To guarantee that things go to the delete/history tables, I'll usually put a trigger on the original table to move data over to the history tables. This way no application-specific code is needed.
> Accidentally deleting data that needs to be undeleted is usually rare so I don't think people should optimize for it.
That's very use case dependent.
We've made it easy for people to undelete data they've accidentally deleted simply because they used to do it so often and the only people who could get it back were our tech team. We're a devops org so part of our job is of course to support the systems we build, but our time is better spent on building solutions to business problems than to repeatedly providing support for issues that come up all the time. Part of building those systems is of course engineering in solutions that make it hard to screw up, and easy to unscrew when things inevitably do go wrong. No mean feat given our platform dates back over 15 years and still includes a lot of legacy from the time when tech was just a couple of people.
I suppose the object lesson here is that edge cases in one system or company can be part of core business in another so it's best not to make too many assumptions.
Agreed, soft deleting adds so much complexity to everything. And even has the potential for privacy related bugs. Like, say, accidentally forgetting to respect the deleted column in a query on a joining table that determines user permissions for some resource. Now people have access to something they had permissions revoked for. Whoops.
>I have dealt with the rare "Oops! I deleted important data!" by restoring from backups and it has worked fine.
Usually this goes along with "Oh and the other team did some important work at the same time" so you can't just restore a backup. You either tell them to deal with it or start writing custom scripts to copy out only the data you want to restore.
A more sane solution would be soft delete for x days and after that it becomes a real delete.
I guess one could make a view for each table that always includes the where deleted = false, to not bother about it in application code. Still yes, it adds complexity.
A "deleted" field type deletion is also how you get a massive fine from a GDPR agency when they find out that you're not actually deleting PII properly.
One smaller social media app I used to sysad over actually overwrote data to be purged as xX0-Deleted-oXx and similar (there were a few variants depending on data constraints). There was no "show_deleted_when" garbage.
Then weekly, a task went in and then purged rows with those non-content placeholders to completely purge that user, if a user-purge was requested.
If you're using MS SQL Server it natively supports temporal tables, which take care of this problem for you. There's also an extension for Postgres, and of course triggers.
You don't even need a bug. Just a wrong system clock.
We had a few windows laptops where something caused them to time travel to 8000 years in the future. Then, they'd slowly spend a few hours deleting every local profile, as nobody had logged in to them for 8000 years. Then, they'd do something to their time zone database and travel back 8000 years.
When they started the process, it was unstoppable. Trying to modify the system clock to something sane just caused them to depart to the future again, even if disconnected from the network. None of our users was very amused by this behavior, even if everything important was backed up.
It happened specifically to 1 type of laptop and we only had about 30 of them. So we pulled all of them out of roulation. Then covid struck, so I reformatted most of them with Debian and we gave them away for home schooling. I wonder if I managed to linuxify some kid in the process.
Yeah, the idea is that by expecting the deletion logic you can make it simpler and more rigorously tested than regularly changing business logic or application code.
If you organizationally cannot prioritize quality then nothing can help you.
I fucked up once and lost 2 hours of customer data. I was so lucky we were a small startup and had daily backups AND the backup was only 2 hours old. I would have been royally fucked. Never making that mistake again.
Always use a copy of prod on a staging server and run your queries there for testing.
It's incredibly depressing how common this is in the real world, but I can tell you from experience that this NEVER happens in Atlassian production systems.
Another tip: never enter queries directly in a production database connection with write access in the first place. (Ideally very few people even have that level of access.) Write it in your codebase, write tests for it, get it code reviewed, and run it in a dry run first and get a list of affected records before running it for real.
For anyone doing this, just keep in mind that chucking on a LIMIT 1 can give you a false sense of security. For example, say you want to drop a single row but forget the WHERE. A LIMIT 1 will return "yep, deleted one row" but it's not the one you wanted (instead, it's whatever row came up first). Better to do the operation in a transaction that you can rollback - that way, you can better validate the results of your operation before committing.
This is also true for people cleaning house or going minimal. Put it in a well labeled box and tuck it away somewhere. If it matters then you'll fetch it, otherwise just chuck it when the box gets in the way.
What's wrong with restoring from backups? This is one reason they exist after all. I don't think that making a mistake in delete statement is something you would do every week.
Because you lose all the work done since the issue happened. It's very rarely acceptable to just give up and do a full backup restore unless literally everything is gone. If there was just some bug that cause partial issues, you have to find some other way to fix it.
Could be that they are restoring from backups - just that that restoring from those backups is very very slow. Atlassian would not be the first where resourcing and testing a speedy disaster recovery strategy wasn't given the highest engineering priority.
> Never ever ever be in a situation where a rogue script or bad SQL WHERE clause means restoring from backups.
As a second step, restore from the backup at a set frequency. This would force orgs to automate and optimize not just the backup flow but also the restore flow. Tear-down and restore entire systems from backups. Of course, doing so enormously adds to the cost, but when there's an outage, it will pay itself over.
You are within a certain period of time, not ‘instantly’ (depending on the exact situation you are referring to). The script could take that into account (using a shorter period of time or the like)
What is hard deletion? You can restore rows from database files before vacuum runs. You can often restore data from disk sectors. Some people say SSD can remap sectors under your chair and you won't even know that your deleted data is there.
The law isn't a technical specification. You have to follow the spirit of the law. A soft deleted_at timestamp wouldn't be following the law in good faith. Having some data stuck in an unmapped section of an ssd would be within the spirit.
IANAL, but IMHO a soft 'deleted_at' timestamp along with a daily cron job that hard deletes everything with a deleted_at older than 24 hours would fall within the spirit of the law.
I agree that just having a deleted_at timestamp and old entries are never pruned would not be a good faith interpretation of the law.
From what I have seen, there is no requirement for instant deletes. Even emailing a support address and having them manually delete the data is acceptable. Most places using deleted_at never clean up the data from what I have seen though.
> This data was from a deprecated service that had been moved into the core datastore of our
products.
That is very interesting. This implies they are backing off, at least somewhat, from their very aggressive microservice strategy. Perhaps they feel like they have gone too far in decomposing their products.
I think sites are their account management stuff. Sounds like they deleted user accounts and not actual data. Notice that only native products are down and not acquisitions. They probably just haven’t migrated those yet.
We're mere weeks aware from migrating to their could platform after the self-hosted rugpull. This really doesn't give me confidence in their ability to not break my stuff.
1) This outage will get their organization to prioritize work such that it never happens again.
2) This outage is representative of a dysfunctional organization that can't prioritize work correctly.
If you've been using Atlassian software for a while and are used to how they prioritize tickets then one of those options seems far more likely than the other.
I can tell you #1 never happens. It will be a temporary effect of the management green lighting the years of neglected maintenance work until everybody forgets about it and it will go back to business as usual until the next incident happens and the cycle repeats.
I will say that back when Bugzilla was it, JIRA rocked. It was amazing the new power you had and the functionality it provided.
We self-hostd JIRA from 2008ish to 2014ish. (memory is fading on exact dates) By the time we decided to stop using JIRA, we fracking hated JIRA and would never return.
Since then, GitHub Issues, Trello, Clubhouse (neé Shortcut) all provide less friction in day to day use. As an Enterprise, I do believe Shortcut is your best bet.
To be quite honest, the problem is historical. We have over a decade of project plans, support tickets, change control logs, etc in our Jira instance. There's simply no painless way to export that into another product that will have approximately the same functionality and features. There's a few that come close, but all fall short of a drop-in replacement.
The only options now are the $$$$$ "datacenter" license, migrating to the dangerously unstable cloud, or not doing anything and running unsupported EOL software.
I totally understandthat. But at the same time, a multi-week outage is really a sign of an org that simply does not have their shit together at all.
But the lack of transparency is the worst. Another post speculated that Atlassian has lost data, doesn't even have backups, and is re-creating it by munging their emails and diffing them to re-create history. I can't really imagine that's tue - but what if it is, and Atlassian is concealing things?
> the $$$$$ "datacenter" license, migrating to the dangerously unstable cloud, or not doing anything and running unsupported EOL software
Data Center is pricier than Server, but isn't it still cheaper than Cloud? And you control your own back-ups as with Server, so Atlassian cannot lose your data.
My company is in the middle of a multi year transition from selfhosting atlassian products to using their cloud offerings, and I am sure the infrastructure team/management is very thrilled to see this news.
While our tenant was unaffected, i told my management of this issue. They just shrugged and said we could watch and eat popcorn. I was half way expecting them to raise eyebrows
I was kinda like....."not really the point of bringing it up."
Its worth noting we have had them just delete things within our account before. In fact one of our Senior VP's had their account just....disappear one day. We couldn't @ them in chats, tickets etc. Atlassian just shrugged and "restored" the account and said it was some issue with a stored proc on their backend or something.
I have always felt uneasy about how flippant they are in their processes. But it seems that is not shared.
I use JIRA and confluence every single day, I have to, it is everywhere - but imo it is such a horrific toolset in every way (even before this outage), I can't for the life of me figure out how it got so much market-share.
Does it have good API for extending with what's missing compared to Atlassian/Microsoft/Jetbrains offerings, i.e. the tight connection between issue tracking other aspects than issue tracking such as builds, deployments etc? E.g. how pull requests are related to work items, or which work items end up in a specfic build/deployment etc?
It launched in 2002 - well before my time but I'm guessing that for the first few years of its life there weren't many competitors.
Now it has endless competitors and I'm led to believe that it has accumulated lots of features that businesses can't live without but which the average end user never touches.
Reminds me of the time a group I worked for at a National Lab decided to call the root folder for their project “core”. I bet you can’t guess what filename the backup scripts were configured to ignore…
This is an expensive lesson I think everyone gets to learn at some point. There's no such thing as a file worth excluding from a backup, it always fucks you, some (like me) more than others. Have to buy twice as much disk? Fine, at least you know you actually have a backup
The "Edifice Complex" has been taken to be a bad portent for a company.
Obviously not universal, but if upper management has decided to spend a lot of time focusing on a marque building they're not focusing on the business itself.
I wish companies would stop with this "small number of customers" messaging. It always seems disingenuous and, besides, that matters for your internal estimation of business impact but means absolutely nothing to the customers affected.
I think they add that because otherwise you read about the problem, panic, and then spend hours digging through your own data to make sure it's all there. Unaffected customers like being told they're not affected.
Wow! I didn’t realize the scope and duration of this outage. This must be doing some serious damage to some of their clients (catastrophic if this does impact JIRA, Confluence, and OpsGenie broadly on a company level). Is there any report of approximately how many (or specific) companies have been affected as a result of this?
Atlassian does not care about individual customers. They are purely driven by numbers. I listened to a presentation by one of their founders a long time ago where he admitted that statistics and number management was part of their DNA. They don’t think about the customer’s name, maybe this is right, maybe not.
Meanwhile, people have been waiting since at least 2013 for Atlassian to deliver a way to automate backups for their Cloud offerings: https://jira.atlassian.com/browse/CLOUD-6498
Whats the legal implications of such downtime for Atlassian? I could imagine thousands of companies unable to manage employees, product releases, bug fixes, rollbacks and more from this.
Self-hosted full Atlassian stack (jira, confluence, bamboo, bitbucket) for 6+ years. ~200k tickets. To be honest, it just runs with minimal issues.
Mostly downtime is just upgrades. I can remember a few times we've had to add (JVM) memory as our usage increased. Not sure what we're going to do with the discontinuation of server product line. We self-host to keep source code, etc. more than one configuration mistake (or zero-day) away from exposing it to the world.
We are in the same boat. We are looking at the cloud, but the migration tools are just a pain. I can migrate a project to the cloud from Jira, and it even gives me a report of workflow transitions I have to manualy update/change to fix, which is great.
But then, there is no way to keep it in sync. I have to blow that project away in jira cloud, and migrate it again.
So I Have to hard-cut over projects, on a system that has dozens and dozens of projects, and somehow have people figure out which ones are where. or one really, really ugly night to cut it all over, and hope it goes well.
I'm looking for alternatives, but our team is so invested in some very, very customized workflows, its going to be a pain.
Is there any way to migrate away from Jira, or is it full lock-in? I mean, is there any migration tool or service to Redmine, Gitlab, MantisBT, Trac or whatever?
I think one reason Atlassian was successful is that they always invested a lot of effort in building tools to migrate to their products from any of their competitors (obviously not the other way around).
Self-hosting Jira is running a big Java application that has its own directory structure and talks to a Postgres database. It really isn't hard. Even upgrades are automated using shell scripts, although you do need to manually replace some configuration files afterwords for some damn reason. Configuring Jira is complicated, but that's the same regardless of whether you self-host.
Self-hosted Jira sucked because it was shitty software, not because self-hosting has to suck. Mattermost and Jitsi are easy to self-host, to give two examples that are not complete shit.
Maybe. But you're counting on your sysadmin(s), who are also managing dozens of other things, to keep up to speed on Jira and its quirks, and apply patches and new versions as they become available without missing any steps or screwing something up.
On average, you're still probably better off having a company that knows the product also host it for you, but obviously they can make mistakes too, and the downside is that when they do it might affect all clients, not just one.
> and the downside is that when they do it might affect all clients, not just one.
This is also potentially an upside. For example when us-east-1 went down recently, customers were somewhat understanding because it was "amazon's fault" and everyone was down - it was in the news, etc. If we ran our own data center and that went down, our customers would've just said "why did you morons roll your own data center instead of just using aws?"
I worked at a company that self hosted Jira, and it was miserable then, I can’t imagine depending on the cloud. I’ll never approve Atlassian products after that experience.
We're using the cloud version and we're fine too (no outage). What's your point? Are you claiming that self-hosted is never down? Or that self-hosted is more reliable? Because I doubt that. Difference is just that when self-hosted goes down, it doesn't end up in the news.
I think there is a strange psychological trait that I have, and others may as well, where I am much more forgiving breaking my own stuff than having someone else do it.
When you host Jira yourself, the monthly subscription pays for Software + Software updates. When Atlassian hosts it for you, you're paying for Software + Software Updates + Service (hosting). When you hosted yourself, a team gets blasted for not monitoring it or updating it correctly. When Atlassian fails to do it (and charges for it) then they get the heat.
All that to say, I don't think it's a weird phenomenon, it's just you're realizing that you're paying someone else for something that's not delivered on.
People are more forgiving torwards themselves and their own folks, I can understand that. I'm just thinking it's important to make decisions based on facts. Some self-hosters walk around with this "my own basement is safer than Amazon datacenters" attitude and that's just not true (in most cases, I guess :D).
No. The basic assumption is only, that if I pay somebody money for the service and to keep things running, then sudden data loss is unacceptable (and a multi week downtime even more so).
Some companies cannot operate effectively without atlassian products, so a fuckup of that scale might just have legal consequences depending on whom it hits.
One difference might be that when you self-host, you are more sensitive of some of the risks, whereas a hosting service might be balancing that risk with a need to scale, and their appetite for risk and tradeoff considerations might be different from yours. It might be that these companies know something that their customers don't, and thus are more willing to take on risks.
In this case, it seems like the company took a risk and it did not go well. The possibility of being able to restore from backups might have been factored into this risk, but the latency of doing so might not have been.
It sounds like Atlassian is doing individual restores, each restore takes a fair amount of time, and they don't have the capacity to do all 400 simultaneously (because why would they). So you just have to wait.
If you're self-hosted, you dedicate as many people as possible/necessary to restoring your service, and it becomes their top priority.
You also have a lot more insight into the detailed inner workings of the restore, making it easier to plan against, instead of just vague "we're working on it" messages for days a time with no clear end in sight.
if my company would be part of the outage, the restoration part would only take 2 days (if we're counting all the data they have, not just jira/conf) but not 3 weeks
so selfhosting may still have certain upsides even with such outage
I'd disagree with this lesson. Saying "do not utilize cloud solutions" period is nuts. Google and Microsoft are way better at email hosting and delivery than your on-prem server is unless you spend a ton more money on hardware and engineers to keep it up, which is simply not worth it for many companies. Dropbox is going to have better uptime and lower TCO than your self-hosted owncloud instance.
What I will say is it's important for the customer to HAVE THEIR OWN BACKUPS. Don't rely on the vendor - that's the lesson here. If you have all your stuff in AWS back that data up someplace that's not AWS, etc.
On prem outages and data loss happens constantly. Much more than cloud hosted issues. They just affect a smaller group each time. It's like how people view the countless car crash deaths as non issue but freak out over a rare train crash.
Have your own backup works only if there are common open standards for export/ import. Email or storage may have those, project management tools don't. I can't simply backup from Jira and start using pivotal.
Even for Email or storage or any other open system, UX changes and feature differences can take a lot of time to train properly, you don't migrate from one vendor to another vendor just like that.
Edit: to clarify I would say if the data is important to you, then “the ability to back up the data” should be a requirement when selecting saas. See my other comment in this thread on ms planner.
I am very surprised that there aren't more people/companies announcing "offsite backup capabilities" for JIRA or Confluence, etc.. I did just search for scripts that can do this, I'd probably pay a few bucks for something like that at this point.
If technical competency had any bearing on stock prices they should've been at 0 since long ago. Their stock price is tied to the amount of clueless/shitty companies that will still cling onto their products regardless of what happens, and I don't think this incident is going to change much.
Since Trello is part of Atlassian aswell - what are good, reliable and above all lightweight alternatives for managing projects without the “pseudo-agile” rabbit holes of functionality?
Linear seems optimized for teams that just build one product at a time. We tried it and while the app was elegant, the concepts just didn't map to our mode of multiple repos and across many different clients (some product, some consulting, etc). I don't know if that changed, but that stopped us from adopting Linear.
You can use one any of the good old [issue tracking systems](https://en.wikipedia.org/wiki/Comparison_of_issue-tracking_s...). AFAIK, Trac and Redmine are quite easy to run self-hosted, as long as you don't need to handle public projects (and therefore spam management).
Realistically, though, you're more likely to be able to convince other project members to use the issue tracker of whatever forge they're comfortable with, for instance Gitlab, Gitea, Pagure or Sourcehut.
Microsoft Planner is included in most Microsoft 365 plans. Pretty much if you've got Teams, you've got Planner (and you can just add Planner as Tab in a Teams channel). At this point it has surprising feature parity with Trello.
Last time I checked (a few months ago), Planner still could not be backed up. Like, at all. If someone went in and deleted a whole bucket you can't recover it, not natively, and not with third party. So that's a big fat no from me.
I seem to recall dumping Planner to JSON easily enough. I don't know if you can easily restore directly from its JSON, but the JSON was nice enough to work on for what little I needed to do with it.
My current employer shut off Trello and forced us over to Jira and is threatening to disable Planner, so I'm "not allowed" to rely on Planner enough day-to-day so it's possible it is either better or worse than I remember it being in that department. But this Jira outage has me reevaluating, and they haven't turned off Planner yet.
Not sure why an issue affecting a tiny number of clients would crater the stock.
I've been pushing for an exit from Jira for a little while now, but this doesn't really add much ammo to that argument for me. It's like pointing at a plane crash and trying to justify the company no longer fly people places.
Yeah, it's not really a good piece of advice. If anyone actually reads that bit of information, they'll just shrug and toss the resume in the bin and move to one of the other hundreds of applicants.
Worse, outside of "we have rebuilt functionality for over 35% of the users", I haven't seen any reports from the people who have ostensibly been recovered.
Next, their published RTO is 6 hours, so obviously they must have done something that completely demolished their ability to use their standard recovery methods: https://www.atlassian.com/trust/security/data-management
Finally, there have been some hints that this is related to the decommissioning of a plugin product Atlassian recently acquired (Insight asset management) which is only really useful to large organizations. I suspect that the "0.18% impacted" number is relative to ALL users of Atlassian, including free/limited accounts, and that the percentage of large/serious organizations who are impacted (and who would have a use for an asset management product), is much higher.