It really does not make any sense to me, given that @replies were not even a feature of twitter originally. Isn't the folk story that they only made @replies into a link when all people started marking replies in such a way?
The way it is now (from my understanding) is that for every reply they have to check if somebody is following the person that is being replied to. How can this be more efficient than simply sending the reply-tweets to all followers (which they have to do anyway with all the other tweets that are not replies)?
They don't have to check 1.7 million users to see if they have @replies on, if they maintain a list with "@reply subscribers", there's little overhead as far as I can see. If a something rarely changes but is checked a lot, it's a pretty good idea to have it pre-calculated.
Edited: "No overhead", was a bit to absolute thanks jpwagner.
Re-Edited: Oh look there was a commenting feature on the blog, was pretty hard to find.
That solution doesn't address my question. Caching @reply subscribers only addresses the 3% of users who have opted-in to receiving all @replies. For the remaining 97% whether they receive @replies from someone they are following or not is a function of which user the reply is directed to and whether that user is also their friend. You can't cache that, at best you can optimize how you calculate who should receive replies.
In fact, what you've pointed out is that building an implementation to address the 3% case is straightforward which was the point of my post.
PS: My blog does have a commenting feature. In fact there are 3 comments in response to the post.
If you're interested in wild speculation (I have no idea how Twitter's database works), here's how I interpreted Biz's explanation:
With the new system, say user Foo writes "@Bar lol me too!". Then Twitter can take Foo's follower list, join it with Bar's follower list, and send the message to everyone in the resulting list. Relational databases are very good at joins.
On the other hand, with the old system, they'd have to do a deep inspection of the record for each of Foo's followers to know if they should send the message to that follower. Relational databases are much less good at this.
But, as you and others have pointed out, if the number of users that use the "all @-replies" feature is really so small, it would be fairly inexpensive to cache the list of all of Foo's followers who use that feature, and join them in as well. I don't know why they don't do that -- maybe it adds up (like, if only 3% of users use the feature, but those users follow a lot of other users, they'll each end up in a lot of other users' caches).
The 3% (of users affected) number has been trotted out a few times. Has twitter ever backed that up? It feels like a way of marginalizing the people who're complaining about the change, by painting them (us) as a vocal minority.
Taking into consideration the fact that the percentage of people likely to put in the time to explore documentation and settings and discover the original configurability is probably quite small, and a reasonable assumption that not everyone who discovers the functionality will use it, yeah, the number sounds all right.
Note, btw, that the number they seem to be going for is the above -- e.g., "according to the accounts database, only 3% of users ever changed this setting", _not_ "only 3% of users are claiming to care about this now that it's been removed".
Give me a break, scaling Twitter is not easy. CS principles do not magically allow you to construct an optimal solution for real world engineering problems. At Twitter's volume, the tiniest bit of extra latency in some middleware or network hardware can be the bottleneck.
It's so easy to sit here and play armchair engineer and diagnose Twitter's (mostly past, btw) problems with a nauseating mix of hubris and ignorance.
What makes people laugh at twitter is the problem is not that hard, but they make all the classic mistakes.
Every time someone wrote a reply Twitter... this was the most expensive work the database
Sorry, I am going to laugh at people having scaling issues when they use a database like that on a messaging system. There was an essay by PG where as soon as he heard they where using Oracle (or C?) he instantly knew to ignore them for much the same reason.
I'll try, but let me ask you this, which post was more personally offensive? What I said in response to one flip comment? Or what the comment said about a whole group of successful people without any reasoning or substantive argument?
Yours was worse. His was a type of inane throwaway remark that's at least marginally acceptable to make about famous groups or individuals. E.g. "the 49ers suck this year." Yours was a bitter, personal attack on an individual.
> ... with a nauseating mix of hubris and ignorance.
That, and the pointing out of quite natural questions which Twitter has failed to address in its explanation of how it is acting. I mean, once you know what you are doing, being able to satisfy people asking ignorant and hubristic questions is a pretty sweet bonus.
Scaling twitter would have been relatively easy using Erlang, because it has all the hard parts solved as basic features of the language. They decided to go fishing instead.
Its not a silver bullet, but it is specifically designed for problems like Twitter's back-end issues. If it can handle running telephone systems for years at a time with a large string of 9s of uptime, I think it can handle broadcasting SMS. Well. Its the most proven tool for this type of job and the only one I'm aware of designed from the ground up for building applications like twitter that are utterly reliable.
Not a case of magic, just the right tool for the right job. You could code a web app in C, but why would you? Making twitter in Ruby (and to a lesser extent Scala) is the same thing.
I clicked the profile link to your blog, and it looks like you have a lot of Ruby experience. Compare the syntax for message passing between Erlang and Ruby.
id ! {my_fun,Args},
Thats it. Passing messages between pids on nodes across a network isn't much harder.
Compare the difficulty of parallel computation and storing data in parallel across many systems reliably between Erlang and Ruby. Doing that stuff in Erlang is just as easy as doing CRUD in RAILS, because the tools are designed to make it easy.
The downside is that Erlang is pretty bizarre and hard to learn if you've never coded in a functional language - you only get to set values ONCE, etc. Thats a critical feature for debugging large pools of processes working together, but it sure is strange. Which is why you wouldn't use Erlang to do a simple CRUD web app, you'd use RAILS.
Seriously... its an entire language based around passing messages between lightweight pids working in congress, across many systems, with built-in fail-safes for any failures, and a rock-solid distributed/redundant data store baked in.
In this situation its just about as close to a silver bullet as you will ever see. Erlang would make writing twitter's back-end FUN.
A dozen crazy Swedish mad scientists slaved over a language for years to enable creating scalable applications like twitter trivial, only to be totally ignored by most everyone until quite recently. And still twitter picks something else for the rebuild.
'Heavy metals poisoning' would seem to be a more apt metaphor than 'silver bullet.'
That wasn't a bait and switch. He suggested the engineering problem isn't as hard as is being made out, and then when you claimed otherwise, he explained that he was in a strong position to know otherwise. In the real world, this is known as a "good argument".
He didn't claim his experience proved HE could do it. He didn't imply that you NEEDED his experience to do it. He was calling you out on your claim by showing that he had done it, and so knew from experience that it wasn't hard. You're the one moving the goal posts.
Note: I actually don't have an opinion on whether that experience is relevant to twitter's situation, but it certainly wasn't a bait and switch.
Evidently the concept of studying theory in school then applying that theory in the practical world of industry is too complicated to fit into 140 chars.
How much did Twitter cost to develop vs TIBCO Rendezvous? How much of the "easy" part did you come up with yourself vs working at a polished product that already? What are the actual volumes that Rendezvous handles and how do the details of the network topography compare? How do the details of the functionality compare? How much scalability analysis to they teach in CS (answer: only theory)? Is anyone who doesn't have your particular experience an idiot? Is working on a product for a small collection of the filthy rich better work than creating something that millions of people directly identify with?
Finally, how much do you know about Twitter's architecture and do you know the definition of hubris?
What if they have a fast way to get the intersect between followers of 2 different users? In this case, if a message is a @reply they could ignore the follower lists of each users and use the intersect list only (which can be cached). Supporting the 'get all replies' option means that they also need to consider the full follower lists and then deduplicate. .. just guessing here
First, I sympathize with the Twitter engineers who have had to build a scaling system improvisationally under incredible growth and time constraints. I doubt they have time to reevaluate early limiting decisions except when things are on fire, and even then they probably only have enough time to go for the quick fix.
Ultimately, though, I'm sure they'll figure out a way for this to work -- especially as it only requires "3%" of users to get @unacquiantanted-replies the same way they get broadcast messages.
One approach: split accounts into two inner accounts, for example, aplusk and aplusk(all). Only the original name is ever publicly shown -- though the account's public page shows the all tweets, as now. However, which one people actually 'follow' depends on their 'show all @unacquianted replies' setting. Thus every aplusk(all) tweet is handled as a broadcast, because all those followers -- only 3% of the total -- prefer it that way.
"In that case I’d expect Twitter to argue that the feature they want to remove for engineering reasons is filtering out some of the tweets you see based on whether you are a follower of the person the message is directed to not the other way around. "
That would be easier for the twitter engineers, but thats not what 97% of twitter users want. The author doesn't acknowledge this.
That 97% only indicates that most people don't change the defaults; it says nothing about what Twitter users want.
In this case it's particularly misleading, since we're talking about an option to turn off silent data loss. What would make someone track down a setting related to a problem they don't know they have? A more honest representation would be: "A small number of our users were aware that we filter @replies by default, some were happy about it and a few (3% of total users) changed their settings to remove the filtering."
The way it is now (from my understanding) is that for every reply they have to check if somebody is following the person that is being replied to. How can this be more efficient than simply sending the reply-tweets to all followers (which they have to do anyway with all the other tweets that are not replies)?