Hacker News new | past | comments | ask | show | jobs | submit login
Twitter IDs to roll past 53 bits in a couple days (may break javascript apps) (groups.google.com)
86 points by there on Nov 23, 2010 | hide | past | favorite | 46 comments



For those wondering where 53 bits comes from, it's the size of the mantissa (significand, fraction) in a 64bit IEEE float.

You can use 64bit floats (double) to store exact integers up to 53 bits.

This works in a lot of languages on non 64bit hardware, php for example.


My rule of thumb is to use strings for all IDs that belong to external systems. A bunch of people got burned in the same way a few years ago when Flickr photo IDs rolled past 32 bits.


Still surprised people use incrementing ints for ids over uuids for web apis that may become immense.

http://www.ietf.org/rfc/rfc4122.txt

uuids are made just for this purpose, string based, never bigger than 40 characters (with dashes and curly braces).

Most products use uuids or Microsoft's name for them guids.


I prefer my ID's universally unique. I worry that globally unique ID's won't scale as we begin to colonize other planets. (yes, its a joke.)


Stating it is a joke kind of ruined the joke.


we don't use incrementing ids, but needed ids that increase over time (so that you can sort tweets by them). hence, snowflake: http://github.com/twitter/snowflake


You don't even need to go that far. For non-critical applications (i.e., your web app), you can randomly generate a small string, say 12 bytes, using base-62 characters (A-Za-z0-9) to serve as a probably unique user ID (with VERY high probability).


Heard of GUIDs? Make part of it a timestamp, and the chance of collisions goes down astronomically further.


In some cases, including a timestamp in an ID can be giving away information considered private. Sure, you could just hash the resultant ID, but then you're getting back to random digits anyway.


My rule of thumb is that if you apply math operators to it, then it should be a number, if not, make it a string. There are still rare cases where you use a number as an id because you're trying to save on space, but typically not for a web app. I've also found that there are even cases where you don't need record ids. Statistical data, for example.


The twitter api includes string representations of the ids already and we encourage everyone to use those rather than the numeric ones.


yeah. don't get why everyone didn't just convert to string after the last twitpocalypse


Yeah, only problem is that is very inefficient from a storage perspective. It depends on the situation of course. If it's only a few thousand per day for example, not a big deal, but otherwise that could end up biting you in the ass hard after a while.


When the first 32-bit troubles occurred, they should have just switched to a string-based ID, IMO.


To be fair, 2^64 is a whole ton huger than 2^32 and they might not have been thinking about whether JavaScript could handle the identifiers (one normally assumes these days that programming languages have native support for 64-bit ints).


That was my first thought... who on earth thought that a number (albeit 64-bit) would be enough for Twitter? Who even thought that a 32bit int would have been enough?

I didn't know that Javascript couldn't handle numbers bigger than 53-bits, but honestly, these should have been strings from the beginning.


I didn't know that Javascript couldn't handle numbers bigger than 53-bits

The JavaScript Number type can't handle more precision than 53 bits. Magnitude is orthogonal due to floating-point representation. Precision is governed by the size of the mantissa, which is 52+1 bits long in the 64-bit IEEE 754 representation used by JavaScript.


The problem isn't that their ids have already gone past the 53 bit (much less 64 bit) marker in sequential order. The problem is that they are going to start generating ids in a different fashion which is causing the issue.

This seems to be the relevant id generating code: https://github.com/twitter/snowflake/blob/master/src/main/sc...


64 bits is enough to have everyone on Earth send over a billion Tweets and still have enough room to find a new solution. That sounds like more than enough to me.


The problem is they added a timestamp and people assumed they would not need the full 64 bit ID.

IMO, it's not a bad idea on their part. A 32 bit UNIX timestamp * 2 ^ 32 + a 32 bit sequential id let's them track up to 4.2 billion tweets a second and should work just find up to the year 2106.

Edit: As to why it's a good idea, you can have different systems handing out ID's without stepping on each other’s toes or even talking to each other. The full ID is composed of a timestamp, a worker number, and a sequence number. Granted, I would probably put the sequence number ahead of the worker number so sub second tweets are better ordered vs. being ordered strictly based on the system that generated them.


Yeah, I guess this is why I didn't understand how they could use 64 bit numbers to begin with... I couldn't see how they were going to be able to generate them all without leaving huge gaps in the number-space. If you used 64bit ints you'd be unable to have one machine do the generating, so you'd have to have some sort of offset for the worker who generated the ID at the very least.

And once you've done that, why not just go all out and use UUIDs?


this is exactly what they have done in their new ids and the snowflake system. They have a timestamp, a sequence number and a system identifer, plus a few neat properties.

I don't think they can be blamed for using a trivial incremental key when they had 10 users, I am sure they were not expecting to have 200M :)


Agreed. And also, there's not a lot of manipulation/math to be done on these IDs once they come back to the client. They're only used again when an operation happens on a tweet (favoriting it, replying to it) and in that case you're just taking the ID and sending it back to Twitter.


not really. You (can) use since_id and max_id in many places, which means you need them to be ordered (and trivially so).

This is of course doable with strings too, once you decide what the ordering is.


Tweets come back ordered newest to oldest so there's nothing to order if you don't want to. And accessing since_id or max_id just to send it back to Twitter is exactly what I was referring to: no math to be done, just storing & accessing a value so it doesn't have to be in number format.

In a more practical application, a timeline comes back and the top-most and bottom-most IDs are stored. When a user gets to the bottom an API call is made to load more so you look at the bottom-most ID. If they want to load new tweets you look at the top-most ID. No math needed, just looking up values. They could've been strings all along and it wouldn't have mattered much.


I said that there are more operations than favouring and replying by id.

I don't understand how saying you're gonna access the timeline without looking at the values denies that.

Moreover, I did not object to using strings, I said you only want them to be orderable.


I imagine that they did think of this, but a good string hash takes time. So switching to 64-bits takes much less time, and it would work until they finished snowflake.


The 64bit id is snowflake.


Technically, snowflake ids are 63 bits. Java doesn't have unsigned longs, so they don't touch the sign bit.


we've been running snowflake for weeks already


JavaScript apps have been broken for a couple weeks now with links to things like:

http://twitter.com/mattyglesias/statuses/7.1157195777E+15


Being in that majority of people who haven't built Twitter-scale systems in the past, could someone explain to me why Twitter is moving to a new form of twitter ID in the first place? What is wrong with their current system?

Also, it seems strange that they would include the new ID in string AND integer form in their JSON. I realize that they don't want to break existing javascript apps, but isn't there a significant bandwidth cost in adding that sort of kludge to the API when you're serving a quadrillion of these api requests every day?


How do you generate sequential IDs a quadrillion times a day if they're coming from 100 different computers? Lots of very fast coordination.

In the new system, they don't have to coordinate every server just to make sure IDs are sequential. They just use the time stamp and some machine-specific information.


I'm going to take the opposite position of most commenters here and decry the fact that almost all programming languages get arithmetic wrong.

Adding two positive numbers should never result in a negative number. Adding a one to an integer should result in the next largest integer.

There are new languages created all the time that don't have built-in support for arbitrary precision integers or rational numbers. They might have some neat ideas, but if your language can't even get arithmetic right, it's garbage.


Arithmetic is a relatively self-contained part of a language. The problem isn't that JavaScript's arithmetic is lame, it's that they didn't leave any way for the programmer to fix it—even if you make rational bignums out of arrays of doubles (or whatever), you can't make them drop-in replacements by overloading operators. It's a strange oversight for a language which lets instances override methods.


>It's a strange oversight for a language which lets instances override methods.

That's only because everything is an instance in Javascript. There are no classes, it's prototype OO.


A lot of languages also get arithmetic right, for example Python. Most languages that get it 'wrong' are low-level languages (such as C), which have a good reason, as CPUs have limited width registers. However, I agree that there is no excuse for Javascript to not have arbitrary length integers. Then again, that language is broken and ugly in a lot of ways.


That's truly, truly dumb on twitter's behalf. They should have started with using the whole numeric range and/or switched to string ids long ago.

As I have suggested in a post 9 months ago, and as I would have designed this system at any date since 2002 or so, the twit ID would be composed of userid bits + time bits. In the post below I suggested 32+32, but other divisions are acceptable depending on your "bot user" policy. Such an ID would at the same time be sufficient (up to 4G users, up to 5 tweets/sec/user AVERAGE, up until 2030). You can have two times as many users for just "2 tweets/sec average". Facebook only has 500M, so 4G should be sufficient.

Such a construct makes the entire system significantly simpler and more robust to "meaningful" failure. I haven't seen a single thing tweeter has done right in the technical sense.

They do deserve marketing and bizdev credit.

http://www.reddit.com/r/programming/comments/b2u6t/twitter_o...


-1. With hindsight it's easy to make a comment like yours, but many clients rely on the status_id to be monotonic.

You would be changing behavior in a way that breaks apps. Twitter doesn't want to do that.


-1 all you like. It's engineering, not hindsight -- when building a system, I always ask myself "how does this scale" which usually translates to "on what attribute does this shard".

Look at e.g. YouTube and many other sites around the same time. They knew what they were doing; Twitter didn't.

I _have_ actually designed such a system in 1999, that used 48 bits, and it worked perfectly well. (Only had 28 bits for the user-id, which would have been broken at the 250M users -- alas the system never had more than than 5M; This was in the years 1999-2003).

The only way you can shard absolutely monotonic is (effectively) randomly, which is an option however you assign ids; but other assignments let you build a much cheaper, much more robust system.


I don't see how your comparison is remotely fair to Twitter. Youtube is just a website, if they wanted to change how they allocate video numbers, nobody except for rogue bots will notice.

Twitter is an API. Even if they have the knowledge of how to fix past mistakes, they need to ask for feedback, give plenty of notice, and set a deadline of when the old version is cut off.

Maybe they didn't do everything perfect day one, but I don't think there are any APIs the size of Twitter. Cut them some slack, they're not morons.


The point was exactly that YouTube, which had a simpler problem to solve at around the same time (or earlier) did it properly.

There do exist independent youtube clients (though not as many as Twitters's), but using the encoding they did, youtube has made it so that it is never going to be an issue, whereas for twitter it has already been a significant issue twice (that I'm aware of).

It's very easy to dismiss sound engineering in retrospect as luck or as "how could anyone have known".


I am wiser than any god or scientist, for I have squared the circle and cubed Earth's sphere, thus I have created 4 simultaneous separate 24 hour days within a 4-corner (as in a 4-corner classroom) rotation of Earth. See for yourself the absolute proof!


This wouldn't work for us. We need the ids to sort by time across users, which this wouldn't do.



They have been doing the same thing for next page string that you have to pass back to get the next page on an API request. Assuming this has been since the new API come about at least.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: