Hacker News new | past | comments | ask | show | jobs | submit login

When the first 32-bit troubles occurred, they should have just switched to a string-based ID, IMO.



To be fair, 2^64 is a whole ton huger than 2^32 and they might not have been thinking about whether JavaScript could handle the identifiers (one normally assumes these days that programming languages have native support for 64-bit ints).


That was my first thought... who on earth thought that a number (albeit 64-bit) would be enough for Twitter? Who even thought that a 32bit int would have been enough?

I didn't know that Javascript couldn't handle numbers bigger than 53-bits, but honestly, these should have been strings from the beginning.


I didn't know that Javascript couldn't handle numbers bigger than 53-bits

The JavaScript Number type can't handle more precision than 53 bits. Magnitude is orthogonal due to floating-point representation. Precision is governed by the size of the mantissa, which is 52+1 bits long in the 64-bit IEEE 754 representation used by JavaScript.


The problem isn't that their ids have already gone past the 53 bit (much less 64 bit) marker in sequential order. The problem is that they are going to start generating ids in a different fashion which is causing the issue.

This seems to be the relevant id generating code: https://github.com/twitter/snowflake/blob/master/src/main/sc...


64 bits is enough to have everyone on Earth send over a billion Tweets and still have enough room to find a new solution. That sounds like more than enough to me.


The problem is they added a timestamp and people assumed they would not need the full 64 bit ID.

IMO, it's not a bad idea on their part. A 32 bit UNIX timestamp * 2 ^ 32 + a 32 bit sequential id let's them track up to 4.2 billion tweets a second and should work just find up to the year 2106.

Edit: As to why it's a good idea, you can have different systems handing out ID's without stepping on each other’s toes or even talking to each other. The full ID is composed of a timestamp, a worker number, and a sequence number. Granted, I would probably put the sequence number ahead of the worker number so sub second tweets are better ordered vs. being ordered strictly based on the system that generated them.


Yeah, I guess this is why I didn't understand how they could use 64 bit numbers to begin with... I couldn't see how they were going to be able to generate them all without leaving huge gaps in the number-space. If you used 64bit ints you'd be unable to have one machine do the generating, so you'd have to have some sort of offset for the worker who generated the ID at the very least.

And once you've done that, why not just go all out and use UUIDs?


this is exactly what they have done in their new ids and the snowflake system. They have a timestamp, a sequence number and a system identifer, plus a few neat properties.

I don't think they can be blamed for using a trivial incremental key when they had 10 users, I am sure they were not expecting to have 200M :)


Agreed. And also, there's not a lot of manipulation/math to be done on these IDs once they come back to the client. They're only used again when an operation happens on a tweet (favoriting it, replying to it) and in that case you're just taking the ID and sending it back to Twitter.


not really. You (can) use since_id and max_id in many places, which means you need them to be ordered (and trivially so).

This is of course doable with strings too, once you decide what the ordering is.


Tweets come back ordered newest to oldest so there's nothing to order if you don't want to. And accessing since_id or max_id just to send it back to Twitter is exactly what I was referring to: no math to be done, just storing & accessing a value so it doesn't have to be in number format.

In a more practical application, a timeline comes back and the top-most and bottom-most IDs are stored. When a user gets to the bottom an API call is made to load more so you look at the bottom-most ID. If they want to load new tweets you look at the top-most ID. No math needed, just looking up values. They could've been strings all along and it wouldn't have mattered much.


I said that there are more operations than favouring and replying by id.

I don't understand how saying you're gonna access the timeline without looking at the values denies that.

Moreover, I did not object to using strings, I said you only want them to be orderable.


I imagine that they did think of this, but a good string hash takes time. So switching to 64-bits takes much less time, and it would work until they finished snowflake.


The 64bit id is snowflake.


Technically, snowflake ids are 63 bits. Java doesn't have unsigned longs, so they don't touch the sign bit.


we've been running snowflake for weeks already




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: