Don't build services based on other people's APIs.
It's sad but true. For all the talk about mashups, it's rare to find a demo that's actually cool and much rarer to find a real application because of these problems with API limits. Sure, you might be able to build something that can handle 99% of twitter users, but the interesting and profitable 1% will blow out the API limit and then you're hosed.
So, your solution to the problem (carefully stated and with recompense offered) is: "Have a time machine"?
I guess it is a good solution, but it would require a lot of fundamental physics work so it might not match the time-frame he needs.
My point, and I apologise for the snark, is that when people have a problem and ask for help, they are not asking for judgement on things they should have done in the past, they are looking for ideas for how to move forward. If you feel strongly that people who build on APIs do not deserve this help, then perhaps consider making that point on any of the frequent "X API sucks" threads the pop up from time to time.
he is both right and wrong. relying on APIs can make the world very complex and high-pressure suddenly, but on the other hand, building/owning your complete eco-system is sometimes not doable from a starting perspective (wish I'd started Twitter, but I haven't...). so his advice, while not helpful, does have some truth between the lines.
you do have a point. it's incredibly fascinating to use an API since it instantly connects you with a large eco-system in this case, a large twitter universe - but once you start charging and making it a business, it becomes very much a middle-man-ish high-pressure type of situation.
Maybe I'm stating the obvious, but you write that:
So I scale back up to 100 follower requests for the next call. It goes through. Next 100, fails. I scale down again …
So, that sounds like a cache is being primed on the first request. If this is a consistent pattern, you should be able to issue a request for one record, followed by a request for 100 and get it through most of the time (E.g. unless you run into a garbage collection cycle). If you code against this assumption, you should be able to utilise 50% of the theoretical limit of a 100/hour.
sometimes, it's follows a clear pattern. request 100, fail. request same 100, success. repeat. --- that's likely because they load it in the cache once they have the results. it halves my success rate, basically.
but sometimes it's just random (depending on twitter overall traffic I'd assume).
plus, some records are faulty and can not be fetched. so this causes other issues as well and drops the api call.
1. Can you modify your system to return results based on a subset of their followers? If this is still of value to the user, it looks like the way to go, providing results based on a subset followed by another set of "final" results based on the full follower graph.
2. If you have enough celebrity-scale accounts using your service, are you able to share their followers details and cut down on time used to pull them?
3. On the business side, the service sounds dangerously like something that would be killed by Twitter if it becomes successful, by some definition of successful because you are pulling out their follower graphs. Look at Tumblr and Instagram. While it is good to milk it while you can and build it with ambition, it is also wise to look further and prepare for the day if it gets shut down and revenue goes to 0. If you have nothing to lose, go ahead, but be aware.
Great, great points. Here are my responses:
1) Yes, that seems to be the only way out right now. But my core features (most popular followers / most valuable followers) depend on a full set and will not be accurate. Growth data works fine though. The main reason why my service got traction were the most popular followers / most valuable followers metrics though. So that sucks. But seems to be my only resort right now.
2) I can get IDs faster, and then theoretically use my own DB to see if I need details for that ID (i.e. follower details) from Twitter or have them stored. Problem is: I am duplicating Twitter's database, also, I had this before and the database grew to immense size and kept crashing all the time despite efforts to avoid it. Also: Celebrities have a large distribution of followers, so unlikely to save much time by seeing who's details I already got elsewhere.
3. Yes, although Twitter announced their quadrants recently, of services that will be supported by them, one of them is social analytics. This is what I do. So that should be good. But you have a point, I can't touch any money for up to 1 year in case Twitter shuts it down, I want to pay back the remaining unused time to my users. So it's frozen money.
Focusing on 1): if "most popular follower" means "follower with the most followers", you could try one of three strategies.
First, just check which of the the top 100 (1000) most-followed users follow the celebrity - this will likely eliminate the heaviest hitters immediately.
For accounts with normal amount of followers, do as you already do.
For the in-betweeners,
Pick your celebrity.
For a random 1% (.1%) of his/her followers:
Retrieve list of people followed by this follower
For each user followed by a follower of the celebrity:
Increase number of people following this user by 1
For each user in the top 1% (10%, .1%) of the above table of users:
Retrieve number of followers
Report user with highest number of followers from the above
This is, of course, based on the idea that often-followed users who follow a celebrity will also have many followers among the followers of the celebrity. Results will become more accurate as you poll more followers, of course.
(I don't use Twitter or their API, and the above may be completely wrong.)
see, this is what I came for. I believe there's room for optimization within the limits I have to adhere to right now. your approach seems like a stab at it and I'll think this through again in a bit if it makes sense. but you're hitting the nail on the head in terms of where my problem is, i can't change what twitter does, i can display partial results but it's suboptimal in many ways, so how can i make these partial results the best possible quality? this is where you answer seems to make sense - thanks. i'll think about this one for sure.
Digging further in 3), if your app is proven to be in the 'good' quadrant, have you ever considered contacting Twitter directly and requesting a higher API limit rate.
I can imagine they are ok with that if your app provides value to their ecosystem.
well i constantly mention the issue through their bug reports and discussions, they acknowledge it and promised to talk to the team about re-evaluating the way things are handled now, but I'd assume this means it's not a priority.
i tried signing up through one of their partner thing forms but as expected, no response.
With the right infrastructure and tech, the "mirror" DB shouldn't crash. And also keep in mind how many people follow a lot of celebrities… common point!
First, you could try Datasift or Gnip, who both sell twitter data, and thus have no API limit. Not sure if you can afford it, as it does have a cost.
Second, maybe you could use the streaming API: you could get part of the data that way, and have more credits.
If users follow back, you could use the sitestream, although it's quite different to work with then the REST API.
Thirdly, if I read correct, if Alice and Bob are both your clients, and Fred follows both, you now collect Fred's data twice, right? I would put a cache in between that. Riak, cluster of redis, or even S3 or DynamoDB.
If I can help more, send me an email (in profile)
Lastly, if you have twitter investors in your userbase, ask for an intro to talk to twitter. They see the value of your service.
Hey! Datasift+Gnip both seem to supply conversational data, which I currently don't track. My focus is on user-data, which both don't seem to provide. And yes, they're expensive, wow.
The streaming API idea is a good one to get the most out of all of Twitter's data sources... someone else suggested this as well. Seems like it's worth a try.
Your Alice/Bob/Fred assumption is correct. I had a cache of sorts, through a large mysql table which went way over a couple of gigabytes and kept crashing all the time and restoring took half a day.
The twitter investors haven't had a chance to see any of the service since they are still waiting for results :) tough one to ask for intros ;)
The first thing I would try is rebuilding the cache. Since you only need to do key-value lookups, there are many (easy) approaches, that can scale better than MySQL. You could do JSON files on a filesystem (with some nesting, as you don't want to put 20 million files in 1 dir). Or Redis: 20 million x 1 KB is 20 GB. You will need more with overhead, but a few machines would work. Or Riak (we went > 1 billion items with Riak).
But even MySQL should be able to handle this: We had over 100 million records in MySQL (on SSD's) before switching to something else.
That seems to be the quickest win to reduce the number of requests, but it will depend on the overlap of your usersets how much this will help.
ha, thanks! that wasn't my intention, but well, it helps checking out if it can withstand the load.
i feel sort of bad asking for advice here, since many have better things to do and i am the one making money with this, but i've reached a point where I don't know how to continue. and this is very odd.
I'd build it with two stages. First, a cache that holds user information (sort of replicating profiles), that is, follower count and etc. This would be shared among all users, and can go into a dedicated Redis instance (and why not, also replicated to a MySQL-InnoDB for convenience).
Then, the "graph" DB (follower list), that I'd put into Redis. With some scripting and Redis magic, you can keep automatically sorted (server-side) users by their follower count. You'll just need a lot of RAM (get a dedicated server, look at ovh or others, cloud is usually more expensive and less reliable when it comes to RAM).
You can collect profile information before they go on to the 1.1 (which forces auth), to populate the global DB. Then, you'd only have to fetch users' follower IDs (using the 1.1: followers/ids), which I believe is way more reliable (and progressively, pooling queries, populate the profile database in batches of 500 or 250 users, using followers lists with user details).
This means that data can be queried dynamically without killing the server (or the servers, there should be more than one), therefore allowing for "partial results" (1M followers -> info about the first 10,000 just after signing up, for example).
you do have a sensible approach that you suggest here. originally, without using redis, i wanted to use mysql to cache all user data. and insert/refresh details in it over time. the table quickly grew to 20M records (with meta-data information taking at least 1K of data per record if not more) , the database grew to multiple gigabytes. twitter has up 140M accounts or more now, so i'd need headroom here, although i'd likely not touch a large amount of twitter users.
also the system started making sense after a while when I had user ids that I had already cached (you are correct, the IDs I get through followers/ids which is much more well-though out function in terms of limits).
but then mysql constantly crashed and reparing/backing up a multiple-gigabyte-table exceeded my technical abilities and i gave up. so I split up everything into per-user-sqlite databases that I backup to S3. i lose the ability to access a cache of users though since I can't query other user's sqlite databases in a sane way to see if they have meta data for that user id.
major problem is that I believe twitter will eventually shut me down if I duplicate/replicate their user database (and I constantly need to refresh since user data will eventually be outdated).
Well, it's intended to work with followers/ids and other calls. Twitter might go against you, but that would be disregarding your calls… If they pull the plug it will be because of features, not because of how you use the API :(
you could easily apply sharding to that DB table and then scale horizontally as much as you like (just think big enough, so that you don't have to re-shard too soon..)
I am working on the project where I also have to scrape a lot of information about a user(posts/tweets/statuses, photos, friends, etc) on the social networks(g+, twitter, facebook, youtube, etc) All these limitations are really annoying. Errors are expected almost on every API call (timeout, 5xx errors, host not found, new fields on data structure, etc). The scraper should be smart enough to support all these caveats.
What I do not understand is why these major players do not want to introduce API for money? I am ready to pay for it and I know a lot of people who also are ready to pay for it. But please, remove these limitations and make your APIs more stable.
I have quite a few celebrities signed up, without advertising to them, including certain inventors of Twitter itself, TV personalities, major investors and others.
And all of them (1M+ followers) have to wait up to 60 days or more to get past the login page due to a bug and limits of Twitter.
I feel there's a smart way to work around this and I have always managed to do so in the past, but now, I've hit my technical limits and need help.
I am willing to split upcoming PRO account payments 50/50 with anyone able to help me code moving forward / solve this issue.
i know, but as said, it started as a funny experiment, became a toy and then I felt bad charging more for things i can not influence. It'll definitely be able to charge 19.99 a month up to 99.00 a month for corporations, there's a lot of features I can add - but right now, it'd put me under even more pressure to charge that much. it's a messed up weird situation.
Considering increasing the highest price level. There are people who build businesses around Twitter. They will pay more than $1200/yr for something that helps them make more money than that.
yeah well, i instantly would and I have the features/service to back it up with quality data, but as long as I can't deliver any sort of service to larger accounts (1M+ followers), it doesn't make sense to charge like that. long term, no problem. right now, twitter limits me too much.
yeah, good point for sure. problem is that I am living in Europe right now (I miss the US), and it used to be a lot easier to just meet up with someone knowledgable from any industry back in the US. Here, it's hard to do so. And cold-emailing companies in that field might come off as me trying to drain their competitive advantage? No? I'd be curious who might be able to help out. But calling up Klout would likely not get me a call back at this stage.
and just to add: I think my particular problem is, and that's sort of the selling point, my analysis and reports is not just growth data (which is easy), but my calculations require all of your follower details to work correctly (to provide correct results). Twitter is sending me random follower details, so having a partial set means little reliability (for up to 60 days).
I don't know where in Europe you're currently based, but Datasift are in the UK and seem to be one of the hottest "social media analytics" companies out there. Their staff seem fairly active on twitter itself, maybe give them a try [1]
Yeah, I was about to suggest Datasift as well. IIRC I had a chat to some of them at the last SiliconMilkRoundabout & they seemed like they knew their stuff.
It does occur to me that the OP is effectively trying to replicate large chunks of the twitter datastore & that's going to be very difficult to manage! It's not like twitter themselves were particularly reliably to start with after all.
yeah, their data records are sometimes faulty and I need to scale down (even if it's not a time out) to 1 record per request to find the faulty in one hundred accounts. so that sucks.
and regarding replicating, originally that's what I did. I had up to 20M of Twitter's 140M records cached almost - but that probably wasn't cool with them on the long run and i was unable to maintain a database with one table having multiple gigabytes of data.
I'm happy to answer questions in regards to Klout as far as I can. We utilize GNIP and Datasift both for different situations, but we're working on a different side of data than you are, it sounds. Feel free to email me at api@klout.com.
Also: I've helped Twitter fix two other bugs that I reported, but on this one, it's a grey area (they don't consider it a bug, but a safety measure in terms of traffic time out, but still charge me API call credits). so yeah.
the verified account will be correct tomorrow. this is a timing bug that some users encounter where one procedure finishes faster than another (it's rendered before all data is there, somehow). the retweets is funky because Twitter is migrating to a new API version which I've adopted mostly. but the retweets feature will be gone in the new api feature, they don't offer that data anymore. so it might not be reliable right now.
Could you not think of this as a networking problem. Say you have 10 users with more than 1 million followers and there are 100 million twitter users, what is the probability that some of your other users are following them?
Perhaps for some of the accounts that do not use much credit to search for followers, you could also search for those who it follows. Then on your backend, you can see if this subset exists within a previous list of followers of a big user.
don't know the twitter api very well, so....maybe instead of scaling back and doggedly trying to get the current record set, you simply mark the 1st 1000 as having an error somewhere, proceed to the 2nd 1000....come back to the ones giving you problems later. I know eventually you have to come back and get the broken ones, but if you manage to process 80% of someone's millions of followers, then you can start digging into the other 20% a bit at a time and at least provide some value for your customers in the meantime......
I haven't actually checked out your service because i'm not really a twitter person so maybe you do this already, but could you provide statistics based on the amount of data you've got so far? (i know they will be inaccurate), but you can sort of give them as a "moving target", based on x number of your followers type thing. That way the user gets a little bit a value right away.
i also tried this and it helped a bit, for every 100 user details request I pull a random set of 100 ids from different places in a user's follower list to minimize getting stuck. it helped a bit. but main problem are still the time-outs.
I had a similar problem when using SSL connections. I'm not sure about your data by in my case the data was pretty much public data and there was no harm in using simple HTTP connection. This significantly improved the speed and no more very frequent timeouts.
Also try enabling/disabling gzip compression for API calls.
the bottleneck is twitter's api limits, data-wise and http connection wise I have headroom, lots of.
simple http connections to parse/spider the follower records from public pages is a no-go since twitter blocks the IP then, and scaling this out will eventually not end in a good way.
This may be against twitter's ToS, but can you create a bunch of accounts with different API keys to access the api concurrently and get around the rate limits?
I'm sure twitter must account for this, but how do they? You don't need to provide much information to get an API key.
yeah no, they're explicitly not allowing this and as soon as this thing grew out of weekend-project scale, i have to adhere to any rules they're pushing out. too high a risk to be shut down and locked out completely.
@mittermayr I responded to your comment too :) There isn't much you can do about the latency at their end except try and get the most out of your API quota by making sure you don't issue a call when you know you will time out (discussed in the reply).
I would be interested in knowing other solutions that work for you.
When I worked on SocialGrapple (similar featureset to Fruji), here's what I did:
* Technically: I had a well-optimized PostgreSQL database which had a few parts: Follower graph schema with a revision id (which node is following which), it got cleaned out every N revisions; a delta schema which took the last two revision ids and diff'd them; an aggregate schema which did a bunch of queries and summarized the results every T interval; metadata schema which stored cached information about each node (updated every time that user object was fetched).
I'm pretty obsessive about query and schema optimization, and I had a comprehensive benchmarking suite which helped me consistently improve my performance with bulk insertion as well as aggregate queries as well as user displayed queries. Each job was broken down into small efficient pieces that were executed in dependency order by my custom task scheduler, Turnip (open source at https://github.com/shazow/turnip).
I don't remember the exact numbers but I was approaching 100M rows on a single 512mb Linode.
Redis would have worked too but I would have needed much more RAM or more moving pieces to move things in and out of RAM for processing. None of my queries were slow enough to worry about this.
* Pricing: As others mentioned, higher prices make it easier to scale. I charged based on the size of the account and how many accounts you wanted to monitor (basically proxies to how many API calls you'll cost me). A small account cost something like $6/mo, 5 medium accounts $14/mo, 25 bigger accounts at $50/mo, 100 large accounts (1M+ followers) at $125/mo. I had modest revenue but I can't say my pricing scheme was perfect. I was actively messing with it towards the end.
* I had a legacy Twitter whitelisted account which gave me 10K api hits per hour. This helped me a lot. At the same time, I was careful to not become too dependent on that account in case I lost it. I was well within the boundary of normal user limits the entire time and only really used my whitelisted account to experiment or backfill new data. I made sure to always make the most efficient API calls to avoid wasting them. I too had issues with timeouts but it more came in waves when Twitter was having infrastructure issues rather than consistently. It wouldn't surprise me if this has gotten worse.
Also, I used and stitched all three Twitter APIs: REST, Search, and Streaming. It was painful.
* Diversify. Twitter is becoming an increasingly developer-unfriendly platform to build on, and your business should not be dependent on it. I added Facebook support to SocialGrapple, and I was going to add Google+ support too. Today, I'd also add app.net support. That said, the majority of my business was still Twitter, and that sucked. This was a big factor in my decision to sell out and shut it down—I didn't see the developer ecosystem as a place where you can have a sustainable business, let alone a thriving one.
I actually had several conversations/negotiations with Twitter about how they'd interpret their terms of service wrt my product. It helped to know people at the company to get a favourable ruling, but I still felt like it could be reversed—err, "provided with guidance" at any moment.
For what it's worth, I found it more rewarding to build an analytics product that was super useful for a smaller group of people than a little helpful for a lot of people (I'd say tweepsect.com is the latter). Think about where on the spectrum you want to be as this makes decisions, like pricing, easier.
Best of luck! Shoot me an email (in my profile) if you'd like more details.
andrey, just wanted to say thanks for this comprehensive answer. it seems like you speak from a lot of experience and while it's scary to read through what you have had to do to survive, it all makes a lot of sense and helps me in picking my battles a bit.
again, thanks for this, i am going through it later again, just responding to over 60 other e-mails with help and support, just fascinating to see this. if one thing, we can hopefully make it clear that betting on someone's platform will provide tremendous opportunity but also introduce a considerable uncertainty if it takes off.
Not sure if this is against Twitter's ToS, but since follower data is public, can you make use of unused API requests from accounts with fewer followers?
Nah, unfortunately not. All calls originate from my company account or on behalf of a user. So for using other people's account tokens to scan, wouldn't work either as they are only authorized to scan their followers.
just wanted to say thanks real quick for all this help. i've received over 30 e-mails already from random people helping me out with advice. i know it's often looked down upon in comments sections to thank the community since it provides no value, but i don't give a shit right now, i just need to say thanks, so much everyone :)
If one of your requests for followers fails at 100, does it fail for a different user as well? Instead of backing down on a user you could try delaying it for a time period and switch to another user.
also keeping track of when the failures occur is important, and potentially valuable information.
i run up to 16 crawlers at the same time. twitter limits (in terms of rate limits) me only for each user and the calls i issue through that user's credentials. so i can go parallel easy. not much of a bottleneck on my side.
Does it fail for an individual user or is it all users at that time period. instead of reducing the number of followers, you may just need to wait 5 - 10 minutes and ask for that user again with full 100.
pretty much what I do, my local sqlite storage per user slows things down a bit (but that's good since Twitter is even slower). so between requests, I often give Twitter enough time to finish the request for the previous 100 and store them in cache, so that when I re-request the same 100 (i always try twice), there are often there. but not always. it's a mix of overall twitter load, plus where/how deep down these 100 followers are stored, plus if a follower record is damaged (happens frequently), plus other time-out factors. that's what I am complaining about, it's so hard to work around this.
doesn't work. tried this briefly at the beginning but they return error pages after a certain amount of calls per IP. and no, splitting up to many IPs isn't feasible. unfortunately.
yeah, but i don't even go to rate limit (i check before every call). i stop 2 or 3 calls shy of it, so that doesn't limit me. but if I send an API request and that request times out and does not return data, then it still cost me a precious call (and one closer to rate limit).
Don't build services based on other people's APIs.
It's sad but true. For all the talk about mashups, it's rare to find a demo that's actually cool and much rarer to find a real application because of these problems with API limits. Sure, you might be able to build something that can handle 99% of twitter users, but the interesting and profitable 1% will blow out the API limit and then you're hosed.