Inspired by Twint I created an alternative web client that relies almost entirely on web scraping. The parser is more complete than Twint's due to the different goals, supporting things like polls and videos properly. It's still work in progress but very usable with features like RSS and convenient profile search.
https://github.com/zedeus/nitter
I don't use twitter android application and the browser version on mobile is extremely slow. This made twitter usable on my browser again. Thanks!
PS: It would be even more useful if someone hosts it in a domain like <someshortcode>twitter.com, then I can just add shortcode to the url and escape from the slow, official one.
You can use it to download every tweet from a user, not just the last 3000 that their API supports. It uses the same query syntax that the web search uses.
Does it work on suspended accounts? Mine got reported by a bunch of Trump supporters so I lost 17 years of tweets and Twitter won't listen. I can still see my own tweets, but only a page at a time. I'm thinking I might need to use a scraper like Twint instead of the API.
Nice! I followed the instructions but after I paste the curl command I get this error:
Unhandled exception. System.ArgumentOutOfRangeException: Length cannot be less than zero. (Parameter 'length')
at System.String.Substring(Int32 startIndex, Int32 length)
at TwitterDump.Program.ParseCurlCommand(String curlCommand, Dictionary`2& headers) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 150
at TwitterDump.Program.Auth(AuthOptions options) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 113
at TwitterDump.Program.<>c.<Main>b__0_1(AuthOptions opts) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 33
at CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1 result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc)
at TwitterDump.Program.Main(String[] args) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 30
If you can you see your own (suspended account) tweets when logged in, you could use your auth cookie to scrape them.
Not sure how to do that with Twint off the top of my head, though.
We tried to use Twint in production but it didn’t work for us. I ended up writing one that works very well. Let me know if I can help you get your tweets.
"popular" is a stretch, it's what a few banks are trying to get people to use because they couldn't get NFC working on apple devices with their apps. Until recently they have been giving money away like crazy to get people to use it.
It requires extra hardware at the cash register (if using bluetooth) or modification to the vendors terminal to be able to display a qr code.
Sadly the app is extremely slow and cumbersome to use at a cash register compared to other options such as NFC payments on android or just straight tap and pay credit cards.
It would be a great system for payments online as scanning a code is very easy compared to entering all your CC information but becauses the fees are so high (compare to credit cards) the largest online electronics retailer (digitec) in Switzerland dropped them after the intro rates expired.
It's probably an attempt to implement what WeChat/Alipay has. In China the majority of POS terminals has this device https://www.alibaba.com/product-detail/Specialized-Alipay-an... , which is a fancy housing for a camera that scans the wallet barcode on your phone (optimized for the perfect angle and to prevent glare, I'm guessing). It seems preferable and faster than Bluetooth...
A strong quality label is that twint's twitter account got suspended - evidently Twitter doesn't like that a third party tool scrapes tweets better than they allow to.
Thanks for your hard work! The tool has been extremely helpful in my political science research. I have noticed that a sort of error occurs when I try to scrape too many tweets from too many accounts at once (which I do by referencing a txt file with the account names that I want to scrape tweets from). I've fixed this by just making shorter lists of account names. Is there another workaround I am unaware of?
Looks like a decent tool, but I personally would not use it in production. Took a few minutes to look through the code: They basically use the HTML pages instead of the APIs. What puts me off is that the code is missing tests altogether and has quite a few separation of concern issues. HTML extraction is everywhere, storage is embedded into the package. Cool proof of concept, not ready for production, I think.
Minor stuff: Printing instead of logging. Would prefer a package that only does the retrieval and nothing else. Hardcoded SQL(ite?) storage.
> They basically use the HTML pages instead of the APIs.
For Twitter in specific, isn't HTML scraping vastly preferable to using their official APIs? Otherwise you run into pretty arbitrary usage limits and missing features.
There's a small list of services where I think I prefer HTML scraping and browser piloting for a 3rd-party client: Twitter, Patreon, Facebook, LinkedIn, a few others. Services where the official APIs are underdeveloped or crippled to the point of almost uselessness.
I was teaching a hands on workshop (meetup) on how to use the Twitter API a few months ago, and the hardest part for everybody was getting those API keys, clicking through several pages, checking boxes, having it emailed to you.
Then it turned out Twitter refused half of the attendees the API key.. (maybe they thought it was spam coming from the same wifi, same time).
So then I just gave out my API key to the rest of the class, and in a few minutes it was blocked..
For a service that has a history of empowering users to protest and to spread news in crisis situations, it's a shame their API is so locked down.
Took me 20 seconds to install twint, and start scraping.
Took me 20 minutes to install twitter-dump, and despite a successful twitter-dump auth, I end up with 'Request exception Forbidden {"errors":[{"code":200,"message":"Forbidden."}]}'.
Not going to spend time to fix that, I'll use the dirty solution that works.
I'd be interested to hear if this tool could be used to scrape malware hashes (or links containing malware hashes) from Twitter? This or any other tool really... it appears my brain has turned to mush today and I cannot find or get anything to work :(
What’s your use case for this? I scrape a ton of pastebin links and other sources of hashes posted by folks on Twitter about Emotet, trickbot etc. and I’d be happy to point them to a webhook or get them to you another way? Happy to talk through how we do it too.
It kind of comes down to how well you can defend yourself from it being called a DOS attack (follow politeness standards and robots.txt), from violating their copyright (generally not problematic if you don't distribute the data), and from violating their terms of service (this is key in the case of twitter and reddit, carefully read their TOS).
However, the scraping of public information like in the case of tweets or reddit posts is the less problematic part. It's when you distribute the data or aggregations of the data that it could be problematic to use scraped public information.
I've used a fair share of (painful) APIs over the years and I have one simple plea: can we please stop being a-holes? Can we stop scraping websites that have APIs? They are already offering machine readable data and maybe they have a reason for not providing everything. Scraping their sites circumvents this API and not only abuses their systems, but also makes your code super brittle - any change to their site breaks your code - if you code wasn't broken already as you were banned by the provider.
If people are resorting to scraping then clearly the official API isn't fit for their purposes.
As an example, here are two key features that are missing from the Twitter API at the moment:
- Bookmarks. You can privately bookmark tweets on the Twitter website and apps. There is no way to access the list of tweets you have bookmarked in the API.
- Threads. The concept of threads - where a tweet has replies from the same author get special treatment in terms of display - is key to how Twitter is used today. The official API doesn't support them, in that there is no way to look at a tweet and see that there exists a threaded tweet reply.
There is no good commercial reason for excluding either of these features from the public API, other than that Twitter have made a strategic decision not to invest resources in expanding the API to keep up with new features they are adding to the platform.
Given that, is it any surprise that people are resorting to scraping?
You say "maybe they have a reason for not providing everything" - I cannot think of a reason not to provide me with API access to my own private bookmarks other than "we decided to invest our engineering resources elsewhere".
Which isn't a bad reason! But it's not a good argument for people not to scrape their own data.
Everything Twitter used to and not anymore enable third party profit off “their valuable” data. It’s economically/strategically unsustainable or so they think.
I doubt it. People who write scrapers are talented engineers. Talented engineers understand the limitations of scraping compared to using an official API, and know to try to get things done with the official API first.
The Twint README calls out reasons for going beyond the API at the start - things like Twitter's increasingly strict rate limits and the limit of only 3,200 historic tweets for a user account.
I've been maintaining a browser extension[1] that has used various versions of the unofficial API over a period of six years -- from literally navigating the site as the user and grabbing HTML, to the various incarnations of JSON and HTML hybrid (SSR) APIs that their web and mobile clients have used internally over that time.
Believe me, I would LOVE if the official API supported threads. I have tried several times to make it work, but the official APIs are stuck in a circa-2012 idea of how Twitter works. Replies just aren't a thing to it.
I totally agree, writing scrapers doesn't necessarily make one a talented engineer. You may often find it easier to write a scrapers than going through the API documentation or series of authentication procedures.
I think it's the social media companies that are being assholes by privatizing conversations and selling access to the graph. The idea of a platform is to set up a protocol and some hardware, get people to talk to each other on it, and then gradually construct a mall around the outer edges of the platform. It's a shitty business model because it relies on knowing more about he users than they know themselves and then selling that information to advertisers.