Twint – Twitter scraping tool written in Python

zedeus · on April 13, 2020

Inspired by Twint I created an alternative web client that relies almost entirely on web scraping. The parser is more complete than Twint's due to the different goals, supporting things like polls and videos properly. It's still work in progress but very usable with features like RSS and convenient profile search. https://github.com/zedeus/nitter

scalableUnicon · on April 13, 2020

I don't use twitter android application and the browser version on mobile is extremely slow. This made twitter usable on my browser again. Thanks! PS: It would be even more useful if someone hosts it in a domain like <someshortcode>twitter.com, then I can just add shortcode to the url and escape from the slow, official one.

mav3rick · on April 13, 2020

Use a PWA instead. Go to Twitter on Chrome and do "Add to Home Screen". Really fast.

jakobdabo · on April 13, 2020

Thank you! I love it! It's so much better than the official front-end and works fine without JS.

pknopf · on April 13, 2020

I created a tool that uses Twitter's public facing (but private) API.

https://github.com/pauldotknopf/twitter-dump/

You can use it to download every tweet from a user, not just the last 3000 that their API supports. It uses the same query syntax that the web search uses.

arkanciscan · on April 13, 2020

Does it work on suspended accounts? Mine got reported by a bunch of Trump supporters so I lost 17 years of tweets and Twitter won't listen. I can still see my own tweets, but only a page at a time. I'm thinking I might need to use a scraper like Twint instead of the API.

pknopf · on April 13, 2020

If you can see it while you are logged in, then yes.

Check ```twitter-dump auth``` for instructions on how to use your web cookies with the command.

When the command queries Twitter, it will look as if it is coming from you.

arkanciscan · on April 13, 2020

Nice! I followed the instructions but after I paste the curl command I get this error:

Unhandled exception. System.ArgumentOutOfRangeException: Length cannot be less than zero. (Parameter 'length') at System.String.Substring(Int32 startIndex, Int32 length) at TwitterDump.Program.ParseCurlCommand(String curlCommand, Dictionary`2& headers) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 150 at TwitterDump.Program.Auth(AuthOptions options) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 113 at TwitterDump.Program.<>c.<Main>b__0_1(AuthOptions opts) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 33 at CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1 result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc) at TwitterDump.Program.Main(String[] args) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 30

What'm I doing wrong?

arkanciscan · on April 13, 2020

NM, I figured it out. Chrome has several "copy as curl..." commands now. I was using "Copy as Curl(cmd)" but "Copy as Curl (bash)" is what worked.

arkanciscan · on April 13, 2020

Dude! Thanks so much!!! If you're ever in Portland, I owe you a drink!

arkanciscan · on April 13, 2020

::reads old tweets::

Oh god... no wonder they suspended me

jchook · on April 13, 2020

If you can you see your own (suspended account) tweets when logged in, you could use your auth cookie to scrape them.

Not sure how to do that with Twint off the top of my head, though.

We tried to use Twint in production but it didn’t work for us. I ended up writing one that works very well. Let me know if I can help you get your tweets.

arkanciscan · on April 13, 2020

> you could use your auth cookie to scrape them

Seems that Twint does all it's work unauthenticated so it may not be possible.

I suppose I could try using the API but I don't think I care that much.

jchook · on April 13, 2020

Looks like you could just paste your auth header right here: https://github.com/twintproject/twint/blob/a5df26d988109838d...

arkanciscan · on April 13, 2020

Good looking out, but I don't think that's working.

> PS C:\Users\Jesse Hattabaugh\Documents\GitHub\twint> twint -u jessehattabaugh > CRITICAL:root:twint.get:User:'NoneType' object is not subscriptable

nvr219 · on April 16, 2020

arkanciscan · on April 14, 2020

lol @ the Trump supporters downvoting this. Y'all tired of all the winning?

Stammon · on April 13, 2020

JFYI: Twint is already the name of a popular payment app in Switzerland

https://www.twint.ch/en/

kohtatsu · on April 13, 2020

I don't think there will be any confusion.

sschueller · on April 13, 2020

"popular" is a stretch, it's what a few banks are trying to get people to use because they couldn't get NFC working on apple devices with their apps. Until recently they have been giving money away like crazy to get people to use it.

It requires extra hardware at the cash register (if using bluetooth) or modification to the vendors terminal to be able to display a qr code.

Sadly the app is extremely slow and cumbersome to use at a cash register compared to other options such as NFC payments on android or just straight tap and pay credit cards.

It would be a great system for payments online as scanning a code is very easy compared to entering all your CC information but becauses the fees are so high (compare to credit cards) the largest online electronics retailer (digitec) in Switzerland dropped them after the intro rates expired.

netsharc · on April 13, 2020

It's probably an attempt to implement what WeChat/Alipay has. In China the majority of POS terminals has this device https://www.alibaba.com/product-detail/Specialized-Alipay-an... , which is a fancy housing for a camera that scans the wallet barcode on your phone (optimized for the perfect angle and to prevent glare, I'm guessing). It seems preferable and faster than Bluetooth...

throwaway4585 · on April 13, 2020

A strong quality label is that twint's twitter account got suspended - evidently Twitter doesn't like that a third party tool scrapes tweets better than they allow to.

fnord123 · on April 13, 2020

Their API is absolute garbage. Search is incomplete and streaming consumes GBs/hour even when you're getting no results.

dmuth · on April 13, 2020

Oh hey, glad to see a project I contributed to get a mention on HN.

I wrote the Splunk integration with Twint for crawling Twitter timelines: https://github.com/twintproject/twint-splunk

Feel free to hit me up if there are any questions about that part of Twint.

skylarchunk · on April 13, 2020

Thanks for your hard work! The tool has been extremely helpful in my political science research. I have noticed that a sort of error occurs when I try to scrape too many tweets from too many accounts at once (which I do by referencing a txt file with the account names that I want to scrape tweets from). I've fixed this by just making shorter lists of account names. Is there another workaround I am unaware of?

lorey · on April 13, 2020

Looks like a decent tool, but I personally would not use it in production. Took a few minutes to look through the code: They basically use the HTML pages instead of the APIs. What puts me off is that the code is missing tests altogether and has quite a few separation of concern issues. HTML extraction is everywhere, storage is embedded into the package. Cool proof of concept, not ready for production, I think.

Minor stuff: Printing instead of logging. Would prefer a package that only does the retrieval and nothing else. Hardcoded SQL(ite?) storage.

danShumway · on April 13, 2020

> They basically use the HTML pages instead of the APIs.

For Twitter in specific, isn't HTML scraping vastly preferable to using their official APIs? Otherwise you run into pretty arbitrary usage limits and missing features.

There's a small list of services where I think I prefer HTML scraping and browser piloting for a 3rd-party client: Twitter, Patreon, Facebook, LinkedIn, a few others. Services where the official APIs are underdeveloped or crippled to the point of almost uselessness.

lorey · on April 13, 2020

That quote is supposed to be a neutral summary, so yes, that might be the case :)

karlicoss · on April 13, 2020

Fair objections perhaps, but regarding using HTMLs -- scraping is the whole point, because getting API tokens these days is hard.

silverdrake11 · on April 13, 2020

I was teaching a hands on workshop (meetup) on how to use the Twitter API a few months ago, and the hardest part for everybody was getting those API keys, clicking through several pages, checking boxes, having it emailed to you.

Then it turned out Twitter refused half of the attendees the API key.. (maybe they thought it was spam coming from the same wifi, same time).

So then I just gave out my API key to the rest of the class, and in a few minutes it was blocked..

For a service that has a history of empowering users to protest and to spread news in crisis situations, it's a shame their API is so locked down.

karlicoss · on April 13, 2020

Yep. At least it was a realistic experience :)

The hardest part when working with data is often not manipulating data per se, but spending time on crap like this.

lorey · on April 13, 2020

Sorry, looks like I did not make that clear enough. Was not meant to be an objection, totally understand why they do it.

chishaku · on April 13, 2020

What do you use instead?

lorey · on April 13, 2020

There's a pretty actively-maintained Python wrapper for the internal API: https://github.com/bisguzar/twitter-scraper

rovr138 · on April 13, 2020

Here's a comment of someone who built a tool that uses twitter's public facing but private api, https://news.ycombinator.com/item?id=22855148

BiteCode_dev · on April 13, 2020

Took me 20 seconds to install twint, and start scraping.

Took me 20 minutes to install twitter-dump, and despite a successful twitter-dump auth, I end up with 'Request exception Forbidden {"errors":[{"code":200,"message":"Forbidden."}]}'.

Not going to spend time to fix that, I'll use the dirty solution that works.

rovr138 · on April 14, 2020

That's fine. I'm not saying you have to switch, but it looks like there's a way.

Maybe they could combine efforts? Maybe they could look at the code, and if the licenses allow, port things to theirs.

anthonyaykut · on April 13, 2020

I'd be interested to hear if this tool could be used to scrape malware hashes (or links containing malware hashes) from Twitter? This or any other tool really... it appears my brain has turned to mush today and I cannot find or get anything to work :(

thomasdub · on April 13, 2020

What’s your use case for this? I scrape a ton of pastebin links and other sources of hashes posted by folks on Twitter about Emotet, trickbot etc. and I’d be happy to point them to a webhook or get them to you another way? Happy to talk through how we do it too.

anthonyaykut · on April 13, 2020

Interesting! I’ll reach out to you via LinkedIn and/or Keybase to explain my use case.

tandav · on April 13, 2020

nice job, twitter api is terrible

_____smurf_____ · on April 13, 2020

I noticed that you don't use tweepy (https://github.com/twintproject/twint#requirements). Can you highlight the difference?

detaro · on April 13, 2020

Tweepy is for API access, this is a scraper.

faizshah · on April 13, 2020

And the main advantage is you don’t need to authenticate and you aren’t rate limited.

_____smurf_____ · on April 13, 2020

A question I asked before, but I get different answers. what are the -legal- limitations of scraping data when we have an a limited access API

faizshah · on April 16, 2020

The problem is that there isn't a straight answer to this, see this recent thread: https://news.ycombinator.com/item?id=22180559

It kind of comes down to how well you can defend yourself from it being called a DOS attack (follow politeness standards and robots.txt), from violating their copyright (generally not problematic if you don't distribute the data), and from violating their terms of service (this is key in the case of twitter and reddit, carefully read their TOS).

However, the scraping of public information like in the case of tweets or reddit posts is the less problematic part. It's when you distribute the data or aggregations of the data that it could be problematic to use scraped public information.

ornornor · on April 14, 2020

TWINT is also the name of the product for mobile payments in Switzerland (twint.ch). I hope the reject won’t run into a fight over the name.

ornornor · on April 14, 2020

s/reject/project

drej · on April 13, 2020

I've used a fair share of (painful) APIs over the years and I have one simple plea: can we please stop being a-holes? Can we stop scraping websites that have APIs? They are already offering machine readable data and maybe they have a reason for not providing everything. Scraping their sites circumvents this API and not only abuses their systems, but also makes your code super brittle - any change to their site breaks your code - if you code wasn't broken already as you were banned by the provider.

simonw · on April 13, 2020

If people are resorting to scraping then clearly the official API isn't fit for their purposes.

As an example, here are two key features that are missing from the Twitter API at the moment:

- Bookmarks. You can privately bookmark tweets on the Twitter website and apps. There is no way to access the list of tweets you have bookmarked in the API.

- Threads. The concept of threads - where a tweet has replies from the same author get special treatment in terms of display - is key to how Twitter is used today. The official API doesn't support them, in that there is no way to look at a tweet and see that there exists a threaded tweet reply.

There is no good commercial reason for excluding either of these features from the public API, other than that Twitter have made a strategic decision not to invest resources in expanding the API to keep up with new features they are adding to the platform.

Given that, is it any surprise that people are resorting to scraping?

simonw · on April 13, 2020

You say "maybe they have a reason for not providing everything" - I cannot think of a reason not to provide me with API access to my own private bookmarks other than "we decided to invest our engineering resources elsewhere".

Which isn't a bad reason! But it's not a good argument for people not to scrape their own data.

dewey · on April 13, 2020

> But it's not a good argument for people not to scrape their own data.

Why would you want to scrape your own data if you an already request all your data and get a whole archive? https://help.twitter.com/en/managing-your-account/how-to-dow...

simonw · on April 13, 2020

Because then you have to trigger and download a GB+ file every time you want to programmatically access your latest bookmarks.

numpad0 · on April 13, 2020

Everything Twitter used to and not anymore enable third party profit off “their valuable” data. It’s economically/strategically unsustainable or so they think.

fireattack · on April 13, 2020

API can only get first 3200 tweets from a user is also a dealbreaker.

scared2 · on April 13, 2020

Or they don't know how to effectively use the API, maybe.

simonw · on April 13, 2020

I doubt it. People who write scrapers are talented engineers. Talented engineers understand the limitations of scraping compared to using an official API, and know to try to get things done with the official API first.

The Twint README calls out reasons for going beyond the API at the start - things like Twitter's increasingly strict rate limits and the limit of only 3,200 historic tweets for a user account.

paulgb · on April 13, 2020

I've been maintaining a browser extension[1] that has used various versions of the unofficial API over a period of six years -- from literally navigating the site as the user and grabbing HTML, to the various incarnations of JSON and HTML hybrid (SSR) APIs that their web and mobile clients have used internally over that time.

Believe me, I would LOVE if the official API supported threads. I have tried several times to make it work, but the official APIs are stuck in a circa-2012 idea of how Twitter works. Replies just aren't a thing to it.

[1] https://github.com/paulgb/Treeverse

scared2 · on April 15, 2020

I don't think this program actually solved the 3,200 limit you mentioned, not sure though.

armitron · on April 13, 2020

People who write robust, performant, maintainable systems that don’t collapse under their own weight are talented engineers.

Most scrapers including this one lack these characteristics.

scared2 · on April 15, 2020

I totally agree, writing scrapers doesn't necessarily make one a talented engineer. You may often find it easier to write a scrapers than going through the API documentation or series of authentication procedures.

anigbrowl · on April 13, 2020

I think it's the social media companies that are being assholes by privatizing conversations and selling access to the graph. The idea of a platform is to set up a protocol and some hardware, get people to talk to each other on it, and then gradually construct a mall around the outer edges of the platform. It's a shitty business model because it relies on knowing more about he users than they know themselves and then selling that information to advertisers.

throwaway4585 · on April 13, 2020

>maybe they have a reason for not providing everything

Not my problem. You make your http resources accessible, I'll download them however I want.