Hacker News new | past | comments | ask | show | jobs | submit login
Twint – Twitter scraping tool written in Python (github.com/twintproject)
204 points by usernameno on April 13, 2020 | hide | past | favorite | 63 comments



Inspired by Twint I created an alternative web client that relies almost entirely on web scraping. The parser is more complete than Twint's due to the different goals, supporting things like polls and videos properly. It's still work in progress but very usable with features like RSS and convenient profile search. https://github.com/zedeus/nitter


I don't use twitter android application and the browser version on mobile is extremely slow. This made twitter usable on my browser again. Thanks! PS: It would be even more useful if someone hosts it in a domain like <someshortcode>twitter.com, then I can just add shortcode to the url and escape from the slow, official one.


Use a PWA instead. Go to Twitter on Chrome and do "Add to Home Screen". Really fast.


Thank you! I love it! It's so much better than the official front-end and works fine without JS.


I created a tool that uses Twitter's public facing (but private) API.

https://github.com/pauldotknopf/twitter-dump/

You can use it to download every tweet from a user, not just the last 3000 that their API supports. It uses the same query syntax that the web search uses.


Does it work on suspended accounts? Mine got reported by a bunch of Trump supporters so I lost 17 years of tweets and Twitter won't listen. I can still see my own tweets, but only a page at a time. I'm thinking I might need to use a scraper like Twint instead of the API.


If you can see it while you are logged in, then yes.

Check ```twitter-dump auth``` for instructions on how to use your web cookies with the command.

When the command queries Twitter, it will look as if it is coming from you.


Nice! I followed the instructions but after I paste the curl command I get this error:

Unhandled exception. System.ArgumentOutOfRangeException: Length cannot be less than zero. (Parameter 'length') at System.String.Substring(Int32 startIndex, Int32 length) at TwitterDump.Program.ParseCurlCommand(String curlCommand, Dictionary`2& headers) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 150 at TwitterDump.Program.Auth(AuthOptions options) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 113 at TwitterDump.Program.<>c.<Main>b__0_1(AuthOptions opts) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 33 at CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1 result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc) at TwitterDump.Program.Main(String[] args) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 30

What'm I doing wrong?


NM, I figured it out. Chrome has several "copy as curl..." commands now. I was using "Copy as Curl(cmd)" but "Copy as Curl (bash)" is what worked.


Dude! Thanks so much!!! If you're ever in Portland, I owe you a drink!


::reads old tweets::

Oh god... no wonder they suspended me


If you can you see your own (suspended account) tweets when logged in, you could use your auth cookie to scrape them.

Not sure how to do that with Twint off the top of my head, though.

We tried to use Twint in production but it didn’t work for us. I ended up writing one that works very well. Let me know if I can help you get your tweets.


> you could use your auth cookie to scrape them

Seems that Twint does all it's work unauthenticated so it may not be possible.

I suppose I could try using the API but I don't think I care that much.


Looks like you could just paste your auth header right here: https://github.com/twintproject/twint/blob/a5df26d988109838d...


Good looking out, but I don't think that's working.

> PS C:\Users\Jesse Hattabaugh\Documents\GitHub\twint> twint -u jessehattabaugh > CRITICAL:root:twint.get:User:'NoneType' object is not subscriptable


snap


lol @ the Trump supporters downvoting this. Y'all tired of all the winning?


JFYI: Twint is already the name of a popular payment app in Switzerland

https://www.twint.ch/en/


I don't think there will be any confusion.


"popular" is a stretch, it's what a few banks are trying to get people to use because they couldn't get NFC working on apple devices with their apps. Until recently they have been giving money away like crazy to get people to use it.

It requires extra hardware at the cash register (if using bluetooth) or modification to the vendors terminal to be able to display a qr code.

Sadly the app is extremely slow and cumbersome to use at a cash register compared to other options such as NFC payments on android or just straight tap and pay credit cards.

It would be a great system for payments online as scanning a code is very easy compared to entering all your CC information but becauses the fees are so high (compare to credit cards) the largest online electronics retailer (digitec) in Switzerland dropped them after the intro rates expired.


It's probably an attempt to implement what WeChat/Alipay has. In China the majority of POS terminals has this device https://www.alibaba.com/product-detail/Specialized-Alipay-an... , which is a fancy housing for a camera that scans the wallet barcode on your phone (optimized for the perfect angle and to prevent glare, I'm guessing). It seems preferable and faster than Bluetooth...


A strong quality label is that twint's twitter account got suspended - evidently Twitter doesn't like that a third party tool scrapes tweets better than they allow to.


Their API is absolute garbage. Search is incomplete and streaming consumes GBs/hour even when you're getting no results.


Oh hey, glad to see a project I contributed to get a mention on HN.

I wrote the Splunk integration with Twint for crawling Twitter timelines: https://github.com/twintproject/twint-splunk

Feel free to hit me up if there are any questions about that part of Twint.


Thanks for your hard work! The tool has been extremely helpful in my political science research. I have noticed that a sort of error occurs when I try to scrape too many tweets from too many accounts at once (which I do by referencing a txt file with the account names that I want to scrape tweets from). I've fixed this by just making shorter lists of account names. Is there another workaround I am unaware of?


Looks like a decent tool, but I personally would not use it in production. Took a few minutes to look through the code: They basically use the HTML pages instead of the APIs. What puts me off is that the code is missing tests altogether and has quite a few separation of concern issues. HTML extraction is everywhere, storage is embedded into the package. Cool proof of concept, not ready for production, I think.

Minor stuff: Printing instead of logging. Would prefer a package that only does the retrieval and nothing else. Hardcoded SQL(ite?) storage.


> They basically use the HTML pages instead of the APIs.

For Twitter in specific, isn't HTML scraping vastly preferable to using their official APIs? Otherwise you run into pretty arbitrary usage limits and missing features.

There's a small list of services where I think I prefer HTML scraping and browser piloting for a 3rd-party client: Twitter, Patreon, Facebook, LinkedIn, a few others. Services where the official APIs are underdeveloped or crippled to the point of almost uselessness.


That quote is supposed to be a neutral summary, so yes, that might be the case :)


Fair objections perhaps, but regarding using HTMLs -- scraping is the whole point, because getting API tokens these days is hard.


I was teaching a hands on workshop (meetup) on how to use the Twitter API a few months ago, and the hardest part for everybody was getting those API keys, clicking through several pages, checking boxes, having it emailed to you.

Then it turned out Twitter refused half of the attendees the API key.. (maybe they thought it was spam coming from the same wifi, same time).

So then I just gave out my API key to the rest of the class, and in a few minutes it was blocked..

For a service that has a history of empowering users to protest and to spread news in crisis situations, it's a shame their API is so locked down.


Yep. At least it was a realistic experience :)

The hardest part when working with data is often not manipulating data per se, but spending time on crap like this.


Sorry, looks like I did not make that clear enough. Was not meant to be an objection, totally understand why they do it.


What do you use instead?


There's a pretty actively-maintained Python wrapper for the internal API: https://github.com/bisguzar/twitter-scraper


Here's a comment of someone who built a tool that uses twitter's public facing but private api, https://news.ycombinator.com/item?id=22855148


Took me 20 seconds to install twint, and start scraping.

Took me 20 minutes to install twitter-dump, and despite a successful twitter-dump auth, I end up with 'Request exception Forbidden {"errors":[{"code":200,"message":"Forbidden."}]}'.

Not going to spend time to fix that, I'll use the dirty solution that works.


That's fine. I'm not saying you have to switch, but it looks like there's a way.

Maybe they could combine efforts? Maybe they could look at the code, and if the licenses allow, port things to theirs.


I'd be interested to hear if this tool could be used to scrape malware hashes (or links containing malware hashes) from Twitter? This or any other tool really... it appears my brain has turned to mush today and I cannot find or get anything to work :(


What’s your use case for this? I scrape a ton of pastebin links and other sources of hashes posted by folks on Twitter about Emotet, trickbot etc. and I’d be happy to point them to a webhook or get them to you another way? Happy to talk through how we do it too.


Interesting! I’ll reach out to you via LinkedIn and/or Keybase to explain my use case.


nice job, twitter api is terrible


I noticed that you don't use tweepy (https://github.com/twintproject/twint#requirements). Can you highlight the difference?


Tweepy is for API access, this is a scraper.


And the main advantage is you don’t need to authenticate and you aren’t rate limited.


A question I asked before, but I get different answers. what are the -legal- limitations of scraping data when we have an a limited access API


The problem is that there isn't a straight answer to this, see this recent thread: https://news.ycombinator.com/item?id=22180559

It kind of comes down to how well you can defend yourself from it being called a DOS attack (follow politeness standards and robots.txt), from violating their copyright (generally not problematic if you don't distribute the data), and from violating their terms of service (this is key in the case of twitter and reddit, carefully read their TOS).

However, the scraping of public information like in the case of tweets or reddit posts is the less problematic part. It's when you distribute the data or aggregations of the data that it could be problematic to use scraped public information.


TWINT is also the name of the product for mobile payments in Switzerland (twint.ch). I hope the reject won’t run into a fight over the name.


s/reject/project


I've used a fair share of (painful) APIs over the years and I have one simple plea: can we please stop being a-holes? Can we stop scraping websites that have APIs? They are already offering machine readable data and maybe they have a reason for not providing everything. Scraping their sites circumvents this API and not only abuses their systems, but also makes your code super brittle - any change to their site breaks your code - if you code wasn't broken already as you were banned by the provider.


If people are resorting to scraping then clearly the official API isn't fit for their purposes.

As an example, here are two key features that are missing from the Twitter API at the moment:

- Bookmarks. You can privately bookmark tweets on the Twitter website and apps. There is no way to access the list of tweets you have bookmarked in the API.

- Threads. The concept of threads - where a tweet has replies from the same author get special treatment in terms of display - is key to how Twitter is used today. The official API doesn't support them, in that there is no way to look at a tweet and see that there exists a threaded tweet reply.

There is no good commercial reason for excluding either of these features from the public API, other than that Twitter have made a strategic decision not to invest resources in expanding the API to keep up with new features they are adding to the platform.

Given that, is it any surprise that people are resorting to scraping?


You say "maybe they have a reason for not providing everything" - I cannot think of a reason not to provide me with API access to my own private bookmarks other than "we decided to invest our engineering resources elsewhere".

Which isn't a bad reason! But it's not a good argument for people not to scrape their own data.


> But it's not a good argument for people not to scrape their own data.

Why would you want to scrape your own data if you an already request all your data and get a whole archive? https://help.twitter.com/en/managing-your-account/how-to-dow...


Because then you have to trigger and download a GB+ file every time you want to programmatically access your latest bookmarks.


Everything Twitter used to and not anymore enable third party profit off “their valuable” data. It’s economically/strategically unsustainable or so they think.


API can only get first 3200 tweets from a user is also a dealbreaker.


Or they don't know how to effectively use the API, maybe.


I doubt it. People who write scrapers are talented engineers. Talented engineers understand the limitations of scraping compared to using an official API, and know to try to get things done with the official API first.

The Twint README calls out reasons for going beyond the API at the start - things like Twitter's increasingly strict rate limits and the limit of only 3,200 historic tweets for a user account.


I've been maintaining a browser extension[1] that has used various versions of the unofficial API over a period of six years -- from literally navigating the site as the user and grabbing HTML, to the various incarnations of JSON and HTML hybrid (SSR) APIs that their web and mobile clients have used internally over that time.

Believe me, I would LOVE if the official API supported threads. I have tried several times to make it work, but the official APIs are stuck in a circa-2012 idea of how Twitter works. Replies just aren't a thing to it.

[1] https://github.com/paulgb/Treeverse


I don't think this program actually solved the 3,200 limit you mentioned, not sure though.


People who write robust, performant, maintainable systems that don’t collapse under their own weight are talented engineers.

Most scrapers including this one lack these characteristics.


I totally agree, writing scrapers doesn't necessarily make one a talented engineer. You may often find it easier to write a scrapers than going through the API documentation or series of authentication procedures.


I think it's the social media companies that are being assholes by privatizing conversations and selling access to the graph. The idea of a platform is to set up a protocol and some hardware, get people to talk to each other on it, and then gradually construct a mall around the outer edges of the platform. It's a shitty business model because it relies on knowing more about he users than they know themselves and then selling that information to advertisers.


>maybe they have a reason for not providing everything

Not my problem. You make your http resources accessible, I'll download them however I want.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: