Hacker News new | past | comments | ask | show | jobs | submit login

It's simple: Use the website. It will never stop working because people use it and you can automate and scrape it just fine. Kind of sucks that you have to go through the effort - but then there is no social media corporation that isn't toxic in some way to it's userbase.



I actually don't mind scraping it that much, and even enjoy the adversarial aspects of writing scrapers. but it makes a lot of high level functionality like filtered streams or historical search much less accessible; I'd probably never have learned a lot of network analysis stuff over the last decade or so if I'd had to to pay to access streaming data first. Also, I think it's going to be harder for academic researchers to get institutional approval to scrape adversarially, so it could put a dent in a lot of social science research by forcing people to chase grants instead of focusing on their code.


It's ironic, for a couple (non-Twitter) projects I wound up scraping because either a) they didn't have an API yet (e.g. early crypto pricing sites) or b) I wasn't confident the API would remain intact over the long term. Kind of depressing.


> I think it's going to be harder for academic researchers to get institutional approval to scrape adversarially

This is a good point about what might happen, but it seems worthwhile to address and fix directly. Personally I don't see why adversarial scraping of a publicly published website should require any more ethical consideration/review than using the suggested API would. Ethical concerns should revolve around humans, not the business desires of non-human entities.


Also quality residential proxies are pricey. You need to rate limit and rotate both IP and puppeteer fingerprint when adversarially scraping.


The website is awful. It's horrific. It was the worst site I visited regularly until I started using Nitter instances. I'd rather not know what is going on than go to it. I wish they'd wind the UI back 10 years, to back when it was pleasant to browse.


The point is that the website is an API, irrelevant if it is a good or bad UI.


Depending on twitter's pricing, this might not be worthwhile.

Scraping web pages is a last resort for when proper APIs are not available.


I look forward to someone teaching Elon Musk what version control is, resulting in him wholesale rolling back Twitter's entire stack to 2008. At last, Twitter is written in Ruby again!


Do you remember, I think it was 2020, when there was a big "hack" and tons of blue check accounts started posting fraud links?

Two important take-aways from that:

1) The attackers just pulled the old "Hey, I'm from IT service desk and need your password."

2) Apparently every engineer had god mode in production

I can't imagine what scary monsters are in that codebase.


I too miss the fail whale.


IIRC the twitter website goes to great lengths to mangle itself to prevent ad blockers presumably.

Of course this used to have the side effect of breaking significant swathes of basic browser UX, especially in areas of accessibility. I assume they’re better now than they once were, but given musk’s historical behavior I assume he won’t consider breaking something like accessibility to be bad or problematic.


> musk’s historical behavior I assume he won’t consider breaking something like accessibility to be bad or problematic.

I don't see why would he spend more to make website less accessible. Not fixing bugs - yes, just about everyone does so. But breaking it intentionally costs money.

Tesla's aren't THE most accessible (by cost) EV out there, but SpaceX def is.


Supposedly the entire accessibility team got laid off [1], so there might not be anyone left to ensure that changes to existing features do not reduce accessibility nor ensure that new features are accessible. Not that I'd expect Musk to care a whole lot about concerns raised in that regard, considering that he supposedly went ahead with Twitter Blue despite significant (and well warranted) concerns from the Trust and Safety team [2]. So Twitter will probably become less and less accessible over time.

[1] https://www.wired.com/story/twitter-layoffs-accessibility/

[2] https://www.businessinsider.com/twitter-sent-musk-risks-paid...


> Tesla's aren't THE most accessible (by cost) EV out there, but SpaceX def is.

SpaceX is... an accessible EV? What?


Most accessible satellite launcher...


Ah yes, because "what's the relative cost and complexity to procure a launch for my satellite?" is famously a real example of an accessibility problem faced by people with disabilities...


I'm pretty sure you know exactly what the poster meant ...


> I'm pretty sure you know exactly what the poster meant ...

No, I don't. It obviously is trying to say something about accessibility and SpaceX, but I don't see how the sense of accessbility being discussed in the thread even applies to SpaceX, much less what claim is being made.


Agree. It's quite the jump going from a user-centric website to a space launch company that is used by corporations and governments. Like, okay, I guess SpaceX is extremely accessible compared to other options in the space industry, but why would that have any bearing on how accessible the Twitter website is for you and me day-to-day?


Starlink has created more information accessibility that twitter will ever will


But that’s not in any way shape or form the “accessibility” being talked about here.

Just because it’s the same word does not mean it’s the same meaning.


I wasn't sure how to put it into words and this was exactly what I had in mind.


Yes, but since we are judging a person for his merit…


Do you understand where you are?

This is a comment thread discussing accessibility at Twitter, in a post about Twitter discontinuing part of its platform. It’s not a place to circlejerk Musk’s achievements, regardless how much you would want that to be so.

Explain, in exact words, what about SpaceX in any way shape or form would indicate that they know how to handle user a11y and how that would transfer over to Twitter— you know, the thing we’re actually discussing here.


Most accessible (by cost) launch provider.


Accessibility is not a by cost thing, and you aren't paying to break accessibility, you're paying to break scraping and maintaining accessibility when doing so costs money. It also requires having engineers working to keep the site accessible, but musk fired them.


Scraping is much easier when sites are more accessible.


Correct, which is why making something accessible and not scrapeable is harder/more expensive than breaking accessibility.


I think it’s a terrible move to kill the stuff like postybirb that a lot of people use to make posts on multiple websites at once. they’ll have to either change their workflow to make a tweet (annoying) or just abandon twitter altogether. just makes the website worse for practically no benefit.


Does not take much effort. Below is an example using curl. For reading Twitter feeds I just get the JSON and read the "full_text" objects. I have simple custom program I wrote that turns JSON of unlimited size into something like line-delimited JSON so I can use sed, grep and awk on it but HN readers probably prefer jq. For checking out t.co URLs I use HTTP/1.1 pipelining.

Usage for reading is something like (but not identical to)

   1.sh screen_name > 1.json
   yy059 < 1.json|grep full_text > 1.txt
   less 1.txt
Usage for checking out URLs is something like (but not identical to)

   unset connection
   export Connection=keep-alive
   yy059 < 1.json|grep full_text \
   |yy030 \
   |grep "https://t.co/.{10}$" \
   |uniq \
   |yy025 \
   |nc -vv h1b 80 \
   |sed -n '/location: /s///p' \
   |ahref > 1.htm
   links -no-connect 1.htm
"ahref" is just a script turns URLs on stdin into simple HTML on stdout

Alternatively if I do not trust the URLs I might use a script called "www" instead of ahref. It takes URLs on stdin and fetches archive.org URLs wrapped in simple HTML to stdout, using the IA's cdx API.

  #!/bin/sh
  SCREEN_NAME=$1
  COUNT=500
  PUBLIC_TOKEN="Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA"
  GT=$(exec curl -A "" -s https://twitter.com/$SCREEN_NAME|sed -n '/gt=/{s/.*gt=//;s/;.*//p;}');
  echo "x-guest-token: $GT" >&2;
  REST_ID=$(exec curl -A "" -H "authorization: $PUBLIC_TOKEN" -H "content-type: application/json" -H "x-guest-token: $GT" -s "https://twitter.com/i/api/graphql/mCbpQvZAw6zu_4PvuAUVVQ/UserByScreenName?variables=%7B%22screen_name%22%3A%22$SCREEN_NAME%22%2C%22withSafetyModeUserFields%22%3Atrue%2C%22withSuperFollowsUserFields%22%3Atrue%7D"|sed 's/\(rest_id\":\"[0-9]*\)\(.*\)/\1/;s/.*\"//'); 
  echo "rest_id: $REST_ID" >&2;
  curl -A "" -H "authorization: $PUBLIC_TOKEN" -H "content-type: application/json" -H "x-guest-token: $GT" -s "https://twitter.com/i/api/graphql/3ywp9kIIW-VQOssauKmLiQ/UserTweets?variables=%7B%22userId%22%3A%22${REST_ID}%22%2C%22count%22%3A$COUNT%2C%22includePromotedContent%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withSuperFollowsUserFields%22%3Atrue%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%2C%22withSuperFollowsTweetFields%22%3Atrue%2C%22withVoice%22%3Atrue%2C%22withV2Timeline%22%3Atrue%7D&features=%7B%22dont_mention_me_view_api_enabled%22%3Atrue%2C%22interactive_text_enabled%22%3Atrue%2C%22responsive_web_uc_gql_enabled%22%3Afalse%2C%22vibe_tweet_context_enabled%22%3Afalse%2C%22responsive_web_edit_tweet_api_enabled%22%3Afalse%2C%22standardized_nudges_misinfo%22%3Afalse%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%2C%22include_rts%22%3Atrue%7D"
There's no way I would use the Twitter website as it requires enabling Javascript and not for the user's benefit.

This solution isn't pretty but I can easily keep tabs on Twitter feeds without any need for a Twitter account, a Twitter "API key" or a so-called "modern" browser.


Can u scrape search feeds too? Ie tweets that match a certain string?


You can scrape anything you see in the UI (and sometimes stuff you cannot see). Twitter makes almost no effort to stop people from using their internal APIs, which is why them saying discontinuing the free public API is to stop malicious bots is pretty laughable. Unless they seriously increase their detection abilities for non-approved clients using their internal API, it would take any malicious actor all of a few hours to transition to using the internal API for whatever they want. Honestly, I assumed most bad actors would already be doing this, since things like spamming were already against the ToS of the public API.


What happens if/when they block that Bearer token?


The token has been the same since at least 2020 when Twitter started using GraphQL instead of REST.

Every person visiting twitter.com is using this same token. The token is neither personal nor private.

What would be the point of changing or blocking it.


> Every person visiting twitter.com is using this same token.

Be interesting to see if that stays the same when they're charging for the API but leaving a huge loophole with this token.


Twitter is not alone in using GraphQL this way, having all website visitors use the same token or key. Other websites do it, too, as shown below.

Using GraphQL like this can be an effective dark pattern because to anyone using a "modern" browser that "tech" commpanies control it makes it seem like the text of the website cannot be retrieved without Javascript enabled. That's false, but nonetheless it gets people to enable Javascript because the website explicitly asks them to enable it. Then the website, i.e., "tech" company, can perform telemetry, data collection, surveillance, and other shenanigans.

Sometimes this practice might not be a deliberate dark pattern, it might just be developers who are using Javascript gratuitously. For example, HN search provided by Algolia uses GraphQL. HN puts URLs with pre-selected query terms and a public token ("API key") on the HN website. Everyone that uses those URLs uses the same key.

Unlike Twitter, HN istelf does not ask anyone to enable Javascript. The website works fine without it, including the Algolia search, as shown below.

Usage is

   1.sh query > 1.json

   #!/bin/sh

   curl -A "" -d '{"query":"$@","analyticsTags":["web"],"page":0,"hitsPerPage":30,"minWordSizefor1Typo":4,"minWordSizefor2Typos":8,"advancedSyntax":true,"ignorePlurals":false,"clickAnalytics":true,"minProximity":7,"numericFilters":[],"tagFilters":["story",[]],"typoTolerance":"min","queryType":"prefixNone","restrictSearchableAttributes":["title","comment_text","url","story_text","author"],"getRankingInfo":true}' "https://uj5wyc0l7x-3.algolianet.com/1/indexes/Item_production_sort_date/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.0.2)%3B%20Browser%20(lite)&x-algolia-api-key=8ece23f8eb07cd25d40262a1764599b1&x-algolia-application-id=UJ5WYC0L7X"


Here is a non-curl version of HN search using custom HTTP generator yy025 and h1b, an alias for localhost address of TLS forward proxy

    #!/bin/sh

    export Connection=close;
    export Content_Type=x-www-form-urlencoded;
    export httpMethod=POST;
    x=$(echo '{"query":"'$@'","analyticsTags":["web"],"page":0,"hitsPerPage":30,"minWordSizefor1Typo":4,"minWordSizefor2Typos":8,"advancedSyntax":true,"ignorePlurals":false,"clickAnalytics":true,"minProximity":7,"numericFilters":[],"tagFilters":["story",[]],"typoTolerance":"min","queryType":"prefixNone","restrictSearchableAttributes":["title","comment_text","url","story_text","author"],"getRankingInfo":true}');
    export Content_Length=${#x};
    echo "https://uj5wyc0l7x-3.algolianet.com/1/indexes/Item_production_sort_date/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.0.2)%3B%20Browser%20(lite)&x-algolia-api-key=8ece23f8eb07cd25d40262a1764599b1&x-algolia-application-id=UJ5WYC0L7X"|(yy025;echo "$x") \
    |nc -vv h1b 80




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: