Hacker News new | past | comments | ask | show | jobs | submit | gbmatt's comments login

Only Big Tech (Microsoft,Google,Facebook) can crawl the web at scale because they own the major content companies and they severly throttle the competition's crawlers, and sometimes outright block them. I'm not saying it's impossible to get around, but it is certainly very difficult, and you could be thrown in prison for violating the CFAA.


I'm not sure if training on a vast amount of content is really necessary in the sense that linguistic competence and knowledge can probably be separated to some extent. That is, the "ChatGPT" paradigm leads to systems that just confabulate and "makes shit up" and making something radically more accurate means going to something retrieval-based or knowledge graph-based.

In that case you might be able to get linguistic competence with a much smaller model that you end up training with a smaller, cleaner, and probably partially synthetic data set.


Common Crawl claims to have 82% of the tokens used to train GPT-3, and it's available to anyone.

Add all the downloadable material at archive.org and you've got a formidable corpus.

https://commoncrawl.org/


We are just robots in a human simulator, reliving our creation.


Q: how might an AI algorithm be modified in order to return citations with its response?

A: There are several ways in which an AI algorithm could be modified to return citations with its responses. Here are a few possibilities:

    One approach would be to add a citation generation module to the AI algorithm. This module could be trained on a dataset of citations and could be used to generate appropriate citations for the information provided by the AI algorithm.

    Another option would be to modify the AI algorithm to include a database of references, with each reference being associated with a specific piece of information. When the AI algorithm provides a response, it could also return the appropriate reference(s) from the database.

    A third possibility would be to modify the AI algorithm to search external sources of information (such as online databases or libraries) for relevant citations whenever it provides a response. The AI could then return the relevant citations along with its response.
It is worth noting that adding a citation generation module or incorporating external sources of information into an AI algorithm can be a complex task, as it requires the development of additional algorithms and data structures to support these features. It may also require the acquisition of additional data or resources to support the citation generation process.


I just posted this same comment on the ddg story, but I'm going to post it here as well.

Google forced my search engine (gigablast) basically out of business. I had ixquick.com as a big client at one time; I was providing them with search results from my custom web search engine. Then their CEO called me one day and told me he was cancelling, even though he'd been a client for over 10 years. He said it was because of some change Google had made to their agreement. Ixquick needed Google's results and ads for their startpage.com website, and, even though my results were shown on their ixquick.com and later ixquick.eu site, apparently Google wasn't good with that.


Yeah, Google forced my search engine basically out of business. I had ixquick.com as a big client at one time; I was providing them with search results from my custom web search engine. Then their CEO called me one day and told me he was cancelling, even though he'd been a client for over 10 years. He said it was because of some change Google had made to their agreement. Ixquick needed Google's results and ads for their startpage.com website, and, even though my results were shown on their ixquick.com and later ixquick.eu sites, apparently Google wasn't good with that.


everyone needs equal access to public data. right now only big tech can download the many web pages (without thottling or being ip banned) on linkedin (microsoft), youtube (google), facebook, github (microsoft) and billions of more pages. this also leads to a gap on AI training sets to give big tech even more entrenchment. for instance, only microsoft can build that ai coding application they did because other companies can't access all of github without being throttled or ip banned (last time i checked - but i could be wrong now)[microsoft owns github]. regardless, we need some sort of bot 'bill of rights' to ensure equal access going forward. perhaps the answer is legislation or perhaps it is some massive p2p proxy net. i think it is legislation because the p2p proxy net is too hard to implement, and it would have to solve turing tests.

but perhaps web 3.0 (dweb) can just bypass all this nonsense and make its own versions of these popular services with baked-in accessibility for all.


access to public data doesn't really mean much. A lot of training data already is public. Increasing computational resources and reducing the cost of communication always has a centralizing effect, for basic economic/energy/efficiency reasons, it increases the benefits of division of labor and returns at scale.

The history of technological process is a history of agglomeration and if web3 reduces barriers all it does is creates more, not less leverage for centralization, the same way the internet did after a short phase of disruption, or even book printing for that matter.


this is all true, but consider a computing model that almost was: https://en.wikipedia.org/wiki/Telescript_(programming_langua...

telescript was kinda of weird, but wildly ahead of its time. the idea was decentralized services and commercial activity, but centralizing computation due to power/energy efficiency.

it was like a inverse jvm born at the transition from arpanet->nsfnet->internet, which greatly deregulated commercial activity.

it was also born when mobile devices were almost tractible, though 14 years before the iphone.

it almost birth the idea of an app store, except "apps" would be decentralized services people send/recv.


Except web3 won’t actually fix those problems. Switching to blockchain won’t decentralize the web because centralization is an unavoidable emergent property of large complex systems.


thanks ben, you are too kind.


I second this - thanks for building this. It's an unbelievably inspiring achievement. It's my default search engine, and I'm really glad it exists.


Wow even knows me by my first name too. Very humbled. Once again thanks for being so open with what you have done.


the javascript is run by your browser, so you can fully audit it.


It's still served by the site and I doubt most are interested or capable in auditing software to perform routine online tasks.


I am not sure there are good solutions besides going off browser.

P.S. I was involved in user authorization, attestation and privacy flows for a particular product recently and the browser was always where shit hit the fan. The web features are just not made with simplicity and privacy in mind. Then again we had more complex constraints.


There's an extension as well [1]. This means that the code is not being served by the server in this use case.

[1] https://private.sh/extension.html


hey thanks for the recognition, people. :) finally, all my problems are solved. this comment is here for hacker news karma points.


Hey Matt, would you consider making a search box (input with id="q") a bit wider? I can type only around 15 characters before the beginning of search query becomes "cut off".


Seconded.

I went to check out a few example searches and the too-narrow search bar is the first annoyance I found.

The next annoyance was that the crawled index seems much smaller than google's or bing's. I looked for things I know exist on twitter, on an old wordpress blog, on obscure websites I frequent: forcing terms to not be skipped using + I could see that none of my test cases were in the index.


The too small text box is also a pain when deleting the query in order to type a new one on mobile. A clear button would mitigate some of this pain although making the field larger would probably be sufficient


I noticed there were IPs in the source code that seemed to reference yours, and mabye others', home IP addresses. I'm curious if you run any parts of either the crawling, indexing, or searching from home networks?

I'm asking since I'm working on similar/different crawling problems that would make some stuff easier to just handle from the hardware I have at home, and have always assumed the provider would shut it down. Have you had any issues with that?


*throws karma at the screen*


both ddg and brave are bing (microsoft) in disguise.


This is not correct. Brave Search owns its own (growing) index and relies on third-parties like Bing for some fraction of the requests. Which is not the same thing as relying fully on Bing or third-parties for results like so many meta-search engines. More detailed answer here: https://search.brave.com/help/independence

Edit: Forgot to say that I work on Brave Search.


brave 'falls back' to bing. which in my experience is most of the time. in fact, out of all the queries i did a while back, they all seemed to come directly from bing. is there a way to disable the reliance on bing and get pure 'brave only' results? and can you be more specific as to what this fraction is? do you blend at all?


You can check exactly which fraction of the results were fetched from Brave's index vs. third-parties using the "independence score" found in the setting drawers (opening can be done with the cog icon at the top right of any page on search.brave.com). There is there a global and personalized score of independence (respectively aggregated on all user's and for your queries only).

Explanation is also found here with screenshots: https://search.brave.com/help/independence


What independence percentage do you see when you click on the gear in upper right of the Brave Search results page?

I get 84% personal (browser-based), 87% global (which means we hit Bing only 13% of the time from our server side).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: