Hacker News new | past | comments | ask | show | jobs | submit login
The idea maze for AI startups (2015) (cdixon.org)
108 points by gmays 12 months ago | hide | past | favorite | 41 comments



Chris Dixon just spent 2 years investing in and hyping NFTs. AI is a legitimate innovation that can do without all of the “Web3” con artists driving any aspect of it.

Source: https://cdixon.org



But what about "The next big thing will start out looking like a scam"?


I agree. Hype is a double edged sword. We've already got ourselves cut up more than a few times in ML due to it and honestly, it doesn't seem unlikely that we do far more damage. I wish people would tone it down a bit and we could oust the con artists before moving on.


> just spent 2 years

So ... after he wrote this blog post?


In my opinion it’s legitimate to question the discretion of anyone who put their entire weight behind cryptocrappency especially NFT.


Maybe. What about questioning the discretion of anyone who uses childish terminology like "cryptocrappency" and ad hominems instead of making an actual point?


I won’t totally advocate on ignoring what someone says, because of who they are, but I’m not going to waste precious time reading their thoughts when have the choice to read someone else’s.


I think there's a new approach for “How do you get the data?” that wasn't available when this article was written in 2015. The new text and image generative models can now be used to synthesize training datasets.

I was working on an typing autocorrect project and needed a corpus of "text messages". Most of the traditional NLP corpuses like those available through NLTK [0] aren't suitable. But it was easy to script ChatGPT to generate thousands of believable text messages by throwing random topics at it.

Similarly, you can synthesize a training dataset by giving GPT the outputs/labels and asking it to generate a variety of inputs. For sentiment analysis... "Give me 1000 negative movie reviews" and "Now give me 1000 positive movie reviews".

The Alpaca folks used GPT-3 to generate high-quality instruction-following datasets [1] based on a small set of human samples.

Etc.

[0] https://www.nltk.org/nltk_data/

[1] https://crfm.stanford.edu/2023/03/13/alpaca.html


An interesting question is, if you can get ChatGPT to generate high quality data for you, should you just cut out the middle-model and be using ChatGPT as your classifier?

The answer probably depends a lot on your specific problem domain and constraints, but a non-trivial amount of the time the answer will be that your task could be solved by a wrapper around the ChatGPT API.


You definitely can use LLMs to do your modeling. But sometimes you need very fast, cheap, and smaller models instead. Also there's research out there showing that using LLM to generate training data for targeted & specific models may result in better performance.


>should you just cut out the middle-model and be using ChatGPT as your classifier?

Oh you certainly could.

See here: GPT-3.5 outperforming elite crowdworkers on MTurk for Text annotation https://arxiv.org/abs/2303.15056

GPT-4 going toe to toe with expertrs (and significantly outperforming crowdworkers) on NLP tasks

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

I guess it will tke some time before the reality really sinks but the days of the artificial sota being obviously behind human efforts for NLP has come and gone.


> should you just cut out the middle-model and be using ChatGPT as your classifier?

And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?


> And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?

They are enjoying the be the market leader for now, but OpenAI will soon be facing real competition, and LLM services will become a commodity product. That must be partly why they seeked Microsoft backing: to be a part of the "big tech".


Besides that, there's the issue of efficiency.

Better quality training data might enable you to build a leaner more efficient model than is far cheaper to implement and run than the expensive model used to generate the data to train it.

See for example: https://twitter.com/SebastienBubeck/status/16713263696268533...


This is a very bad idea for image models. They pick up and amplify imperceptible distortions in images no human reviewer would catch... Not to speak of big ones when the output is straight up erroneous.

This may apply to text too.

Partial or fully synthetic data is OK when finetuning existing LLMs. I personally discovered its not OK for finetuning ESRGAN. Not sure about diffusion models.


> Not sure about diffusion models.

Diffusion models are still approximate density estimators, not explicit. They lose information because you don't have an unique mapping to the subsequent step. Got to think about the relationships of your image and preimage.

So while they have better distribution that GANs, they still aren't reliable for dataset synthesis. But they are better than GANs for that (GANs will be very mean focused, which is why we had such high quality images from them but we also see huge diversity issues and amplification of biases).


> Not sure about diffusion models.

Human-curated synthetic data is commonly used in finetuning (or LoRa-training) for SD. I doubt that uncurated synthetic data would be very usable. There might be use cases where curating synthetic data with some kind of vision model would be valuable, but my intuition would be that it would be largely hit-or-miss and hard to predict.


> The new text and image generative models can now be used to synthesize training datasets.

No. Just no. Dear god, no.

This isn't too different from GPT-4 grading itself (looking at you MIT math problems)!

Current models don't accurately estimate the probability distribution of data, so they can't be reliable for dataset synthesis. Yes, synthesis can help, but you also have to specifically remember that typically they don't because they generate the highest likelihood data, which is already abundant. Getting non-mean data is the difficult part and without good density estimation you can't reliably do this. The density estimation networks are rather unpopular and haven't received nearly as much funding or research. Though I highly suggest it, but I'm biased because this is what I work in (explicit density estimation and generative modeling).


Sampling an AI output when the distribution you want is human data is incredibly stupid.


I don't think it is. The distribution of an AI model that was trained on such a huge amount of movie reviews is very close to the human distribution.

At least that's true around the mean. If your application needs to handle long-tail cases, an LLM won't easily give you that. But depending on the application, that may not be necessary. So yeah, sometimes this is a bad idea, but for many applications it may be just fine.


It's funny, for my lil startup, "How do you get the data" is now _less_ tech than ever. I pay an hourly wage to a human to generate/transcribe it. This method is both much more cost effective and scalable than tech-enabled alternatives.


Is synthesized data high quality, or does it just seem high quality


“Quantity has a quality all its own.”


Appears to be susceptible to model collapse[1], depending on how you do it.

1 = https://arxiv.org/abs/2305.17493v2


> The new text and image generative models can now be used to synthesize training datasets.

Only with heavy curation [0], otherwise your new models will be trained on progressively worse data than earlier models.


This is an interesting post (and an interesting reminder that even Bitcoin maximalists had other things on their minds in 2015).

I would argue that the first step of the maze makes a ton of sense for the voice recognition/image classification/driving use-cases of 2015 that had binary outcomes, but now-a-days, what would it even mean for an LLM to be right 80% of the time? 8/10 words are predicted correctly? It can speak correctly on 80% of topics?

The reason people are so jazzed about generative AI is that it's not autonomously doing a task - it's helping a human operator by making (sometimes very useful) guesses on their behalf. It's much more of a tool than a solution (even if a lot of people want it to be a solution).


8 out of 10 is pretty darn good though right? Then again, a 9 year old is probably right 80% of the time, and a calculator 99%, so we need to compare to other similar tech. I don't know what to compare it to though. Maybe current product suggestion engines since that's basically what this "AI" is.


The linked document from Balaji's startup engineering course is extremely useful

https://spark-public.s3.amazonaws.com/startup/lecture_slides...



Coursera link is broken unfortunately


Chris is an absolute grifter - I read his book on web 3 and was so underwhelmed by his depth of thought.

Better to ignore folks who have no experience in building cutting edge product - he's just a average philosopher turned VC because it pays more.


He's launching a new book on web3 soon, isn't he?

Wondering if you're talking about that new one or a previous one


How much has changed since 2015. With NLP and ML it used to be so hard to create high-quality datasets. It was a case of rubbish in, rubbish out. Not LLMs have solved that problem. It seems that if you put in a huge amount of data into a big enough model then an emergent ability seems to be the ability to discern the wheat from the chaff. Certainly in the NLP space, the days of crowd sourced datasets seem to be over replaced with few shot learning. So much value has been unlocked.


There's an interesting dark side to this as well, which is that in 2023 when you think you are crowdsourcing data you may actually just be tasking it to ChatGPT. A lot of turkers just turn around and use an LLM!


Which is of course absolutely terrible for the quality of the dataset you're trying to produce


I think the author's point broadly still holds -- you can get further with more engineering resources and data, whether you're using 2015 era models or 2023 retrieval-augmented LLMs and fine-tuning. Just that now you can accomplish a lot more quickly with a ChatGPT prompt.


Interesting read. I'd argue the most successful AI-based products are the ones that settle for 80-90% accuracy and “Create a fault-tolerant UX.”

Then, the question becomes: how to create a great fault-tolerant UX?

There are some nice recent cases... Github Copilot is one...


It's amazing how "AI" now exclusively means LLMs.


Not super on point, but wow that boilerplate at the end really goes above and beyond at saying “this is my personal blog, and just, like, my opinion man.”


I suspect it’s due to his writing about crypto, where the all the legal/regulatory risks around securities and financial advice can be high.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: