Introducing deep research

gorgoiler · 2025-02-03T08:29:48 1738571388

For “deep research” I’m also reading “getting the answers right”.

Most people I talk to are at the point now where getting completely incorrect answers 10% of the time — either obviously wrong from common sense, or because the answers are self contradictory — undermines a lot of trust in any kind of interaction. Other than double checking something you already know, language models aren’t large enough to actually know everything. They can only sound like they do.

What I’m looking for is therefore not just the correct answer, but the correct answer in an amount of time that’s faster than it would take me to research the answer myself, and also faster than it takes me to verify the answer given by the machine.

It’s one thing to ask a pupil to answer an exam paper to which you know the answers. It’s a whole next level to have it answer questions to which you don’t know the answers, and on whose answers you are relying to be correct.

michaelgiba · 2025-02-03T03:03:47 1738551827

Gemini has had this for a month or two, also named "Deep Research" https://blog.google/products/gemini/google-gemini-deep-resea...

Meta question: what's with all of the naming overlap in the AI world? Triton (Nvidia, OpenAI) and Gro{k,q} (X.ai, groq, OpenAI) all come to mind

shihab · 2025-02-03T05:15:55 1738559755

From the creator of Triton (OpenAI)-

"PS: The name Triton was coined in mid-2019 when I released my PhD paper on the subject. I chose not to rename the project when the "TensorRT Inference Server" was rebranded as "Triton Inference Server" a year later since it's the only thing that ties my helpful PhD advisors to the project."

samplatt · 2025-02-03T04:11:55 1738555915

>Meta question

I think you have to prefix the query with "@Meta AI", hope this helps

toomim · 2025-02-03T04:58:02 1738558682

John Stewart had something to say about this: https://youtu.be/Byg8VZdKK88?si=pX1WbtRwZCBGpwHS&t=141

justaj · 2025-02-03T08:30:27 1738571427

Without the tracking bits: https://youtu.be/Byg8VZdKK88#t=141

james_promoted · 2025-02-03T03:21:28 1738552888

I've always thought the Triton situation was intentional since the name isn't generic and because the companies are stepping on each others toes here (Nvidia's Triton simplifying owning your inference; OpenAI's Triton eroding the need for familiarity with CUDA). I couldn't figure out who publicly used the name first though.

stonogo · 2025-02-03T03:35:51 1738553751

It's a sort of unofficial trade association where they coalesce on specific redefinitions of terms to meet their sales and PR efforts. First they came for "intelligence," then "open source," then "reason," and it will continue. Any word which the PR wants but they can't achieve gets redefined -- "grok" is a perfect example, since in the original sci-fi book it meant "total understanding." The mythological Triton ruled the deeps, so the "deep learning" sales copy immediately co-opted it.

albert_e · 2025-02-03T05:47:46 1738561666

Also "accuracy" as a measure of model's performance used to mean something objective in the traditional ML world.

Now with LLMs it is what human evaluators feel about the LLM output?

yorwba · 2025-02-03T07:15:17 1738566917

Traditional ML is no stranger to measuring accuracy in terms of agreement with human evaluators.

albert_e · 2025-02-03T08:00:12 1738569612

A customer churn model or revenue forecast did have hard objective data (ground truth) to compare against - isn't it?

DigitalSea · 2025-02-03T00:27:54 1738542474

Not sure if people picked up on it, but this is being powered by the unreleased o3 model. Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly. Seems to be quite an impressive model and the leading out of Google, DeepSeek and Perplexity.

lordofgibbons · 2025-02-03T02:30:58 1738549858

> Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly

It's the only tool/system (I won't call it an LLM) in their released benchmarks that has access to tools and the web. So, I'd wager the performance gains are strictly due to that.

If an LLM (o3) is too expensive to be released to the public, why would you use it in a tool that has to make hundreds of inference calls to it to answer a single question? You'd use a much cheaper model. Most likely o3-mini or o1-mini combined with o4-mini for some tasks.

xbmcuser · 2025-02-03T00:31:33 1738542693

It was expensive as they wanted to charge more for it but deepseek has forced their hand

willy_k · 2025-02-03T01:29:53 1738546193

They’ve only released o3-mini, which is a powerful model but not the full o3 that is being claimed as too expensive to release. That being said, DeepSeek for sure forced their hand to release o3-mini to the public.

shawabawa3 · 2025-02-03T05:44:08 1738561448

o3 mini was previewed in December. Deepseek maybe made them release it a few weeks early but it was already on its way

sdesol · 2025-02-03T08:10:00 1738570200

I guess the question is, did DeepSeek force them to rethink pricing? It's crazy how much cheaper it (v3 and R1) is, but considering they (Deepseek) can't keep up with demand, the price is kind of moot right now. I really do hope they get the hardware to support the API again. The v3 and R1 models that are hosted by others are still cheap compared to the incumbents, but nothing can compete with DeepSeek on price and performance.

kandesbunzler · 2025-02-03T08:10:04 1738570204

no they didn't, this was literally all announced in December with a release date for January

Sparkyte · 2025-02-03T00:45:27 1738543527

Rightfully so, some models are getting super efficient.

bitshiftfaced · 2025-02-03T04:05:27 1738555527

> but this is being powered by the unreleased o3 model

What makes you believe that?

_bin_ · 2025-02-03T04:15:09 1738556109

they explicitly stated it in the launch

bitshiftfaced · 2025-02-03T04:34:45 1738557285

The linked article says,

> Powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis, it leverages reasoning to search, interpret, and analyze massive amounts of text, images, and PDFs on the internet, pivoting as needed in reaction to information it encounters.

If that's what you're referring to, then it doesn't seem that "explicit" to me. For example, how do we know that it doesn't use less thinking than o3-mini? Google's version of deep research uses their "not cutting edge version" 1.5 model, after all. Are you referring to something else?

golol · 2025-02-03T08:34:49 1738571689

o3-mini is not really "a version of the o3 model", it is a different model (less parameters). So their language strongly suggests, imo, that Deep Research is powered by a model with the same number of parameters as o3.

ai-christianson · 2025-02-03T00:49:32 1738543772

Has anyone here tried it out yet?

nycdatasci · 2025-02-03T04:08:39 1738555719

Pro user. No access like everyone else.

OpenAI is very much in an existential crisis and their poor execution is not helping their cause. Operator or “deep research” should be able to assume the role of a Pro user, run a quick test, and reliably report on whether this is working before the press release right?

maroonblazer · 2025-02-03T01:54:39 1738547679

Per the below, seems it's not available to many yet.

https://news.ycombinator.com/item?id=42913575

mistercheph · 2025-02-03T01:34:41 1738546481

I'm sure o3 will be a generation ahead of whatever deepseek, google and meta are doing today when it launches in 10 months, super impressive stuff.

petesergeant · 2025-02-03T02:53:59 1738551239

I’m not sure if you’re implying this subtly in your comment or not, as it’s early here, but it does of course need to be a generation ahead of what 10 months of their competitors moving forward have done too. Nobody is standing still

bruce511 · 2025-02-03T04:34:48 1738557288

I read a fair amount of sarcasm in the parent comment ;)

bbor · 2025-02-03T00:43:31 1738543411

Interesting, thanks for highlighting! Did not pick up on that. Re:"leading", tho:

Effectiveness in this task environment is well beyond the specific model involved, no? Plus they'd be fools (IMHO) to only use one size of model for each step in a research task -- sure, o3 might be an advantage when synthesizing a final answer or choosing between conflicting sources, but there are many, many steps required to get to that point.

xendipity · 2025-02-03T04:47:23 1738558043

I don't believe we have any indication that the big offerings (claude.ai, Gemini, operator, tasks, canvas, chatgpt) use multiple models in one call (other than for different modalities like having Gemini create an image). It seems to actually be very difficult technically and I'm curious as to why.

I wonder how much of an impact our being still so early in the productization phase of this all is. Like it takes a ton of work and training and coordination to get multiple models synced up into an offering and I think the companies are still optimizing for getting new ideas out there rather truly optimizing them.

someothherguyy · 2025-02-03T07:59:57 1738569597

...or its all a farce, for now.

hi_hi · 2025-02-03T00:47:09 1738543629

This is terrifying. Even though they acknowledge the issues with hallucinations/errors, that is going to be completely overlooked by everyone using this, and then injecting the outputs into their own powerpoints.

Management Consulting was bad enough before the ability to mass produce these graphs and stats on a whim. At least there was some understanding behind the scenes of where the numbers came from, and sources would/could be provided.

The more powerful these tools become, the more prevelant this effect of seepage will become.

autoconfig · 2025-02-03T01:01:49 1738544509

Either you care about being correct or you don't. If you don't care then it doesn't matter whether you made it up or the AI did. If you care then you'll fact check before publishing. I don't see why this changes.

azinman2 · 2025-02-03T02:00:40 1738548040

When things are easy, you’re going to take the easy path even if it means quality goes down. It’s about trade offs. If you had to do it yourself, perhaps quality would have been higher because you had no other choice.

Lots of kids don’t want to do homework. That said, previously many would because there wasn’t another choice. But now they can just ask ChatGPT for the answers they’ll write that down verbatim with zero learning taking place.

Caring isn’t a binary thing or works in isolation.

simonw · 2025-02-03T03:46:03 1738554363

"Lots of kids don’t want to do homework"

Sure, but if you're a professional you have to care about your reputation. Presenting hallucinated cases from ChatGPT didn't go very well for that lawyer: https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-...

PeterStuer · 2025-02-03T06:36:58 1738564618

That's a lawyer in an adverserial situation. Business consultants tell their clients what they want to believe, the facts be dammed.

rsanek · 2025-02-03T07:13:09 1738566789

it sounds like ai doesn't really change that situation

asimpletune · 2025-02-03T07:57:07 1738569427

But the point is it does if you count making it worse changing the situation.

financypants · 2025-02-03T05:16:12 1738559772

what about tests?

hi_hi · 2025-02-03T01:19:43 1738545583

Because maybe you want to, but you have a boss breathing down your neck and KPIs to meet and you haven't slept properly in days and just need a win, so you get the AI to put together some impressive looking graphs and stats that will look impressive in that client showcase thats due in a few hours.

Things aren't quite so black and white in reality.

dauhak · 2025-02-03T02:07:38 1738548458

I mean those same conditions already just lead the human to cutting corners and making stuff up themselves. You're describing the problem where bad incentives/conditions lead to sloppy work, that happens with or without AI

Catching errors/validating work is obviously a different process when they're coming from an AI vs a human, but I don't see how it's fundamentally that different here. If the outputs are heavily cited then that might go someway into being able to more easily catch and correct slip-ups

tikhonj · 2025-02-03T05:30:54 1738560654

Making it easier and cheaper to cut corners and make stuff up will result in more cut corners and more made up stuff. That's not good.

Same problem I have with code models, honestly. We already have way too much boilerplate and bad code; machines to generate more boilerplate and bad code aren't going to help.

mquander · 2025-02-03T05:36:35 1738560995

The technology also makes it easier and cheaper to make good things, so the direction of the outcome isn't guaranteed.

hi_hi · 2025-02-03T02:33:57 1738550037

Yep, I agree with this to some extent, but I think the difference in the future is all that stress will be bypassed and people will reach for the AI from the start.

Previously there was alot of stress/pressure which might or might not have led to sloppy work (some consultants are of a high quality). With this, there will be no stress which will (always?) lead to sloppy work. Perhaps there's an argument for the high quality consultants using the tools to produce accurate and high quality work. There will obviously be a sliding scale here. Time will tell.

I'd wager the end result will be sloppy work, at scale :-)

spaceywilly · 2025-02-03T01:10:38 1738545038

I think a lot about how differentiating facts and quality content is like differentiating signal from noise in electronics. The signal to noise ratio on many online platforms was already quite low. Tools like this will absolutely add more noise, and arguably the nature of the tools themselves make it harder to separate the noise.

I think this is a real problem for these AI tools. If you can’t separate the signal from the noise, it doesn’t provide any real value, like an out of range FM radio station.

WOTERMEON · 2025-02-03T01:37:21 1738546641

Not only that: by publishing noise, you’re lowering the signal/noise ratio.

n4r9 · 2025-02-03T07:30:27 1738567827

It's a bit like saying "my kids are going to hit themselves anyway, so it doesn't matter if I give them foam rods or metal rods".

layer8 · 2025-02-03T01:19:42 1738545582

People are much less scrupulous using LLM output than making up stuff themselves, because then they can blame the LLM.

RainyDayTmrw · 2025-02-03T04:53:59 1738558439

It's possible that you care, but the person next to you doesn't, and external pressures force you to keep up with the person who's willing to shovel AI slop. Most of us don't have a complete luxury of the moral high ground at our jobs.

doomroot · 2025-02-03T05:00:07 1738558807

It looks like the moral high just came more in demand.

ADeerAppeared · 2025-02-03T02:22:44 1738549364

> If you care then you'll fact check before publishing.

Doing a proper fact check is as much work as doing the entire research by hand, and therefore, this system is useless to anyone who cares about the result being correct.

> I don't see why this changes.

And because of the above this system should not exist.

sbarre · 2025-02-03T01:55:20 1738547720

How hard it is to produce credible-looking bullshit makes a really big difference in these scenarios.

Consultants aren't the ones doing the fact-checking, that falls to the client, who ironically tend to assume the consultants did it.

anthonyshort · 2025-02-03T06:34:21 1738564461

Then the hallucinated research is published in an article which is then cited by other AI research, continuing the push the false information until it’s hard to know where the lie started.

_bin_ · 2025-02-03T04:16:45 1738556205

let's be real for a sec, i've done consulting and have a lot of friends who still do. three times in four, your mckinsey report isn't super well-founded in reality and involves a lot of guesstimation.

scarab92 · 2025-02-03T01:02:53 1738544573

Think of it like a vaccine.

The majority of human written consultant reports are already complete rubbish. Low accuracy, low signal-to-noise, generic platitudes in a quantity-over-quality format.

LLMs are innoculating people to this kind of low information value content.

People who produce LLM quality output, are now being accused of using LLMs, and can no longer pretend to be adding value.

The result of this is going to be higher quality expectations from consultants and a shaking out of people who produce word vommit rather than accurate, insightful, contextually relevent information.

DrSiemer · 2025-02-03T07:47:56 1738568876

Exactly what will happen with art. The tolerance for low quality output will decrease.

layer8 · 2025-02-03T01:24:41 1738545881

This has been downvoted, but I think there’s actually a chance it might become true (until AGI comes along at least).

n144q · 2025-02-03T05:23:25 1738560205

I think that ship has sailed many years ago since Facebook allowed false information to spread freely on their site (if not earlier).

tmnvdb · 2025-02-03T00:54:05 1738544045

> At least there was some understanding behind the scenes of where the numbers came from, and sources would/could be provided.

Oh Sweet summer child.

elashri · 2025-02-03T04:45:17 1738557917

It is actually interesting for people working in academia. I would like to test it but no way I can afford $200/m right now.

Can someone test it with this prompt.

"As a research assistant with comprehensive knowledge of particle physics, please provide a detailed analysis of next-generation particle collider projects currently under consideration by the international physics community.

The analysis should encompass the major proposed projects, including the Future Circular Collider (FCC) at CERN, International Linear Collider (ILC), Compact Linear Collider (CLIC), various Muon Collider proposals, and any other significant projects as of 2024.

For each proposal, examine the planned energy ranges and collision types, estimated timeline for construction and operation, technical advantages and challenges, approximate costs, and key physics goals. Include information about current technical design reports, feasibility studies, and the level of international support and collaboration.

Present a thorough comparative analysis that addresses technical feasibility, cost-benefit considerations, scientific potential for new physics discoveries, timeline to first data collection, infrastructure requirements, and environmental impact. The projects should be compared in terms of their relative strengths, weaknesses, and potential contributions to advancing our understanding of fundamental physics.

Please format the response as a structured technical summary suitable for presentation at a topical meeting of particle physicists. Where appropriate, incorporate relevant figures and tables to facilitate clear comparisons between proposals. Base your analysis on information from peer-reviewed sources and official design reports, focusing on the most current available data and design specifications.

Consider the long-term implications of each proposal, including potential upgrade paths, flexibility for future modifications, and integration with existing research infrastructure."

sagarpatil · 2025-02-03T04:47:21 1738558041

I’ll do it. AFL right now.

spyckie2 · 2025-02-03T01:30:41 1738546241

Is this ability really a prerequisite to AGI and ASI?

Reasoning, problem solving, research validation - at the fundamental outset it is all refinement thinking.

Research is one of those areas where I remain skeptical it is that important because the only valid proof is in the execution outcome, not the compiled answer.

For instance you can research all you want about the best vacuum on the internet but until you try it out yourself you are going to be caught in between marketing, fake reviews, influencers, etc. maybe the science fields are shielded from this (by being boring) but imagine medical pharmas realizing that they can get whatever paper to say whatever by flooding the internet with their curated blog articles containing advanced medical “research findings”. At some point you cannot trust the internet at all and I imagine that might be soon.

I worry especially with the rapidly changing landscape of the amount of generated text in the internet that research will lose a lot of value due to massive amounts of information garbage.

It will be a thing we used to do when the internet was still “real”.

BeetleB · 2025-02-03T04:26:51 1738556811

> For instance you can research all you want about the best vacuum on the internet but until you try it out yourself you are going to be caught in between marketing, fake reviews, influencers, etc.

So you wouldn't use this tool for those types of use cases.

But still, a valid point. I recall I once wanted to compare Hydroflask, Klean Kanteen and Thermos to see how they perform for hot/cold drinks. I was looking specifically for articles/posts where people had performed actual measurements. But those were very hard to find, with almost all Google hits being generic comparisons with no hard data. That didn't stop them from ranking ("Hydroflask is better for warm drinks!")

Would I be able to get this to ignore all of those and use only ones where actual experiments were performed. And moreover, filter out duplicates (e.g. one guy does an experiment, and several other bloggers link to his post and repeat his findings in their own posts - it's one experiment but with many search results).

simonw · 2025-02-03T03:49:02 1738554542

> Is this ability really a prerequisite to AGI and ASI?

That depends entirely on how you choose to define "AGI".

observationist · 2025-02-03T02:03:14 1738548194

It's a direction in a vast landscape, not a feature of itself - being better at different tasks, like search generally, and research in conjunction with reasoning, gets the model closer to AGI. An AGI will be able to do these tasks - so the point of the research is to have more Venn diagrams of capabilities like these to help narrow down the view on things that might actually be fundamental mechanisms involved in AGI.

Moravec detailed the idea of a landscape of human capabilities slowly being submerged by AI capabilities, and the point at which AI can do anything a human can, in practice or in principle, we'll know for certain we've reached truly general AI. This idea includes things like feeling pain and pleasure, planning, complex social, oral, and ethical dynamics, and anything else you can possibly think of as relevant to human intelligence. Deep Research is just another island being slowly submerged by the relentless and relentlessly accelerating flood.

numba888 · 2025-02-03T02:14:34 1738548874

> hings like feeling pain and pleasure

can machine feel? without that there is no AGI according to definition above.

and the second question: are animals "GI"? they don't have language and don't solve math problems, never heard of np-complete.

xwolfi · 2025-02-03T03:28:13 1738553293

Are we not machines anyway ? Ofc a machine can feel, just need to have priorities that are aligned to itself, and use strong feedback when that self is either in danger or on the right path to preservation...

Feelings are nothing very special you know...

YmiYugy · 2025-02-03T00:33:04 1738542784

If I understood the graphs correctly, it only achieves 20% pass rate on their internal tests. So I have to wait 30min and pay a lot of money just to sift through walls of most likely incorrect text? Unless the possibility of hallucinations is negligible, this is just way too much content to review at once. The process probably needs to be a lot more iterative.

itkovian_ · 2025-02-03T01:22:29 1738545749

Here's an example of the type of question it is acheiving 20% on;

The set of natural transformations between two functors F,G ⁣:C→DF,G:C→D can be expressed as the end Nat(F,G)≅∫AHomD(F(A),G(A)). Nat(F,G)≅∫A HomD (F(A),G(A)).

Define set of natural cotransformations from FF to GG to be the coend CoNat(F,G)≅∫AHomD(F(A),G(A)). CoNat(F,G)≅∫AHomD (F(A),G(A)).

Let: - F=B∙(Σ4)∗/F=B∙ (Σ4 )∗/ be the under ∞∞-category of the nerve of the delooping of the symmetric group Σ4Σ4 on 4 letters under the unique 00-simplex ∗∗ of B∙Σ4B∙ Σ4 . - G=B∙(Σ7)∗/G=B∙ (Σ7 )∗/ be the under ∞∞-category nerve of the delooping of the symmetric group Σ7Σ7 on 7 letters under the unique 00-simplex ∗∗ of B∙Σ7B∙ Σ7 .

How many natural cotransformations are there between FF and GG?

slaterbug · 2025-02-03T06:17:33 1738563453

As someone who doesn't understand anything beyond the word 'set' in that question, can anyone give an indication of how hard of a problem that actually is (within that domain)?

Also I'm curious as to what percentage of the questions in this benchmark are of this type / difficulty, vs the seemingly much easier example of "In Greek mythology, who was Jason's maternal great-grandfather?".

I'd imagine the latter is much easier for an LLM, and almost trivial for any LLM with access to external sources (such as deep research).

baal80spam · 2025-02-03T07:51:35 1738569095

That's easy Dave: 42.

Davidzheng · 2025-02-03T03:17:37 1738552657

btw isn't this question at least really badly worded (and maybe incorrect?) the definitions they give for F and G are categories not functors... (and both categories are in fact one object with contractible space of morphisms...)

perching_aix · 2025-02-03T04:27:04 1738556824

It's very interesting to think about what kind of "mental model" might it have, if it's capable of "understanding" all this (to me) gibberish, but is then unable to actually work the problem.

brokensegue · 2025-02-03T00:46:49 1738543609

26.6% on humanity's last exam is actually impressive.

pass rate really only matters in context of the difficulty of the tasks

tmnvdb · 2025-02-03T00:36:43 1738543003

Only if you are asking questions at the level of a cutting edge benchmark

rvnx · 2025-02-03T00:50:58 1738543858

This is one of the actual questions:

> In Greek mythology, who was Jason's maternal great-grandfather?

https://www.google.com/search?q=In+Greek+mythology%2C+who+wa...

johnfn · 2025-02-03T04:35:22 1738557322

Did you intentionally flip through all the questions to find the one that seemed the easiest? If so, why? That's question #7, and all other 7 questions in the sample set seem ridiculously difficult to me.

elicksaur · 2025-02-03T01:08:03 1738544883

In Greek mythology, Jason's maternal great-grandfather was Einstein.

pama · 2025-02-03T02:10:18 1738548618

No it is not an actual question on this exam. From the paper: “To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable, and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, though contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (e.g., precise historical details, trivia, local customs) and have specific, unambiguous answers…”. (Emphasis mine)

Neynt · 2025-02-03T03:03:30 1738551810

It's example #7 on https://lastexam.ai/

tmnvdb · 2025-02-03T01:07:43 1738544863

This is a hard question for language models since it targets one of their known weaknesses.

andyg_blog · 2025-02-03T01:21:24 1738545684

Greek mythology? But seriously please elaborate for my less educated self.

_bin_ · 2025-02-03T04:21:54 1738556514

it tests syllogistic reasoning: Jason's mother was Tyro, whose father was Poesidon, whose father was Kronos. it also tests whether it "eagerly" rather than comprehensively considers something: a maternal great-grandfather could be the father of either one's maternal grandmother or maternal grandfather. so the answer could also be king Aeolus of the Etruscans.

ideally a model would be able to answer this accurately and completely.

nimithryn · 2025-02-03T04:46:30 1738557990

I think there are more possible answers? Jason's mother differs depending on the author...

For example, Jason's mother was Philonis, daughter of Mestra, daughter of Daedalion, son of Hesporos. So Jason's maternal great-grandfather was Hesporos.

tmnvdb · 2025-02-03T01:51:27 1738547487

LLMs often don't do well on tasks that require composition into smaller subtasks. In this case there is a chain of relations that depend on the previous result.

layer8 · 2025-02-03T01:15:45 1738545345

Users don’t care about how hard something is for LLMs if they receive incorrect output.

11101010001100 · 2025-02-03T01:23:51 1738545831

It's categorically more than a weakness.

roenxi · 2025-02-03T01:00:24 1738544424

Maybe. Not enough data to say. Say it does a days worth of work in a query. It is sensible to use if it takes less than a day to review ~5 days worth of work. I don't know if we're near that threshold yet but conceptually this would work well for actual research where the amount of preparation is large compared to the amount of output written.

And eyeballing the benchmarks, it'll probably reach a >50% rate per query by the end of the year. Seems to double every model or two.

dyauspitr · 2025-02-03T03:28:34 1738553314

On questions even specialists in that field can’t answer correctly.

spyckie2 · 2025-02-03T00:59:07 1738544347

I mean you want it to grill your steak and eat it for you too?

I mean I too can complain that my iPhone doesn’t automatically screen out spammers and send my mom flowers on Mother’s Day.

scarab92 · 2025-02-03T01:13:25 1738545205

Why doesn't the iPhone screen spammers yet? Pixel has had this feature for a decade.

senordevnyc · 2025-02-03T08:07:54 1738570074

Pixel hasn’t even been around for a decade.

anon373839 · 2025-02-03T08:21:36 1738570896

Setting aside how well it works, I think this is a pretty nice demonstration of how to do UX for an agentic RAG app. I like that the intermediate steps have been pushed out to a sidebar, with updates that both provide some transparency about the process and make the high latency more palatable.

6gvONxR4sf7o · 2025-02-03T02:05:31 1738548331

There are some people in the blogosphere who are known experts in their niche or even niche-famous because they write popular useful stuff. And there are a ton more people who write useful stuff because they want that 'exposure.' At least, they do in the very broadest sense of writing it for another human to read it. I wonder if these people will keep writing when their readership is all bots. Dead internet here we come.

seanmcdirmid · 2025-02-03T02:11:29 1738548689

I'm all for writing just for the bots, if I can figure it out. A lot of academic papers aren't really read anyways, just briefly glanced at so they can be cited together, large publications like journal pubs or dissertations even less so. But the ability to add to a world of knowledge that is very easy to access by people who want to use it...that is very appealing to me as an author. No more trudging through a bunch of papers with titles that might be relevant to what I want to know about...and no more trudging through my papers, I'm OK with that.

lmm · 2025-02-03T06:04:54 1738562694

Of course they will. Loads of people go around taking hundreds of photos with the biggest camera they can afford even though no-one else will ever willingly look at them.

adriand · 2025-02-03T00:37:24 1738543044

Feels like only a matter of time before these crawlers are blocked from large swathes of the internet. I understand that they’re already prohibited from Reddit and YouTube. If that spreads, this approach might be in trouble.

felindev · 2025-02-03T08:27:13 1738571233

While people might attempt that, it's going to be an arms race, just like ads vs adblocks. There's already multiple crawlers that present fake user-agent when their original one is blocked. Temptation of more data is just to irresistible to them

scarab92 · 2025-02-03T00:52:42 1738543962

I doubt those crawler rules will be honoured for long.

I wouldn’t even be surprised if a law is passed requiring sites to provide equal access to humans whether accessed directly or via these models.

It’s too important an innovation to stall, especially considering the US’s competitors (China) won’t respect robots.txt either.

crazylogger · 2025-02-03T00:55:05 1738544105

This is trivially bypassed by OpenAI asking the user to take control of their computer (or a sandboxed browser within it,) then for all intents and purposes it’s the user themselves accessing your site (with some productivity/accessibility aid from OAI.)

cj · 2025-02-03T00:53:23 1738544003

Anyone selling anything would want to remain crawlable if people use this to research something that could lead to a purchase.

reaperman · 2025-02-03T01:19:14 1738545554

Not necessarily. Southwest airlines doesnt allow itself on price comparison sites or Google Flights.

Amazon listings are blocked from google shopping and other price comparison sites.

rsanek · 2025-02-03T07:19:57 1738567197

they finally bowed mid last year https://www.nerdwallet.com/article/travel/southwest-google-f...

shlomo_z · 2025-02-03T03:20:04 1738552804

Your point is completely valid, but... Southwest now has an arrangement with Google Flights to allow their listings there.

drcode · 2025-02-03T00:43:11 1738543391

I suppose there is an equilibrium, where sites that penalize these types of crawlers will also get less traffic from people reading ai citations, so for many sites the upsides of allowing it will be greater than the downsides.

bbor · 2025-02-03T00:48:09 1738543689

TBF OpenAI in particular bought access to Reddit. Otherwise yeah this is my main confusion with all of these products, Perplexity being the biggest -- how do you get around the status-quo of refusing access to bots? Just to start off with, there is no Google Search API, and they work hard to make sure headless browsers can't access the normal service.

They do say "Currently, deep research can access the open web...", so maybe "open" there implies something significant. Like, "websites that have agreements with OpenAI and/or do not enforce norobot policies".

wahnfrieden · 2025-02-03T01:54:48 1738547688

Client-side browsers that crawl for users (and prompt for logins or captcha as needed) won't be as easily blockable

optimalsolver · 2025-02-03T02:11:55 1738548715

Big Tech Podcast listener?

cye131 · 2025-02-03T00:46:54 1738543614

Does anyone actually have access to this? It says available for pro users on the website today - I have pro via my employer but see no "deep research" option in the message composer.

fosterfriends · 2025-02-03T02:20:47 1738549247

I have pro, in US, not seeing yet

_bin_ · 2025-02-03T04:24:02 1738556642

what about a full refresh of the page or perhaps jump into the dev tools and check "disable cache"

could also be aggressive caching from cloudflare. could be they're just trying to announce more stuff to maintain cachet and can't yet support all users forking over 200/month.

energy123 · 2025-02-03T06:29:44 1738564184

I relogged, disabled cache and reloaded the page with Ctrl+Shift+R but it doesn't show up.

nijaar · 2025-02-03T06:51:28 1738565488

same here. pro in the US and still no access. i even logged in using my phone and a different browser

nycdatasci · 2025-02-03T04:07:18 1738555638

Pro user. No access like everyone else.

OpenAI is very much in an existential crisis and their poor execution is not helping their cause. Operator or “deep research” should be able to assume the role of a Pro user, run a quick test, and reliably report on whether this is working before the press release right?

kandesbunzler · 2025-02-03T08:18:07 1738570687

How many times are you going to post this exact same comment here? Are you a Chinese bot or something?

snewman · 2025-02-03T00:48:29 1738543709

Two different people I know with pro subscriptions report not having access yet.

greatpostman · 2025-02-03T00:52:05 1738543925

Have pro, can’t see it yet

labanimalster · 2025-02-03T04:02:08 1738555328

same here

chachamatcha · 2025-02-03T02:46:19 1738550779

also US based, have pro and still no access.

fizx · 2025-02-03T02:23:29 1738549409

same same

picografix · 2025-02-03T07:05:34 1738566334

I think deep research as a service could be a really strong use case for enterprises, as long as they have access to non-public data. I assume that most of this guarded data is high quality, and seeing progress in these areas might end up being even more impressive than it is now.

auggierose · 2025-02-03T08:28:05 1738571285

The flow reminds me a bit of undermind.ai.

Havoc · 2025-02-03T01:05:50 1738544750

The descriptions of the product sounded substantially more impressive than the actual samples tbh.

Still I think there is a big market for this sort of „go away for 30 mins and figure this out“ style agent

TechDebtDevin · 2025-02-03T03:58:22 1738555102

This is 5-10 years out. What OpenAI is displaying here I've been able to do with relatively little code, a bit of scraping and far less capable models for a year. I really don't see what is novel or useful here.

airstrike · 2025-02-03T03:22:36 1738552956

"Deep research" is now somehow synonymous to searching online for stats and pulling stuff from Statista? And when I want to make changes to that report, do I have to tweak my prompt and get an entirely different document?

Not sure if I'm too tired and can't see it but the lack of images/examples of the resulting report in this announcement doesn't inspire a lot of confidence just yet.

jmount · 2025-02-03T03:06:24 1738551984

I had no idea there was a market for "Compile a research report on how the retail industry has changed in the last 3 years. Use bullets and tables where necessary for clarity." I imagine reading such a result is pure torture.

ejang0 · 2025-02-03T01:49:34 1738547374

Can anyone confirm if this is available in Canada and other countries? This site says "We are still working on bringing access to users in the United Kingdom, Switzerland, and the European Economic Area." But I'm not sure about other countries. I don't have Pro currently, only Plus.

carbocation · 2025-02-03T01:50:09 1738547409

I don't even see it in the US right now.

highfrequency · 2025-02-03T03:26:17 1738553177

Can it compile and run (non-Python) code as part of its tool use? Compile-run steps always seemed like they would be a huge value add during reasoning loops - it feels very silly to get output from ChatGPT, try to run it in terminal, get an error and paste the error to have ChatGPT immediately fix it. Surely it should be able to run code during the reasoning loop itself?

simonw · 2025-02-03T03:47:55 1738554475

It sounds like it can run Python, which means it has access to Code Interpreter, which means it can run various other languages as well if you can convince it to do so.

I've used Code Interpreter to compile and run C code - https://simonwillison.net/2024/Mar/23/building-c-extensions-... - and I've managed to get it to run JavaScript (by uploading a Deno binary) and even Lua and PHP in the past as well: https://til.simonwillison.net/llms/code-interpreter-expansio...

thefourthchime · 2025-02-03T00:37:23 1738543043

OpenAI has a deep bench. I bet they pushed this out to change the narrative about deepseek

btown · 2025-02-03T00:42:07 1738543327

Also named specifically to muddle the SEO for the term "deep." Nothing that OpenAI does is unintentional.

kevlened · 2025-02-03T00:49:52 1738543792

It's more likely this is a response to Gemini Deep Research released in December

https://blog.google/products/gemini/google-gemini-deep-resea...

petra · 2025-02-03T00:58:29 1738544309

That Google product isnt that good, it can't really replace research done by a person.

nicce · 2025-02-03T01:19:27 1738545567

Just one tool in the toolbox. It helps to see if some sources have been missed.

alvah · 2025-02-03T04:40:18 1738557618

It absolutely can replace the research done by one person, for my use case at least. It’s also available on their $20/month subscription, unlike OpenAI’s $200/month.

sadeshmukh · 2025-02-03T05:09:18 1738559358

Nobody was going to hire a researcher for a quick question.

nicce · 2025-02-03T00:55:02 1738544102

Two birds with one stone: timing for Deepseek and feature for Gemini

xnx · 2025-02-03T01:12:44 1738545164

Google publicly announced a model named "Deep Research" on December 11th: https://blog.google/products/gemini/google-gemini-deep-resea...

dougb5 · 2025-02-03T00:52:01 1738543921

Does the naming scheme they've used for models so far suggest that they care about SEO?

leonheld · 2025-02-03T00:48:47 1738543727

Oh God, this is such an astute observation. I think it worked so well on me that I didn't even think about the "deep" portion initially. Goes to show how effective these things are psychologically.

bbor · 2025-02-03T00:44:36 1738543476

I have never believed a conspiracy theory more instantly. Deep Search vs. DeepSeek is way more than enough to confuse the average layman! Especially when you're googling something you heard about at work a few hours ago, or on Bloomberg TV

bonoboTP · 2025-02-03T00:48:00 1738543680

You might as well say that DeepSeek wanted to cause confusion with DeepMind. Deep isn't such a distinguishing name, deep learning has been a buzzword since 2012.

viraptor · 2025-02-03T01:08:41 1738544921

Deepmind is not a consumer product. Gemini is part of it but nobody calls it deepmind.

bonoboTP · 2025-02-03T01:16:35 1738545395

The point is, "deep" is an extremely generic word in the AI space.

rajnathani · 2025-02-03T06:21:11 1738563671

I remember about 10-15 years ago that Ray Kurzweil (who still works at Google) or someone at Google had this idea for what Google should be able to do: About doing deep research by itself with a simple search query. I can't find the source. Obviously it didn't pan out without transformers.

RandomWorker · 2025-02-03T04:26:33 1738556793

I’m a researcher and honestly not worried. 1. Developing the right question has always been the largest barrier to great research. Not sure OpenAI can develop the right question without the Human experience. The second biggest part of my role is influencing people that my questions are the right questions. Which is made easier when you have a thorough understanding of the first. That being said, I’m sure there will be many people here that will tell me that algorithms already influence people, and ai can think through much of any issues there are.

I do use these systems from time to time, but it just never renders any specific information that would make it great research.

RayVR · 2025-02-03T04:45:10 1738557910

100% agree.

These systems serve best at augmenting information discovery. When I'm tackling a new area or looking for the right terminology, these models provide a quick shortcut because they have good probabilistic "understanding" of my naive, jargon-free description. This allows me to pull in all of the jargon for the area of research I'm interested in, and move on to actually useful resources, whether that be journal articles, textbooks, or - rarely - online posts/blogs/videos.

the current "meta" is probably something like Elicit + notebookLM + Claude for accelerating understanding of complex topics and extracting useful parts. But, again, each step requires that I am closely involved, from selecting the "correct" papers, to carefully aggregating and grooming the information pulled in from notebookLM, to judging the the usefulness of Claude's attempts to extract what I have asked for

GeoAtreides · 2025-02-03T07:38:36 1738568316

> Developing the right question has always been the largest barrier to great research.

I thought funding was the biggest barrier to great research

rob_c · 2025-02-03T08:28:21 1738571301

Feels more and more like openAI doesn't have "that next big thing".

To be clear I'm constantly impressed with what they have and what I get as a customer, but the delivery since 4 hasn't exactly been in line with Altman's Musk-tier vapoware promises...

pazimzadeh · 2025-02-03T03:03:53 1738551833

> In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied?

Aren't there more than one articles that did not mention plasmons or plasmonics in Scientific Reports in 2012?

Also, did they pay for access to all journal contents? that would be useful

nicce · 2025-02-03T03:06:31 1738551991

Maybe that is the only one with open access

usaar333 · 2025-02-03T00:30:37 1738542637

Overall impressive.

Though, the jump for Gaia relative to SOTA is relatively not that high. Especially given that this is o3

VerdisQuo5678 · 2025-02-03T00:54:59 1738544099

The accuracy of this tool does not matter. This is exclusively designed for box ticking "reports" that nobody reads and a produced for the sake of itself.

tomrod · 2025-02-03T01:24:33 1738545873

The new term for this is "AI Loopidity", highlighting the unintelligent ouroboros nature of one side using AI to generate content and then another side to consume content.

sockaddr · 2025-02-03T03:46:29 1738554389

Similar to “Bullshit jobs”

All the AI commercials are designed to appeal to people that don’t produce any actual value but haven’t been detected by the system yet.

Need to send email to boss? Press magic button! Job well done, idiot.

Someone send you big scary email? Press magic button! Good job dummy!

Someone wants to go eat some Italian with you, push magic button for totally not-ad result. Enjoy your Olive Garden, moron.

rsanek · 2025-02-03T07:24:30 1738567470

i think the apple ads are the poster child here. hopefully we can see more inventive ones than just serving lazy people.

arbywhy · 2025-02-03T00:58:49 1738544329

99% of corpo upper management slide deck work. ai only makes more of this useless pencil-neck board of directors slop.

reaperman · 2025-02-03T01:25:14 1738545914

“Pencil-neck” is a strange insult to use here. How are software developers, or hardware design engineers, or finance workers any less “pencil-neck” than “board of directors”?

pjs_ · 2025-02-03T01:12:49 1738545169

McKinsey mode

rsanek · 2025-02-03T07:27:21 1738567641

don't be hyperbolic. deep research would need to help cause an opioid crisis for to get to that level.

TechDebtDevin · 2025-02-03T04:02:49 1738555369

More like high school intern mode.

sockaddr · 2025-02-03T03:48:15 1738554495

wilg · 2025-02-03T00:57:05 1738544225

I think this looks cool. Apparently unlike everyone else on this website?

tmnvdb · 2025-02-03T01:11:47 1738545107

HN is full of people who want to feel smart by complaining.

kenjackson · 2025-02-03T00:23:50 1738542230

If it has access to play by play data for all sports this could be an absolute playground for amateur sports statisticians. The possibilities…

RayVR · 2025-02-03T03:28:14 1738553294

Each release from openAI gives me less hope for them and this whole AI boom. They should be leading the charge of highlighting how the current generation of LLMs fail, not churning out half-baked overhyped products.

Yes, they can do some cool tricks, and tool calling is fun. No one should trust the output of these models, though. The hallucinations are bad, and my experience with the "reasoning" models is that as soon as they fuck up (they always do) they go off the rails worse than the base LLMs.

taran_narat · 2025-02-03T08:32:03 1738571523

isn't this just perplexity?

Bjorkbat · 2025-02-03T02:38:48 1738550328

Actually sounds pretty cool, but the graph on expert level tasks is confusing my expectations. Saying it has a pass rate of less than 20% sounds a lot like saying this thing is wrong most of the time.

Granted, these strike me as difficult tasks and I’d likely ask it to do far simpler things, but I’m not really sure what to expect from looking at these graphs.

Ah, but the fact that it bothers to cite its sources is a huge plus. Between that and its search abilities it sounds valuable to me

random_cynic · 2025-02-03T03:08:09 1738552089

I think that's mostly because of the access to information it has. Much of the highly useful information is not on the public internet or shows up on search engines, only domain experts know about them. Also, the websites may be paywalled or gated by login. So a better comparison would be if the models had the same level of access as an expert.

getnormality · 2025-02-03T03:04:28 1738551868

The demo on global e-commerce trends seems less useful than a Google search, where the AI answer will at least give you links to the claimed information.

xt00 · 2025-02-03T00:42:22 1738543342

"will find, analyze, and synthesize hundreds of online sources"

Synthesize? Seems like the wrong word -- I think they would want to say something like, "analyze, and synthesize useful outputs from hundreds of online sources"..

pjot · 2025-02-03T01:16:29 1738545389

From New Oxford dictionary:

  > combine (a number of things) into a coherent whole: pupils should synthesize the data they have gathered | Darwinian theory has been synthesized with modern genetics.

nicce · 2025-02-03T00:47:53 1738543673

On the other hand, accurate if it is prone for hallucination…

tmnvdb · 2025-02-03T00:52:57 1738543977

You can synthesize the parts to get the whole. Both uses are correct AFAIK

lolpanda · 2025-02-03T04:14:31 1738556071

"synthesize large amounts of online information" does it heavily depend on the search engine performance and relevance of the search results? I don't see any mention of Google or Bing. Is this using their internal search engine then?

jasonjmcghee · 2025-02-03T01:12:11 1738545131

Surprised more comments aren't mentioning deepseek has this feature (for free) already. Assuming this is why OpenAI scrambled to release it.

The examples they have on the page work well on chat.deepseek.com with r1 and search options both enabled.

Do I blindly trust the accuracy of either though? Absolutely not. I'm pretty concerned about these models falling into gaming SEO and finding inaccurate facts and presenting them as fact. (How easy is it to fool / prompt inject these models?)

But has utility if held right.

starchild3001 · 2025-02-03T01:24:43 1738545883

Not really accurate. The "Search" functionality you're describing in DeepSeek is comparable to OpenAI's existing "Search GPT." OpenAI's recent announcement refers to a more advanced capability, similar to Gemini's existing "deep research" feature. DeepSeek's current offerings are significantly more limited in scope.

jasonjmcghee · 2025-02-03T01:36:57 1738546617

Doesn't seem like access is available to try "deep research" yet on OpenAI, so I can only speak to what I tried, which was their examples on the blog post (using DeepSeek w/ R1 + Search) and results were pretty similar.

AFAIK OpenAI's current offering uses 4o, and it does a web search and then pipes it into 4o. I'm guessing adding CoT + other R1/o3 like stuff is one of the key effective differences. But time will tell how different it is. Maybe it's a dramatic improvement.

WiSaGaN · 2025-02-03T01:46:46 1738547206

SearchGPT is bad because its underlying model is not a reasoning one. Deepseek one mentioned above is closer to deep research than searchgpt.

TechDebtDevin · 2025-02-03T03:59:58 1738555198

Are you unaware that there is a "Deepthink (R1)" button right next to the "Search" button on DeepSeek's Chat app. Its been there for some time, even before all the hype regarding R1.

starchild3001 · 2025-02-03T04:13:26 1738556006

I'm well aware of that. That is not what openai calls "deep research".

nicce · 2025-02-03T01:16:54 1738545414

I wish Kagi would work with similar performance. Their lenses feature is perfect for this and they already filter out most of the SEO spam based on trackers and other typical red flags.

gwerbret · 2025-02-03T00:42:03 1738543323

To anyone who's tried it: how does it handle captchas? I can't imagine that OpenAI's IP addresses are anyone's favorites for unfettered access to web properties these days.

layer8 · 2025-02-03T01:29:10 1738546150

And is it smart enough to use archive.today for paywalled articles. ;)

corentin88 · 2025-02-03T06:36:56 1738564616

Curious about the use cases here. Building AI Agents? But which one?

layer8 · 2025-02-03T01:09:31 1738544971

From the demo: “Use bullets and tables where necessary for clarity.” It’s weird that it would be necessary to specify that. I suppose they want to showcase that you can influence the output style, but it’s strange that you’d have to explicitly specify the use of something that is “necessary for clarity”. It comes across as either a flaw in the default execution, or as a merely performative incantation.

esafak · 2025-02-03T01:20:02 1738545602

Is there a benchmark we can compare this against You.com's research mode? It looks like R1 forced them to release o3 prematurely and give it Internet access. And they didn't want to say they released o3 so they called it 'Deep Research'.

resters · 2025-02-03T04:11:13 1738555873

Still not seeing access on my account.

_bin_ · 2025-02-03T04:15:39 1738556139

they're not giving it to us lowly $20/month users yet :( gotta take out a second mortgage and throw them 200/month if you want it now

reader9274 · 2025-02-03T00:39:58 1738543198

I think we're all reaching AI fatigue. Fewer and fewer people care anymore

bonoboTP · 2025-02-03T00:55:35 1738544135

Sure if you're viewing this as some kind of spectator thing, or entertainment, maybe it's less interesting. But it doesn't really matter whether "people care". What matters is whether it's useful and has impact. It's enough if the small number of people use it for whom it is useful. It doesn't matter if the average Joe on the street is excited by it.

Few people care or even know about various advances in various specialized fields. It's enough if AI simply seeps into various applications in boring and non-flashy ways for it to have significant effects that will affect a wider range of people, whether they get hyped by the news announcements or not. Jobs etc.

An analogy: the Internet as such is not very exciting nowadays, certainly not in the way it was exciting in the 90s with all the news segments about surfing the information superhighway or whatever. There was a lot of buzz around the web, but then it got normalized. It didn't disappear, it just got taken for granted. No average person got excited around HTML5 or IPv6. It just chugs along in the background. AI will similarly simply build into the fabric of how things get done. Sometimes visibly to the average person, sometimes just behind the scenes.

InkCanon · 2025-02-03T02:27:49 1738549669

Not sure if it's just me, but it looks like all SOTA companies are doubling down to chase the new benchmark, which beyond hype, doesn't seem to translate into real world uses. Why don't these companies just plug it into a popular git repo and say, hey our AI fixed these 100 issues! Or something real? The only people who seem to be doing something real is DeepMind.

khazhoux · 2025-02-03T05:27:24 1738560444

Incorrect. We are not all reaching AI fatigue.

rvnx · 2025-02-03T00:43:52 1738543432

Especially this is not a breakthrough justifying a 340B USD valuation, but rather the work that junior developers can do; implement a loop of Bing Searches connected to an LLM.

tmnvdb · 2025-02-03T00:47:53 1738543673

Peak HN comment

rvnx · 2025-02-03T00:57:50 1738544270

Doesn't make it untrue.

Agents that can search the internet exist for a while now and have been essentially solved and happily used in platforms like Perplexity.

It's really "meh", very far from revolutionary.

Keep in mind this company is trying to convince everybody they need 500B USD now (through the Stargate project).

alvah · 2025-02-03T04:45:53 1738557953

I haven’t tried the OpenAI version yet, as I’m on their peasant-level $20 plan, but the Google equivalent is way superior to Perplexity (I use both extensively). The web search Perplexity carries out is superficial compared to the Google product; it misses a large percentage of what Gemini Deep Research finds, and for a particular task in my business this makes a huge difference.

spyckie2 · 2025-02-03T01:03:10 1738544590

To go from partially automated to fully automated is thousands of non trivial edge cases and unforeseen decision points that must be tamed.

To say this is trivial is like saying the one shot ai prompted twitter clone is the same thing as twitter.

Peak HN indeed.