Hacker News new | past | comments | ask | show | jobs | submit | juxtaposicion's comments login

Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.

Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.

The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.

Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.


https://huggingface.co/mattshumer/Reflection-70B says system prompt used is:

   You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.
Also, only "smarter" models can use this flow, according to https://x.com/mattshumer_/status/1831775436420083753

> Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.

They may already implement this technique, we can't know.


Claude 3.5 does have some "thinking" ability - I've seen it pause and even say it was thinking before. Presumably this is just some output it decides not to show you.

THIS!!!!!!! People act like Claude and 4o are base models with no funny business behind the scenes, we don't know just how much additional prompt steps are going on for each queue, all we know is what the API or Chat interface dump out, what is happening behind that is anyones guess.. The thinking step and refinement steps likely do exist on all the major commercial models. It's such a big gain for a minimal expenditure of backend tokens, WTF wouldn't they be doing it to improve the outputs?

Well they can't do a /lot/ of hidden stuff because they have APIs, so you can see the raw output and compare it to the web interface.

But they can do a little.


As if they couldn’t postprocess the api output before they send it to the client…

No, I mean they sell API access and you can query it.

That's only in the web version, it's just that they prompt it to do some CoT in the antThinking XML tag, and hide the output from inside that tag in the UI.

The API does it too for some of their models in some situations.

Interesting, is there any documentation on this or a way to view the thinking?

I am not sure, but you seem to be implying that the Reflection model is running through multiple rounds? If so, that is not what is happening here. The token generation is still linear next token prediction. It does not require multiple rounds to generate the chain of thought response. It does that in one query pass.

I have been testing the model for the last few hours and it does seem to be an improvement on LLAMA 3.1 upon which it is based. I have not tried to compare it to Claude or GPT4o because I don't expect a 70b model to outperform models of that class no matter how good it is. I would happy to be wrong though...


I suspect GPT4o already has training for CoT. I've noticed it often responds by saying something like "let's break it down step by step". Or maybe it's the system prompt.

I had a similar idea[0], interesting to see that it actually works. The faster LLM workloads can be accelerated, the more ‘thinking’ the LLM can do before it emits a final answer.

[0]: https://news.ycombinator.com/item?id=41377042


Further than that, it feels like we could use constrained generation of outputs [0] to force the model to do X amount of output inside of a <thinking> BEFORE writing an <answer> tag. It might not always produce good results, but I'm curious what sort of effect it might have to convince models that they really should stop and think first.

[0]: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...


Can we replicate this in other models without finetuning them ?


Apple infamously adds "DO NOT HALLUCINATE" to its prompts.

Huh ? Source please (this is fascinating)


what's our estimate of the cost to finetune this?

I don't know the cost, but they supposedly did all their work in 3 weeks based on something they said in this video: https://www.youtube.com/watch?v=5_m-kN64Exc

We're building that spreadsheet as a product. I'd love to show you. I'd message you a private link to a prototype but you have no contact info on your profile. If you are interested, can you email or DM me using my profile info?


You also don't have an email on your profile, for what it's worth. I don't use Twitter. I'd be interested in something like this as well.


Oh, thanks! I added my email to my profile. Look forward to replying to your note!


Oh wow, I’ve had that idea for 14yrs but never wanted to start coding it


We’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing.

The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).

Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.


Co-author of article here.

Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.

RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )

What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?


Disk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.


we've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.


What type of latency requirements are you dealing with? (i.e. look up time, ingestion time)

Were you using postgres already or migrated data into it?


I'd love to know the answer here too!

I've ran a few tests on pg and retrieving 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances. And that was without a vector index.

Entirely possibly my take was too cursory, would love to know what latencies you're getting bryan0!


> 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances.

Is there a write up of the analysis? Something seems very wrong with that taking 700ms


we have look up latency requirements on the elastic side. on pgvector it is currently a staging and aggregation database so lookup latency not so important. Our requirement right now is that we need to be able to embed and ingest ~100M vectors / day. This we can achieve without any problems now.

For future lookup queries on pgvector, we can almost always pre-filter on an index before the vector search.

yes, we use postgres pretty extensively already.


What size are your embeddings?



What kind of retrieval performance are you observing with Lance?


For a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.


We use Lance extensively at my startup. This blog post (previously on HN) details nicely why: https://thedataquarry.com/posts/vector-db-4/ but essentially it’s because Lance is a “just a file” in the same way SQLite is a “just a file” which makes it embedded and serverless and straightforward to use locally or in a deployment.


What’re the techniques that’ll get this to run on a single GPU?


Most of the parameters are in the language model (LLaMa-7B). So, they'd pretty much be the same techniques that would let LLaMa run on a single GPU -- especially lower precision tricks. If you only want to run inference/forward (no training), it should be pretty doable.

You can almost definitely run it on consumer GPU if you swap out the language model for something smaller as well (although the performance would definitely not be as good on the language side).


It is exciting that you could train a CLIP-style model from scratch with only 4M datapoints. But if you’ve got that data, why not fine tune a pretrained model with your 4M points? It seems likely to outperform the from-scratch method.


There is not only a difference in the data source but pre-trained tasks as well. But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval. And it is correct for CLIP, ALBEF, VICHA, and UFORM.


Any plans to document how to fine tune your models then?


It will take some time, but yes, we have this in our plans.


perhaps this approach can lead to better training of foundational models?..


More efficient - for sure!


How does this compare to Quickwit or other Tantivy-powered engines?


I think quickwit is more tailored towards querying large indexes sitting on S3 than fast queries from a local or in-memory index.

lnx might be similar, I'm not sure. It's very new and I had a bad experience trying it out.


How are these interactive visualizations made? As a senior machine learning engineer (with only rudimentary JS skills) it would be fantastically fun to make something like these.


It's WebGL in a <canvas>, written by hand by the looks of the source - https://ciechanow.ski/js/gps.js



Beautiful. I also checked the archives on that blog and every article is a work of art. I would love to work with this dev!


The author wrote his own WebGL library. If you don't have much knowledge about 3D, then https://threejs.org/ is a fantastic library to learn. It abstracts away much of the tedious part.

Not sure what's the best starting point to learn, but there's lots of videos on YT to help you get started.


Of all the reasons, I was stunned with how much detail went into the work at seeing the little globe/Earth in the satellite orbits section -- the Earth has the weather patterns and clouds running in animation as you spin the globe around!


not what he/she uses, but if you are interested in these kind of things, check out https://cables.gl/ .

it provides you with an in-browser, graphical, node based interface where you can just connect boxes together and it will output js-code ready to implement in your website.

(disclosure: i know the dev plus am a huge fan!)


Cable.gl is excellent.


Someone told me that apparently they aren't made, they are discovered.


Stitch Fix. Data Science Manager.

Come lead a team to build real-time state-of-the-art recommender systems.

See more here: https://docs.google.com/presentation/d/1xCUatmnolfzUzfHR6Wmq...

Apply here: https://www.stitchfix.com/careers/jobs?gh_jid=2211077&gh_jid...


How does this compare to Edward, PyMC, Stan, et al? Is the primary distinction due to PyTorch’s imperative, dynamic programming?


Edward: like Edward, Pyro is a deep probabilistic programming language that focuses on variational inference but supports composable inference algorithms. Pyro aims to be more dynamic (by using PyTorch) and universal (allowing recursion).

PyMC, Stan: Pyro embraces deep neural nets and currently focuses on variational inference. Pyro doesn't do MCMC yet. Whereas Stan models are written in the Stan language, Pyro models are just python programs with pyro.sample() statements.

One unique feature of Pyro is the probabilistic effect library that we use to build inference algorithms: http://docs.pyro.ai/advanced.html


This still doesn't really explain the dfifference to PyMC. What is the advantage of using Pyro over PyMC which supports a multitude of inference algorithms as well as mini-batch advi.


0.1.0 definitely not feature full, but Pyro seems promising.

PyMC3 is fine, but it uses Theano on the backend. Theano will stop being actively maintained in 1 year, and no future features in the mean time. That was announced about a month ago, it seems like a good opportunity to get out something that filled a niche: Probablistic Programming language in python backed by PyTorch. They are taking cues from edward and webppl, which from a casual glance seem to be the best libraries for python and javascript respectively http://edwardlib.org/ http://webppl.org/

But Edward is backed by TensorFlow

That announcement by Theano’s main developer Pascal Lamblin and Yoshua Bengio: https://syncedreview.com/2017/09/29/rip-theano/ https://groups.google.com/forum/#!topic/theano-users/7Poq8BZ...

"Dear users and developers,

After almost ten years of development, we have the regret to announce that we will put an end to our Theano development after the 1.0 release, which is due in the next few weeks. We will continue minimal maintenance to keep it working for one year, but we will stop actively implementing new features. Theano will continue to be available afterwards, as per our engagement towards open source software, but MILA does not commit to spend time on maintenance or support after that time frame. "

https://www.wired.com/2016/12/uber-buys-mysterious-startup-m...

Uber acquired Geometric Intelligence and renamed it Uber AI. From this article:

"But it hasn't published research or offered a product. What is has done is assemble a team of fifteen researchers who can be very useful to Uber, including Stanford professor Noah Goodman, who specializes in cognitive science and a field called probabilistic programming, and University of Wyoming's Jeff Clune, an expert in deep neural networks who has also explored robots that can "heal" themselves."


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: