The most interesting thing about this is the way it was trained using synthetic ...

tkellogg · 2024-12-16T09:32:07 1734341527

I added Phi-4 to my reasoning model collection because it seems to exhibit reasoning behavior, it stopped to consider alternatives before concluding. I assume this is related to their choice in training data:

> Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.

https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md

mirekrusin · 2024-12-16T08:32:49 1734337969

This "draw pelican riding on bicycle" is quite deep if you think about it.

Phi is all about synthetic training and prompt -> svg -> render -> evaluate image -> feedback loop feels like ideal fit for synthetic learning.

You can push it quite far with stuff like basic 2d physics etc with plotting scene after N seconds or optics/rays, magnetic force etc.

SVG as LLM window to physical world.

dartos · 2024-12-16T11:56:05 1734350165

> SVG as LLM window to physical world.

What? let’s try not to go full forehead into hype.

SVGs would be an awfully poor analogy for the physical world…

ben_w · 2024-12-16T14:03:28 1734357808

SVGs themselves are just an image format; but because of their vector nature, they could easily be mapped onto values from a simulation in a physics engine — at least, in the game physics sense of the word, rods and springs etc., as a fluid simulation is clearly a better map to raster formats.

If that physics engine were itself a good model for the real world, then you could do simulated evolution to get an end result that is at least as functional as a bike (though perhaps it wouldn't look like a traditional bike) even if the only values available to the LLM were the gross characteristics like overall dimensions and mass.

But I'd say the chance of getting a pelican SVG out of a model like this is mostly related to lots of text describing the anatomy of pelicans, and it would not gain anything from synthetic data.

dartos · 2024-12-16T14:08:19 1734358099

> but because of their vector nature, they could easily be mapped onto values from a simulation in a physics engine.

I don’t think the fact that the images are described with vectors magically makes it better for representing physics than any other image representation. Maybe less so, since there will be so much textual information not related to the physical properties of the object.

What about them makes it easier to map to physics than an AABB?

For soft body physics, im pretty sure a simpler sort of distance field representation would even be better. (I’m not as familiar with soft body as rigid body)

ben_w · 2024-12-16T15:05:01 1734361501

For rendering them, more than for anything else. There's a convenient 1-to-1 mapping in both directions.

You can of course just rasterise the vector for output, it's not like people view these things on oscilloscopes.

perrygeo · 2024-12-17T00:33:22 1734395602

SVGs are just 2D geometries, and I can assure you that almost every GIS project in the world uses vector data to model the physical world.

Whether it's a good model or a model that the LLMs can manipulate, I have no idea. But it's entirely plausible that I could feed an SVG map and ask it geospatial reasoning questions like how far between A and B, what's the nearest grocery store, etc.

syntaxing · 2024-12-15T23:49:16 1734306556

Wow, those responses are better than I expected. Part of me was expecting terrible responses since Phi-3 was amazing on paper too but terrible in practice.

refulgentis · 2024-12-16T00:23:05 1734308585

One of the funniest tech subplots in recent memory.

TL;DR it was nigh-impossible to get it to emit the proper "end of message" token. (IMHO the chat training was too rushed). So all the local LLM apps tried silently hacking around it. The funny thing to me was no one would say it out loud. Field isn't very consumer friendly, yet.

TeMPOraL · 2024-12-16T08:59:32 1734339572

Speaking of, I wonder if and how many of the existing frontends, interfaces and support packages that generalize over multiple LLMs, and include Anthropic, actually know how to prompt it correctly. Seems like most developers missed the memo on https://docs.anthropic.com/en/docs/build-with-claude/prompt-..., and I regularly end up in situation in which I wish they gave more minute control on how the request is assembled (proprietary), and/or am considering gutting the app/library myself (OSS; looking at you, Aider), just to have file uploads, or tools, or whatever other smarts the app/library does, encoded in a way that uses Claude to its full potential.

I sometimes wonder how many other model or vendor-specific improvements there are, that are missed by third-party tools despite being well-documented by the vendors.

refulgentis · 2024-12-16T09:12:23 1734340343

Hah, good call out: there was such a backlash and quick turnaround on Claude requiring XML tool calls, I think people just sort of forgot about it altogether.

You might be interested in Telosnex, been working on it for ~year and it's in good shape and is more or less designed for this sort of flexibility / allowing user input into requests. Pick any* provider, write up your own canned scripts, with incremental complexity: ex. your average user would just perceive it as "that AI app with the little picker for search vs. chat vs. art"

* OpenAI, Claude, Mistral, Groq Llama 3.x, and one I'm forgetting....Google! And .gguf

regularfry · 2024-12-16T15:59:45 1734364785

In a field like this the self-doubt of "surely it wouldn't be this broken, I must just be holding it wrong" is strong.

algo_trader · 2024-12-16T06:15:10 1734329710

> More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...

Nice. Thanks.

Do you think sampling the stack traces of millions of machines is a good dataset for improving code performance? Maybe sample android/jvm bytecode.

Maybe a sort of novelty sampling to avoid re-sampling hot-path?

lifeisgood99 · 2024-12-16T00:02:10 1734307330

The SVG created for the first prompt is valid but is a garbage image.

bentcorner · 2024-12-16T00:44:54 1734309894

In general I've had poor results with LLMs generating pictures using text instructions (in my case I've tried to get them to generate pictures using plots in KQL). They work but the pictures are very very basic.

I'd be interested for any LLM emitting any kind of text-to-picture instructions to get results that are beyond a kindergartner-cardboard-cutout levels of art.

pizza · 2024-12-16T02:10:28 1734315028

I do with Claude: https://news.ycombinator.com/item?id=42351796#42355665

mycall · 2024-12-18T02:18:00 1734488280

I've had success with LLMs producing mermaid.js or three.js output but that is a different use case.

simonw · 2024-12-16T00:49:20 1734310160

That's why I use the SVG pelican riding a bicycle thing as a benchmark: it's a deliberately absurd and extremely difficult task.

accrual · 2024-12-16T01:57:10 1734314230

Appreciate your rapid analysis of new models, Simon. Have any models you've tested performed well on the pelican SVG task?

simonw · 2024-12-16T02:26:29 1734315989

gemini-exp-1206 is my new favorite: https://simonwillison.net/2024/Dec/6/gemini-exp-1206/

Claude 3.5 Sonnet is in second place: https://github.com/simonw/pelican-bicycle?tab=readme-ov-file...

carbocation · 2024-12-22T16:29:21 1734884961

The gemini result is great. I modified your prompt to encourage more detail ("Generate an SVG of a pelican riding a bicycle. The degree of detail should be surprisingly high and should spark delight for the viewer.")

This is what o1-pro yielded: https://gist.github.com/carbocation/8d780ad4c3312693ca9a43c6...

accrual · 2024-12-16T03:13:48 1734318828

The Gemini result is quite impressive, thanks for sharing these!

codedokode · 2024-12-16T08:22:51 1734337371

They probably trained it for this specific task (generating SVG images), right?

simonw · 2024-12-16T10:42:11 1734345731

I'm hoping that nobody has deliberately trained on SVG images of pelicans riding bicycles yet.

Teever · 2024-12-16T01:25:03 1734312303

I'm really glad that I see someone else doing something similar. I had the epiphany a while ago that if LLMs can interpret textual instructions to draw a picture and output the design in another textual format that this a strong indicator that they're more than just stochastic parrots.

My personal test has been "A horse eating apples next to a tree" but the deliberate absurdity of your example is a much more useful test.

Do you know if this is a recognized technique that people use to study LLMs?

simonw · 2024-12-16T01:34:48 1734312888

I've seen people using "draw a unicorn using tikz" https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...

int_19h · 2024-12-16T19:18:04 1734376684

I did some experiments of my own after this paper, but letting GPT-4 run wild, picking its own scene. It wanted to draw a boat on a lake, and I also asked it to throw in some JS animations, so it made the sun set:

https://int19h.org/chatgpt/lakeside/index.html

One interesting thing that I found out while doing this is that if you ask GPT-4 to produce SVG suitable for use in HTML, it will often just generate base64-encoded data: URIs directly. Which do contain valid SVG inside as requested.

girvo · 2024-12-16T03:16:42 1734319002

That came, IIRC, from one of the OpenAI or Microsoft people (Sebastian Bubeck); it was recounted in an NPR podcast "Greetings from Earth"

https://www.thisamericanlife.org/803/transcript

krackers · 2024-12-16T05:56:19 1734328579

It's in this presentation https://www.youtube.com/watch?v=qbIk7-JPB2c

The most significant part I took away is that when safety "alignment" was done the ability plummeted. So that really makes me wonder how much better these models would be if they weren't lobotomized to prevent them from saying bad words.

MyFirstSass · 2024-12-16T09:51:10 1734342670

But how will that prove that it's more than a stochastic parrot, honestly curious?

Isn't it just like any kind of conversion or translation? Ie. a relationship mapping between diffrent domains and just as much parroting "known" paths between parts of different domains?

If "sun" is associated with "round", "up high", "yellow","heat" in english that will map to those things in SVG or in whatever bizarre format you throw at with relatively isomorphic paths existing there just knitted together as a different metamorphosis or cluster of nodes.

On a tangent it's interesting what constitutes the heaviest nodes in the data, how shared is "yellow" or "up high" between different domains, and what is above and below them hierarchically weight-wise. Is there a heaviest "thing in the entire dataset"?

If you dump a heatmap of a description of the sun and an SVG of a sun - of the neuron / axon like cloud of data in some model - would it look similar in some way?

sabbaticaldev · 2024-12-16T10:01:43 1734343303

that’s a huge stretch for parroting

memhole · 2024-12-16T04:15:24 1734322524

Not sure if this counts. I recently went from description of a screenshot of graph to generate pandas code and plot from description. Conceptually it was accurate.

I don’t think it reflects any understanding. But to go from screenshot to conceptually accurate and working code was impressive.

chen_dev · 2024-12-16T03:17:57 1734319077

Amazon Nova models:

https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1...

simonw · 2024-12-16T00:04:53 1734307493

Yeah, it didn't do very well on that one. The best I've had from a local model there was from QwQ: https://simonwillison.net/2024/Nov/27/qwq/

refulgentis · 2024-12-16T00:43:45 1734309825

For context, pelican riding a bicycle: https://imgur.com/a/2nhm0XM

Copied SVG from gist into figma, added dark gray #444444 background, exported as PNG 1x.

fisherjeff · 2024-12-16T05:09:10 1734325750

Looks like someone’s finally caught up with The Hallmark Channel’s LLM performance

selcuka · 2024-12-17T03:24:52 1734405892

I tried the same GGUF on LMStudio and it seems to sometimes leak its training data.

    Prompt:
    Are you based on OpenAI?

    Reply:
    No, I am not based on OpenAI's models; I am developed by Microsoft. While both systems are designed for natural language processing and understanding, we may have different architectures, training methodologies, or datasets. However, our shared goal is to provide helpful and informative responses while ensuring user safety and privacy.<|user|>
    Are you similar to ChatGPT?<|end|>

vergessenmir · 2024-12-16T13:33:54 1734356034

When working with GGUF what chat templates do you use? Pretty much every gguf I've imported into ollama has given me garbage response. Converting the tokenizer json has yielded mixed results.

For example how do you handle the phi-4 models gguf chat template?

simonw · 2024-12-16T13:41:49 1734356509

I use whatever what template is baked into the GGUF file.

You can click on the little info icon on Hugging Face to see that directly.

For https://huggingface.co/matteogeniaccio/phi-4/tree/main?show_... that's this:

  {% for message in messages %}{% if
  (message['role'] == 'system')
  %}{{'<|im_start|>system<|im_sep|>' +
  message['content'] + '<|im_end|>'}}{%
  elif (message['role'] == 'user')
  %}{{'<|im_start|>user<|im_sep|>' +
  message['content'] +
  '<|im_end|><|im_start|>assistant<|im_sep|>'}}{%
  elif (message['role'] == 'assistant')
  %}{{message['content'] + '<|im_end|>'}}{%
  endif %}{% endfor %}

belter · 2024-12-16T09:43:38 1734342218

> it was trained using synthetic data

Is this not supposed to cause Model collapse?

rhdunn · 2024-12-16T10:37:26 1734345446

It depends on how you construct the synthetic data and how the model is trained on that data.

For diffusion-based image generators training only on synthetic data over repeated model training can cause model collapse as errors in the output can amplify in the trained model. It's usually the 2nd or 3rd model created this way (with output of the previous used as input for the first) for it to collapse.

It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.

simonw · 2024-12-16T10:39:35 1734345575

This is why I don't think model collapse actually matters: people have been deliberately training LLMs on synthetic data for over a year at this point.

As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.

ziofill · 2024-12-16T16:30:24 1734366624

I may have misunderstood, but I think that it depends a lot on the existence of a validation mechanism. Programming languages have interpreters and compilers that can provide a useful signal, while for images and natural language there isn’t such an automated mechanism, or at least its not that straightforward.

nxobject · 2024-12-16T13:52:38 1734357158

As someone who's a completely layman: I wonder if the results of model collapse are no worse than, say, sufficiently complex symbolic AI (modulo consistency and fidelity?)

fulafel · 2024-12-16T09:50:50 1734342650

belter · 2024-12-16T13:49:29 1734356969

Is this paper wrong? - https://arxiv.org/abs/2311.09807

simonw · 2024-12-16T16:45:25 1734367525

It shows that if you deliberately train LLMs against their own output in a loop you get problems. That's not what synthetic data training does.

belter · 2024-12-16T19:50:36 1734378636

I understand and appreciate your clarification. However would it not be the case some synthetic data strategies, if misapplied, can resemble the feedback loop scenario and thus risk model collapse?

mhh__ · 2024-12-16T19:46:14 1734378374

Along those lines (synthetic data) I would keep an eye on the chinese labs given that they are probably quite data and compute constrained, in English at least.

Havoc · 2024-12-16T00:57:50 1734310670

>Microsoft haven't officially released the weights

Thought it was official just not on huggingface but rather whatever azure competitor thing they're pushing?

simonw · 2024-12-16T01:22:55 1734312175

I found their AI Foundry thing so hard to figure out I couldn't tell if they had released weights (as opposed to a way of running it via an API).

Since there are GGUFs now so someone must have released some weights somewhere.

lhl · 2024-12-16T05:44:57 1734327897

The safetensors are in the phi-4 folder of the very repo you linked in your OP.

Havoc · 2024-12-16T01:58:33 1734314313

Yeah the weights were on there apparently.

Planned week delay between release on their own platform and hf

But much like you I decided I can be patient / use the ggufs

patrick0d · 2024-12-16T11:49:45 1734349785

this vibe check is more insightful to me than the popular evals. nice job!