Kudos on your release! I know this was just made available but
- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.
- The README should explicitly say somehere that there's no GPU support (at the moment)
- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.
- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?
- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?
The weights should be the same across formats, but it's easy for differences to arise due to quantization and/or subtle implementation differences. Minor implementation differences has been a pain point in the ML ecosystem for a while (w/ IRs, onnx, python vs. runtime, etc.), but hopefully the differences aren't too significant (if they are, it's a bug in one of the implementations).
Thanks, I'm glad to see your time machine caught my comment.
I'm using the 32-bit GGUF model from the Google repo, not a different quantized model, so I could have one less source of error. It's hard to tell with LLMs if its a bug. It just gives slightly stranger answers sometimes, but it's not completely gibberish. or incoherent sentences or have extra punctuations like with some other LLM bugs I've seen.
Still, I'll wait a few days to build llama.cpp again to see if there are any changes.
- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.
- The README should explicitly say somehere that there's no GPU support (at the moment)
- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.
- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?
- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?