Hacker News new | past | comments | ask | show | jobs | submit login

Agreed, would be interested if someone with more knowledge could comment.

My layman's understanding of LLMs is that they are essentially "fancy autocomplete". That is, you take a whole corpus of text, then train the model to determine the statistical relationships between those words (more accurately, tokens), so that given a list of tokens of length N, the LLM will find the next most likely token for N + 1, and then to generate whole sentences/paragraphs, you just recursively repeat this process.

I certainly understand encoding proteins as just a linear sequence of tokens representing their amino acids, but how does that then map to a human-language description of the function of those proteins?




Most protein language models are not able to understand human-language descriptions of proteins. Mostly they just predict the next amino acid in a sequence and sometimes they can understand certain structured metadata tags.


Can they understand the functional impact of different protein chains, or are they just predicting what amino acid would come next based on the training set with no concern for how the protein would function?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: