Both, the API response includes a breakdown. In the best case 1 token = 1 word (for example "and", "the", etc). Depending on input, for English it seems reasonable to multiply the word count by about 1.3 to get a rough token count
This pricing model seems fair since you can pass in huge prompts and request a single word reply, or a few words that expect a large reply
Is there anything akin to creating Stable Diffusion embeddings where it can train a very discrete concept that takes up a few kilobytes and use that with the base model?
Such an approach could in theory make it so you spend a little upfront to train more complex (read: concepts costing many tokens) and can subsequently reuse it cheaply because you're using an embedding of the vectors for that complex concept instead which may only take a single token.