Yeah, I wasn't sure how I wanted to deal with duplicates so I mostly ignored them. I track letter positions directly (just a bunch of tuples), but don't actually do anything with this other than restricting candidates words.
I think if I work on this some more I'd try to factor in letter positioning when deciding what to guess. My hunch is that it won't make too much of a difference though.
So I tried an experiment using 15,918 five letter English words. I used a basic scoring strategy of scoring a word by summing up the frequency of the candidate letters in the candidate words as determined by a regexp of included and excluded letters. (e.g. `.aves` would score `waves` 1, but `saves` as 0 since `s` is already included)
Variations included adding in the frequency of the letter at a particular position, and adding in the frequency of two letter combinations.
Interestingly enough, the winning strategy was using single letters and using figuring in the position. Second second best was using two letters and position.
ngram=1 posfreq=True mean attempts: 4.34 WinPct 91.280%
ngram=2 posfreq=True mean attempts: 4.35 WinPct 91.186%
ngram=2 posfreq=False mean attempts: 4.37 WinPct 90.074%
ngram=1 posfreq=False mean attempts: 4.38 WinPct 90.445%
Since my base dictionary is way bigger than the Wordle one, I also mixed in a smaller 1,382 word dictionary (google-10000-english.txt) and then combined them by either just sorting by the score, or normalizing the scores, and then sorting. Normalizing the scores was strictly worse.
normalize=False ngram=1 posfreq=True mean attempts: 4.34 WinPct 91.280%
normalize=True ngram=1 posfreq=True mean attempts: 4.43 WinPct 90.281%
FWIW, the absolute worse one was:
normalize=True ngram=1 posfreq=False mean attempts: 4.43 WinPct 89.835%
I think if I work on this some more I'd try to factor in letter positioning when deciding what to guess. My hunch is that it won't make too much of a difference though.