What the SATs Taught Us about Finding the Perfect Fit

lsiebert · on Dec 14, 2017

One of the great joys of my bachelor's degree in psychology was being invited to take a graduate level course on Item Response Theory (with Professor Jack Vevea, now at UC Merced). I wouldn't have fallen in love with programming and become a software developer if I hadn't taken it. 1

The Rasch Model is a specifically simplified case of item response theory, but I'd argue that it may not be the best one for stitch fix. That's not to say that it can't be useful, but rather that the simplifications and assumptions of the Rasch model may lead to information that does not reflect the customer's measurements as well as a more sophisticated model could. Of course it very well may be good enough, but it serves as a somewhat useful exploration of the

The Rasch model is an attempt to differentiate two associated sets of information, the latent trait of the test taker/question answerer (in this case, their measurements) and the difficulty of the question (in this case, whether the item is too big, too small or just right). Basically the Rasch model treats the level of a latent trait of an individual as a function of the difficulty of a test question and what they answered.

But the model purposely ignores the question of the discrimination of the question, that is, how good is the question at differentiating between those who's latent trait differs, and just assumes that the discrimination (the slope of the line reflecting the model of the question's difficulty) is not relevant. Other models see this as relevant.

For example, if StitchFix offers a belt with a number of different holes, some people may feel the belt is too small if they are forced to use the last hole, some the second last hole. A question about such a belt that just asked if it was too large, too small, or just right might have low discrimination in terms of identifying an individuals underlying size. Likewise someone who has bigger thighs but a relatively slim torso might have different answers about a pair of slim fit pants of size x which are too small for their thighs, and a belt of size x. Thus questions about pants may have a higher discrimination then questions about a belt.

Item Response Theory outside of the Rasch also has a third factor to consider on a per question and per individual basis, which is basically the propensity to guess. Basically, how likely is someone to think carefully about the question as opposed to just putting down a random answer , and likewise are some questions more likely to have people answer blithely instead of earnestly.

The other thing to consider is that in most IRT tests, the latent trait is assessed at a single time for multiple questions. But weight/fit/measurements are here being assessed item by item, as they are tried, and the underlying fit may be changing if a person is gaining weight or bulk, retaining water, or recovering from thanksgiving dinner. While it's unlikely that someone's weight or size would change radically in a brief period, a model that weighed items that were tested more recently might better reflect the individual's measurements.

Of course it's been years and years since I took the class, so any screwup in this comment should reflect on me, and not my professor.

1 I was writing a function in R to speed up an IRT model fitting a curve in a way that let me do it in seconds instead of hours (It's been a while but I think it was identifying the point of the curve where the slope is maximized), in any case it was a time consuming computation if you check every possibility linearly to 6 decimals for hundreds of test takers, but I figured that there weren't local maximums and optimized with something like a binary search (but by decimal place), before I had ever heard of binary searches, and getting that sort of efficiency jump was deeply satisfying.

ouid · on Dec 14, 2017

I didn't actually read the article, so maybe you're making a point specific to the article, but in response to

>For example, if StitchFix offers a belt with a number of different holes...

surely the solution is just to model belts shirts and pants separately. Just because there's a correlation between those things doesn't mean you have to use it, and it fucks with your convergence.

avs733 · on Dec 14, 2017

>The Rasch Model is a specifically simplified case of item response theory, but I'd argue that it may not be the best one for stitch fix.

Hello fellow IRT nerd. I was going to come post something similar! I love yours but I hope you don't mind if I flesh out a few details.

I think the biggest takeaway is that one of the underlying assumptions of Rasch is a set of observations of the phenomenon that collectively work to make an estimate of a uni dimensional trait. In this case both of those assumptions are germane. For starters, I would be really concerned about the assumption of unidimensionality. I definitely find variances in fit in more than one dimension for clothes.

I totally agree that ignoring discrimination (Rasch sets the discrimination of each item to an arbitrary value, 1) is likely problematic here. Rasch constrains discrimination because of the idea that the test (i.e., all the questions) discriminates (to be clear, discriminate == sort). If I was developing this model I would want to have that discrimination parameter. It would be informative as to what clothes are more effective at bridging different sized people (i.e., how flattering they are to differing body types). It would also get me away from the internal assumption that the model basically necessitates multiple 'observations' (i.e., pieces of clothing) to make an effective estimate of fit.

Similarly they might also make use of the 'guessing'. That parameter is effectively one to track randomness in a model. Maybe you have a piece of clothing with a weird shape or a different relationship sizes in the cut than others.

All of this model also ignores other variances between people and between brands or items of clothing. You can't really introduce these with strict IRT, but there are more generalized forms of IRT that can deal with multiple dimensions. If you want to get into predictor variables you can do things like multilevel logistic modeling (also known as multilevel measurement modeling) which are becoming more widely used in education. Those tolerate nesting of data which can be very useful. Maybe banana republic clothes fit me better than they do lsiebert because of where the arm holes are even though we both wear a L from every brand. That is a nested effect/interaction...it is not specific to banana republic, but specific to me and that brand. You can't get that out of IRT or Rasch.

General description: 1 paramater IRT/Rasch estimates 'difficulty' varies, akin to general size assuming size is a single linear dimension

2 parameter IRT estimates difficulty and discrimination (i.e., how 'sharp' the logistic curve is)...maybe a piece of clothing that fits slightly different shapes and sizes beetter or worse.

3 parameter IRT estiamtes difficulty, discrimination, and guessing. helps identify a garment with a random influence...maybe people tolerate the item less well because its a known brand

wenc · on Dec 14, 2017

Truly good sizing is a much more complex problem than can be solved through recommendation systems. You can get incrementally better at it, but nothing beats actually trying something on at the store. Problems that cannot be solved through recommendations alone include:

1) Body types that aren't average (all of our bodies are irregular, but some of us are more irregular than others). Clothes sizing are based on a average model of the human body.

2) Fabric interaction with body. Softer fabrics drape in different ways from more rigid fabrics. Also how loose or muscular your flesh is can affect fit if you're in the market for really slim fits (European style).

3) Non-representative snapshots: your measurements in the morning will differ from other times of day, and it will differ over the course of weeks and months, even if your diet is stable.

4) Shrinkage/expansion: depending on material, there is some shrinkage or expansion after the first wash/wear. Although this is mostly a known quantity and good clothiers account for this.

Really good bespoke tailors understand these principles, and make allowances for them as they build your clothing. Also they know to put elastic fabrics in the right places so the suit will still fit even after a large meal.

I think the one way we can get closer to a good fit while being remote is to have pop-up stations/kiosks where we can get multiple 3D scans of our body on separate occasions (kinda like how most bespoke tailors require at least 3 fittings). That still doesn't account for fabric-body interactions, but it gets us a lot closer than recommendation systems.

p.s. the other problem is cultural. Most Americans don't know or care quite as much about fit (because it's not as prized in the culture) as their European counterparts, so their data is going to be skewed slightly towards the left end of the competence curve.

asddkk · on Dec 14, 2017

I live and breathe IRT models of the sort they discuss for my work and it's fascinating to me to see it applied to clothing sizing. It makes sense because it's a measurement problem.

One thing they don't get into really is that the IRT model they are using is pretty simple, which is typical of the random/mixed effects model formulation as a way of keeping things mathematically tractable to permit parameter explanation/prediction.

They could add other components to the model, though, such as different dimensions of fit (different aspects of body shape and fit), or how closely different items track those dimensions (as opposed to just how large or how small; e.g., maybe some types of fabric provide more information about fit than others). The models they're fitting are a sort of entry point in that regard.

It wouldn't solve the problem of course, and I agree with you 100% about in-person fitting being the final word, but they also have a lot of ways they could improve these models.

Jun8 · on Dec 14, 2017

I'd say the situation is even more complex than that, i.e. even trying it on at the store may be insufficient: I hypothesize that a non-small percentage of people, like me, have no good subjective function to assess apparel fit when they out it on. That's why (I'm a bit embarrassed to admit) I have to bring in in my wife with me for all non-trivial shopping to get her expert opinion.

wenc · on Dec 14, 2017

> I have to bring in in my wife

Excellent point. :) What is the point of finding the perfect fit, after all?

It's so that the person (or persons) whose opinions you care about think you dress well and look well put together. That is an important piece of data.

snegu · on Dec 14, 2017

I find it surprising they do all this math since the model they seem to use is "Send everybody tent-like shirts that would fit a horse." After repeated requests for more fitted styles, I kept getting tents (although eventually smaller size tents).

I understand why they do this, because this style is likely to fit more people. But if the technique they describe in this post actually works, perhaps they could be more adventurous?

pjc50 · on Dec 14, 2017

Related: http://sizes.darkgreener.com/ recording the discrepancy between label size and actual fit for a lot of UK high street shopping.

The variance in label sizes is bad enough - and worse for women's clothing than mens - but when you get into "large/medium/small" it's just a lottery. Especially if you're a westerner ordering direct from China.

Bartweiss · on Dec 14, 2017

The variance in label sizes is definitely more extreme in women's sizes, but I'm particularly struck by how it happens with supposedly-objective men's clothing. As men's pants get larger, the waistband size becomes increasingly dishonest - despite supposedly being measured in "inches". I'll look for the study, but it can get up to an impressive 10%+ discrepancy.

lanius · on Dec 15, 2017

Not to mention variance within a given labeled size itself. Last time I went shopping for jeans, I tried on two pairs from Levi's with the same cut and labeled size. One was too tight and the other was too loose.

jatsign · on Dec 14, 2017

Stitchfix has done a good job learning my size, but I wish I actually could "shop" at stichfix. When you get a box, there's a foldout paper that shows you examples of what you could wear these new clothes with...but I don't own any of those thing.

I wish they had some sort of follow up experience that would let me "complete" the outfit.

They talk a bit about why they don't have any sort of online shopping here:

http://multithreaded.stitchfix.com/blog/2015/07/07/personali...

robterrin · on Dec 14, 2017

They used Stan! http://mc-stan.org/

Gelman and the rest of the crew at Columbia are doing great work. Check out https://www.generable.com/ too.

not_that_noob · on Dec 15, 2017

Nice! Using IRT for clothes sizing is indeed innovative.

One factor that is difficult to account for is users not telling you the truth. Users may feel embarrassed to say something is too small, as that may have negative connotations for some. It’s difficult of course to control for that, but I wonder how honest people might be in their feedback.

perseusprime11 · on Dec 14, 2017

The simple model didn't work for me. They sent me shirts and pants that did not fit.

dgritsko · on Dec 14, 2017

This is the "cold start" problem, which they mention in the post ("What do we do about clients who recently signed up and have no past history on the service?") along with how they attempt to mitigate it. It's not surprising that their approach doesn't work for all users (as it would appear that it didn't in your case). I'm curious though, how much feedback does their system need in order to converge? E.g., how many ratings would you need to provide before receiving well-fitting clothes?

perseusprime11 · on Dec 14, 2017

Interesting. Thanks for pointing that out. Most of the clothes they sent are from their own brands which makes me wonder what is new about this model. Anybody retailer can ship their clothes if they are willing to process the returns for free.

gumby · on Dec 14, 2017

I'm not in B2C or really any consumer play, and haven't looked at SF closely. But: I assume it's an execution play.

Anybody can sell their books and electronics online (well, they could before Amazon started, yet amazon exceeded). Anybody can make clothes, yet Inditex is a monster. And why is there more than one restaurant. SF's claim is that their tech is a differentiator, and they are a tech company (unlike, say, Target or Blue Nile).

There are a couple of thoughts about "in store brand": their business plan may have required tighter control and visibility into the dimensions and materials of their product (dimension for the reasons discussed in the article and its footnotes; material because stuff that is tried on not in store and returned may need to be more durable, I don't know). It makes the plan more complex, but perhaps improves execution enough that it's worth the expense.

In the long run, if they really have a tech advantage they have the option of expanding beyond just in-house product or simply splitting into sub brands, or both.

Klathmon · on Dec 14, 2017

Then return the item for one that will fit for free, which also gives them the information to improve it for you and others in the future.

jdpedrie · on Dec 14, 2017

I've found too often that by the time I figure out it doesn't fit and needs to be exchanged, other sizes are sold out. It's doubly annoying because I either have to eat the cost of an item which doesn't fit or eat the cost of sending it back and losing the discount they give you for keeping everything.

I like stitchfix quite a bit, but I'm starting to become a bit disillusioned after seeing the small bag of clothes dropped at the Salvation Army which didn't fit or I didn't like and which it was cheaper to keep than return. At least I have a mortgage and this itemize deductions, so I can write those clothes off my taxes!