Keep in mind ML models are really great at cheating to get answers: Maybe it's detecting that women have to tilt their head up more to reach the retinal photography machine, because they're shorter on average. Maybe, some of the images come from an optometrist that specializes in women's glass frames, and his retinal photography machine has a slightly dimmer bulb. Maybe, men are more likely to get a retinograph only when they have more severe disease already, so the retinas look different for that reason.
Their use of an external validation dataset eliminates many, if not all, of those concerns.
Regarding external validation set:
> This dataset differed from the UK Biobank development set with respect to both fundus camera used, and in sourcing from a pathology-rich population at a tertiary ophthalmic referral center.
Regarding UK Biobank set (training set)
> UK Biobank dataset, which is an observational study in the United Kingdom that began in 2006 and has recruited over 500,000 participants—85,262 of which received eye imaging38. Eye imaging was obtained at 6 centers in the UK and comprises over 10 terabytes of data39. Participants volunteered to provide data including other medical imaging, laboratory results, and detailed subjective questionnaires.
That same spirit is also what's behind shorts in the market to keep overly optimistic players in check. As with crypto-ransom operations against corporations to keep their system security on alert. Adversarial actions (within reasonable boundaries) is the most important balance mechanism.
The problem is keeping the adversaries balanced. Otherwise you have one side overpowering everything (and typically not through the "correctness" of their argument, but scale and resources) and the adversarial system becomes detrimental to progress.
Yup. I learned this the hard way on the last model I trained at scale. It was evaluating fit between two heterogeneous classes I sampled training & test split off of a large time window and got to work. It performed extremely well on test & training. Too good.
I pulled a third sample from a completely different time window and it performed terribly.
It turned out that both datasets were dominated by class A being sorted into always selecting great fit or poor fit, so the ML model learned to memorize the class A instances.
This problem when away when I subselected down to only instances of class A that had examples of both good fit and poor fit.
My problem with modern scientific publications is that they focus more on the "discovery" instead of describing the logical rigor as to why their "discovery" could be true or false.
What kind of publications are you referring to? Yeah, many pop-sci articles often overstate the implications of a given discovery, but actual research papers (such as the one linked) typically have a discussion section in which they describe the limitations of the discovery and what kind of future research could help expound on the pitfalls of the current research.
The paper in this post itself has its own Limitations section.
> The paper in this post itself has its own Limitations section.
I haven't actually read the paper. I was reacting to the title.
> what kind of future research could help expound on the pitfalls of the current research.
I think a better way of documenting research to people is by describing what scientific boxes where checked.
In the given case for example one of the boxes maybe,
"Is there an anatomical distinction between retinas of sexes?" — If that is true then we can see if machine learning can detect such differences.
Take computer science for example.
A publication with the title "Professors create a machine that can think" would maybe published as "Professors create machine that passes Turing test [0]"
Another example, can be that is medicine.
The research in question maybe if a microbe A causes a flue.
Instead of a publication being "Microbe A causes disease B".
A better Publication IMO should revolve around, "Microbe A had passed Koch's postulates [1] for disease B"
Once upon a time—I’ve seen this story in several versions and several places, sometimes cited as fact, but I’ve never tracked down an original source—once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.
The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest.
Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that wouldn’t generalize to new problems. Not, “camouflaged tanks versus forest”, but just, “photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive…” But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed!
The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos. It turned out that in the researchers’ data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.
I certainly did not expect the down votes. I've designed a number psychology experiments and I and my colleagues were always careful to balance experimental stimuli and conditions on a number of factors, ranging from luminosity to semantic association, and especially order.
When I was tasked to create stimuli or to take measurements on a series of items it was always important to try to eliminate systematic differences on non-interest, most typically thru randomization of the order of creation or measurements.
I would point out that the study which appears to be the closest thing to a tiny seed of truth to the tank story (Kanal & Randall 1964) actually did balance its photographs by construction, by subcropping large aerial photographs to have tank/no-tank sections and training on that. That necessarily controls for time-of-day, weather, luminosity etc.
> For instance, in our work, we noted that the algorithm appeared more likely to interpret images with rulers as malignant. Why? In our dataset, images with rulers were more likely to be malignant; thus the algorithm inadvertently “learned” that rulers are malignant.
Yeah, when I read the headline I thought Cornea/Iris, not Retina, and there are a couple of ways an ML could "cheat" there.
But still, there are ways of getting the resulting NN and applying some techniques to figure out which variable/patterns are responsible for most of the output.
Reproducibility is one thing I like about AI research. If they provide the model, I can take on my own computer and test it against whatever I want and judge it.
Most things in science is almost impossible to reproduce because of cost or specialized equipment.
I think the problem with that line of thinking is that even after fully transitioning, trans people still tend to have strong biological markers of the sex they've transitioned from.
For a ML model to correctly guess gender identity, it'd need other cues that indicate a person would prefer to be referred to by a certain pronoun, such as clothing or facial hair.
It really depends on the bio marker. Most bio markers are effected: fat distribution, hair texture (curly vs. straight), and even immune system function. I’d think that learning whether this model works on people who’ve taken hormones for gender transition could give some insight into what it’s detecting. Perhaps there are retinal features whose function or form are impacted by the presence of testosterone, for instance, or maybe it’s a different that forms before birth.
>Clinicians are currently unaware of distinct retinal feature variations between males and females, highlighting the importance of model explainability for this task.
If I'm reading this correctly what they're saying is that since we don't currently know the difference between male and female retinas, being able to explain what the ML black box is doing is important. But from what I can see in the paper they basically don't know what the black box is doing, they really don't understand what features their tool has isolated. I might be misunderstanding though?
There is a cursory discussion about how this is "inconceivable to those who spent their careers looking at retinas". However, if it's not clinically useful (as the next sentence says), those experts probably haven't spent much--if any--training themselves to try.
Humans can learn to detect surprisingly subtle features. For example, the right training regime can make you much better at reporting the tilt of a line, but it requires practice and feedback, just like the network got.
The umbrella term for this is "perceptual learning" and it turns out that training can tweak the visual system: baseball players learn to read where an incoming pitch will go, radiologists can find subtle clues that indicate a tumor, and--as someone pointed out above--chicken sexers can tell the sex of a baby chicken somehow. These are fairly "high-level" phenomenon. Surprisingly, training also works on low-level visual phenomena, which you might think are limited by the eye itself or 'hard-wired' neural circuits. It's mostly just practice.
One of the classic experiments looks at vernier (hyper-)acuity. You're shown pairs of lines that are slightly offset, like:
| or |
| |
You report whether the top is shifted rightwards or leftwards. The computer gives you feedback, and shows you another pair (varying the distance to make it easier or harder). If you keep doing this, your threshold (i.e., the smallest offset you can reliably report) will about decrease 5-fold. The same thing happens for many other phenomena too--reporting the direction that some dots are moving, the tilt of a line, etc. In some cases, the improvements continue for days or even weeks of training.
The wild thing is that they're often very specific to the training. For example, if you trained with the vertical stimuli above, the improvement does not transfer over to stimuli like:
_ or _
_ _
There are "tricks" to designing a curriculum that generalizes and is relatively efficient, but there's still a lot to learn.
They don't give an explanation because they don't know how to give an explanation - many if not most ML model lack an easy explanation presently, they just spit out answers.
They are saying "someone should do this because it's important even though we don't (presently) know how to do this".
Have you ever used them? They're pretty but not things I put much trust in and the other awkward issue is there's been a variety of papers that do an analysis of many interpretability algorithms and find some very weak properties. One example is train a model M, run something like SHAP, delete the features marked important in the original dataset, train a new model M2 and see minimal performance difference. At best that tells us many models have large number of ways to get there answer and the interpretability algorithm is giving you one. If your goal was understanding what distinguishes the data and not just specific model you failed. At worst it tells you the local way most interpretability algorithms work is just not revealing. The local nature (most are heavily gradient based) when working with data like images where individual pixels mean little also puts a lot of skepticism for me. There are a decent number of interpretability algorithms that look like they're just running edge detection which makes sense with gradient part but is poor for interpretability.
ML explainability is a wide field with a lot of success. For example you can discover what features are activating which detections. You comment, taken literally, suggests that research in this field is impossible. That is not the case.
when the activated feature in an image recognition net looks like a lovecraftian horror, that doesn’t explain how the net came up with “turtle”.
explainability is going to have a rough time for the same reason ai alignment is going to have a rough time. people think they can explain decisions (technical and moral) far more effectively than they actually can.
I judge by our general inability to explain most of our intuitive decisions (intuition is basically trained pattern recognition neural network without reflection, much like artificial ML).
A great sports player often makes a lousy coach. They often can't articulate how they play, how they move, and how they think, except in the broadest strokes.
All our formal/reflective models are evolved entirely independently of our intuition. They're supported by our intuition (release ball, ball drops, gravity) as a heuristic, but not explained by it. They're independent in terms of logical frame.
I suspect a similar thing will occur in ML. We'll have our black box ML that produces great results which we can't explain. And we'll have other NN models that arrive at reflective models, which however are far less insightful on their own (at least per watt consumed, let's say).
The trouble is that we expect to sit down and debug a series of equations in a NN and come up with a "factual nugget" or a few, that explains what happened. Actually correlation often doesn't work this way. You just correlate dozens of mundane factors that vary by one degree between outcomes, and you happen to be able to produce a solid result from that.
Expecting specific models in a NN is a bit like inspecting a dog photo on your phone under a magnifying glass, expecting to learn more about the dog. Instead you cease seeing a dog, and start seeing arbitrary colored pixels.
The problem is with your assumption that the decision function can be reduced to a nice, clean, closed form that a human's brain can consciously conceptualize. It might as well be a tangled mess of a thousand parameters you can't reduce any further without losing predictive power. There is no particular reason why anything should be simple.
We're not at the point of explaining complicated models with a straight face. Based on the saliency maps, it looks like the model has learned something around the bright circles (or is it the blind spot? Not an ophthalmologist). Makes me think the network can reverse engineer distortions in the light to get curvature of the lens which might be indicative of gender differences.
Your question is confusing, it might be that you are using “feature” in an ML sense and the quote refers to human describable distinctions we know about? But I still don’t know how to parse your question.
The model can predict male vs female retinas but they don’t understand why. What exactly are you asking?
ML models have layers, and neurons in a given layer detect “features” in the image or in the previous layer. So yes I believe the person meant which structures in the image are activating the network. Which is a well studied area so it is surprising the authors didn’t explore that.
Thanks to hormones, women may experience vision changes throughout their adult lives. The hormones estrogen and progesterone have a lot to do with this. Their changing levels can affect the eye’s oil glands, which can lead to dryness. Estrogen can also make the cornea less stiff with more elasticity, which can affect how light travels into the eye. The dryness and the change in refraction can cause blurry vision and can also make wearing contact lenses difficult.
I remember a researcher doing some early research on compressing diagnostic imaging and was happy about all the hard disk space saved. They did some research to find out what level of compression they could go with that wouldn't result in different clinicians reaching different conclusions from the same images.
It really upset me. We probably threw away decades of training data that a computer could have used for early detection.
Fine for broken arms of whatever, but for cancer diagnostics, ugh. The computer might have been able to see the tumour before a clinician.
That data compression meant getting imagery to experts in less than an hour digitally instead of via sneakernet physically in a multi-day timeframe, thus massively speeding up and improving access to high quality diagnosis.
Could do both, but still depends on storage costs. But I guess re-scanning wouldn't get justified once there's already a "good enough" compressed version.
Real question is if they dialed down the compression as storage got cheaper. Doubtful.
I would be curious whether this is actually classifying something that typically corresponds with sex, like hormone levels. In that case people with hormonal disorders would potentially be mis-classified, and someone photographed pre/during puberty might also be mis-classified. Since the paper mentions both neural and vascular tissue being represented in retinal photos, it seems like the levels of various hormones in the individual's blood could potentially also generate a mis-classification if they (for example) cause blood vessels in the eye to expand or contract. The mention that foveal pathology causes the model to mispredict suggests it would probably have issues in these cases too, I think.
I wonder what actual values they were trying to predict with this analysis? Based on the paper, I get the impression they were trying to do something more interesting and they got the best data for sex.
They were specifically trying to classify sex because it is something that experts cannot already do:
> While our deep learning model was specifcally designed for the task of sex prediction, we emphasize that this
task has no inherent clinical utility. Instead, we aimed to demonstrate that AutoML could classify these images
independent of salient retinal features being known to domain experts, that is, retina specialists cannot readily
perform this task.
I wonder this too, and they could get a pretty solid answer if they incorporated transgender folks’ data into the set since they’re actively keeping their hormones in the desired ranges of their gender identity.
Generally you'd start with the common, easier problem before you delve into the abnormal cases.
There are probably better ways to do that, like looking at bloodwork.
That's fine for training, but for testing the model delving into abnormal cases seems required. If there's noticeable difference on abnormal cases that gives a lot of insight into what your model is actually testing.
All observations about the use of the term “abnormal” aside, I agree the idea of specifically isolating the hormone variable by using groups of people who naturally fit that range and groups of people that use medication to mirror that range would at least indicate whether this classifier is picking up on sexual dimorphism or vascular effects from hormonal differences (which also would potentially impact not only transgender people, but intersex people — who make up around 3% of the population)
The lab I was affiliated with published a similar paper on gender dimorphism for pediatric hand and wrist radiographs in 2018. The experience of working on that paper made me realize saliency/class-activation maps are only really useful if you want to know where an object is in the image, but not that helpful if you are trying to legitimately try and figure out what the AI is using to classify.
I spent days looking at male and female hand radiographs + class activation maps trying to figure out what the system was using to tell the genders apart. I never figured it out.
Just throwing a guess here about one factor that partitions photos based on sex might: height (I'm assuming males are generally taller).
Looking at how a retinal photography machine looks like, I'd guess the height at which the photo is taken might slightly affect the POV angle, which in turn might be just enough to get caught by the ML model.
They sit down, as is my understanding, yet from photos I found on google search, it seems like people bend their neck forward to adjust their head's height depending on the height of the camera. Your head height from the ground, even when seated, depends on your torso's length (and the chair's height, which I assume people don't adjust to their preferences as patients in a clinic).
People with Ph.D's often makes up facts and believes in nonsense just like everybody else.
Edit: This includes those with STEM degrees as well. You really shouldn't trust someone more just because they have a research degree. I knew a professor who claimed to have solved some famous problems but that the peer reviewers just didn't want to accept that he solved it and therefore rejected his papers.
What do you think they made up? You need to be more specific than "something", because I don't see anything in that post that looks like they just made it up.
In particular, "People with Ph.D's often makes up facts and believes in nonsense just like everybody else." is definitely not something they just made up.
And "just like everybody else" is not putting a particular group into a bag.
To clarify, I trust research papers since they are a network of citations each reviewed by peers, not because they were written by researchers.
So when someone cites papers they tend to be trustworthy and you can have a discussion. When they don't you can't trust them, you can still have casual conversations but you can't trust any facts. Them being a researcher or not doesn't matter here.
That's an unexpected (to me at least) claim to make - https://onlinelibrary.wiley.com/doi/abs/10.1002/ajpa.1330530... - I always thought of sexual dimoprhism (men taller than women on average) as a given and this paper backs up my claim specifically in the context of human societies (where it gives a 10cm height advantage in average in men over women, comparing 216 different societies ). There might be some way of counting in which the converse is true, but not in any way I know. I understand you don't want to dig deeper, but thought I'd flag it in case anyone else unwittingly digests this (possibly wrong) knowldge.
Do we even need clinicians be able to distinguish male and female retinas?
The task is meaningless. Yes, there might be some interesting facts in discovery how male and female retinas are different. It could even lead to differentiated treatments. But ML hasn't provided any clues regarding this and therefore it is not that deep.
It's not about differentiating male and female retinas (that's just a PoC to publish the paper), it's about using ML to find something useful in data (e.g. signs of a disease) which might be hard to see for humans otherwise.
>Clinicians are currently unaware of distinct retinal feature variations between males and females, highlighting the importance of model explainability for this task.
I remember a debate a couple of years ago where some woke scientists were arguing men and women brains are indistinguishable. Others were saying there are distinguishing features. This kind of analysis may put an to that debate.
That's a bit of an exaggeration. Mascara in the eye happens very often, to me anyway. What's really painful is the hairs off the mascara brush detaching and insinuating themselves under the eylid. That. hurts.
> Mascara on your retina would most likely end up in a hospital visit.
... And likely blindness. The retina is the inside of the back of your eye, where the optic nerve attaches. If you have mascara on your retina, it means you have punctured your eye.
Having something between your cornea and eye lid, while extremely irritating and painful - especially if a scratch results, is far, far less painful than having something puncture your cornea. Trust me, I've had surgery behind the cornea and had the anesthesia wear off.
In figure 1 it looks like we have 50% precision at 100% recall. I don't do a lot of binary classification, but should it be concerning for the model if it is only right half the time (random chance) when we demand it predicts for all data?
For a binary classifier, recall is defined as the true positive rate: Of all the samples that are actually positive, how often did our model output "positive".
To get this to 100%, the model would simply say "positive" all the time. This means that the precision (the fraction of samples where the classifier said "positive" that were actually positive) goes to 50% (for a balanced dataset).
No. Only if you had a perfect classifier. The only way to have 100% recall is to have absolutely no false negatives. In practice, the only way to have no false negatives is to have no negatives at all.
I’ve been hearing this since at least 2016. Saliency labs tend to put the interesting areas around the fovea or optic nerve, I think it must’ve been the optic nerve because it’s really where the vessels emerge. My best guess might be that it has something to do with blood pressure. Women have slightly more blood pressure than men in the largest caliber vessel would be the best place to observe that.
"EK is a consultant for Google Health. PAK has received speaker fees from Heidelberg Engineering, Topcon, Haag-Streit, Allergan, Novartis and Bayer. PAK has served on advisory boards for Novartis and Bayer, and is a consultant for DeepMind, Roche, Novartis and Apellis. KB has received research grants from Novartis, Bayer. Heidelberg and Roche. KB has received speaker fees from Novartis, Bayer, TopCon, Heidelberg, Allergan, Alimera. KB is a consultant for Novartis, Bayer and Roche. AK is a consultant to Aerie, Allergan, Novartis, Google Health, Reichert and Santen. All other co-authors have no competing interests to declare."
This isn't necessarily surprising. Maybe clinicians can be trained to get good at it, but, maybe they can't.
Back before computers beat Go, the conventional wisdom was that it was a hard target because humans can leverage their pattern-finding systems to prune the state space much more efficiently than computers can.
Which got me thinking: what about a game that's the opposite? Like the game's state space is embedded in the timbre of a complex tone, and you make moves by twiddling three or four knobs. It doesn't have to be a very complex game for computers to do a lot better at this, extracting meaning out of white-ish noise is not something we're good at.
Assuming this isn't a data artifact, it might be one of those cases. Some pattern in the brachiation of the veins or the placement of the rod and cone cells, which just looks like noise to us, but which an ML algorithm can find and use to separate male and female retinas.
There's no reason to think that humans would ever get very good at that, although I never count humans out given examples like chicken sexing[0], a notoriously opaque skill which people can nonetheless acquire, despite having little facility in explaining to others how they do it.
I am going to drop a thought here to see what happens.
If there is a difference between male/female retinas. Could this affect our perception of reality?
A woman complained that her very tall boyfriend had hung a mirror in the bathroom. She took a picture of it. It was her reflection holding the camera level with the top of her head, and little else.
One that happened to me personally. I asked my then gf (5'1") what it was like being a small person, do you feel like a normal sized person in a land of giants? "Yes" was the instant response.
One very tall guy once remarked: "The tops of your fridges are fucking disgusting."
> A woman complained that her very tall boyfriend had hung a mirror in the bathroom. She took a picture of it. It was her reflection holding the camera level with the top of her head, and little else.
As a tall guy, there is a surprising number of bathroom mirrors where my reflection doesn't include my head.
You people need some taller mirrors. The bottom of my bathroom mirror is several inches blow my waist. The top is about a foot and a half above the top of my head.
> If there is a difference between male/female retinas
Is this even an "if"? It's well established that men are more likely to be colorblind, and it's likely many women are tetrachromats (most people are mere trichromats). The genes for the extra cone pigments are in the X chromosome, and are seemingly expressed more often when somebody has two X chromosomes. Similarly, people with two X chromosomes are less likely to be colorblind because most forms of colorblindness are caused by defects in genes in X chromosomes.
Perhaps most men and women, men and women with normal trichromatic vision, have identical retinas. But with genes so important to eyeballs residing in the X chromosome, who knows. But I'm left wondering why experts are particularly surprised by this result.
I don't like philosophical questions like this. Let's say male blue is female red and vice versa. Our perception of the world is different yet it doesn't change anything as to how we understand and interact with "reality".
It is well backed up by science that women can see or at least perceive/brainlog a stronger variety of colors than men, and then this is expressed on women having a stronger beefier vocabulary when it comes to naming and identifying colors
So yeah, that's a thing
Also, what you are describing is called Qualia, and that is intangible qualities of how the brain processes data, such as the "yellowness of a lemon", or the "foot pain of stepping on unexpected rock shoeless"
Qualia can't be verbalized or compared between people because it is an inherent "brainfeel", you just need to expect others to have "at least similar-ish" qualias
Right, and children hear much higher frequencies than the rest of us. Just because you see more doesn't fundamentally change how we perceive reality. Like if someone says there is a color between eggshell white and snow white, I believe them because there is obviously a gradient there. I don't need to see their reality to agree on the state of it.
If someone is colorblind, and another isn't, does that entail a change in perception? Sure. It means the colorblind guy can't discern things that people with normal vision can.
A person born blind can't see anything and never has. They don't even imagine visual images (only images informed by the remaining senses). Their perception is unimaginable to me and mine to them.
So if women can discern more colors than men, it follows that they experience more colors which seems like a matter of perception. Have you never argued with a woman about the color of a sweater?
Red, Brown, Blue, Gray, Blue, Green, Yellow, green, Gray, Blue. Those linear color spectrums never made sense to me as a color blind person.
Technically I only see blue and green, but since people call different shades of green so many different things I start to call them stuff like red and brown. So dark green is brown, then as you go brighter it becomes red, orange, green and brightest green is yellow. White is blue + green, and since red is green pink is just white with less intensity, so gray.
Edit: Anyway, most men are one chromosome color better than me. Most women are two chromosomes better at distinguishing color than me. Makes sense then that women are way better at telling them apart.
Yeah - you are right. I thought about it and
I guess since we can interact our perception can’t be to far off - otherwise we wouldn’t be able to procreate.
It would be something out of the hitchhikers guide to the galaxy. A specie that because of a retinal differences between sexes is unable to mate.
I would argue that it does and that we conform behaviors to a standard, but there are alot of assumptions we make that lead us to not understand each other at all
There is a shared experience isolated to one sex that the other cannot perceive
Fascinating. I had no idea retinas were sexually dimorphic. I wonder if the difference serves a purpose or is just a consequence of some other adaptation.
Neither did doctors apparently. If everything is kosher, they've proved some level of sexual dimorphism and now they can investigate and perhaps find out what it is.
This is an interesting use of machine learning. We (or at least I) normally think of these models as replacing or complementing humans. But using them as a driver for research is cool.
There's also the risk of severe overfitting to some latent variable. I haven't quite dug into the work itself yet, but it does bring back memories of some case of perfect diagnosis due to hospital documentation process though.
“Our deep learning model was trained using code-free deep learning (CFDL) with the Google Cloud AutoML platform ... the CFDL platform provides the option of image upload via shell-scripting utilizing a .csv spreadsheet containing labels ... Automated machine learning was then employed, which entails neural architecture search and hyperparameter tuning.”
Earlier, in the Limitations section:
“The design of the CFDL model was inherently opaque due to the framework’s automated nature with respect to model architecture and hyperparameters. While this opacity is not unique to CFDL, there is potential to further reduce ML explainability due to lack of insight of model architectures and parameters employed.”
Maybe there’s a whitepaper somewhere on how Google’s AutoML works?
My understanding is that is uses heuristics to try a range of different base models and techniques based on the input data, along with grid searchs to find hyper parameters. It is fairly pricey but works.
It is probably not super exotic, but if they spent enough money optimizing it, it probably has good hyper params.
Clinicians are not aware of the distinct retina features because they don’t need to. Everybody can know the sex on the first sight and when in doubt, they simply can ask.
I’m certain ML would reveal other hidden, obscure features like this in the future. But that does not mean machine can do some thing people can not as the title might suggest. If people can set their mind to it, they will do. Maybe in much slower pace but they do.
As I look at the Figure 2. Region based saliency maps, I notice the high salience regions are the brightest area on the retinal image. I am not sure whether that is just a bright regions of the retina or it is the reflection of the light illuminating the retinal scanner. In any case, it is interesting to me that this seems to be the regions that helps discriminability the most (if I am understanding correctly), which is surprising to me.
thats going to be one of the biggest shocks to society going forward when it comes to the changes that AI bring. there are mountains of data everywhere that is completely overlooked simply because the cost of processing the data is too high. too high to discover patterns/correlation and too high to process in any case.
human beings filter out most of what goes on around them. they dont see the world as it is and their minds dont keep track of physical primitives. their minds abstract the world into larger conceptual parts and track those parts. its not just a question of processing power, its a question of intuitive access. and nobody realizes this yet because the only sentient beings who are around to demonstrate any of this have those filters in place. when the AI comes with all that horse power and with no filters, it will see things all around that we are blind to. it will seem as though it can make impossible predictions. it will seem god-like, even before it graduates to doing something other than simply observing the world.
Sexy title, but it is unclear that clinicians can’t classify sex from the retina, its just that they haven’t bothered to. And the classification is not that great (<80% PPV on independent data). Clinicians will certainly get much higher sensitivity, specificity, and PPV just by looking at the subject ;)
Don't all women have 6 types but usually they all have very similar or even identical frequency response? Only when they have a colorblind gene are they noticeably different.
"Tetrachromacy is the condition of possessing four independent channels for conveying color information, or possessing four types of cone cell in the eye."
Yes, it says it is from carrying a colorblind gene, usually red-green. But they have six copies of rhodopsin encoding dna, usuall 3 of them are duplicates. In red green colorblind only two are duplicates, but there are other types of colorblind. Theoretically they could be hexachromatic
One study suggested that 15% of the world's women might have the type of fourth cone whose sensitivity peak is between the standard red and green cones, giving, theoretically, a significant increase in color differentiation.[23] Another study suggests that as many as 50% of women and 8% of men may have four photopigments and corresponding increased chromatic discrimination compared to trichromats.[24]
Nor is color blindness exclusively a male phenomena; in Northern Europeans, 8% of males are colorblind while 0.5% of females are.
Suppose a few more sexual dimorphic traits like this exist in the eyes; perhaps differences that have no practical effect on human vision and have consequently gone unnoticed by clinicians. If the ML model is picking up a few of these dimorphic traits, it could perhaps classify sex with more accuracy than anybody looking at a single trait could. This is pretty standard Bayesian stuff; it's the way basic "Plan for Spam" style Bayesian spam filters work.
This is a really impressive result and an interesting result to apply ML to. Thank you for sharing, OP. I'm just wondering if there are any real world applications of why you'd want to tell the sex of a person by a retinal photograph? It seems like a bit of a useless skill to have?
I think this more an example how black-box models are basically useless for clinical research.
The authors aren't aware of any distinguishing retinal features between male and female eyes and the model itself has no explanatory power.
Could be a Clever Hans situation where the model exploits meta information of some kind in the absence of actual features. It could just as well mean that there are indeed distinguishing features that are compromised in the presence of foveal pathology.
The authors note that another study using manually selected features identified three features that are indicative of genetical sex. These features yielded about 0.78 AUROC accuracy measure. Compared to the presented model's AUROC accuracy of 0.93 that's only 19% worse and these 19% additional accuracy may point to a combination of the already identified features or one or more additional features.
I personally find this paper rather pointless. It stops at the point where actual progress could be made and things would get interesting - why didn't the authors evaluate the previously known features on the model's matches to measure their significance?
This could have told them whether their black-box was relying on the same set of features as the ones identified by previous work, for example.
> I think this more an example how black-box models are basically useless for clinical research.
A result different from the null hypothesis is useles.
Let us say the machine could not succeed greater than chance, it would be a case of cosmic bad luck in that case, that all social and biological factors cancel each other out.
Thus, in the case of the null hypothesis being confirmed by this, one may conclude that in all likelihood retinal patterns have no sexual divergence.
But, machines very rarely find such a null hypothesis in such cases, and that might be in no small part because of all the extra factors that such models latch on to.
I for one would be more interested in an obvious nonce test to see if the machine can find something: see if the a.i. can find a retinal difference between, say, the poor and the rich. If it can with high accuracy what retinae are poor, and what are rich, we might have a somewhat interesting situation.
> External validation was performed on the Moorfields dataset. This dataset differed from the UK Biobank development set with respect to both fundus camera used, and in sourcing from a pathology-rich population at a tertiary ophthalmic referral center. The resulting sensitivity, specificity, PPV and ACC were 83.9%, 72.2%, 78.2%, and 78.6% respectively
The paper is about 4 pages long - it takes about as long as it to you to write that comment as it does to skim through and learn that what you mentioned is exactly why they did the study:
> While our deep learning model was specifcally designed for the task of sex prediction, we emphasize that this
task has no inherent clinical utility. Instead, we aimed to demonstrate that AutoML could classify these images
independent of salient retinal features being known to domain experts, that is, retina specialists cannot readily
perform this task.
It always amazes me how people spend 5 seconds reading a headline but think they know more than someone who has spent days and months on the same topic.
Sorry I misinterpreted then. I thought you were dismissing it out of negativity but actually it's worse - you actually made a judgement that you knew more than the authors of the study.
The only judgement I made was to not read the whole paper. I read up until the paper stated that classifying sex based on retinal pictures was unlikely to be clinically useful. At which point I lost interest.
Why wasn't the ML model and clinician classifying something that actually is clinically useful?
If it has no clinical significance, what's the relevance of the classification of the clinicians?
How is it any more spectacular than beating a random classifier?
Had these points been addressed at this point I might have continued reading
Because I had already spent time reading and maybe someone could enlighten me as to why it in fact is interesting. That and I was also hoping to get insulted
I'd agree that if clinicians haven't been trained on this for their line work, then the comparison is not fair, but I wouldn't go so far as to say it's "useless".
No, you're right. But since there's a whole field on the subject I figured they could have chosen something with clinical utility and I don't really understand why they didn't
Also, before they tested on the other smaller dataset from a different source, aiui, they also trained only on the earlier subset of the first source, and used the later portion from the first source (with no overlap in patients) for the testing.
(also, I'm not sure that 252 is really all that small?)
Why do you say that the model is overfitted? You have no way of knowing that. Plus, 84 743 is a very reasonable size for a vision dataset with a binary prediction
> Gender refers to the socially constructed roles, behaviours, expressions and identities of girls, women, boys, men, and gender diverse people
First it's not some hamfisted mixup of sex and gender:
> Terefore, this feld may
contain a mixture of NHS recorded gender and self-reported gender. Genetic sex in the UK Biobank was determined
And yet:
> Predicting gender from fundus photos, previously
inconceivable to those who spent their careers looking at retinas, also withstood external validation on an independent dataset of patients with different baseline demographics Although not likely to be clinically useful, this finding hints at the future potential of deep learning for the discovery of novel associations through unbiased modelling of high-dimensional data.
If we had a way to detect trans children, for sure that would be clinically useful!
Edit: as always, thanks for the downvotes, but please also educate me where I am wrong.
They're claiming that they can predict sex, not gender.
The study does comment on a trans case in their validation-set, which of 1,287 images, had 1 image for someone whose genetic-sex and reported-sex didn't match. For that 1 image, the algorithm's prediction corresponded to the genetic-sex rather than the reported-sex.
> Genetic sex was discordant from reported sex in one validation set image, and this image was incorrectly predicted by the model; that is the model predicted sex consistent with genetic sex in this case (Table S1).
Language belongs to everyone. Some specialists in their field make a distinction between “sex” and “gender”; others for instance do between “speed” and “velocity”, or “weight” and “mass”, vernacular, and in many fields, they are respective synonyms.
Most technical terms start as lay terms in a language that are then given a more technical meaning, often pulling two synonyms apart in the process.
On the note of “children”; I tried scanning the result for whether the machine can distinguish before puberty, which would be even more spectacular, but I couldn't find it in the article. — there is significant debate as to what extent non-genital sex characteristics exist before puberty, as the difference is often so small that they could easily be attributed purely to environmental or social factors.
The vernacular, in this case, is obsolete. It needs to catch up. Sex and gender while often happens to be the same are not at all the same thing and we need to speak up against confusing the two for the sake of our trans friends.
The same can be said for “speed” and “velocity”, “mass” and “weight”, “g.p.u.”, and “graphics card”, “working memory” and “r.a.m.”, and so forth.
But most people will live and die without such a distinction ever being relevant in their lives.
Methinks that this distinction plays an important role in your life, but you must realize that it does not in that of most.
The other difference is that of all the other things I mentioned, the distinction is of a very technical and exact nature, whereas “sex” as is common in biology is bereft of a technical definition, and “gender” as is common in psychology even more so. — I initially used the phrase “technical term”, but I am honestly loathe to do so for concepts so poorly defined as either “sex” or “gender” whereof specialists very frequently disagree on wherein to place objects discussed.
>
But most people will live and die without such a distinction ever being relevant in their lives.
Lives is the keyword. speed vs velocity is of concern to physicists but mixing up gender and sex has been weaponized as a tool against trans people and because of that, we need to push back.
>
Methinks that this distinction plays an important role in your life, but you must realize that it does not in that of most.
Yes, most people don't give a damn about other people, I know, after all, Atlas Shrugged is popular in the United States. This is why the United States is heading to, if not already there, to be a failed state. But those who actually care about others, recognize how much a difference they can make by deliberately using two such simple words in their right meaning and so they do.
> Lives is the keyword. speed vs velocity is of concern to physicists but mixing up gender and sex has been weaponized as a tool against trans people and because of that, we need to push back.
Such can be said about many such words. — various terms that enjoy more præcise nuance in linguistics have very much been used to weaponize against allowing people to speak in their native registers, similar things can be said about religious nuances. So do you also go about correcting people who use the word “Flemish” as most use it, and rather insist that in technical terms, it only refers to a specific group of Dutch dialects rather than standard Dutch as spoken in Belgium?
> Yes, most people don't give a damn about other people, I know, after all, Atlas Shrugged is popular in the United States. This is why the United States is heading to, if not already there, to be a failed state. But those who actually care about others, recognize how much a difference they can make by deliberately using two such simple words in their right meaning and so they do.
Yes, and I would submit that neither you do, but that you simply insist that a special exception be made for the one thing that seems to be important to you, for if you evenly applied this standard throughout your life, and demanded the same corrections elsewhere, you could not generally allow someone to finish a single sentence without demanding several alternations to the words.
In about every sentence vernacularly spoken, words are used that have a more præcise nuance in technical vocabulary, and in many such cases the lack of distinction has been weaponized, for vagueness is indeed the ultimate tool in politics.
No, that is past, we need to do better. Tolerance has its places but sex and gender not being the same is not controversial any more. It's not like mass vs weight which is mostly of interest to physicist. We are talking about real people. Truly, there's no place for this any more. We need to do better. Tolerance is necessary to accept people not like us and but erasing their very existence by using certain words is not tolerance.
> the gender-reveal phenomenon pulls off a rousing counter-progressive two-for-one: weapons-grade reinforcement of oppressive gender norms (sorry, feminists!) and blunt-force refusal of the idea that sex assigned at birth does not necessarily equate with gender identity (sorry, trans-rights movement!).
I disagree. We need to do better with tolerance and acceptance.
Preaching these things when it benefits a favored point of view and then rejecting them when it comes to others is actually the definition of weaponizing, and it does more harm and causes everyone to be intolerant.
What exactly do you want me to accept? Transphobia? Make no mistake: when you confound sex and gender that's what you are fueling. One half of the USA is busy enacting transphobic laws, we must not fuel that agenda. We can not just accept
behavior singing "tolerance". That's not how this works. Tolerance means we respect every human being equally and it does not mean that any opinion denying that needs to be tolerated.
You can not just go around and redefine words especially when this tolerance hides transphobia. Again, it just doesn't work this way. These words have a meaning, a very specific meaning at that and I don't think transphobic dogwhistling is something worthy of tolerance. You know, sex, gender, same thing, we can mix it up, wink, wink? Even if you didn't intend it that way, even if you think it's still just ignorance/lack of knowledge, time is up for that. This is why I said the vernacular is now obsolete.
What made it obsolete? Obviously, the GOP did by declaring open season on trans people, especially by denying treatment to trans children. This is sheer evil that will lead to suicidal children -- they might as well just shoot them, same result, just less acceptable (although how much is an open question given how American society have normalized school shootings as well but now I really digress). These laws fit the very definition of genocide as defined by the convention. We can not stand for this. We need to fight back. Every day. Every sentence. Every word if need to be. We need to show that aside from a far right fringe, society stands with trans people. It's civil solidarity. And yes, I asked on /r/trumpsupporters what they think is the difference between society and just living next to each other and of course the reply is there's no difference. But there is. We do live in a society. We do care for the next person and if politicians want to erase we will not let that happen. We can not let that happen.
Do you have a workplace policy where people put in pronouns right in their chat name? We do that because we want to normalize that when you introduce yourself, your pronouns are a normal, everyday part of that.
Do I need to quote Niemöller ?
First they came for the socialists, and I did not speak out—
Because I was not a socialist.
Then they came for the trade unionists, and I did not speak out—
Because I was not a trade unionist.
Then they came for the Jews, and I did not speak out—
Because I was not a Jew.
Then they came for me—and there was no one left to speak for me.
So we do speak out because we learned from history what happens when do not. Especially when all it takes is such a tiny thing as getting the words sex and gender right and using the pronouns a person prefers.
If someone uses the n-slur on a black person, are you just going to preach tolerance of their words or are you going to step up, use your white privilege (most of HN readers are white cis het males) and say "that's not okay"?
They elected an openly nazi person to the highest office of the United States and when he rightly got the boot, they attempted a coup to keep him in power. And one of their chief weapons indeed were -- as they called it -- alternative facts. Yeah, there is a certain part of society which does not want to be part of society and does use words and facts rather arbitrarily. Is this what you want me to accept? Because I will not.
I have not started this, I have not wanted this but if history calls us to task, answer we will. And again, right now, in here, we just need to make sure our speech is precise. Not more. It's a small act but it has serious weight.
One certainly can be tolerant of different people with different ideas, and one can practice what they preach about to.
Having a different personal definition of a word, or a different opinion or thought does not automatically make someone a -obic, or mean they are "coming for" others. Try to be inclusive and tolerant rather than hostile and divisive and uninclusive.
Please don't perpetuate flamewars on HN, and especially not on classic flamewar topics. It does no good for anyone, and it destroys the curious conversation that this site is supposed to exist for.