Automated image-based diagnosis
Nowhere is it as difficult to get a fully automatic image analysis system accepted and used in practice as in the clinic. Not only are physicians sceptical of technology that makes them irrelevant, but an automated system has to produce a perfect result, a correct diagnosis for 100% of the cases, to be trusted without supervision. And of course this is impossible to achieve. In fact, even if the system has a better record than an average (or a good) physician, it is unlikely that it is the same cases where the system and the physician are wrong. Therefore, the combination of machine + physician is better than the machine, and thus the machine should not be used without the physician.
What often happens then is that the system is tuned to yield a near 100% sensitivity (to miss only very few positives), and thus has a very low specificity (that is, it marks a lot of negative tests as positive). The system is heavily biased to the positives. The samples marked by the system as negative are almost surely negative, whereas the samples marked as positive (or, rather, suspect) are reviewed by the physician. This is supposed to lighten the workload of the physician. This seems nice and useful, no? What is the problem?
One example where automated systems are routinely used in the clinic (at least in the rich, western world) is for screening for cervical cancer. This is done with the so-called Pap smear since the 1940’s. It takes about 10 minutes to manually examine one smear, which is made on a microscope glass, stained, and looked at through the microscope. Even before digital computers became common, there have been attempts to automate the analysis of the smear. My colleague Ewert Bengtsson wrote his PhD thesis on the subject in 1977, and is still publishing in the field today. This gives an idea of how hard it is to replicate something that is quite easy, though time consuming, for a trained person. The solution, as is often the case, was to change how the sample is prepared. Instead of smearing the sample on a microscope glass, liquid cytology systems were invented that cleaned the sample (removing slime, blood cells, etc.), and produced a neat deposition of cells on the glass such that they are nicely separated and not likely to overlap each other. Such a preparation makes the automated image analysis much easier. However, these automated systems still do not produce a perfect result, and therefore are only approved to be used together with a trained cytologist. That is, the cytologist still needs to review all the tests. This means that the Pap smear test has become more expensive, rather than cheaper (the liquid-based sample preparation uses expensive consumables).
But the cost is not the real problem, and this is why these new automated Pap test systems are used in practice. The problem is that the probability that a sample is positive is quite low. That means that even a small bias towards the positives will produce a huge amount of false positives. For example, assume one in a 1000 tested women has cervical cancer, and that the system has a specificity of 90% (totally made up figures, I have no idea what these are in practice). This would mean that about 100 perfectly healthy women would be diagnosed with cancer by the system, for every one woman whose cancer is detected!
I recently came across an article in The New York Times reporting on a device, MelaFind, that is designed to recognize melanoma. To me it seems they did a lot of things right: the device takes images at various wavelengths, including some near-infrared, and at various depths under the skin surface (i.e. it obtains information that the dermatologist does not have); they trained the device on a database with 10,000 lesions (with biopsy information, from 40 different sites); etc. Yet look at these three comments from the FDA panel that are reported in the article:
- “[I am] concerned that a doctor could inadvertently use MelaFind on a non-melanoma skin cancer, receive a score indicating that the spot was not irregular, and erroneously decide not to biopsy it.”
- “My concern with MelaFind is that it just says everything is positive.”
- “There is inadequate data to determine any true value added for MelaFind for use by a dermatologist or other provider.”
Why are these FDA panellists so negative? The first comment is about wrong use of the device. Of course, MelaFind is trained on a dataset of “clinically atypical cutaneous pigmented lesions with one or more clinical or historical characteristics of melanoma,” and therefore will produce random output if an image taken is of anything else. If the input is outside of the domain generated by the training set, all bets are off. This is valid concern, but not difficult to avoid by limiting the use of MelaFind to dermatologists, as the FDA did. The second and third comments, on the other hand, are what this blog post is about. As the article puts it: “[...] a biostatistician [...] compared the accuracy of MelaFind in distinguishing non-melanomas to a hypothetical pregnancy test which, used on 100 nonpregnant women, would mistakenly conclude that 90 of them were pregnant.” The problem is low specificity!
Further reading on MelaFind’s dermatologist-only web pages (yes, I had to actually claim I’m a dermatologist to get to those pages!), you can see the details of the study that the FDA approval is based on. Apparently they collected a large group of patients, who together had 1632 lesions that were suitable for the device to diagnose. Of these, 175 were melanomas or high-grade lesions (I’ll just call them all melanomas for simplicity). That is, a little over 10% of the test set is positive, most of it is negative. MelaFind detected 173 of 175 melanomas (98.3% sensitivity). This is a very good sensitivity (but of course no mention is made of the dermatologists’ sensitivity in this study, more on that later). However, the specificity was only 10.8%. That is, it recognized only 157 negatives as negatives! MelaFind diagnosed as melanoma 1473 out of 1632 lesions shown to it (90.3%), pretty close to just saying everything as positive. And thus, as the 3rd comment above said, is there are added value to use the device?
We have to compare MelaFind performance with that of dermatologists. In the study, dermatologists had a 5.6% specificity. That is, they marked as melanoma even more of the lesions (though only 65 or so lesions more). Surprisingly, there is no mention of the sensitivity of these dermatologists. I would guess they get 100%, otherwise they would have reported it. Instead, they give some random statistics from another study that shows dermatologists only detect 70–80% of melanomas. There is no way of comparing that study to this one, as the input to the study (the test set) is most likely chosen very differently. And, given the clinical process, dermatologists are very much biased, as is the device, to positives: everything that might be a melanoma is taken out an send to a pathologist; they don’t want to run any risks of missing one. This guess of 100% sensitivity is also supported by the FDA’s approval specifically stating that it is not intended to be used on lesions with a clinical diagnosis of melanoma or likely melanoma, and that biopsy decision on MelaFind negative lesion is up to the dermatologist.
So assuming dermatologists really had a 100% sensitivity, the study says that, if always following the device’s recommendation, 65 fewer samples would have been biopsied, but two melanomas would not have been found. Is that a valid trade-off? Most surely not. I wonder what the dermatologists’ specificity would be if they would be a little bit less cautious (i.e. lowering their sensitivity to that of MelaFind). Or conversely, what the specificity of MelaFind would be if tuned to yield 100% sensitivity on this test set. In either of those cases, it would be possible to directly compare the specificities. I would bet that in such a case the dermatologists would do much better than MelaFind. And thus the comment regarding its added value is legitimate. In the article it says that “patients are paying $25 to $175 for the first mole evaluation and around $25 for subsequent moles.” Is this a waste of money? I don’t think that the device gives the patients a better diagnosis. If it just helps the dermatologist to do his/her job faster, why is the patient paying more for the diagnosis, instead of less?
I hope this does not sound too negative. I know a lot of things can be automated and improved with image analysis, even in clinical settings. I truly believe we will have fully automatic diagnosis at some point in the not too distant future (hey, Dr. McCoy didn’t need to go through med school just to hold that device that diagnosed and cured all his patients, did he?). But we’re not nearly there yet: we need to do more work on our ability to recognize wrong input (in this case the non-cutaneous or non-pigmented lesions, the non-lesions, the patient’s shirt, etc.), and we need to do more on our ability to generalize from examples. This is what sets the human eye–brain combo apart from anything else, natural or engineered: we can take the context into account, we can adapt to environmental changes, and we generalize like there is no tomorrow.