a companion discussion area for blog.codinghorror.com

An Initiate of the Bayesian Conspiracy


#23

I guess I’m dense. The first sentence says 1% of women who participate in screening HAVE BREAST CANCER. It goes on to blah blah blah about false positives, then asks the probability that the woman who participates has breast cancer. It doesn’t ask what is the probability that her positive mammogram is really cancer, vs a false positive. It asks the probability that she has cancer, which is stated in the first sentence as being 1%.
Perhaps, this is why I did not do as well in some classes as others.
Do I win the doofus prize?
–dang


#24

Woh, it’s fizzbuzz all over again.

I’m surprised there was no mention of the Monty Hall Problem. I think it’s a good example of Bayes theorem.


#25

Isn’t the answer: .01 probability (1%)?

Seems to me the first sentence gives the answer, and all the stuff about positive mammographies is irrelevant.

(But, the Bayesian discussion after the question probably indicates I’m wrong.)


#26

I thought I was going mad thinking I was the only person to get to the answer of 1% until I got the the bottom of the comments. The answer is in the question:

“1% of women at age forty who participate in routine screening have breast cancer…snip…What is the probability that she actually has breast cancer?”

Heck, it’s the first word! Is the problem that of indirection to make you think that the number in the middle actually mean something?


#27

You have to pay attention to the entire question:

“A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?”

It is not asking for the probability that a women (who is routinely tested) has breast cancer.

It asks for the probability that a women who tests positive (in the routine testing) has breast cancer.


#28

Wow! It’s amazing the number of people who comment on here with wild numbers and theories, without actually taking ten seconds to look at the linked article.

Hint: if an article states that most people get this wrong, then please check you’re not one of them before posting a reply! It’s like “FizzBuzz” all over again!


#29

10% of people who read this blog post replies. Of those 10%, 100% have to enter a captcha word. Of these, 100% have to enter the word “orange”. What is the chance that you typed “orange” in order to reply to this post?


#30

I thought I was going mad thinking I was the only person to get to the answer of 1% until I got the the bottom of the comments. The answer is in the question:

“1% of women at age forty who participate in routine screening have breast cancer…snip…What is the probability that she actually has breast cancer?”

Heck, it’s the first word! Is the problem that of indirection to make you think that the number in the middle actually mean something?


#31

@Will

I didn’t have to type the word you describe. You did it for me.


#32

Wooo! I got it right… but only because I took a probability theory course a year ago, and AI this semester.


#33

I wrote a statistical spam filter two years before Phil Graham’s. Worked pretty well, and some people converted it into an open source project.

http://www-cse.ucsd.edu/~wkerney/spamfilter.README
http://www-cse.ucsd.edu/~wkerney/spamfilter.tar.gz

“Bayesian Conspiracy”? Please. Conditional probability is covered in every lower division probability class. It’s probably the first actual interesting thing you learn in probability… but it’s not hard to understand.


#34

“A woman in this age group had a positive mammography in a routine screening.”

It can’t be 1% because you know the test result.


#35

Really, the only thing you need to remember is this:

P(A | B) = P(A and B) / P(B)

In words, that’s: the probability of A given B is equal to the probability of A and B divided by the probability of B.

Given:

P(p | c) = 0.80 (probability of positive result given they have cancer)
P(p | ~c) = 0.096 (probability of positive result given they DON’T have cancer)

P© = 0.01 (probability they have cancer)

Goal:

P(c | p) (probability they have cancer given a positive result)

Work:

  1. Need P(c | p):
    Same as P(c and p) / P§

  2. Need P(c and p):
    Know: P(p | c) = 0.80 = P(p and c) / P© and that P© = 0.01
    P(p and c) = 0.80 * P© = 0.80 * 0.01 = 0.008

  3. Need P§:
    Same as P(p and c) + P(p and ~c)

  4. Have P(p and c), need P(p and ~c)
    Know: P(p and ~c) / P(~c) = P(p | ~c) = 0.096 and that P(~c) = 1 - P© = 0.99
    P(p and ~c) = P(p | ~c) * P(~c) = 0.096 *0.99 = 0.09504

Back to #3: P§ = P(p and c) + P(p and ~c) = 0.008 + 0.09504 = 0.10304
Back to #1: P(c | p) = P(c and p) / P§ = 0.008 / 0.10304 = 0.07764

There’s your answer: 7.76%


#36

I keep getting spam that defeats Bayesian filtering. However, I don’t understand how the spammers think anyone is actually going to read the spam messages buried in the middle of 100’s of random words and phrases.

Obviously I can’t train my filter on these messages as they are primarily filled with non-spam content. Doing so would just train my filter to mark all content as spam.

Sending every message with a $ to the spam account helps but sometimes some e-mail from grandma ends up there too. I can live with it, but it is extremely annoying to have to delete the messages which are so obviously spam.


#37

@ wkerney: clearly it is hard to understand, or so many people wouldn’t get it wrong. I think it’s mainly that people get all confused thinking about this nebulous concept of probability, rather than assuming some whole number of events and working from there.

@ half the others in this thread: the whole point of this type of problem is conditional probability - the way in which additional information alters what you know about something. Yes, the probability of any arbitrary woman having cancer is 1%, independent of whether she’s been tested. The question is asking you about this particular woman, who you know more about - you know the result of her test, and that affects what you know about her probability.

I figure most of the people that read this blog are software developers, which makes it doubly surprising that some don’t get this. I don’t mean to sound pompous - I needed a little help on my way to the answer - but the fact that some folks can’t follow the working at all troubles me. FizzBuzz indeed.


#38

1%? Oh dear, people. Please apply logic. If all the stuff about postitive and negative mamograms was irrelevant, then why would anybody bother HAVING one done? The fact that the result was positive must have some implication to the chance of having breast cancer, otherwise the test would be useless.


#39

Well, given the first statemnt, you might conclude that if she’s tested, the the chance of having breast cancer are 1%. And that IS true, IF we don’t know the results from the test. That’s actaully not the same as saying that 1% of 40-year old women have breast cancer. It’s saying 1% of those who are screened routinely have breast cancer.


#40

Hmmm. I got around half way down that huge page and got confused. There was a problem on eggs and pearls that it does not give the answer too and I don’t know how to answer. Unhelpful. The reuse of almost exactly the same example problems doesn’t help either.


#41

Remember: per the article, only 15% of doctors get this question right. If the question is rephrased in different ways, their accuracy goes up.

http://www.yudkowsky.net/bayes/bayes.html

Let me be the first to say that I’m one of the 85%. I don’t find this intuitive, and I absolutely would have gotten the question wrong. It’s difficult for me to see past the accuracy of the individual breast cancer test, which is 80%.

I would also expect, knowing what I know about human nature, that most of the commenters will get the question right-- people who aren’t sure of their answer aren’t likely to make it permanent in a comment box, either. So allow me to compliment those of you who got it wrong in a comment: at least you’re honest. :slight_smile:


#42

Here’s my solution, and a comment on the implications in this application.

In the following, let C denote cancer, M+ denote a positive mammogram, and C^ and M+^ donote not having cancer and not having a positive mammogram, respectively.

The probability that a woman in the age range has breast cancer is P© = 0.01.

The probability that a woman’s mammogram is positive given that she has breast cancer is P(M+ | C) = 0.8.

The probability that a woman’s mammogram is positive given that she doesn’t have breast cancer is P(M+ | C^) = 0.096.

We are required to find the probability that a woman has breast cancer given her mammogram is positive, P(C | M+).

From conditional probability:

P(M+ | C)P© + P(M+ | C^)P(C^) = P(M+)

so P(M+) = (.8 * .01)+(.096*(1-.01)) = 0.1030

Bayes’ theorem states that:

P(M+ | C) = P(C | M+)P(M+) / P©

We re-arrange to find:

P(C | M+) = P(M+ | C)P© / P(M+) = .0777.

So, the probability of a woman having breast cancer given that she has a positive mammogram is just under 10%.

Assuming these probabilities are correct, one might be tempted to say that this is unacceptable. However, it is worth thinking whether it would be preferable for fewer positive mammograms to be reported (given it’s hard to know which ones are true positives and which are false). Doing so would result in fewer (false) scares—but would also result in more cancers being missed. It is also worth asking whether a better method for detecting breast cancer even exists (hint: not yet).