Probability Part 2: Updating Your Beliefs with Bayes: Crash Course Statistics #14


Hi, I’m Adriene Hill, and Welcome back to
Crash Course Statistics. We ended the last episode by talking about Conditional Probabilities
which helped us find the probability of one event, given that a second event had already
happened. But now I want to give you a better idea of
why this is true and how this formula–with a few small tweaks–has revolutionized the
field of statistics. INTRO In general terms, Conditional Probability
says that the probability of an event, B, given that event A has already happened, is
the probability of A and B happening together, Divided by the probability of A happening
– that’s the general formula, but let’s give you a concrete example so we can visualize
it. Here’s a Venn Diagram of two events, An
Email containing the words “Nigerian Prince” and an Email being Spam. So I get an email that has the words “Nigerian
Prince” in it, and I want to know what the probability is that this email is Spam, given
that I already know the email contains the words “Nigerian Prince.” This is the equation. Alright, let’s take this part a little.
On the Venn Diagram, I can represent the fact that I know the words “Nigerian Prince”
already happened by only looking at the events where Nigerian Prince occurs, so just this
circle. Now inside this circle I have two areas, areas where the email is spam, and areas where it’s not. According to our formula,
the probability of spam given Nigerian Prince is the probability of spam AND Nigerian Prince
which is this region… where they overlap…divided by Probability of Nigerian Prince which is
the whole circle that we’re looking at. Now…if we want to know the proportion of
times when an email is Spam given that we already know it has the words “Nigerian
Prince”, we need to look at how much of the whole Nigerian Prince circle that the
region with both Spam and Nigerian Prince covers. And actually, some email servers use a slightly
more complex version of this example to filter spam. These filters are called Naive Bayes
filters, and thanks to them, you don’t have to worry about seeing the desperate pleas
of a surprisingly large number of Nigerian Princes. The Bayes in Naive Bayes comes from the Reverend
Thomas Bayes, a Presbyterian minister who broke up his days of prayer, with math. His
largest contribution to the field of math and statistics is a slightly expanded version
of our conditional probability formula. Bayes Theorem states that: The probability of B given A, is equal to
the Probability of A given B times the Probability of B all divided by the Probability of A You can see that this is just one step away
from our conditional probability formula. The only change is in the numerator where
P(A and B) is replaced with P(A|B)P(B). While the math of this equality is more than we’ll
go into here, you can see with some venn-diagram-algebra why this is the case. In this form, the equation is known as Bayes’
Theorem, and it has inspired a strong movement in both the statistics and science worlds. Just like with your emails, Bayes Theorem
allows us to figure out the probability that you have a piece of spam on your hands using
information that we already have, the presence of the words “Nigerian Prince”. We can also compare that probability to the
probability that you just got a perfectly valid email about Nigerian Princes. If you
just tried to guess your odds of an email being spam based on the rate of spam to non-spam
email, you’d be missing some pretty useful information–the actual words in the email! Bayesian statistics is all about UPDATING
your beliefs based on new information. When you receive an email, you don’t necessarily
think it’s spam, but once you see the word Nigerian you’re suspicious. It may just
be your Aunt Judy telling you what she saw on the news, but as soon as you see “Nigerian”
and “Prince” together, you’re pretty convinced that this is junkmail. Remember our Lady Tasting Tea example… where
a woman claimed to have superior taste buds …that allowed her to know–with one sip–whether
tea or milk was poured into a cup first? When you’re watching this lady predict whether
the tea or milk was poured first, each correct guess makes you believe her just a little
bit more. A few correct guesses may not convince you,
but each correct prediction is a little more evidence she has some weird super-tasting
tea powers. Reverend Bayes described this idea of “updating”
in a thought experiment. Say that you’re standing next to a pool
table but you’re faced away from it, so you can’t see anything on it. You then have
your friend randomly drop a ball onto the table, and this is a special, very even table,
so the ball has an equal chance of landing anywhere on it. Your mission–is to guess
how far to the right or left this ball is. You have your friend drop another ball onto
the table and report whether it’s to the left or to the right of the original ball.
The new ball is to the right of the original, so, we can update our belief about where the
ball is. If the original is more towards the left,
than most of the new balls will fall to the right of our original, just because there’s
more area there. And the further to the left it is, the higher the ratio of new rights
to lefts Since this new ball is to the right, that
means there’s a better chance that our original is more toward the left side of the table
than the right, since there would be more “room” for the new ball to land. Each ball that lands to the right of the original
is more evidence that our original is towards the left of the table. But, if we get a ball
landing on the left of our original, then we know the original is not at the very left
edge. Again, Each new piece of information allows us to change our beliefs about the
location of the ball, and changing beliefs is what Bayesian statistics is all about. Outside thought experiments, Bayesian Statistics
is being used in many different ways, from comparing treatments in medical trials, to
helping robots learn language. It’s being used by cancer researchers, ecologists, and
physicists. And this method of thinking about statistics…updating
existing information with what’s come before…may be different from the logic of some of the
statistical tests that you’ve heard of–like the t-test. Those Frequentist statistics can
sometimes be more like probability done in a vacuum. Less reliant on prior knowledge. When the math of probability gets hard to
wrap your head around, we can use simulations to help see these rules in action. Simulations
take rules and create a pretend universe that follows those rules. Let’s say you’re the boss of a company,
and you receive news that one of your employees, Joe, has failed a drug test. It’s hard to
believe. You remember seeing this thing on YouTube that told you how to figure out the
probability that Joe really is on drugs given that he got a positive test. You can’t remember exactly what the formula
is…but you could always run a simulation. Simulations are nice, because we can just
tell our computer some rules, and it will randomly generate data based on those rules. For example, we can tell it the base rate
of people in our state that are on drugs, the sensitivity (how many true positives we
get) of the drug test… and specificity (how many true negatives we get). Then we ask our
computer to generate 10,000 simulated people and tell us what percent of the time people
with positive drug tests were actually on drugs. If the drug Joe tested positive for–in this
case Glitterstim–is only used by about 5% of the population, and the test for Glitterstim
has a 90% sensitivity and 95% specificity, I can plug that in and ask the computer to
simulate 10,000 people according to these rules. And when we ran this simulation, only 49.2%
of the people who tested positive were actually using Glitterstim. So I should probably give
Joe another chance…or another test. And if I did the math, I’d see that 49.2%
is pretty close since the theoretical answer is around 48.6%. Simulations can help reveal
truths about probability, even without formulas. They’re a great way to demonstrate probability
and create intuition that can stand alone or build on top of more mathematical approaches
to probability. Let’s use one to demonstrate an important
concept in probability that makes it possible to use samples of data to make inferences
about a population: the Law of Large Numbers. In fact we were secretly relying on it when
we used empirical probabilities–like how many times I got tails when flipping a coin
10 times–to estimate theoretical probabilities–like the true probability of getting tails. In its weak form, Law of Large Numbers tells
us that as our samples of data get bigger and bigger, our sample mean will be ‘arbitrarily’ close to the true population mean. Before we go into more detail, let’s see
a simulation and if you want to follow along or run it on your own – instructions are in
the description below. In this simulation we’re picking values
from a new intelligence test–from the normal distribution, that has a mean of 50 and a
standard deviation of 20. When you have a very small sample size, say 2, your sample
means are all over the place. You can see that pretty much anything goes,
we see means between 5 and 95. And this makes sense, when we only have two data points in
our sample, it’s not that unlikely that we get two really small numbers, or two pretty
big numbers, which is why we see both low and high sample means.
Though we can tell that a lot of the means are around the true mean of 50 because the
histogram is the tallest at values around 50. But once we increase the sample size, even
to just 100 values, you can see that the sample means are mostly around the real mean of 50.
In fact all of the sample means are within 10 units of the true population mean. And when we go up to 1000, just about every
sample mean is very very close to the true mean. And when you run this simulation over
and over, you’ll see pretty similar results. The neat thing is that the Law of Large numbers
applies to almost any distribution as long as the distribution doesn’t have an infinite
variance. Take the uniform distribution which looks
like a rectangle. Imagine a 100-sided die, every single value is equally probable. Even the sample means that are selected from
a uniform distribution get closer and closer to the true mean of 50.. The law of large numbers is the evidence we
need to feel confident that the mean of the samples we analyze is a pretty good guess
for the true population mean. And the bigger our samples are, the better we think the guess
is! This property allows us to make guesses about populations, based on samples. It also explains why casinos make money in
the long run over hundreds of thousands of payouts and losses, even if the experience
of each person varies a lot. The casino looks at a huge sample–every single bet and payout–whereas
your sample as an individual is smaller, and therefore less likely to be representative. Each of these concepts can help us another
way …another way to look at the data around us. The Bayesian framework shows us that every
event or data point can and should “update” your beliefs but it doesn’t mean you need
to completely change your mind. And simulations allow us to build upon these
observations when the underlying mechanics aren’t so clear. We are continuously accumulating evidence
and modifying our beliefs everyday, adding today’s events to our conception of how the
world works. And hey, maybe one day we’ll all start sincerely emailing each other about
Nigerian Princes. Then we’re gonna have to do some belief-updating. Thanks for watching. I’ll see you next time.


82 Responses

  1. Alan Telemishev

    May 2, 2018 9:06 pm

    Will you be covering reification and simpson's paradox and other such topics in the future?

  2. Riya

    May 2, 2018 9:09 pm

    Hey! We need a crashcourse on lord of the rings !
    It is such an intricately designed plot executed with amazing imagery. But it cannot be denied that it’s a difficult one to understand. Crashcourse literature would help!

  3. Karl Young

    May 2, 2018 10:33 pm

    Yay, the Reverend gets his day ! Was worried you were going to skip the topic. Thanks, nice job.

  4. Jorge David Ramos Mercado

    May 2, 2018 10:33 pm

    I’m very impressed by stating the finite second moment condition that you need for a random variable to be in L2 and you need for many—if not all—of the laws of large numbers and the Central Limit Results.

  5. Yasemin BAHAR

    May 2, 2018 10:57 pm

    ong i just finished a course on cog sci and another on cog psyc, i don’t wanna hear about bayes no more

  6. Arthur de Melo Sá

    May 3, 2018 1:00 am

    Nice video. I'm a linguist and currently trying to learn Bayesian statistics. Our aim is to model how people choose what to say given the situation of the utterance. It is pretty hard, and I wish I have studied Bayes in school. It would make our lives so much easier.

  7. Art Curious

    May 3, 2018 1:27 am

    So we can use statistics to debunk conspiracy theories, very interesting. We can look at all the mass shootings in America and ask was the FBI involved and then compare that to all the shootings where the FBI was actually involved to those that occurred apparently randomly. We can then predict what the likely hood is that a mass shooting was actually not a random event. We can then use that data to sue the government.

  8. Dalton Growley

    May 3, 2018 1:35 am

    this episode makes me doubt the efficacy of drug testing for employment purposes…

  9. Andrés González Rangel

    May 3, 2018 2:11 am

    Did you see "Nigerian Prince: the movie"?
    Yes, it was awesome! Oscar material

  10. Huawei Android

    May 3, 2018 4:40 am

    If you like probability, don't read Nassim Taleb's "the black swan", just don't…

  11. Jasmine Ramirez

    May 3, 2018 3:27 pm

    Hello Mr. John Green,
    My group members and I have currently finished reading Looking for Alaska. We’re working on a English project regarding your book. We would like to interview you. Can you tell us your thoughts about underage drinking and drunk driving?

  12. Drew Trox

    May 3, 2018 5:52 pm

    I've always thought that the game Battleship is a great example of Bayesian logic.

  13. K B.

    May 4, 2018 7:58 pm

    I love how I always come across these video's within a few days that I learned the same topics in school

  14. Abrar Faiyaz

    May 6, 2018 6:19 am

    I love how she doesn't talk super fast like they do at some of the other courses.

  15. victor noagbodji

    May 7, 2018 10:08 am

    you have got here the best explanation i have seen so far in my life for P(A|B). that "venn diagram arithmetic" totally makes sense.

  16. J H

    May 8, 2018 7:22 am

    I'm not a Nigerian prince, and if you give me nothing, I will give you nothing in return… Other than perhaps treatment like you're a human… most of the time.

  17. Moritz Schubert

    May 11, 2018 4:39 pm

    Yes! Bayesian statistics!
    I was fearing that this series (like – sadly – so much of modern science) would be all about frequentist statistics. I'm so glad to be proven wrong!

  18. Department of Analytics

    July 7, 2018 7:11 am

    This is absolutely professional and well-made. A confusing concept explained intuitively. Well done CC!

  19. Kit Coffey

    July 21, 2018 10:23 pm

    really need a series that explains the computations as well as she does the concepts. any suggestions?

  20. Raymond K Petry

    August 22, 2018 11:27 pm

    …this, illustrates, even paradigms, the trouble with statistics—by assuming, statistics, is a valid metric or basis and then doing more-statistics on top of that assumption meanwhile mathematics itself consists of finding equivalences of logical and arithmetical—statistics drags a tangent like a ball-and-chain on an average without respect to cause-and-effect… (We still don't have proof that statistics spans the data completely losslessly invertibly)…

  21. Raymond K Petry

    August 22, 2018 11:45 pm

    …sidebar—in universe A piranhas eat guppies, P(guppies&piranhas) = P(guppies|piranhas) * P(piranhas) = 0 , but that doesn't tell us P(piranhas), and worse, it'd be true if guppies eat piranhas—so we can't tell anything about universe A 'til the doctors say "look at the teeth", proving only that we learn nothing by the doctors, and everything by Red Riding Hood…

  22. John Tsamouras

    December 9, 2018 12:34 am

    Just a brief note: If you run the simulation, in the last line of the code "hist(simulated_samples, xlim = c(0,100),breaks=seq(0,100,1))" add a space before "breaks" and after the comma which preceeds "breaks". Otherwise, the code will give you an error.

  23. Karl's Quest

    December 10, 2018 4:40 pm

    I reaaaally dislike the fact that people look at "failing" a drug test as some kind of proof that the person is "on drugs" in the sense of being an addict or corrupt in some way.

    If I am fulfilling my duties as an employee it's of no business to the company what I do to my body at home.

    Fyi, I'm not a cannabis user or anything but I very rarely take psychadelic or MDMA. If a company were to take me on a surprise drug test at an unlucky time for me and I would be fired, I would consider it a grave injustice.

  24. agatakicia

    January 1, 2019 9:34 pm

    Great to see that at least one thinking mind has come to a conclusion that spreading the knowledge about the Bayesian statistics is worth to try. Many thanks for that!

    I had finished my master thesis in psychology at University of Warsaw and I am really shocked how little knowledge of statistics psychologists/ psychiatrists have (yes, non-psychologists, this topic is consisted of 70% of statistics, methodology, testing, experiments, designs etc. and only in 30% of ideas of such individuals as Freud, Horney, Maslow or Pavlov). 5 years of studying this field have given me two sad conclusions about methodology of verifying new thesis of psychology:

    1. Even psychology PhDs/ professors (globally, not only in Poland) have just a tiny tiny (if any!) knowledge of basics of statistics (no one, except statistic tutors, cares about normal distribution, skewness of the frequencies, size of the sample and type of the sample when it comes to analyzing the data). As I was helping other students to deal with statistics, I was facing a really shocking attitudes towards stats from PhDs/ professors like "hey, I don't have a clue on what to do with gathered material, so let's do anything – correlation (r Pearsons test) or causation (t Student's test), whatever". That freaked me out a bit as I started to think that really so little psychology PhDs/ professors really know what they are talking about in their papers…

    Moreover, as I was looking for some literature for my masters, about 50-70% of ALREADY PRINTED papers was a garbage data IMO. Like "hey, I just did a research on 50 Iranian/Polish/American (any nation is applicable) students from 1st year – 35 female and 15 male. I pushed it through the SPSS machine (pushed some buttons and some numbers occur, yay!) and the conclusion is that generally speaking FEMALE are more open to xxx than man (place whatever you want instead of "xxx" (and yet – please don't make me write every mistake made in this description because it will take me another 15 minutes to summarize it 🙂 ).

    2. (Which is the result of point 1) If such honored people around the world rarely cares about the key assumptions to be fulfilled, how come a. other students would be able to learn stuff? b. how come these students will be able to verify the meaning of their data? c. therefore, how future thesis would be verified if we both get rid of any theoretical assumptions and forget about statistical knowledge? d. what comes next?

    Another topic is also the machine of printing scientific papers and the silence of the experiments that did not fit to the already assumed thesis (or those which just crushed the thesis), but it is a whole new area of discussion… 🙂

  25. BlueDragonFireGirl

    January 13, 2019 4:24 pm

    I think you made a mistake while calculating the probability of true positives at 7:42 .
    The true probability is 48.6%.
    (450 divided by 925 = 0486486… ~48.6%)

    To get all the numbers you should start from the bottom of the problem and work your way up:
    1) you get the drug users by multiplying the baserate (5%) with all users (=simulations =10.000) –> 500 people are drug users in real life
    2) you get the non-drug users by subtracting the drug users (500) from all the users (=10.000) –> 9500 people are non-drug users in real life (=they are clean)
    3) you get the true negatives by multiplying the specificity (95%) with the non-drug users (9500) –> 9025 people are non-drug users according to the test and they are non-drug users in real life (=they are clean)
    4) you get the false positives by subtracting the true negatives (9025) from the non-drug users (9500) –> 475 people are drug users according to the test, even though they are not in real life
    4) you get the true positives by multiplying the sensitifity (90%) with the drug users (500) –> 450 people are drug users according to the test and they are in real life
    5) you get the number of positive tests (true positive (450) and false positive (475)) by adding those numbers together –> 925 people are drug users according to the test, independent if they are or aren't in real life.
    6) At last you have to divide the true positives (450) by the number of positive tests (925) to get the probability of a person to be indeed a drug user if the test says so.

  26. Paul Chandler

    January 17, 2019 9:58 pm

    Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Prince…Nigerian Princes…Nigerian Prince…Nigerian Princes…Nigerian.

  27. thecaveofthedead

    January 21, 2019 5:59 pm

    Good vid. But the drug testing example isn't going to age well when even the US looks back and realises what a grotesque violation of human rights it was. Creepy AF for those of us who live in countries with rights.

  28. Jousboxx

    February 27, 2019 5:39 pm

    Hi! I'm the prince of Nigeria. As my long lost half-cousin, I have chosen you to inherit my 1 ton block of gold. Just pay $1000 shipping and I'll have it delivered by magic carpet.

  29. Roberto Fontiglia

    March 5, 2019 6:49 pm

    "As long as the distribution doesn't have infinite variance". And even then it still works… As long as the distribution doesn't have infinite expectation is the real condition.

  30. nothinmulch

    May 12, 2019 1:59 am

    This makes me think about the false dichotomy fallacy a lot, since we tend to reduce probabilities into the simplest terms, like 50%, instead of more complex and harder to visualize mathematics.

  31. Tom McMorrow

    May 28, 2019 10:44 am

    "The failure of drug test…by employee…administered on April 20th"

    Crash Course knows what's up. 😉


Leave a Reply