Distributions 2

Author

Josef Fruehwald

Published

February 9, 2023

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
emo <- read_csv("https://raw.githubusercontent.com/bodowinter/applied_statistics_book_data/master/warriner_2013_emotional_valence.csv")
Rows: 13915 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Word
dbl (1): Val

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
um <- read_tsv("https://bit.ly/3JdeSbx")
Rows: 26060 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (3): word, next_seg, idstring
dbl (11): start_time, end_time, vowel_start, vowel_end, nasal_start, nasal_e...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Discrete Probability distributions

Our dice rolling experiment is an example of a discrete probability distribution. This code chunk plots the predicted probability of observing \(n\) ⚀ or ⚅ on 10 rolls.

## Define how many rolls
nrolls <- 10

## get the probability in a dataframe
tibble(
  ## possible observations
  observed = 0:nrolls,
  ## dbinom for the probability distribution
  prob = dbinom(
    ## possible observations
    x = observed,
    ## number of rolls
    size = nrolls,
    ## probability of success
    prob = 2/6
  )
) |> 
  ## the plotting code
  ggplot(aes(x = observed, y = prob))+
    geom_col()

If we have more rolls, the range of probability density, relative to the whole range of possible observations, decreases.

## Define how many rolls
nrolls <- 1000

## get the probability in a dataframe
tibble(
  ## possible observations
  observed = 0:nrolls,
  ## dbinom for the probability distribution
  prob = dbinom(
    ## possible observations
    x = observed,
    ## number of rolls
    size = nrolls,
    ## probability of success
    prob = 2/6
  )
) |> 
  ## the plotting code
  ggplot(aes(x = observed, y = prob))+
    geom_col()

We can simulate the experiment, varying the number of groups who rolled different number of n times. Increasing the number of groups rolling a die fills in the density distribution more, but only increasing the number of rolls per group narrows the density range.

complicated code just to make the plot
set.seed(611)
expand_grid(
  nrolls = c(10, 100, 1000),
  groups = c(5, 50, 500),
  min = 0
) |> 
  mutate(
    expected = nrolls * (2/6)
  )-> blank

## Create all combinations of rolls
## and number of groups
expand_grid(
  nrolls = c(10, 100, 1000),
  groups = c(5, 50, 500)
) |> 
  ## We didn't learn about code like this,
  ## using it just to make the graph
  mutate(
    roll_df = map2(
      nrolls, 
      groups, 
      \(roll, group) tibble(obs = rbinom(n = group, size = roll, prob = 2/6)))
  ) |> 
  unnest(roll_df) |> 
  ggplot(aes(obs))+
    stat_bin(binwidth = 1,
             aes(y = after_stat(ndensity))) +
    geom_blank(data = blank, aes(x = nrolls, y = 0))+
    geom_blank(data = blank, aes(x = min, y = 0))+
    geom_vline(data = blank, aes(xintercept = expected), color = "steelblue")+
    facet_grid(groups~nrolls, 
               label = label_both,
               scales = "free")

complicated code just to make the plot
# tibble(
#   rand = rbinom(n = 500, size = nrolls, prob = 2/6)
# ) |> 
#   ggplot(aes(x = rand)) +
#     stat_bin(binwidth = 1)+
#     geom_vline(xintercept = (2/6) * nrolls, color = "red")+
#     expand_limits(x = c(0, nrolls))

Normal Distribution

The “Normal Distribution” or “Gaussian Distribution” is a continuous probability density function. Rather than the height of the curve representing the “point probability” of a value, it represents an area that sums to 1. To get the probability of a value falling within a range of the normal distribution, you sum the area under the curve.

tibble(
  x = seq(-4, 4, length = 100),
  dens = dnorm(x, mean = 0, sd = 1)
) |>
  ggplot(aes(x = x, y = dens)) +
    geom_area()

complex code just to make the plot
one_sd_prob = round(pnorm(1) - pnorm(-1), digits = 2)
two_four_prob = round(pnorm(4) - pnorm(2), digits = 2)

tibble(
  x = seq(-4, 4, length = 100),
  dens = dnorm(x, mean = 0, sd = 1)
) |>
  ggplot(aes(x = x, y = dens)) +
    geom_area() +
    annotate(
      ymin = -Inf,
      ymax = Inf,
      xmin = -1,
      xmax = 1,
      geom = "rect",
      alpha = 0.3,
      fill = "steelblue"
    )+
    annotate(
      x = 0,
      y = 0.2,
      label = 
        str_wrap(
          glue::glue("Area under the curve between -1 and 1 = {one_sd_prob}"),
          15
        ),
      geom = "label",
      color = "steelblue"
    )+
    annotate(
      ymin = -Inf,
      ymax = Inf,
      xmin = 2,
      xmax = 4,
      geom = "rect",
      alpha = 0.3,
      fill = "#BE3455"
    )+
    annotate(
      x = 3,
      y = 0.2,
      label = 
        str_wrap(
          glue::glue("Area under the curve between 2 and 4 = {two_four_prob}"),
          15
        ),
      geom = "label",
      color = "#BE3455"
    )

One way to view how much area is under the curve, as you move from left to right, is to look at the “cumulative probability function”

tibble(
  prob = pnorm(
    q = seq(-4, 4, length = 100),
    mean = 0,
    sd = 1
  ),
  x = seq(-4, 4, length = 100)
) |> 
  ggplot(aes(x = x, y = prob))+
    geom_area()

Real Data

Real data is often a bit off from being normal, but might be normal-ish

emo_xbar <- mean(emo$Val)
emo_sd <- sd(emo$Val)

emo |> 
  ggplot(aes(x = Val))+
    stat_density()+
    stat_function(
      fun = dnorm,
      args = list(mean = emo_xbar, sd = emo_sd),
      color = "steelblue",
      linewidth = 1) +
    annotate(x = 7, y = 0.3,
             label = "empirical density",
             geom = "label") +
    annotate(x = 2.5, y = 0.2,
             label = "normal density",
             geom = "label",
             color = "steelblue")

Some definitions

mean

The sum of all values, divided by the number of observations

median

The value at which 50% of the data is less than it, and 50% is more.

The mean and median are the same for symmetrical distributions, like the normal distribution.

mean(emo$Val)
[1] 5.063847
median(emo$Val)
[1] 5.2
model <- lm(Val ~ 1, data = emo)
summary(model)

Call:
lm(formula = Val ~ 1, data = emo)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8038 -0.8138  0.1362  0.8862  3.4662 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.06385    0.01081   468.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.275 on 13914 degrees of freedom

Reuse

CC-BY-SA 4.0