A handy dplyr function for linguistics

Author

Josef Fruehwald

Published

February 5, 2023

One of the new functions in dplyr v1.1.0 is dplyr::consecutive_id(), which strikes me as having a few good use cases for linguistic data. The one I’ll illustrate here is for processing transcriptions.

library(tidyverse)
library(gt)

source(here::here("_defaults.R"))

# make sure its >= v1.1.0
packageVersion("dplyr")
[1] '1.1.2'

I’ll use a sample transcription extract from LANCS, where the audio has been chunked into “breath groups” and transcribed, along with an identifier of who was speaking, and beginning and end times.

transcription <- 
  read_csv("data/KY25A_1.csv")
speaker start end transcript
IVR 192.110 194.710 well uh, I have a number of uh
IVR 195.530 198.620 things I'd like to ask you about. I wonder if you'd just mind uh.
IVR 199.110 200.900 answering questions uh
IVR 202.130 203.610 one after another if you
KY25A 203.295 204.405 yeah
KY25A 204.745 205.225 well
IVR 204.740 205.805 if I remind you of a
KY25A 205.510 207.930 now you might start that
KY25A 208.440 209.570 I was born in
KY25A 210.420 212.120 eighteen sixty seven
IVR 213.350 215.450 mhm and that makes you how old?
KY25A 215.780 216.600 ninety three
IVR 216.665 217.455 ninety three

One thing we might want to do is indicate which sequences of transcription chunks belong to one speaker, corresponding roughly to their speaking turns. I’ve hacked my way through this kind of coding before, but now we can easily add turn numbers with dplyr::consecutive_id(), which will add a column of numbers that increment every time the value in the indicated column changes.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  )
speaker start end transcript turn
IVR 192.110 194.710 well uh, I have a number of uh 1
IVR 195.530 198.620 things I'd like to ask you about. I wonder if you'd just mind uh. 1
IVR 199.110 200.900 answering questions uh 1
IVR 202.130 203.610 one after another if you 1
KY25A 203.295 204.405 yeah 2
KY25A 204.745 205.225 well 2
IVR 204.740 205.805 if I remind you of a 3
KY25A 205.510 207.930 now you might start that 4
KY25A 208.440 209.570 I was born in 4
KY25A 210.420 212.120 eighteen sixty seven 4
IVR 213.350 215.450 mhm and that makes you how old? 5
KY25A 215.780 216.600 ninety three 6
IVR 216.665 217.455 ninety three 7

Now we can do things like group the data by turn, and get a new dataframe summarized by turn.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  ) |> 
  summarise(
    .by = c(turn, speaker),
    start = min(start),
    end = max(end),
    transcript = str_c(transcript, collapse = " "),
  )
turn speaker start end transcript
1 IVR 192.110 203.610 well uh, I have a number of uh things I'd like to ask you about. I wonder if you'd just mind uh. answering questions uh one after another if you
2 KY25A 203.295 205.225 yeah well
3 IVR 204.740 205.805 if I remind you of a
4 KY25A 205.510 212.120 now you might start that I was born in eighteen sixty seven
5 IVR 213.350 215.450 mhm and that makes you how old?
6 KY25A 215.780 216.600 ninety three
7 IVR 216.665 217.455 ninety three

And then you can start moving onto other analyses, like what the lag was between one speaker’s end and the next’s beginning.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  ) |> 
  summarise(
    .by = c(turn, speaker),
    start = min(start),
    end = max(end),
    transcript = str_c(transcript, collapse = " "),
  ) |> 
  mutate(overlapping = start < lag(end)) 
turn speaker start end transcript lag
1 IVR 192.110 203.610 well uh, I have a number of uh things I'd like to ask you about. I wonder if you'd just mind uh. answering questions uh one after another if you NA
2 KY25A 203.295 205.225 yeah well -0.315
3 IVR 204.740 205.805 if I remind you of a -0.485
4 KY25A 205.510 212.120 now you might start that I was born in eighteen sixty seven -0.295
5 IVR 213.350 215.450 mhm and that makes you how old? 1.230
6 KY25A 215.780 216.600 ninety three 0.330
7 IVR 216.665 217.455 ninety three 0.065

This was just the first example that came to mind, but there’s probably a lot of data processing tasks that can be made a lot less annoying with dplyr::consecutive_id().

Extra

I’ll throw the duration of within-turn pauses in there.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  ) |> 
  mutate(
    .by = turn,
    pause_dur = start - lag(end),
    transcript = case_when(
      .default = transcript,
      is.finite(pause_dur) ~ glue(
        "<{round(pause_dur, digits = 2)} second pause> {transcript}"
      )
    )
  ) |> 
  summarise(
    .by = c(turn, speaker),
    start = min(start),
    end = max(end),
    transcript = str_c(transcript, collapse = " "),
  ) |> 
  mutate(lag = start - lag(end)) |> 
  relocate(lag,  .before = start)
turn speaker lag start end transcript
1 IVR NA 192.110 203.610 well uh, I have a number of uh <0.82 second pause> things I'd like to ask you about. I wonder if you'd just mind uh. <0.49 second pause> answering questions uh <1.23 second pause> one after another if you
2 KY25A -0.315 203.295 205.225 yeah <0.34 second pause> well
3 IVR -0.485 204.740 205.805 if I remind you of a
4 KY25A -0.295 205.510 212.120 now you might start that <0.51 second pause> I was born in <0.85 second pause> eighteen sixty seven
5 IVR 1.230 213.350 215.450 mhm and that makes you how old?
6 KY25A 0.330 215.780 216.600 ninety three
7 IVR 0.065 216.665 217.455 ninety three

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2023,
  author = {Fruehwald, Josef},
  title = {A Handy Dplyr Function for Linguistics},
  series = {Væl Space},
  date = {2023-02-05},
  url = {https://jofrhwld.github.io/blog/posts/2023/02/2023-02-05/},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2023. “A Handy Dplyr Function for Linguistics.” Væl Space. February 5, 2023. https://jofrhwld.github.io/blog/posts/2023/02/2023-02-05/.