A handy dplyr function for linguistics

Josef Fruehwald

doi:10.59350/bpbje-afg38

One of the new functions in dplyr v1.1.0 is dplyr::consecutive_id(), which strikes me as having a few good use cases for linguistic data. The one I’ll illustrate here is for processing transcriptions.

library(tidyverse)
library(gt)

source(here::here("_defaults.R"))

# make sure its >= v1.1.0
packageVersion("dplyr")

[1] '1.1.2'

I’ll use a sample transcription extract from LANCS, where the audio has been chunked into “breath groups” and transcribed, along with an identifier of who was speaking, and beginning and end times.

transcription <- 
  read_csv("data/KY25A_1.csv")

speaker	start	end	transcript
IVR	192.110	194.710	well uh, I have a number of uh
IVR	195.530	198.620	things I'd like to ask you about. I wonder if you'd just mind uh.
IVR	199.110	200.900	answering questions uh
IVR	202.130	203.610	one after another if you
KY25A	203.295	204.405	yeah
KY25A	204.745	205.225	well
IVR	204.740	205.805	if I remind you of a
KY25A	205.510	207.930	now you might start that
KY25A	208.440	209.570	I was born in
KY25A	210.420	212.120	eighteen sixty seven
IVR	213.350	215.450	mhm and that makes you how old?
KY25A	215.780	216.600	ninety three
IVR	216.665	217.455	ninety three

One thing we might want to do is indicate which sequences of transcription chunks belong to one speaker, corresponding roughly to their speaking turns. I’ve hacked my way through this kind of coding before, but now we can easily add turn numbers with dplyr::consecutive_id(), which will add a column of numbers that increment every time the value in the indicated column changes.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  )

speaker	start	end	transcript	turn
IVR	192.110	194.710	well uh, I have a number of uh	1
IVR	195.530	198.620	things I'd like to ask you about. I wonder if you'd just mind uh.	1
IVR	199.110	200.900	answering questions uh	1
IVR	202.130	203.610	one after another if you	1
KY25A	203.295	204.405	yeah	2
KY25A	204.745	205.225	well	2
IVR	204.740	205.805	if I remind you of a	3
KY25A	205.510	207.930	now you might start that	4
KY25A	208.440	209.570	I was born in	4
KY25A	210.420	212.120	eighteen sixty seven	4
IVR	213.350	215.450	mhm and that makes you how old?	5
KY25A	215.780	216.600	ninety three	6
IVR	216.665	217.455	ninety three	7

Now we can do things like group the data by turn, and get a new dataframe summarized by turn.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  ) |> 
  summarise(
    .by = c(turn, speaker),
    start = min(start),
    end = max(end),
    transcript = str_c(transcript, collapse = " "),
  )

turn	speaker	start	end	transcript
1	IVR	192.110	203.610	well uh, I have a number of uh things I'd like to ask you about. I wonder if you'd just mind uh. answering questions uh one after another if you
2	KY25A	203.295	205.225	yeah well
3	IVR	204.740	205.805	if I remind you of a
4	KY25A	205.510	212.120	now you might start that I was born in eighteen sixty seven
5	IVR	213.350	215.450	mhm and that makes you how old?
6	KY25A	215.780	216.600	ninety three
7	IVR	216.665	217.455	ninety three

And then you can start moving onto other analyses, like what the lag was between one speaker’s end and the next’s beginning.

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  ) |> 
  summarise(
    .by = c(turn, speaker),
    start = min(start),
    end = max(end),
    transcript = str_c(transcript, collapse = " "),
  ) |> 
  mutate(overlapping = start < lag(end))

turn	speaker	start	end	transcript	lag
1	IVR	192.110	203.610	well uh, I have a number of uh things I'd like to ask you about. I wonder if you'd just mind uh. answering questions uh one after another if you	NA
2	KY25A	203.295	205.225	yeah well	-0.315
3	IVR	204.740	205.805	if I remind you of a	-0.485
4	KY25A	205.510	212.120	now you might start that I was born in eighteen sixty seven	-0.295
5	IVR	213.350	215.450	mhm and that makes you how old?	1.230
6	KY25A	215.780	216.600	ninety three	0.330
7	IVR	216.665	217.455	ninety three	0.065

This was just the first example that came to mind, but there’s probably a lot of data processing tasks that can be made a lot less annoying with dplyr::consecutive_id().

Extra

I’ll throw the duration of within-turn pauses in there.

library(glue)

transcription |> 
  mutate(
    turn = consecutive_id(speaker)
  ) |> 
  mutate(
    .by = turn,
    pause_dur = start - lag(end),
    transcript = case_when(
      .default = transcript,
      is.finite(pause_dur) ~ glue(
        "<{round(pause_dur, digits = 2)} second pause> {transcript}"
      )
    )
  ) |> 
  summarise(
    .by = c(turn, speaker),
    start = min(start),
    end = max(end),
    transcript = str_c(transcript, collapse = " "),
  ) |> 
  mutate(lag = start - lag(end)) |> 
  relocate(lag,  .before = start)

turn	speaker	lag	start	end	transcript
1	IVR	NA	192.110	203.610	well uh, I have a number of uh <0.82 second pause> things I'd like to ask you about. I wonder if you'd just mind uh. <0.49 second pause> answering questions uh <1.23 second pause> one after another if you
2	KY25A	-0.315	203.295	205.225	yeah <0.34 second pause> well
3	IVR	-0.485	204.740	205.805	if I remind you of a
4	KY25A	-0.295	205.510	212.120	now you might start that <0.51 second pause> I was born in <0.85 second pause> eighteen sixty seven
5	IVR	1.230	213.350	215.450	mhm and that makes you how old?
6	KY25A	0.330	215.780	216.600	ninety three
7	IVR	0.065	216.665	217.455	ninety three

Reuse

CC-BY 4.0

Citation

BibTeX citation:

@online{fruehwald2023,
  author = {Fruehwald, Josef and Fruehwald, Josef},
  title = {A Handy Dplyr Function for Linguistics},
  series = {Væl Space},
  date = {2023-02-05},
  url = {https://jofrhwld.github.io/blog/posts/2023/02/2023-02-05/},
  doi = {10.59350/bpbje-afg38},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef, and Josef Fruehwald. 2023. “A Handy Dplyr Function for Linguistics.” Væl Space. February 5, 2023. https://doi.org/10.59350/bpbje-afg38.