library(tidyverse)
library(gt)
source(here::here("_defaults.R"))
# make sure its >= v1.1.0
packageVersion("dplyr")
[1] '1.1.2'
Josef Fruehwald
February 5, 2023
One of the new functions in dplyr v1.1.0 is dplyr::consecutive_id()
, which strikes me as having a few good use cases for linguistic data. The one I’ll illustrate here is for processing transcriptions.
library(tidyverse)
library(gt)
source(here::here("_defaults.R"))
# make sure its >= v1.1.0
packageVersion("dplyr")
[1] '1.1.2'
I’ll use a sample transcription extract from LANCS, where the audio has been chunked into “breath groups” and transcribed, along with an identifier of who was speaking, and beginning and end times.
transcription <-
read_csv("data/KY25A_1.csv")
speaker | start | end | transcript |
---|---|---|---|
IVR | 192.110 | 194.710 | well uh, I have a number of uh |
IVR | 195.530 | 198.620 | things I'd like to ask you about. I wonder if you'd just mind uh. |
IVR | 199.110 | 200.900 | answering questions uh |
IVR | 202.130 | 203.610 | one after another if you |
KY25A | 203.295 | 204.405 | yeah |
KY25A | 204.745 | 205.225 | well |
IVR | 204.740 | 205.805 | if I remind you of a |
KY25A | 205.510 | 207.930 | now you might start that |
KY25A | 208.440 | 209.570 | I was born in |
KY25A | 210.420 | 212.120 | eighteen sixty seven |
IVR | 213.350 | 215.450 | mhm and that makes you how old? |
KY25A | 215.780 | 216.600 | ninety three |
IVR | 216.665 | 217.455 | ninety three |
One thing we might want to do is indicate which sequences of transcription chunks belong to one speaker, corresponding roughly to their speaking turns. I’ve hacked my way through this kind of coding before, but now we can easily add turn numbers with dplyr::consecutive_id()
, which will add a column of numbers that increment every time the value in the indicated column changes.
transcription |>
mutate(
turn = consecutive_id(speaker)
)
speaker | start | end | transcript | turn |
---|---|---|---|---|
IVR | 192.110 | 194.710 | well uh, I have a number of uh | 1 |
IVR | 195.530 | 198.620 | things I'd like to ask you about. I wonder if you'd just mind uh. | 1 |
IVR | 199.110 | 200.900 | answering questions uh | 1 |
IVR | 202.130 | 203.610 | one after another if you | 1 |
KY25A | 203.295 | 204.405 | yeah | 2 |
KY25A | 204.745 | 205.225 | well | 2 |
IVR | 204.740 | 205.805 | if I remind you of a | 3 |
KY25A | 205.510 | 207.930 | now you might start that | 4 |
KY25A | 208.440 | 209.570 | I was born in | 4 |
KY25A | 210.420 | 212.120 | eighteen sixty seven | 4 |
IVR | 213.350 | 215.450 | mhm and that makes you how old? | 5 |
KY25A | 215.780 | 216.600 | ninety three | 6 |
IVR | 216.665 | 217.455 | ninety three | 7 |
Now we can do things like group the data by turn, and get a new dataframe summarized by turn.
turn | speaker | start | end | transcript |
---|---|---|---|---|
1 | IVR | 192.110 | 203.610 | well uh, I have a number of uh things I'd like to ask you about. I wonder if you'd just mind uh. answering questions uh one after another if you |
2 | KY25A | 203.295 | 205.225 | yeah well |
3 | IVR | 204.740 | 205.805 | if I remind you of a |
4 | KY25A | 205.510 | 212.120 | now you might start that I was born in eighteen sixty seven |
5 | IVR | 213.350 | 215.450 | mhm and that makes you how old? |
6 | KY25A | 215.780 | 216.600 | ninety three |
7 | IVR | 216.665 | 217.455 | ninety three |
And then you can start moving onto other analyses, like what the lag was between one speaker’s end and the next’s beginning.
turn | speaker | start | end | transcript | lag |
---|---|---|---|---|---|
1 | IVR | 192.110 | 203.610 | well uh, I have a number of uh things I'd like to ask you about. I wonder if you'd just mind uh. answering questions uh one after another if you | NA |
2 | KY25A | 203.295 | 205.225 | yeah well | -0.315 |
3 | IVR | 204.740 | 205.805 | if I remind you of a | -0.485 |
4 | KY25A | 205.510 | 212.120 | now you might start that I was born in eighteen sixty seven | -0.295 |
5 | IVR | 213.350 | 215.450 | mhm and that makes you how old? | 1.230 |
6 | KY25A | 215.780 | 216.600 | ninety three | 0.330 |
7 | IVR | 216.665 | 217.455 | ninety three | 0.065 |
This was just the first example that came to mind, but there’s probably a lot of data processing tasks that can be made a lot less annoying with dplyr::consecutive_id()
.
I’ll throw the duration of within-turn pauses in there.
transcription |>
mutate(
turn = consecutive_id(speaker)
) |>
mutate(
.by = turn,
pause_dur = start - lag(end),
transcript = case_when(
.default = transcript,
is.finite(pause_dur) ~ glue(
"<{round(pause_dur, digits = 2)} second pause> {transcript}"
)
)
) |>
summarise(
.by = c(turn, speaker),
start = min(start),
end = max(end),
transcript = str_c(transcript, collapse = " "),
) |>
mutate(lag = start - lag(end)) |>
relocate(lag, .before = start)
turn | speaker | lag | start | end | transcript |
---|---|---|---|---|---|
1 | IVR | NA | 192.110 | 203.610 | well uh, I have a number of uh <0.82 second pause> things I'd like to ask you about. I wonder if you'd just mind uh. <0.49 second pause> answering questions uh <1.23 second pause> one after another if you |
2 | KY25A | -0.315 | 203.295 | 205.225 | yeah <0.34 second pause> well |
3 | IVR | -0.485 | 204.740 | 205.805 | if I remind you of a |
4 | KY25A | -0.295 | 205.510 | 212.120 | now you might start that <0.51 second pause> I was born in <0.85 second pause> eighteen sixty seven |
5 | IVR | 1.230 | 213.350 | 215.450 | mhm and that makes you how old? |
6 | KY25A | 0.330 | 215.780 | 216.600 | ninety three |
7 | IVR | 0.065 | 216.665 | 217.455 | ninety three |
@online{fruehwald2023,
author = {Fruehwald, Josef},
title = {A Handy Dplyr Function for Linguistics},
series = {Væl Space},
date = {2023-02-05},
url = {https://jofrhwld.github.io/blog/posts/2023/02/2023-02-05/},
langid = {en}
}