Avatar: The Last Airbender transcript data

Tidy Tuesday 11th August 2020

Aug 13, 2020 7 min read R, Tidy Tuesday

Tidy Tuesday

Tidy Tuesday is a fun weekly event where people exercise their data wrangling and visualization skills and share it on Twitter. Every week a new dataset is explored. It is hosted by the R for Data Science (R4DS) community.

Last August 4th, the beautiful and informative visualizations I saw on Twitter for the Palmer Penguins Dataset really peaked my interest for Tidy Tuesday. So when they released the dataset for this week and it was about one of my favorite shows growing up, I knew I had to participate!

This week data are the full transcript of the amazing Avatar: The Last Airbender TV show, alongside the ratings for each episode. I’m going to make a simple graph to see the connection among characters and some important concepts in the series.

Load data

avatar <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/avatar.csv')
scene_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/scene_description.csv')

The data has the following columns:

str(avatar,give.attr = F)

## tibble [13,385 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id             : num [1:13385] 1 2 3 4 5 6 7 8 9 10 ...
##  $ book           : chr [1:13385] "Water" "Water" "Water" "Water" ...
##  $ book_num       : num [1:13385] 1 1 1 1 1 1 1 1 1 1 ...
##  $ chapter        : chr [1:13385] "The Boy in the Iceberg" "The Boy in the Iceberg" "The Boy in the Iceberg" "The Boy in the Iceberg" ...
##  $ chapter_num    : num [1:13385] 1 1 1 1 1 1 1 1 1 1 ...
##  $ character      : chr [1:13385] "Katara" "Scene Description" "Sokka" "Scene Description" ...
##  $ full_text      : chr [1:13385] "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Av"| __truncated__ "As the title card fades, the scene opens onto a shot of an icy sea before panning slowly to the left, revealing"| __truncated__ "It's not getting away from me this time. [Close-up of the boy as he grins confidently over his shoulder in the "| __truncated__ "The shot pans quickly from the boy to Katara, who seems indifferent to his claim and turns back to her side of "| __truncated__ ...
##  $ character_words: chr [1:13385] "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Av"| __truncated__ NA "It's not getting away from me this time.  Watch and learn, Katara. This is how you catch a fish." NA ...
##  $ writer         : chr [1:13385] "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" ...
##  $ director       : chr [1:13385] "Dave Filoni" "Dave Filoni" "Dave Filoni" "Dave Filoni" ...
##  $ imdb_rating    : num [1:13385] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ...

We are not going to use scenes where nobody said anything. Let’s get rid of them.

library(dplyr)
avatar <- avatar %>% 
          filter(character != "Scene Description")

Characters and words of interest

Every node in the graph will represent either a character of the series or a concept. The main idea is see how much characters reference each of the other characters and concepts. We will have a kind of web of references in the show. Or maybe, sort of a fancy word cloud.

Firs, let’s make one array for characters and words of interest.

characters <- c("Aang", "Sokka", "Katara", "Toph", "Zuko",
                 "Iroh", "Ozai","Azula")
words_of_interest <- c("Avatar", "Aang", "Sokka", "Katara", 
                       "Toph", "Zuko", "Iroh", "Fire Lord",
                       "Azula", "kill", "Sozin's Comet", 
                       "save", "Earth Kingdom", "spirit",
                       "Appa","bison" ,"Ozai", "moon",
                       "Agni Kai", "bending", "water tribe",
                       "balance", "honor", "Dai Li")

There are tribes for each element (water, earth, air and fire) in the show. To illustrate how the tribes are represented in the characters and concepts we will define each word of interest as relating to a tribe or to a general concept. We expect to see people from a given tribe referencing more concepts and people from that tribe. Conversely, some general concepts, such as “Avatar” might be referenced many times by all tribes.

tribes <- c("concept", "air", "water","water",
            "earth", "fire", "fire", "fire",
            "fire", "concept","fire",
            "concept", "earth", "concept",
            "air", "air", "fire", "water",
            "fire", "concept", "water",
            "concept", "concept", "earth")

levels_tribes <- sort(unique(tribes))
tribes <- data.frame(words = words_of_interest, 
                     tribes = factor(tribes,
                                     levels = levels_tribes,
                                     ordered = TRUE
                                     )
                     )

tribes

words	tribes
Avatar	concept
Aang	air
Sokka	water
Katara	water
Toph	earth
Zuko	fire
Iroh	fire
Fire Lord	fire
Azula	fire
kill	concept
Sozin’s Comet	fire
save	concept
Earth Kingdom	earth
spirit	concept
Appa	air
bison	air
Ozai	fire
moon	water
Agni Kai	fire
bending	concept
water tribe	water
balance	concept
honor	concept
Dai Li	earth

Prepare graph

We are going to prepare the graph with the packages igraph and tidygraph. First we will make a adjacency table where the first column is a chacarater, the second is a word she/he referenced and the third is how many times this reference occured. The most times a character said something, stronger will be the connection between them. This will show in proximity and size of the connection in the network. In graphs, this is usually called weight. We are going to keep only references made at least 5 times in the series.

Once we have the adjacency table, it is easy to convert it to a graph.

This is the “data wrangling” part of my Tidy Tuesday.

library(tibble)
library(tidyr)
library(stringr)
library(purrr)
library(igraph)
library(tidygraph)

adjacencies <- rep(list(words_of_interest),length(characters))
names(adjacencies) <- characters

adjacencies <- adjacencies %>% 
  enframe("character", "word") %>% 
  unnest(word) %>% 
   nest_join(select(avatar,character, character_words),
             "character", name = "lines") %>% 
  group_by(character, word) %>% 
  mutate(weight = map_dbl(lines, ~sum(str_count(.$character_words, word)))) %>% 
  filter(weight > 4, character != word) %>% 
  select(-lines)

graph <- graph_from_data_frame(adjacencies)
graph <- as_tbl_graph(graph)
graph <- left_join(graph, tribes, by = c(name = "words"))

adjacencies

character	word	weight
Aang	Avatar	56
Aang	Sokka	87
Aang	Katara	101
Aang	Toph	22
Aang	Zuko	21
Aang	Fire Lord	33
Aang	kill	7
Aang	save	12
Aang	spirit	22
Aang	Appa	106
Aang	bison	12
Aang	Ozai	7
Aang	bending	49
Aang	honor	5
Sokka	Avatar	26
Sokka	Aang	98
Sokka	Katara	69
Sokka	Toph	25
Sokka	Zuko	22
Sokka	Fire Lord	25
Sokka	kill	8
Sokka	save	5
Sokka	Earth Kingdom	6
Sokka	spirit	9
Sokka	Appa	35
Sokka	bison	9
Sokka	moon	5
Sokka	bending	25
Katara	Avatar	36
Katara	Aang	220
Katara	Sokka	108
Katara	Toph	26
Katara	Zuko	19
Katara	kill	5
Katara	save	9
Katara	Earth Kingdom	6
Katara	Appa	28
Katara	bending	47
Toph	Aang	11
Toph	Sokka	15
Toph	Katara	20
Toph	Zuko	5
Toph	Fire Lord	7
Toph	bending	15
Zuko	Avatar	57
Zuko	Aang	9
Zuko	Sokka	8
Zuko	Katara	5
Zuko	Fire Lord	7
Zuko	Azula	14
Zuko	kill	5
Zuko	save	8
Zuko	Earth Kingdom	5
Zuko	bending	19
Zuko	honor	16
Iroh	Avatar	18
Iroh	Zuko	54
Iroh	Fire Lord	10
Iroh	Azula	5
Iroh	moon	5
Iroh	bending	5
Iroh	balance	5
Iroh	honor	10
Ozai	Zuko	6
Ozai	Fire Lord	7
Ozai	Azula	5
Azula	Avatar	13
Azula	Zuko	18
Azula	Fire Lord	8
Azula	honor	6
Azula	Dai Li	6

Plot

The handy package ggraph allow us to plot graphs in a similiar way to ggplot plotting. We also make a custom fill and color scale to be true to Avatar’s palette. I found the colors in the nice palletes by emilytjames. We also embbed the iconic font used in Avatar.

library(ggplot2)
library(ggraph)
theme_set(theme_bw(14))

# Avatar fill color scale
avatar_fill <- function (...) { scale_fill_manual(values =  c(
                               "#F1B84A", "grey20", 
                               "#4FA24C", "#900B0B",
                               "#4A82CB")
) }
avatar_color <- function (...) { scale_colour_manual(values =  c(
                               "#F1B84A", "grey20", 
                               "#4FA24C", "#900B0B",
                               "#4A82CB")
) }
# Avatar font
windowsFonts(avatar_font = windowsFont("Herculanum"))

ggraph(graph, layout = 'fr')+
    geom_edge_link(aes(width = weight), color = 'grey70')+
    geom_node_point(aes(color = tribes))+
    geom_node_label(aes(label = name, fill = tribes),
                    color = "white", fontface = "bold",
                    family = "avatar_font", size = 5,
                    repel = TRUE)+
  avatar_fill()+
  avatar_color()+
  theme(legend.position = "top")+
  labs(title = "Avatar referencing graph", color = "Tribes",
       fill = "Tribes", width = "Weight")

Nice! That’s our graph. We can see how the central characters in the plot are also in the center of the graph. They reference among themselves a lot. The most central concept in our graph is, of course, “Avatar”. We can also get a glimpse of how the idea of Honor is important for the Fire Tribe. And we get the interesting fact that the character that cites the most the Earth Kingdom concept of the royal guard Dai Li is actually Azula, from the Fire Tribe. Well, if you watched the show you know why that is, if you haven’t, what are you waiting for?

R Tidy Tuesday

Mateus Silvestrin

PhD

PhD in Neuroscience and Cognition. My thesis was about the neural correlates of time perception, developed in the Timing Lab at the Federal University of ABC. I am currently a post-doc (FAPESP grant) at the Social and Cognitive Neuroscience Lab at the Mackenzie Presbyterian University, where I study associations between affect and moral judgement. I also colaborate with the Developmental, Affective and Social Neuroscience Lab.