Avatar: The Last Airbender transcript data
Tidy Tuesday 11th August 2020
Tidy Tuesday
Tidy Tuesday is a fun weekly event where people exercise their data wrangling and visualization skills and share it on Twitter. Every week a new dataset is explored. It is hosted by the R for Data Science (R4DS) community.
Last August 4th, the beautiful and informative visualizations I saw on Twitter for the Palmer Penguins Dataset really peaked my interest for Tidy Tuesday. So when they released the dataset for this week and it was about one of my favorite shows growing up, I knew I had to participate!
This week data are the full transcript of the amazing Avatar: The Last Airbender TV show, alongside the ratings for each episode. I’m going to make a simple graph to see the connection among characters and some important concepts in the series.
Load data
avatar <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/avatar.csv')
scene_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/scene_description.csv')
The data has the following columns:
str(avatar,give.attr = F)
## tibble [13,385 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:13385] 1 2 3 4 5 6 7 8 9 10 ...
## $ book : chr [1:13385] "Water" "Water" "Water" "Water" ...
## $ book_num : num [1:13385] 1 1 1 1 1 1 1 1 1 1 ...
## $ chapter : chr [1:13385] "The Boy in the Iceberg" "The Boy in the Iceberg" "The Boy in the Iceberg" "The Boy in the Iceberg" ...
## $ chapter_num : num [1:13385] 1 1 1 1 1 1 1 1 1 1 ...
## $ character : chr [1:13385] "Katara" "Scene Description" "Sokka" "Scene Description" ...
## $ full_text : chr [1:13385] "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Av"| __truncated__ "As the title card fades, the scene opens onto a shot of an icy sea before panning slowly to the left, revealing"| __truncated__ "It's not getting away from me this time. [Close-up of the boy as he grins confidently over his shoulder in the "| __truncated__ "The shot pans quickly from the boy to Katara, who seems indifferent to his claim and turns back to her side of "| __truncated__ ...
## $ character_words: chr [1:13385] "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Av"| __truncated__ NA "It's not getting away from me this time. Watch and learn, Katara. This is how you catch a fish." NA ...
## $ writer : chr [1:13385] "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" ...
## $ director : chr [1:13385] "Dave Filoni" "Dave Filoni" "Dave Filoni" "Dave Filoni" ...
## $ imdb_rating : num [1:13385] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ...
We are not going to use scenes where nobody said anything. Let’s get rid of them.
library(dplyr)
avatar <- avatar %>%
filter(character != "Scene Description")
Characters and words of interest
Every node in the graph will represent either a character of the series or a concept. The main idea is see how much characters reference each of the other characters and concepts. We will have a kind of web of references in the show. Or maybe, sort of a fancy word cloud.
Firs, let’s make one array for characters and words of interest.
characters <- c("Aang", "Sokka", "Katara", "Toph", "Zuko",
"Iroh", "Ozai","Azula")
words_of_interest <- c("Avatar", "Aang", "Sokka", "Katara",
"Toph", "Zuko", "Iroh", "Fire Lord",
"Azula", "kill", "Sozin's Comet",
"save", "Earth Kingdom", "spirit",
"Appa","bison" ,"Ozai", "moon",
"Agni Kai", "bending", "water tribe",
"balance", "honor", "Dai Li")
There are tribes for each element (water, earth, air and fire) in the show. To illustrate how the tribes are represented in the characters and concepts we will define each word of interest as relating to a tribe or to a general concept. We expect to see people from a given tribe referencing more concepts and people from that tribe. Conversely, some general concepts, such as “Avatar” might be referenced many times by all tribes.
tribes <- c("concept", "air", "water","water",
"earth", "fire", "fire", "fire",
"fire", "concept","fire",
"concept", "earth", "concept",
"air", "air", "fire", "water",
"fire", "concept", "water",
"concept", "concept", "earth")
levels_tribes <- sort(unique(tribes))
tribes <- data.frame(words = words_of_interest,
tribes = factor(tribes,
levels = levels_tribes,
ordered = TRUE
)
)
tribes
words | tribes |
---|---|
Avatar | concept |
Aang | air |
Sokka | water |
Katara | water |
Toph | earth |
Zuko | fire |
Iroh | fire |
Fire Lord | fire |
Azula | fire |
kill | concept |
Sozin’s Comet | fire |
save | concept |
Earth Kingdom | earth |
spirit | concept |
Appa | air |
bison | air |
Ozai | fire |
moon | water |
Agni Kai | fire |
bending | concept |
water tribe | water |
balance | concept |
honor | concept |
Dai Li | earth |
Prepare graph
We are going to prepare the graph with the packages igraph and tidygraph. First we will make a adjacency table where the first column is a chacarater, the second is a word she/he referenced and the third is how many times this reference occured. The most times a character said something, stronger will be the connection between them. This will show in proximity and size of the connection in the network. In graphs, this is usually called weight. We are going to keep only references made at least 5 times in the series.
Once we have the adjacency table, it is easy to convert it to a graph.
This is the “data wrangling” part of my Tidy Tuesday.
library(tibble)
library(tidyr)
library(stringr)
library(purrr)
library(igraph)
library(tidygraph)
adjacencies <- rep(list(words_of_interest),length(characters))
names(adjacencies) <- characters
adjacencies <- adjacencies %>%
enframe("character", "word") %>%
unnest(word) %>%
nest_join(select(avatar,character, character_words),
"character", name = "lines") %>%
group_by(character, word) %>%
mutate(weight = map_dbl(lines, ~sum(str_count(.$character_words, word)))) %>%
filter(weight > 4, character != word) %>%
select(-lines)
graph <- graph_from_data_frame(adjacencies)
graph <- as_tbl_graph(graph)
graph <- left_join(graph, tribes, by = c(name = "words"))
adjacencies
character | word | weight |
---|---|---|
Aang | Avatar | 56 |
Aang | Sokka | 87 |
Aang | Katara | 101 |
Aang | Toph | 22 |
Aang | Zuko | 21 |
Aang | Fire Lord | 33 |
Aang | kill | 7 |
Aang | save | 12 |
Aang | spirit | 22 |
Aang | Appa | 106 |
Aang | bison | 12 |
Aang | Ozai | 7 |
Aang | bending | 49 |
Aang | honor | 5 |
Sokka | Avatar | 26 |
Sokka | Aang | 98 |
Sokka | Katara | 69 |
Sokka | Toph | 25 |
Sokka | Zuko | 22 |
Sokka | Fire Lord | 25 |
Sokka | kill | 8 |
Sokka | save | 5 |
Sokka | Earth Kingdom | 6 |
Sokka | spirit | 9 |
Sokka | Appa | 35 |
Sokka | bison | 9 |
Sokka | moon | 5 |
Sokka | bending | 25 |
Katara | Avatar | 36 |
Katara | Aang | 220 |
Katara | Sokka | 108 |
Katara | Toph | 26 |
Katara | Zuko | 19 |
Katara | kill | 5 |
Katara | save | 9 |
Katara | Earth Kingdom | 6 |
Katara | Appa | 28 |
Katara | bending | 47 |
Toph | Aang | 11 |
Toph | Sokka | 15 |
Toph | Katara | 20 |
Toph | Zuko | 5 |
Toph | Fire Lord | 7 |
Toph | bending | 15 |
Zuko | Avatar | 57 |
Zuko | Aang | 9 |
Zuko | Sokka | 8 |
Zuko | Katara | 5 |
Zuko | Fire Lord | 7 |
Zuko | Azula | 14 |
Zuko | kill | 5 |
Zuko | save | 8 |
Zuko | Earth Kingdom | 5 |
Zuko | bending | 19 |
Zuko | honor | 16 |
Iroh | Avatar | 18 |
Iroh | Zuko | 54 |
Iroh | Fire Lord | 10 |
Iroh | Azula | 5 |
Iroh | moon | 5 |
Iroh | bending | 5 |
Iroh | balance | 5 |
Iroh | honor | 10 |
Ozai | Zuko | 6 |
Ozai | Fire Lord | 7 |
Ozai | Azula | 5 |
Azula | Avatar | 13 |
Azula | Zuko | 18 |
Azula | Fire Lord | 8 |
Azula | honor | 6 |
Azula | Dai Li | 6 |
Plot
The handy package ggraph allow us to plot graphs in a similiar way to ggplot plotting. We also make a custom fill and color scale to be true to Avatar’s palette. I found the colors in the nice palletes by emilytjames. We also embbed the iconic font used in Avatar.
library(ggplot2)
library(ggraph)
theme_set(theme_bw(14))
# Avatar fill color scale
avatar_fill <- function (...) { scale_fill_manual(values = c(
"#F1B84A", "grey20",
"#4FA24C", "#900B0B",
"#4A82CB")
) }
avatar_color <- function (...) { scale_colour_manual(values = c(
"#F1B84A", "grey20",
"#4FA24C", "#900B0B",
"#4A82CB")
) }
# Avatar font
windowsFonts(avatar_font = windowsFont("Herculanum"))
ggraph(graph, layout = 'fr')+
geom_edge_link(aes(width = weight), color = 'grey70')+
geom_node_point(aes(color = tribes))+
geom_node_label(aes(label = name, fill = tribes),
color = "white", fontface = "bold",
family = "avatar_font", size = 5,
repel = TRUE)+
avatar_fill()+
avatar_color()+
theme(legend.position = "top")+
labs(title = "Avatar referencing graph", color = "Tribes",
fill = "Tribes", width = "Weight")
Nice! That’s our graph. We can see how the central characters in the plot are also in the center of the graph. They reference among themselves a lot. The most central concept in our graph is, of course, “Avatar”. We can also get a glimpse of how the idea of Honor is important for the Fire Tribe. And we get the interesting fact that the character that cites the most the Earth Kingdom concept of the royal guard Dai Li is actually Azula, from the Fire Tribe. Well, if you watched the show you know why that is, if you haven’t, what are you waiting for?