Avatar: The Last Airbender transcript data

Tidy Tuesday 11th August 2020

Tidy Tuesday

Tidy Tuesday is a fun weekly event where people exercise their data wrangling and visualization skills and share it on Twitter. Every week a new dataset is explored. It is hosted by the R for Data Science (R4DS) community.

Last August 4th, the beautiful and informative visualizations I saw on Twitter for the Palmer Penguins Dataset really peaked my interest for Tidy Tuesday. So when they released the dataset for this week and it was about one of my favorite shows growing up, I knew I had to participate!

This week data are the full transcript of the amazing Avatar: The Last Airbender TV show, alongside the ratings for each episode. I’m going to make a simple graph to see the connection among characters and some important concepts in the series.

Load data

avatar <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/avatar.csv')
scene_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/scene_description.csv')

The data has the following columns:

str(avatar,give.attr = F)
## tibble [13,385 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id             : num [1:13385] 1 2 3 4 5 6 7 8 9 10 ...
##  $ book           : chr [1:13385] "Water" "Water" "Water" "Water" ...
##  $ book_num       : num [1:13385] 1 1 1 1 1 1 1 1 1 1 ...
##  $ chapter        : chr [1:13385] "The Boy in the Iceberg" "The Boy in the Iceberg" "The Boy in the Iceberg" "The Boy in the Iceberg" ...
##  $ chapter_num    : num [1:13385] 1 1 1 1 1 1 1 1 1 1 ...
##  $ character      : chr [1:13385] "Katara" "Scene Description" "Sokka" "Scene Description" ...
##  $ full_text      : chr [1:13385] "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Av"| __truncated__ "As the title card fades, the scene opens onto a shot of an icy sea before panning slowly to the left, revealing"| __truncated__ "It's not getting away from me this time. [Close-up of the boy as he grins confidently over his shoulder in the "| __truncated__ "The shot pans quickly from the boy to Katara, who seems indifferent to his claim and turns back to her side of "| __truncated__ ...
##  $ character_words: chr [1:13385] "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Av"| __truncated__ NA "It's not getting away from me this time.  Watch and learn, Katara. This is how you catch a fish." NA ...
##  $ writer         : chr [1:13385] "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" "<U+200E>Michael Dante DiMartino, Bryan Konietzko, Aaron Ehasz, Peter Goldfinger, Josh Stolberg" ...
##  $ director       : chr [1:13385] "Dave Filoni" "Dave Filoni" "Dave Filoni" "Dave Filoni" ...
##  $ imdb_rating    : num [1:13385] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ...

We are not going to use scenes where nobody said anything. Let’s get rid of them.

avatar <- avatar %>% 
          filter(character != "Scene Description")

Characters and words of interest

Every node in the graph will represent either a character of the series or a concept. The main idea is see how much characters reference each of the other characters and concepts. We will have a kind of web of references in the show. Or maybe, sort of a fancy word cloud.

Firs, let’s make one array for characters and words of interest.

characters <- c("Aang", "Sokka", "Katara", "Toph", "Zuko",
                 "Iroh", "Ozai","Azula")
words_of_interest <- c("Avatar", "Aang", "Sokka", "Katara", 
                       "Toph", "Zuko", "Iroh", "Fire Lord",
                       "Azula", "kill", "Sozin's Comet", 
                       "save", "Earth Kingdom", "spirit",
                       "Appa","bison" ,"Ozai", "moon",
                       "Agni Kai", "bending", "water tribe",
                       "balance", "honor", "Dai Li")

There are tribes for each element (water, earth, air and fire) in the show. To illustrate how the tribes are represented in the characters and concepts we will define each word of interest as relating to a tribe or to a general concept. We expect to see people from a given tribe referencing more concepts and people from that tribe. Conversely, some general concepts, such as “Avatar” might be referenced many times by all tribes.

tribes <- c("concept", "air", "water","water",
            "earth", "fire", "fire", "fire",
            "fire", "concept","fire",
            "concept", "earth", "concept",
            "air", "air", "fire", "water",
            "fire", "concept", "water",
            "concept", "concept", "earth")

levels_tribes <- sort(unique(tribes))
tribes <- data.frame(words = words_of_interest, 
                     tribes = factor(tribes,
                                     levels = levels_tribes,
                                     ordered = TRUE
words tribes
Avatar concept
Aang air
Sokka water
Katara water
Toph earth
Zuko fire
Iroh fire
Fire Lord fire
Azula fire
kill concept
Sozin’s Comet fire
save concept
Earth Kingdom earth
spirit concept
Appa air
bison air
Ozai fire
moon water
Agni Kai fire
bending concept
water tribe water
balance concept
honor concept
Dai Li earth

Prepare graph

We are going to prepare the graph with the packages igraph and tidygraph. First we will make a adjacency table where the first column is a chacarater, the second is a word she/he referenced and the third is how many times this reference occured. The most times a character said something, stronger will be the connection between them. This will show in proximity and size of the connection in the network. In graphs, this is usually called weight. We are going to keep only references made at least 5 times in the series.

Once we have the adjacency table, it is easy to convert it to a graph.

This is the “data wrangling” part of my Tidy Tuesday.


adjacencies <- rep(list(words_of_interest),length(characters))
names(adjacencies) <- characters

adjacencies <- adjacencies %>% 
  enframe("character", "word") %>% 
  unnest(word) %>% 
   nest_join(select(avatar,character, character_words),
             "character", name = "lines") %>% 
  group_by(character, word) %>% 
  mutate(weight = map_dbl(lines, ~sum(str_count(.$character_words, word)))) %>% 
  filter(weight > 4, character != word) %>% 

graph <- graph_from_data_frame(adjacencies)
graph <- as_tbl_graph(graph)
graph <- left_join(graph, tribes, by = c(name = "words"))
character word weight
Aang Avatar 56
Aang Sokka 87
Aang Katara 101
Aang Toph 22
Aang Zuko 21
Aang Fire Lord 33
Aang kill 7
Aang save 12
Aang spirit 22
Aang Appa 106
Aang bison 12
Aang Ozai 7
Aang bending 49
Aang honor 5
Sokka Avatar 26
Sokka Aang 98
Sokka Katara 69
Sokka Toph 25
Sokka Zuko 22
Sokka Fire Lord 25
Sokka kill 8
Sokka save 5
Sokka Earth Kingdom 6
Sokka spirit 9
Sokka Appa 35
Sokka bison 9
Sokka moon 5
Sokka bending 25
Katara Avatar 36
Katara Aang 220
Katara Sokka 108
Katara Toph 26
Katara Zuko 19
Katara kill 5
Katara save 9
Katara Earth Kingdom 6
Katara Appa 28
Katara bending 47
Toph Aang 11
Toph Sokka 15
Toph Katara 20
Toph Zuko 5
Toph Fire Lord 7
Toph bending 15
Zuko Avatar 57
Zuko Aang 9
Zuko Sokka 8
Zuko Katara 5
Zuko Fire Lord 7
Zuko Azula 14
Zuko kill 5
Zuko save 8
Zuko Earth Kingdom 5
Zuko bending 19
Zuko honor 16
Iroh Avatar 18
Iroh Zuko 54
Iroh Fire Lord 10
Iroh Azula 5
Iroh moon 5
Iroh bending 5
Iroh balance 5
Iroh honor 10
Ozai Zuko 6
Ozai Fire Lord 7
Ozai Azula 5
Azula Avatar 13
Azula Zuko 18
Azula Fire Lord 8
Azula honor 6
Azula Dai Li 6


The handy package ggraph allow us to plot graphs in a similiar way to ggplot plotting. We also make a custom fill and color scale to be true to Avatar’s palette. I found the colors in the nice palletes by emilytjames. We also embbed the iconic font used in Avatar.


# Avatar fill color scale
avatar_fill <- function (...) { scale_fill_manual(values =  c(
                               "#F1B84A", "grey20", 
                               "#4FA24C", "#900B0B",
) }
avatar_color <- function (...) { scale_colour_manual(values =  c(
                               "#F1B84A", "grey20", 
                               "#4FA24C", "#900B0B",
) }
# Avatar font
windowsFonts(avatar_font = windowsFont("Herculanum"))

ggraph(graph, layout = 'fr')+
    geom_edge_link(aes(width = weight), color = 'grey70')+
    geom_node_point(aes(color = tribes))+
    geom_node_label(aes(label = name, fill = tribes),
                    color = "white", fontface = "bold",
                    family = "avatar_font", size = 5,
                    repel = TRUE)+
  theme(legend.position = "top")+
  labs(title = "Avatar referencing graph", color = "Tribes",
       fill = "Tribes", width = "Weight")

Nice! That’s our graph. We can see how the central characters in the plot are also in the center of the graph. They reference among themselves a lot. The most central concept in our graph is, of course, “Avatar”. We can also get a glimpse of how the idea of Honor is important for the Fire Tribe. And we get the interesting fact that the character that cites the most the Earth Kingdom concept of the royal guard Dai Li is actually Azula, from the Fire Tribe. Well, if you watched the show you know why that is, if you haven’t, what are you waiting for?

Mateus Silvestrin
Mateus Silvestrin

PhD in Neuroscience and Cognition. My thesis was about the neural correlates of time perception, developed in the Timing Lab at the Federal University of ABC. I am currently a post-doc (FAPESP grant) at the Social and Cognitive Neuroscience Lab at the Mackenzie Presbyterian University, where I study associations between affect and moral judgement. I also colaborate with the Developmental, Affective and Social Neuroscience Lab.