Star Wars and text mining: This is the post you are looking for… – Data Column

10 years. It has been 10 years since the last Star Wars film, and the world is fully feeling the Star Wars fever. From Star Wars toys and games to Star Wars household items, Star Wars is everywhere you turn and it could not be more exciting.
With Star Wars Episode VII: The Force Awakens fast approaching, it is important to remember where the series began. The year was 1977 and a young director coming off of the success of American Graffiti had a vision for a space opera in the vein of the old westerns. With a rag tag, no name cast and a relatively small budget, George Lucas would create a mythos unlike any other and one of the largest cult sensations of all time. Winning multiple awards including best picture, Star Wars not only launched a revolution both in film technique and technical ability but also in the way we think about film.

Help me data science, you’re my only hope!

Given free reign of topics for a text mining project, and with the new movie on the horizon, our team wanted to take a look back at this iconic film. Utilizing text mining approaches, our team attempted to analyze the script of Star Wars Episode IV: A New Hope. We looked at eight of the primary characters, and considered all spoken lines and context involving each of them to compose eight separate documents as a base for analysis.

Number of lines per character. Blue represents a Rebellion character, red represents an Empire character

Our primary focus was to look at the similarity each character had to each other, and then examine each character’s sentiment across time. We were particularly interested in trying to capture the “essence” of the film without actually having to rewatch it. Right before the new episode, someone would theoretically be able to assess each character’s similarity and feelings and hopefully be able to ascertain the gist of A New Hope.
First, we examined each character’s similarity. In order to do this, we had to create a term vector. By eliminating stop words (such as “is”, “the”, “and”, etc.) and then stemming the words (i.e. converting all words into their shortest form. “Go,” “going,” and “gone” would all be shortened to “go”), we were able to generate a frequency of all terms for every character. This output fed into a clustering algorithm that grouped characters’ lines and created a similarity/dissimilarity matrix that we could utilize to gain some understanding of the relationship between characters. In the chart below, the characters’ relative positions represent how similar the characters are in terms of their lines and context. It is important to note that a character’s exact x/y position is relative, as the main relationship that holds is distance.

Character similarity plot. Greater distance represents greater dissimilarity.

We found the results to be generally as we would expect. Rebellion characters, Empire characters, and characters like Obi-Wan and Princess Leia who refer to each other a lot within the script were all clustered closely together. However, it was interesting that Leia ended closer to the Empire characters than others, and the Stormtrooper ended very far away from the other respective villains. Knowing the contents of the script, our team thought this may have been because Leia spends a good portion of the movie in the Death Star and thus often interacts with the Empire (i.e. they discuss similar things). Since the Stormtroopers are a lot of different characters, and are sometimes “comic relief” like C-3P0, our team believes that it might explain why they were positioned as they were.
Next, our team analyzed each of the main character’s sentiments across time. Utilizing the Russell’s model of emotional affect, we were able to gather information about a character’s sentiment in the form of valence and arousal scores for every sentence related to a character. We then took sentences with the maximum scores within a scene in order to represent the emotion for that character for that scene. Finally, we plotted this on an emotional scale based upon the Russell’s model.

Throughout the movie, we paused and highlighted certain scenes in order to ascertain the model’s accuracy when compared to our own assessment. The first scene we looked at was titled “The Walls Close In.” In this scene, our heroes fall into the trash compactor to escape the Imperials, and R2-D2 and C-3PO work frantically to stop the walls from closing in before it’s too late. The accuracy estimate we used was based upon the average proximity of our understanding of the scene and the algorithm results (proximity based upon location on the graph above). For instance, below Luke is relaxed based upon the algorithm’s results, but the team determined from watching the film that Luke was distressed. In this case because Luke is furthest away on the Russell model, the accuracy estimate for Luke was 0%. Taking this into account for all characters, in this specific scene we only had an overall accuracy of 55%. When comparing this accuracy to the other scenes, this scene did not perform as well as hoped.

Characters’ sentiments within the “The Walls Close In" scene plotted on the Russell’s scale. — Characters’ sentiments within the “The Walls Close In” scene plotted on the Russell’s scale.

Screen shot from Scene 34: “The Walls Close In” — Screen shot from scene 34: “The Walls Close In”

Algorithm performance vs. team assessment

Another scene we analyzed was “Obi-Wan vs. Vader.” In this scene, we see the final climactic battle between the wise sage Obi-Wan Kenobi and his old apprentice Darth Vader. We highlighted this scene specifically because our model performed well. Here we had an accuracy of 82%, which was very admirable.

Characters’ sentiments within the “Obi-Wan vs. Vader” scene plotted on the Russell scale

Scene 37: “Obi-Wan vs. Vader” — Screen shot from scene 37: “Obi-Wan vs. Vader”

We analyzed every scene in the film in this same manner. While there are too many scenes to show in this blog, it is noteworthy that overall the algorithm performed extremely well at evaluating character sentiment.

So what does this mean?

Well, if you do not have time before Episode VII to marathon all the Star Wars movies (a 13 hours and 17 minutes long endeavor), you could get a good idea of the general synopsis of the film based upon the character similarity and character sentiment we have shown. Although we only evaluated Episode IV, you could apply the same methodology to all the other films.
Episode VII is right on the horizon, and we as a team could not be more excited. With current projections of this film becoming the highest grossing of all time (a record at one point held by Episode IV), we have a strong indication that the general public may have similar excitement. Whether you have presale tickets to see the first screenings or are just a casual viewer, Star Wars continues to engross audiences alike. With the methodology we have shown here, you can get caught up on the old films, whether you have seen them before or not, just in time for the new movie. While you may not get exact plot points, you can have a good understanding of which characters are related and how the characters feel throughout the film.

And finally, may the Force be with you!

Columnists: Mirna Domancic and Steven Falgout

Team members: Cory Delaney, Mirna Domancic, Steven Falgout, Noah Linger, Binit Malla