CIA NLP Investigation

Named Entity Network Graph
About

This graph shows the connections between terms like names, organizations, events, and locations. It was created using the Python modules Pyvis and Spacy.

First, I used Spacy to create a list of named entities within the text (such as names and locations). Anytime more than one named entity appeared within a section of a document, a relationship was recorded. This data was then used in Pyvis to create this graph.

Obviously, nuance and meaning is lost in this graph. Additionally, other than weeding out entities like ordinal numbers and money, I used default named entity recognition settings in Spacy, which can make mistakes.

If all of the connections were shown, the map would be too tangled. Therefore, only the most frequent connections are shown in this graph. In this case, all of these connections appeared at least 250 times.

Here is a graph that included connections that appeared at least 150 times.

Despite these limitations, this graph does give a sense of a few major topics within the speech.

LDA Topic Modeling - Common vocabulary groups

These visualizations use the LDA algorithm to show some of the major themes in the documents.

This algorithm is used to break down data into topics/categories. It was first used in genetics and later applied to NLP.

To me, Word Clouds communicated the findings most effectively. However, I also thought pyLDAvis was an interesting visualization of this data worth including as well.

At first, the Word Clouds were overcrowded with the same words, such as 'intelligence' and 'central'.

However, after a bit of cleaning, more specific words began to emerge, which gave more detailed hints about the general contents of the documents.

Geographic Analysis - Which nations were mentioned in the documents?

World Map Display
About

This map was created to show which nations appeared in the documents and how frequently they appeared.

This map was created using the Python modules Folium and Spacy.

First, I used Spacy to track mentions of geopolitical entities. Then, I used Folium to indicate these locations as well as their frequency on a choropleth map.

One additional step that I took in generating this map was to match geopolitical entities that weren't modern-day nations with the appropriate label.

For example, Tthe USSR would be marked as Russia.

One significant limitation is that the names of provinces and natural geographic areas are ignored.

Another limitation of this map is that the time period of when nations were mentioned was ignored as well.

Using data to find blind-spots and identify research topics...

Sometimes primary source documents, like the declassified documents examined here, can have details missed in secondary sources like textbooks.But, as stated before, 2809 documents is a lot of dry reading to go through!

After looking at these visualizations, a few details stuck out.

First, Egypt was really on the minds of the CIA, which is unsurprising given the leadership of Nasser and the Suez Crisis. I had also forgotten how pivotal Indonesia to the US during the Cold War in the 1950s.

Second, Bourguiba and Rapacki seemed to be mentioned frequently. Many secondary sources I'd read in the past didn't focus on Polish or Tunisian history, so I didn't recognize these names, but this analysis showed me influential the CIA believed them to be at the time. I also didn't realize

Finally, I didn't think much about the role of submarines in the 1950s, but they seemed to play a major role in these documents

In a sense, what this analysis did best of all was to point out my blind-spots to me. We all have models of the world and stories about the history of the world in our imaginations that sometimes go unchallenged as we read narratives by similar authors.

This experiment with NLP certainly showed me a few of my blind-spots that I can pursue in my future studies of history: the wight that Indonesia and Tunisia had on the minds of the CIA.

Research, Teaching, Technology, and Coffee

About this project...

Named Entity Network - Visualizing connections between specific entities

LDA Topic Modeling - Common vocabulary groups

Geographic Analysis - Which nations were mentioned in the documents?

Using data to find blind-spots and identify research topics...