{"id":1146,"date":"2016-05-09T03:17:30","date_gmt":"2016-05-09T07:17:30","guid":{"rendered":"http:\/\/dh.prattsils.org\/?p=1146"},"modified":"2016-05-09T03:17:30","modified_gmt":"2016-05-09T07:17:30","slug":"topic-modeling-cryptomes-archive-over-time","status":"publish","type":"post","link":"https:\/\/studentwork.prattsi.org\/dh\/2016\/05\/09\/topic-modeling-cryptomes-archive-over-time\/","title":{"rendered":"Topic Modeling Cryptome&#8217;s Archive Over Time"},"content":{"rendered":"<p class=\"p1\"><b><\/b><strong><span class=\"s2\">Introduction<\/span><\/strong><\/p>\n<p class=\"p1\"><span class=\"s2\">For 19 years, the nonprofit website Cryptome has collected and published a wide range of materials primarily related to domestic and international governmental affairs which have otherwise faced obstacles to traditional publication. Founded and solely maintained by the architects John Young and Deborah Natsios, Cryptome openly \u201cwelcomes documents for publication that are prohibited by governments worldwide, in particular material on freedom of expression, privacy, cryptology, dual-use technologies, national security, intelligence, and secret governance &#8211; open, secret, and classified documents.\u201d(1)<\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Within this same epigraphical mission statement, the Cryptome home page purports that documents are only removed in cases of direct US court order, and there are instances within their archives where Young and Natsios have posted their legal discourse over the inclusion of a document rather than remove its presence completely.\u00a0<\/span><span style=\"font-size: 16px\">These kinds of fascinating issues of representation and information freedom abound in the 100,000+ files that Cryptome has amassed since June 1996, though its size and site presentation give the impression of impenetrable opacity. Young and Natsios remain open with the contents of their archives, offering flash drives with rolling updates of their holdings for $100 donations on their website not to mention hosting digital locations for every document freely online. Even so, my studies in the Digital Humanities I course led me to believe there were more thorough tools available for evaluating what the Cryptome archive might offer a casual user.<\/span><\/p>\n<p class=\"p1\"><span class=\"s1\"><b>Developing a Research Question<\/b><\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Spending some time browsing both Cryptome\u2019s online catalog and reading through interviews they\u2019ve given, I developed a central research question. Given the breadth of the collection, not to mention the ambition of Cryptome\u2019s welcoming any document \u201cthat are prohibited by governments worldwide\u201d, I was curious about a way to examine the collection in a manner that might produce a set of topics which might be considered \u201cprohibited.\u201d In other words, I wanted to analyze the breadth of content within Cryptome\u2019s massive archive in relation to Young and Natsios\u2019s own informal epigraph. If readily penetrable, I imagined the data set might give linguistic and thematic definition to the genre of \u201copen, secret, and classified\u201d security and intelligence documents. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> In light of our in-class digital humanities discussions and an exploratory methodology meeting with Professor Sula, I consulted literature on topic modeling with the intention of solidifying my research approach. In his article \u201cTopic Modeling and Digital Humanities\u201d, Blei defines a \u201ctopic\u201d for probabilistic text models as \u201ca probability distribution over terms.\u201d(2) I concluded that this would be well-suited to analyzing the Cryptome archive primarily due to the exceptional number of terms to be considered. Because of the nature of the archive, though, I wanted to go deeper than mere word frequency or occurrences within the whole of the corpus. Young and Natsios had already outlined, albeit in broad strokes, the kinds of words and ideas they were interested in: freedom of expression, cryptology, classified documents, and so on. Their mission statement, though, does little to define the kinds of word or thematic patterns that make such documents the kind of work Cryptome sets out to publish. Neuhaus writes that in topic modeling, \u201cdocuments are considered to be bags of words, and words are considered to be independent given the topics (i.e., word order is irrelevant).&#8221;(3) The topic modeling approach might be more likely to answer my research question of what this genre of document tends to look like, sound like, or even questions about the geography of these kinds of security issues. <\/span><\/p>\n<p class=\"p1\"><span class=\"s1\"><b>Methods<\/b><\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> I began my experiment with the question: what types of documents are in the Cryptome corpus, and what are identifiable topic patterns? I would be using a Mac for my experiment, and my primary software for this task was to be Google\u2019s open-source <a href=\"https:\/\/code.google.com\/archive\/p\/topic-modeling-tool\/\">Topic Modeling Tool<\/a>, a Java application which would require that all documents analyzed be first converted into .txt files. However, a cursory look through the unsorted files on Cryptome\u2019s flash drive revealed that the vast majority of Cryptome files fit into three categories of file type: .txt, .htm \/ .html, and .pdf. Rather than pull simply from the Cryptome flash drive, an endeavor which would have necessitated a mass clean-up of working folders, I opted to utilize Cryptome\u2019s own indices. These indices were collected in a flash drive folder as .htm files that were seemingly identical to the pages hosted on Cryptome\u2019s public website. The primary advantage to using these indices were that they were sorted into 40 separate pages, by and large bifurcating each year since their 1996 inauguration: \u201cJanuary-June 1998, July-December 1998\u201d etc. Although my research question did not explicitly involve time, I concluded that downloading and analyzing 20,000+ documents could only benefit from such implicit organization. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Having established my approach to the corpus, I set about downloading the entirety of the files to an external hard drive. To do this, I used the Firefox add-on Download Them All. Although the limits of my software and expertise ultimately precluded the analysis of images, I chose to download the whole of the files within each individual section.The only exception I made to this were the two 2016 sections, titled \u201cCurrent Listings\u201d and \u201cRecent Listings\u201d, which accounted for various documents from 2016 and were both actively updating over the course of my data collection. Throughout the process there was a fluctuating amount of either broken links or files that were not properly downloaded by the add-on, with the majority of these occurring in the 1990s listings and for files filed as \u201cOffsite.\u201d Adhering to Cryptome\u2019s own half-year structure, I sorted what files I had by my three core types, leaving me with a corpus still totaling close to 20,000 .txt, .htm \/ .html, and .pdf files. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Because the Topic Modeling Tool only analyzes .txt files, there were many documents within each of my 38 sections that needed no additional conversion. I used Terminal\u2019s \u201ctextutil -convert txt\u201d command to batch convert .htm and .html files into readable .txt files and copied these converted .txts into a separate folder alongside the native .txt files for each section. The final component of my data collection used the open-source Java application Drop 2 Read in order to output unicode plain text files via optical character recognition. In the hopes of accelerating this automated process, I copied the .pdfs from every section into one folder and set it as Drop 2 Read\u2019s input. Similarly, I directed the .txt output to a single folder. This proved a problem when, either due to the demand of converting 6,000 .pdfs or my computer\u2019s subpar processing power, the conversion rate had scarcely reached 20% over the course of three days. At this point I realized I had severely miscalculated the three file groups\u2014.txt, .htm \/ .html, .pdf\u2014as rough thirds of the Cryptome corpus. Sorting my .pdf folder by file size, I realized that this file type comprised several very documents that outweighed most of the .txt and .htm files: full-book scans, more than a few collected document series, and other such sizable concerns. In the interest of moving forward towards the actual topic-modeling, I ultimately sorted the remaining unconverted pdfs by file size and used Drop 2 Read to process every file sized up to 1 MB. I was disappointed to have unintentionally compromised the dataset, but this decision would ideally ensure a dynamic if disproportionate range of topics. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Now with 38 nearly-complete sections of .txt files of Cryptome\u2019s documents, I copied all 20,000 .txt files into one folder for the Topic Modeling Tool. To determine my number of topics, I used the formula \u221a(n\/2) where n=total number of documents. Upon initializing the topic modeling, however, my computer returned an error reporting insufficient java memory. I first tried adjusting the number of iterations, then manually increased the java allotment within my preferences, and with no success I attempted to run the model on various Pratt computers. Unfortunately, I was unable to properly download the tool on any public terminal due to restrictions on unauthorized java applications. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Assuming my computer\u2019s failure was resultant of the corpus size (20,000+ documents, approximately 12 GB), I would need to reframe my research question. I decided that to randomly limit the corpus simply so that I could use the Topic Modeling Tool would too drastically compromise the question of what the Cryptome corpus looked like. Fortunately, throughout this process I had preserved each section\u2019s chronologically-filed .txt and converted-.htm and .html files. To sort the .txt files for the successfully converted .pdfs by year, I used Terminal\u2019s copy command to relocate the .txt files with identical names to each section\u2019s .pdfs to a composite folder that also included the previously-sorted .txt and .htm \/ .html files. I could now run topic models for each section of the Cryptome archive and address questions about the makeup of their holdings over segments of time. To more concisely answer this question, I decided to combine the 38 half-year sections into 19 distinct corpora according to year. The exception to this would be a two-year hybrid culled from the first section, technically a collection of 1996 documents as well as the first half of 1997, and the second section which solely comprised latter-1997 documents. Finally, I ran each year of Cryptome\u2019s holdings through the Topic Modeling Tool, again using the \u221a(n\/2) formula in addition to standardizing each model at 200 iterations and 10 topic words. <\/span><\/p>\n<p class=\"p1\"><span class=\"s1\"><b>Results<\/b><\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> My change in scope for the topic modeling ultimately yielded a set of results that seems more appropriate to analyze for broad discrepancies as opposed to interrogating topic occurrence within specific documents or vice versa. Looking at 19 years worth of topics, certain trends emerge. One such example is the gradual globalization of Cryptome\u2019s archive. For the first few years, topics with reference to international affairs err either on the general side (\u201cinternational criminal crime countries states united groups drug world organized\u201d in 2000) or seem filtered through an American perspective (\u201cnuclear china weapons united states satellite satellites . . .\u201d in 1998). Perhaps unsurprisingly given the focus of American media at the time, the topics for 2001 and 2002 display a newfound preponderance of international geography. Countries such as Afghanistan and Israel are present alongside non-U.S. centric topic words like \u201cfood\u201d and these years return non-English topic words throughout. Not every one of these patterns holds across Cryptome\u2019s lifetime, but it was enlightening to identify how these kinds of pivot points arise from the topic modeling method. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Another approach I took to analyzing the archive over time was tracking particular topic words across various annual topics. Certain topic words, such as \u201ckey\u201d, are consistently paired with the same topic words (\u201cencryption\u201d and \u201ccrypto\u201d for \u201ckey\u201d) no matter the year. The \u201ckey\u201d topics in particular occur in both 1997 and 2015, not to mention several years in between. Similarly, over half of all the topics including the word \u201cmilitary\u201d include \u201cweapons\u201d, \u201cwar\u201d, or more specific words like \u201cdrone\u201d that can be placed in similar context. These examples demonstrate the extent to which the vocabulary of Cryptome\u2019s content seems to consistently add up to a bigger picture of how certain words are used in \u201cclassified documents.\u201d The redefinition of diction, then, speaks to the possible understanding of \u201cclassified\u201d as a genre with a distinct linguistic character. Still, this method is indirect in this comparison, and it seems that modeling across time like this essentially answers a different question than my initial question to the full Cryptome corpus. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> More than the topic modeling\u2019s pivot points or varying word occurrence, though, carrying out the analysis by year demonstrates a kind of survey of Cryptome as a public information resource. Topics in the first few years of the archive tend to revolve around matters of encryption, both as it pertains to laws and government but also with regards to software license, copyright, and public users. Other words in these early topics tend towards what might now be thought as tech jargon, such as \u201cdata algorithm packet\u201d and so on. Even beyond the specific \u201cencryption\u201d topics, Cryptome\u2019s early years exhibit a kind of niche interest that includes little reference to the events or even politics that are now historically known to have been occurring simultaneously, issues regarding Presidents or, as previously stated, foreign affairs. It isn\u2019t until 2006 that topic words which now seem commonplace in the discussion on information security, such as \u201cnsa\u201d, begin occurring. Curiously, \u201cnsa\u201d becomes a fixture in the annual topics around the same time that \u201cencryption\u201d disappears for several years. <\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Furthermore, 2009 features a topic which, as a reflection of the public record, seems quite out of place with the way early Cryptome documents seemed to operate: \u201cpresident obama war barack house america washington http secret national\u201d. There are through lines in the way that politics are still correlated with \u201cwar\u201d and possibly subjective terms such as \u201csecret\u201d. Still, it marks a shift in Cryptome\u2019s potential role as a decidedly non-niche public source of reference. This trend continues: from 2010 to 2013, at least one topic per year includes the topic word \u201cwikileaks\u201d or \u201cassange\u201d, sometimes both, and both share topics with words like \u201csecurity\u201d, \u201cnsa\u201d, and \u201cfile\u201d. This stretch, even more than Cryptome\u2019s acknowledgement of 2008-09\u2019s public discussion of Obama, reflects how Cryptome\u2019s canon has now become aligned with the historical record\u2019s own burgeoning interest in information security rights. Because Cryptome had been publishing and possibly soliciting or searching for documents of encryption and the like as far back as 1997, it is difficult to say that Cryptome\u2019s 2010-13 topics exhibit a concerted effort at following the same trends and sensational figures as the mainstream American media. In order to answer a question like that, it would be necessary to gather data on Young\u2019s and Natsios\u2019s acquisitions process. Nonetheless, by the time \u201cassange\u201d shares a 2013 topic with \u201csnowden\u201d and then disappears while the latter continues through 2015, it is clear that the topic modeling of Cryptome\u2019s chronology has resulted in possible evidence that their archive has in some sense transcended the fringe pursuits of its origins. Another potential interpretation, of course, is that the fringe\u2019s concern with information leaks has instead moved towards the mainstream. It is worth noting that 2015\u2019s topics include both \u201cpdf nsa snowden update . . .\u201d and \u201cencryption security crypto software . . .\u201d By examining the corpus in this piecemeal chronological manner, despite my best efforts not to, I stumbled upon one possible manifestation of Cryptome\u2019s otherwise opaque identity complex. <\/span><\/p>\n<p class=\"p1\"><span class=\"s1\"><b>Future Directions<\/b><\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> Given hypothetical circumstances of increased time and resources, the clearest direction forward would be to find a way to topic model my first 20,000+ corpus. It is possible that processing through MALLET directly could compensate for the ostensible performance issues with Google\u2019s Topic Modeling Tool. As indicated above, I believe that the analysis I was able to carry out with topic modeling annually across Cryptome\u2019s archive was a promising place to start in terms of considering how co-occurring topic words reveal the semblance of genre in consistency of language and theme. Ultimately, though, my necessary shift in scope ended up only answering my initial question on this matter indirectly, instead more directly answering the separate question of Cryptome\u2019s relationship to the historical record.<\/span><\/p>\n<p class=\"p1\"><span class=\"s2\"> I am also hopeful that further work with this data might also be able to include the image files that gradually carry more weight within Cryptome\u2019s holdings as the years progress. Although it was beyond my scope for this project, I would think that data on the points in time when these variable file types begin to appear would also be worth including.<\/span><\/p>\n<hr \/>\n<p class=\"p1\">References<\/p>\n<ol>\n<li>http:\/\/cryptome.org\/<\/li>\n<li>Blei, David (2012). Topic Modeling and Digital Humanities.\u00a0<em>Journal of Digital Humanities<\/em>\u00a0<em>2(<\/em>1).<\/li>\n<li>Neuhaus, Stephan and Zimmermann, Thomas.\u00a0http:\/\/research.microsoft.com\/pubs\/136976\/neuhaus-issre-2010.pdf<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p class=\"lead\">Introduction For 19 years, the nonprofit website Cryptome has collected and published a wide range of materials primarily related to domestic and international governmental affairs which have otherwise faced obstacles to traditional publication. Founded and solely maintained by the architects John Young and Deborah Natsios, Cryptome openly \u201cwelcomes documents for publication that are prohibited by governments worldwide, in particular material&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"btn btn-danger\" href=\"https:\/\/studentwork.prattsi.org\/dh\/2016\/05\/09\/topic-modeling-cryptomes-archive-over-time\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":47,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,7],"tags":[],"class_list":["post-1146","post","type-post","status-publish","format-standard","hentry","category-projects","category-student"],"_links":{"self":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/posts\/1146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/users\/47"}],"replies":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/comments?post=1146"}],"version-history":[{"count":0,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/posts\/1146\/revisions"}],"wp:attachment":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/media?parent=1146"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/categories?post=1146"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/tags?post=1146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}