Visualizing News Coverage of Puerto Rico From 2006-2016 in The New York Times

Introduction

This analysis focuses on The New York Times coverage of Puerto Rico from 2006 to 2016. I thought of this idea while reading The New York Times article “Zika Cases in Puerto Rico are Skyrocketing.” In the article the author described PR as an island “in chaos” where the “war against the Aedes aegypti mosquito… is sputtering out in failure.” I found that coverage of PR tended to sway to the negative. However, my opinion is subjective and I wanted to quantify this hypothesis.

To achieve this objective, I needed to create a dataset from scratch. This process involved aggregating the top ten articles about Puerto Rico from 2006 to 2016 from the NYT’s online archives. I excluded any articles from AP/Reuters and articles that only mentioned the island in passing. I chose the NYT because it is a well-respected outlet that holds an esteemed position among American newspapers.

I compiled the following information on the selected 110 articles: dates, headlines, URLs, section where the article had been published, and authors. I wanted to verify if I could detect any patterns or trends in these variables.

Designing the Visualizations

The purpose of my analysis was twofold: to analyze the selected articles metadata and to detect patterns in the words used to describe the island, its people and issues. I used Google Sheets to create the dataset. After I had finished compiling the data, I uploaded the sheet to Tableau Public.

For my first visualization, I copy/pasted the content in the dataset’s articles from 2006, 2011, and 2016 in a Word document. I successfully extracted the text from 30 different articles. I then uploaded all the text into Voyant Tools by year. I set rules to define the words that the tool should ignore when performing the analysis. The “stopwords” were: Puerto, Rico, Rico’s, Rican, Mr, Said, Like, and It’s. In the Word document, I replaced all instances of “San Juan” and “United States” with “SanJuan” and “UnitedStates,” so Voyant would recognize the words as one term. Once Voyant analyzed the text, I exported the information as a text file. I then uploaded these three files into Tableau. For these three bar graphs, I selected “Terms” as the dimension, the sum of the records as the measure, and sorted the information in a descending manner. I filtered the information, so the visualization would only include the top ten words in each year. I wanted users to view these three graphs side-by-side, but each graph had a different amount in the y-axis. To solve this issue, I amended the 2011 and 2016 graph, so their y-axis would match the one from 2006. I then created a Dashboard using the three bar graphs and titled it “Top Ten Words Used in Articles About Puerto Rico by The New York Times.” Each individual graph was labeled with its corresponding year in the sub-header.

For my second visualization, I used Voyant Tools to analyze the text in the 110 headlines that had been selected from 2006 through 2016. I set rules to define the words that the tool should ignore when performing the analysis. The “stopwords” were: Puerto, Rico, Rico’s, and Rican. Once again, I replaced all instances of “San Juan” and “United States” to “SanJuan” and “UnitedStates.” Once Voyant analyzed the text, I exported the information as a text file. I then uploaded this file into Tableau. I selected “Terms” as the dimension and the sum of the records as the measure. I chose to use a bar graph and sorted the information in a descending manner. I filtered the information, so the bar graph would only show words that had been used three times or more during the ten-year period.

My third visualization involved analyzing the metadata of the selected 110 articles from 2006 through 2016. I chose the “Sections” and “Authors” variables as the dimensions and the sum of the records as the measure. I settled on a bar graph with a dual x-axis. I clustered the bars by the section were the articles were published. Then, I sorted the graph by section with the most articles published to least published. I also filtered the information, so the graph would only show the top ten section with the most articles. I realized that time would be a useful variable for context, so I added it as a marker set to discrete. Each bar now had a color that corresponded with the year when the article was published. Since the bars were colored in a gradient scale that was a little hard to see, I organized the bars in each section by the year they were published in a descending scale.

U/X Testing

I setup my laptop with the three visualizations at a coffee shop around the Dekalb stop in Bushwick. I politely asked three individuals that were sitting at the coffee shop if they minded performing user testing of these three visualizations for a class project. They were all generally friendly and consented. User 1 and User 3 were a little bit more guarded than User 2, and were self-conscious of not saying anything “stupid.” I tried to be even more cheerful when they supplied me with answers and emphasized that it was my project being tested, not them.

Users first performed a think-aloud of the visualizations followed by task completions. During the testing I supplied users with Figure 3 first, then Figure 2, and then Figure 1. Figure 3 proved to be the hardest to understand. My title was not clear enough and I had not labeled the axis correctly. Users missed the section headers in the top x-axis and did not understand what the names in the bottom x-axis meant. I had chosen a green gradient to mark the time span, but it was not easy for users to distinguish between the shades. User 1 commented that Figure 3 was “a little overwhelming.” User 2 commented that the section titled “United States” was confusing, since she did not know that was the name of a section in the newspaper. She suggested labeling this visualization better. User 3 seemed upset by Figure 3. He strongly suggested labelling the visualization better, providing users with more context, and making the “Section” column more noticeable. He hated the original title of Figure 3 and stressed that it should be more specific.

After the think-aloud, the participants used the visualizations to answer questions, such as “Who wrote an article in the Travel section during 2006?”, “What section did Ben Ratliff write for?,” “What was the most popular term used in articles during 2011?”. I was surprised to learn that users liked Figure 3’s color gradient, which I thought might be confusing. They also understood that under each section the bars were organized by year in a descending manner. However, they used Tableau’s hover feature to help them figure out the dates, which would not be possible on WordPress. User 2 recommended that I use something more dramatic than a one-color gradient to mark time for Figure 3. User testing made me realize Figure 3 required a lot more work.

Users easily understood Figure 2 and the testing went by smoothly. The users favorite visualization was Figure 1. They all commented that they enjoyed this one the most. User 2 commented that it was a good idea to place the three graphs side-by-side. The three users laughed at the graph for 2006, which has “Yankee” and “Daddy” among the top terms. What I did not expect was that users would create their own stories with the data after seeing Figure 1. User 2 remarked, “2006 was a good year for reggaetón. In 2011 there was a mild interesting in Puerto Rico. And 2016 was a [expletive].”

U/X Amends

Using the information I gathered from the testing, I decided it would be best to organize the visualizations the following way: “Top Ten Words Used in Articles About Puerto Rico by The New York Times” first, followed by “Most Popular Words Used In Headlines About Puerto Rico By The New York Times From 2006-2016,” and leave the densest visualization “Top Ten New York Times Sections Containing Articles About Puerto Rico and Their Corresponding Authors From 2006-2016” last.

In regards to color, I decided to leave the default blue for all the visualizations related to words (Figure 1 and 2). Users liked the color and it is similar to the blue used by the Pew Research Center in their visualizations. Figure 3, however, was trickier. Because color is used to present 11 different variables, the initial green gradient was hard to read. However, using a palette of multiple colors made the visualization even more overwhelming. I settled on a “Red to Gold” gradient and decided to link Figure 3 to Tableau’s interactive graph, so users could use the hover tool outside of WordPress. I chose the “Red to Gold” colors based on User 2’s recommendation and because red is the first color in the hierarchy of color coding.

I decided to improve the titles and axis labels on all the visualizations, especially Figure 3. In Figure 3, I changed “Number of Records” in the y-axis to “Number of Articles.” I increased the size of the text from 9 to 11 in the x- and y-axis to improve user readability. I wish Tableau Public allowed me to type “Authors” below the x-axis, but I was unable to do this. I also made the line along the “Sections” more heavy-set, so users would notice that each topic was self-contained.

Analysis

Figure 1: Top Ten Words Used in Articles About Puerto Rico by The New York Times

As observed by the three users, Figure 1 does a good job of documenting major events in PR and how they are perceived in the US at different points in time. In 2006, prior to the crisis, it seems that music and other cultural topics were the island’s main draw. During 2006, the NYT’s magazine did a profile piece on reggaetón artists Daddy Yankee which is why both words figure so prominently. The author referred to the artist solely as Yankee at times throughout the piece, which is why I did not join the words on Voyant. Interestingly enough, the pending economic crisis was already brewing at the time. Section 936 expired in the island that year, which had previously allowed American companies to operate in the island without paying taxes. Multinational corporations left PR in droves, ushering a wave of unemployment in the island. However, this news did not factor into the top ten retrieved articles. During 2006 reggaetón was approaching its zenith and was probably blasting in every corner of New York City. The top ten words from the selected articles of 2006 reflect a tropical place with a music-centered vibe.

Meanwhile, 2011 displays more politically charged terms and personal names figure into the top 10 words used in the selected articles. That year President Obama made a rare presidential visit to Puerto Rico, becoming the first president to visit the island officially since John F. Kennedy. The amount of attention paid to this visit can be perceived from the 2011 graph; the first 3 words allude or directly reflect his visit. The NYT also ran articles on the New York Giants Victor Cruz, who is half Puerto Rican, and on writer Esmeralda Santiago who moved to New York during her teenage year and had just released the novel “Conquistadora.” Ana is the name of the main character of the novel which figured into the ten most popular words. Six of the ten terms correlate with direct ties to New York or the US’s relationship with PR. Only two words, “Students” and “Pipeline,” allude to situations with no direct correlation to the mainland: the student protests of 2011 in anticipation of a university tuition hike and the protests against the proposed gas pipeline during the Fortuño administration.

Unsurprisingly, the bar graph from 2016 contains the bleakest terms. They all allude to the financial crisis or the Zika outbreak. “Power” refers to Puerto Rico’s electric power authority which is in debt and caused an island-wide blackout this year. The NYT also covered the Zika situation in PR incessantly, it appeared 30 times in the selected ten articles.

There is a tonal shift in the words used in the articles to describe PR over the decade described in the visualization. It is interesting that the term “island” becomes more popular as the economic situation worsens. However, there is not enough data to know whether there is a correlation between the timing and the term. Based on Figure 1, while the pending economic crisis was already brewing in 2006, it went unperceived by the newspaper until very recently.

Interestingly, the terms from Figure 1 allude to topics that mostly have direct relevance to New York or the US. PR’s situation, people, and culture seem to be worth reporting only when they has a direct impact on the US. While this is not surprising, since the NYT writes for its English-speaking American audience, it underscores the importance of having a strong Puerto Rican press. Only we can cover our own issues, culture, and people in a way that directly relates to our reality. We cannot rely on others to tell our stories and write our history.

Figure 2: Most Popular Words Used in Headlines About Puerto Rico by The New York Times from 2006-2016

Figure 2’s findings are similar to those from Figure 1. “Debt” and “Governor” top the list as the two most popular words used in the selected 110 headlines. “Governor,” “Obama,” “Police,” “Inquiry,” and “Visit” were the political terms used the most in headlines. “Debt,” “Utility,” “Fiscal,” and “Power” were the most popular economic terms used in the headlines. “US” was the most popular geographic term followed by “San Juan.” “Salsa” and “Baseball” were also on the list, which User 3 commented was almost stereotypical. Nevertheless, our sports figures and music still figure as a draw for international coverage. “Death” also figures among the most prominent tems, which made me reflect on the old adage, “good news doesn’t sell.”

Figure 3: Top Ten New York Times Sections Containing Articles About Puerto Rico and Their Corresponding Authors from 2006-2016

Meanwhile, Figure 3 provides a decent overview of the 110 selected articles’ metadata. One can see, for example, how Travel section articles have waned from 2006. There is an uptick of Health articles written in 2016 after the Zika outbreak, an uptick in Business section articles after the global financial crisis of 2008, and an uptick in the Real Estate section in 2014 after the PR government announced tax cuts for individuals willing to relocate businesses there. Users can also see that the most frequent contributor, especially during the last two years, is Mary Williams Walsh who covers business-related topics. Coming in second is the Miami-based Lizette Alvarez. Even though more research is needed on each individual author, this next statement is a preliminary hypothesis based on last names, most of the authors reporting on the island do not seem to be Hispanic. When I read sentences where the forceful sterilization of unsuspecting Puerto Rican women is dismissed or a pregnant woman in PR’s hot and tropical climate is described as wearing “the skimpiest of maternity dresses,” I wonder if better and more equitable reporting could be addressed by having a more diverse staff writing on these issues.

Further Research and Conclusion

The three visualization provide users with information on how PR is perceived by US media at different points in time, along with who writes this news and how it is framed. By selecting a sample of 110 articles, I was able to extract popular terms in these articles and headlines, along with their metadata, to provide a snapshot of how PR was covered by the NYT from 2006 through 2016.

I was pleased that the users were able to use the data to come to their own conclusions, especially for Figure 1. I would like to keep on refining Figure 3, since I believe it is not as user-friendly as it could be. Supplying people with too much information leads to frustration, but even though Figure 3 is dense and requires more patience than the other two visualizations, I think it delivers important information. Even User 3 who disliked Figure 3 was able to make interesting comments, like noticing that none of the prominent NYT writers appeared. Listening to users and having patience even when they express negative feelings is important to create a better product.

Further research could focus on collecting information on the amount of articles writen about PR from 2006-2016 and analyzing any patterns. Aggregating articles written by Puerto Rican newspapers, along with other state-side newspapers, and analyzing the terms used to describe the same news would also be an interesting project. This model could also be used to research how other subjects are covered, from Cuba to rural states. As news outlets fall under scrutiny for failing to provide thorough coverage beyond the East/West coasts and fake news websites proliferate, it is important now more than ever to be judicious about the content of what we read, who writes it, and how it is meant to be interpreted. These types of analysis provide us with a useful and quantitative look of our news.

Link to dataset

Information Visualization

Student work at the School of Information, Pratt Institute

Visualizing News Coverage of Puerto Rico From 2006-2016 in The New York Times