LexisNexis

LexisNexis is a database of traditional news articles. It aggregates the full text and metadata for each article it collects.

For our project, we downloaded both the full text of each article from November 7, 2016, to March 5, 2017 (for a total of 11, 834 articles), as well as their metadata. The full text was downloaded as a plain text file, which was then divided up by day in a Google Sheet and OpenRefine. Once the text was separated by day, it was processed using the programming language R and analyzed on Tableau. Term frequency data per day was calculated from the snippets using R (stopwords were removed and terms were stemmed).

We originally downloaded all of the available metadata on each article. This included:

  • Byline – author or journalist
  • Date – range of dates
  • Headline – article title
  • Length – number of words
  • Publication – source of publishing
  • Company – corporate company name
  • Geographic – combines country, state, and city
  • Organization – non-company organizations
  • Person – people mentioned
  • Subject – search by Topic

 

After initially downloading all of this metadata, we decided to pare it down according to our research needs using OpenRefine. For our purposes, we kept:

  • Byline
  • Date
  • Headline
  • Length
  • Publication
  • Type
  • Topic
  • Weight

 

Type, topic, and weight were formed through the “Company,” “Geographic,” “Person,” “Subject,” and “Organization” categories and their accompanying percentages.

See “Developing a Search” from LexisNexis for more information regarding the definitions of metadata, referred to as “document sections.”