When I began this research, I was originally planning to study children’s internet use and data collection methods and concerns around internet-connected, or “smart,” toys. Not only was this a dataset that proved either hard to find or legally unavailable, but as I was conducting a literature review about data privacy legislation for another analysis paper in my Foundations of Information class, I quickly realized that that topic was a part of a bigger conversation about the broader issue of data privacy in the United States.
Through that research, I discovered a data visualization from Information is Beautiful entitled “World’s Biggest Data Breaches and Hacks” (2022), which appeared to show incidents of hacking, data breaches and leaks, and accidental security lapses growing in size and scope since the timeline began in 2004. As I looked further into the visualization and the stories it linked to, I found some dramatic recent examples of data loss, including the loss of 900,000 records from a police database in China, a recent leak of Dubai property data illuminating illicit money and criminal investments, and a hack of streaming platform Twitch in 2021 that exposed salary and payouts alongside technical details of new products and platforms.
Research Question
For this research and data visualization project, I decided to dive deeper into the “World’s Biggest Data Breaches and Hacks” original dataset, which authors David McCandless, Tom Evans, Paul Barton make available online. Their dataset’s sources include information compiled from IdTheftCentre and DataBreaches.net as well as news reports from New York Times, Forbes, The Guardian, Tech Radar, BBC, PC Mag, Tech Crunch and others.
This research project focuses on answering a broad investigative question through data visualization: What lessons can we learn from the past 15 years of global data breaches and hacks?
My main hypothesis was that data breaches and hacks will be shown to have grown in almost every category, but specifically in size (the number of data records lost) and severity (the level of data sensitivity, frequency of events, and the number of “interesting story” events). As part of this investigation, I hoped to be able to highlight and discuss specific, concrete examples of data loss. Through telling that story, I also wanted to begin to illuminate the human toll of, and the greater meaning behind, what I suspected was a trend illuminating a general mass loss of data privacy over time.
This project will complement other research I’ve done that has focused on data privacy issues from both a historical and policymaking or legislative standpoint. It will serve as a support to that research by illustrating both the present dangers and lessons to learn for the global population as we continue to constantly release, collect, and store more and more personal, financial, and biometric data online.
Background
Defining the multifaceted nature of data privacy
To trace the idea of data privacy is to trace the growth of the internet, and the growth of its use over time. Internet access has grown exponentially with the proliferation of smartphones and personal devices (Pew Research Center, 2019), which in tandem has increased the amount of data being collected, stored, and analyzed by essentially all companies, institutions, and governments with a web presence. At the same time, the price of surveillance technology has decreased, allowing businesses both large and small to engage in data collection (Heavin et al., 2020). The global size of the “datasphere,” too, is growing exponentially. It is projected to surpass 175 terabytes by 2025 (Kushmaro, 2021). This cascade of effects, coupled with the lack of regulation in the data privacy space, has created a veritable nightmare for individuals wishing to keep their data private.
A fundamental tension emerges in the world of data collection and analysis that contributes to this conflict. The more detailed the data provided to the researcher for analysis, the more useful it is in drawing conclusions. On one side of the spectrum lies privacy, and on the other, utility (Stewart, 2020). In a context such as the medical research field, some clear examples emerge. In studying a rare disease’s progression through a randomized controlled trial, the greater the information about a population that has the disease in relation to one that does not, the greater the strength of variables to study and find potential significance. Significance, in this case, might lead to better treatment and understanding. However, in the context of a for-profit company looking to generate advertising revenue through data collection, the greater the volume and detail of the data collected, the more it is generally worth for less humanitarian purposes.
Data privacy violations on the rise
There are also privacy concerns raised with publicly available data. The collection and release of mass amounts of personal data can still have far-reaching implications, even if the data was not necessarily kept private. For instance, in 2017, Strava, the running app, accidentally identified a secret military base by publishing its worldwide running routes, including regular laps taken by these particular armed forces (Stewart, 2020). There are also emerging concerns about the possibilities of reidentification of what was thought to be anonymized data, such as when the Netflix prize competition ended up outing a LGBTQ individual (Singel, 2010).
Underscoring this conversation is the recognition that internet use is becoming ubiquitous for an increasingly younger population. Children’s internet use has increased more than ever. One highly popular virtual world-building game, Roblox, “rose by over 20% in popularity [in 2021 alone], with 56% of kids playing the game worldwide” (Qustodio, 2021). In the same study, children’s time spent on IXL, a subscription learning service, rose by 46%. YouTube remained children’s top video streaming app despite recently having settled a major COPPA lawsuit about its illegal data collection practices (Federal Trade Commission, 2019). All of these tools have faced scrutiny due to lack of transparency regarding sharing or selling data, lack of attention to children’s safety on their platform, and/or concerns over security of the data they store (Common Sense, 2021). Roblox, in particular, was the subject of a major hack in 2020 (Cox, 2020).
Returning to the research question: What lessons can we learn from the past 15 years of global data breaches and hacks?
Methodology & Materials
The Information is Beautiful dataset included 16 column variables of comprehensive information on each event, including company name, year and date of the breach or hack, sources that reported on the breach or hack, and one column containing a variable called “interesting story,” corresponding to examples such as the Twitch or Dubai real estate reporting. Individual events often were listed with a main sector and a corresponding subsector, such as “government, health.”
After locating and downloading the dataset, I both checked and cleaned it in Excel, then added one column for country data, which I added in manually after cross-referencing the source materials for the hack or breach incident. I also transformed the “interesting story” variable into a boolean (true/false) variable. Then, for ease of reading, I grouped the sectors together by main sector only in Tableau.
I uploaded the cleaned spreadsheet to Google Drive, connecting Tableau Desktop to the Drive to query it, then created a Tableau Story in a series of five standalone visualizations and dashboards to seek to answer my research question. I published the final product through Tableau Public.
Full interactive Tableau visualization can be viewed here.
The first part of the Tableau story includes an introductory line chart that summarizes the scope of the issue over time, highlighting the aggregated number of data records lost by year since the timeline begins in 2004.
The story tabs include:
Firstly, a timeline of data loss by size, the number of data records lost, from 2004 to 2022:
Next, a dashboard with a series of charts showing how the data was lost across sectors (web, healthcare, app, retail, gaming, transport, financial, tech, government, telecoms, legal, media, academic, energy, military) and by method (hacking, inside job, mistake – “oops!”, poor security, or lost device).
The next tab displays another dashboard with a global heat map visualization in order to display where in the world the companies with breaches and hacks are active. The map serves as a jumping off point for potential discussion and investigation into how a country’s geographic location may impact data privacy and protections with regard to oversight, and regulation of companies’ data management in particular. In the map, I edited the color, the number ranges displayed in the legend, and the cluster sizes of the circles displaying the size of the data record numbers lost in order to make the results more legible and apparent.
This tab also includes the “top ten” list of the biggest hacks and breaches within the entire dataset, by organization or entity.
The next tab shows a selection of the most concerning data hacks and breaches by the level of data sensitivity (1=Just email address/Online information; 2=SSN/Personal details; 3=Credit card information; 4=Health & other personal records; 5=Full details):
The final tab is a reimagined version of the Information is Beautiful original visualization, which encourages the user to explore the entire dataset. Of course, this does not show all the variables, but highlights both the company involved and the size of the incident, via number of records lost, alongside any “interesting story” information associated with it. I configured pop-ups to link to interesting stories with titles and/or topics, so that the reader can read more about where and at what time major breaches and hacks were reported on.
In each of the standalone visualizations, I adjusted the cut points and colors to make each as readable as possible and to reflect “warnings” in culturally recognized “dangerous” colors like reds and oranges, and more education information in neutrals, while also adding highlights to interesting data points. I grouped related visualizations together in a dashboard to tell a more complete story about one subtopic, for instance, the nature and scope of leaks within the government and military sector.
Conclusion & Further Study
As I investigated this dataset, I found that indeed, data privacy violations have increased and escalated in severity alongside the growth of the global datasphere, however the growth has been irregular. In drilling down into the data, I found that data was lost across sectors, but none so big as the overarching “web” sector. This makes sense – this dataset encompasses incidents from 2004 on, wherein the biggest breaches in early years were a combination of physical hacks (e.g. stealing hard drives or physical documents) versus later years, where the breaches segued entirely into the digital, online space.
This is an area for further analysis, but in drilling down into certain methods and sectors, I also found that the increase in data loss seemed to be affected by certain gigantic breaches and hacks, such as the loss of police data in Shanghai or the breach at J.P. Morgan (both highlighted in the story data). Many of these were accompanied by a slew of media coverage. I tried to pull out some of the human toll of this data loss in the tab highlighting specific breaches alongside some of the most “concerning” data breaches and hacks across the health and government sectors.
Time was a major limitation of this study. Given more of it, I would have liked to keep growing and investigating this dataset. However, this is a project that I plan to keep building on in the future by adding further contextual information to the dataset and the visualization. For instance, I would like to create more calculated fields and parameters to highlight each country’s top ten data breaches and hacks by size and by sector. Alongside this, I plan to do further research on how the legal landscape and cultural contexts of different affected companies’ geographic location may further impact their vulnerability to hacking or breaches. Featuring texts, links to outside images, and a more illustrative timeline with interesting stories and images would all help underscore the complexity and urgency of addressing this issue.
As a Data Analytics and Visualization graduate student, issues surrounding data privacy are of particular importance to me. Three months into this investigation, I have become acutely aware of the vastness of the topic of data privacy and protection and the corresponding urgent debate both within the United States and globally about how to address mass data loss. Researchers and leaders across fields of information sciences, law, health, education, politics, and more have spent and continue to spend their entire lives studying the concept of data privacy and protection. This visualization begins to tell but a small contextual part of that story.
Sources
Cox, J. (2020, May 4). Hacker bribed ‘Roblox’ insider to access user data. VICE. Retrieved from https://www.vice.com/en/article/qj4ddw/hacker-bribed-roblox-insider-accessed-user-dat-reset-passwords
Feldman, A. (2022, December 16). Whither Data Privacy? INFO 601-03: Foundations of Information.
Kushmaro, P. (2021, June 7). Why Data Privacy Is A Human Right (And What Businesses Should Do About It). Forbes. Retrieved from https://www.forbes.com/sites/forbescommunicationscouncil/2021/06/07/why-data-privacy-is-a-human-right-and-what-businesses-should-do-about-it/?sh=6fe75a4ec3ca
McCandless, D. (2022, June 1). World’s biggest data breaches & hacks. Information is Beautiful. Retrieved from https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks
Privacy program. The Common Sense Privacy Program. (n.d.). Retrieved from https://privacy.commonsense.org
Singel, R. (2010, March 12). Netflix cancels recommendation contest after privacy lawsuit. Wired. Retrieved from https://www.wired.com/2010/03/netflix-cancels-contest/
Singer, N. and Krolik, A. (2020, January 13). The New York Times. Retrieved from https://www.nytimes.com/2020/01/13/technology/grindr-apps-dating-data-tracking.html
Stock photos: Pexels, https://www.pexels.com/search/security/
Warzel, C., & Ngu, A. (2019, July 10). Google’s 4,000-word privacy policy is a secret history of the internet. The New York Times. Retrieved December 14, 2022, from https://www.nytimes.com/interactive/2019/07/10/opinion/google-privacy-policy.html