top of page
vietraveselur

AOL Search Data Leak Download: The Biggest Mistake in Internet History



AOL did not identify users in the report; however, personally identifiable information was present in many of the queries. As the queries were attributed by AOL to particular user numerically identified accounts, an individual could be identified and matched to their account and search history.[1] The New York Times was able to locate an individual from the released and anonymized search records by cross referencing them with phonebook listings.[2] Consequently, the ethical implications of using this data for research are under debate.[3][4]


In September 2006, a class action lawsuit was filed against AOL in the U.S. District Court for the Northern District of California. The lawsuit accuses AOL of violating the Electronic Communications Privacy Act and of fraudulent and deceptive business practices, among other claims, and seeks at least $5,000 for every person whose search data was exposed.[8] The case was settled in 2013.[9]




aol search data leak download



Through clues revealed in the search queries, The New York Times successfully uncovered the identities of several searchers. With her permission, they exposed user #4417749 as Thelma Arnold, a 62-year-old widow from Lilburn, Georgia.[10] This privacy breach was widely reported, and led to the resignation of AOL's CTO, Maureen Govern, on August 21, 2006. The media quoted an insider as saying that two employees had been fired: the researcher who released the data, and his immediate supervisor, who reported to Govern.[11][12]


The data download is a 439 MB TGZ file in which the AOL screen names of the users have been obfuscated by namerandomization. Please note that the file may (and does) contain sexually explicit data. The collection is distributedfor non-commercial research use only (commercial usage is prohibited). Interested researchers can add their comments tothe U500k community.


Query logs from real search engines are hard to find. Here are ones that I've downloaded before without too much difficulty. Keep in mind nearly all contain a license you need to agree upon before downloading, and are for non-commercial uses only.


SB: Right. Although AOL immediately deleted the files from their website, they were mirrored and distributed hundreds of times by several people. Now, ten years later, the hardest part was not finding the download links in general, but to find a mirror with all the files still being online. After downloading the whole package (which took ages!) I created a MySQL-Database, containing every single query, to get full flexibility with using the data.


Users were exposed in terrible ways. But I had the feeling that there's nothing I can change about that anymore, and neither did I want to. Rather than dragging the data into another environment, by keeping it inside the (rebuilt) AOL search engine, I constructed a memorial to this case and everyone involved.


SB: It was definitely fascinating to see how the reactions changed during the presentation. In my observation it began with amusement, followed by amazement at how deep the insight into the life and mind of user 711391 really is. Interestingly, I've found that the discussions I've had following the presentation differed: while one group of people wanted to know a lot about the circumstances of the search data release and the legal consequences for AOL, the other group really did question their own search behavior and tried to think of search queries that would make them identifiable. In the end, everyone felt caught, in a way, but dealt quite differently with this displeasing feeling.


OL: Your work reminds me of a project by Tobias Leingruber, a student of mine a few years back. In 2008 he released Pirates of the Amazon; there was much ado for a week or two, but it had to be shut down quickly. It was a browser add-on for Amazon: when you searched for something on Amazon, the work provided a link to the same product on Pirate Bay. Leingruber didn't provide pirated material, he didn't pirate anything. He provided a one-click interface. You didn't collect or leak the data, but you made an interface that provided access. This brings me again (and again) to the idea that the role of the interface designer in today's world is enormous. Do you feel your power?


SB: Definitely! I'm kind of grateful I worked with the released AOL search data for my final project, because to me, it clearly underlines the point you've just made: in most cases, it's not about providing a pleasing visual or "entertaining" interface; it's about the power of the interface to enable users to do the things they need to do to gain knowledge, whatever that might be. In my case, I could have transferred the data into a completely different context with some fancy data visualization, but that wasn't the best way to prove the point.


Observational studies of public behavior (including television and public Internet chat rooms) do not involve human subjects as defined when there is no intervention or interaction with the subjects and the behavior is not private. Also, studies based on data collected for non-research purposes may not constitute human subjects research if individual identities are not available (for example, programmatic data such as service statistics, school attendance data, crime statistics, or election returns).


Exempt research with children:The exemption categories that may be used with children include:Research conducted in established or commonly accepted educational settings, involving normal educational practices.Research about educational tests.Observations of children in public settings, providing the researcher does not participate in the activities being observed.Studies using existing data about children, (a) if the data are publicly available, or (b) if they are recorded in such a way by the investigator that the identity of the children cannot be determined either directly or indirectly.Studies conducted by federal departments or agencies about government programs, such as welfare programs.Taste and food quality evaluations and consumer acceptance studies, under some circumstances.According to Subpart D, exemptions may not be used for any of the following:Research involving interviews.Research involving surveys.Observation in which the researcher participates in the activities observed.Internet and re-identification of individual dataRe-identification of Data


In the American online (AOL) search data leak of 2006, the Internet service provider AOL released a dataset that included the search records of 500,000 of its users. AOL, in good faith, had intended to make the data available to benefit academic researchers. AOL had stripped the names from the data released, and provided only what were supposed to be unidentifiable user numbers. However, within days journalists from the New York Times were able to discover the identity of user number 4417749 by simply investigating the unique search queries, which were notable due to various reasons. The company eventually removed the data (Hafner 2006, Jones 2006, Zeller 2006).


According to 2018 research, the average cost per record lost in a data breach is $148. The average total costs of a breach is $3.86 million. Even a small business with 1,000 lost records could see costs in the tens of thousands.


DataLossDB is an open source, community-maintained research project that covers publicly-disclosed data breaches across the globe. The site provides details around data loss incidents as well as analysis of historical data breach trends.


We at Digital Guardian regularly cover the topic of data breaches and provide insights on both preventing and responding to a data breach. In this expert roundup, we ask 30 data security experts to share the most important next step you should take following a data breach. In another roundup, we asked 27 data security experts for their insights on the most cost-effective ways startups can protect themselves from data breaches. To help your employees with cyber awareness, we also created a Cybersecurity Awareness Kit. Be sure to follow our blog for updates on the latest data security information, research, and discussions, and visit our resources section for analyst reports, case studies, data sheets, and other resources on data breach prevention and data security.


Besides these AOL users shouldn't get too worked up. They couldn't possibly be too concerned about what anyone thinks about them or they wouldn't be using AOL in the first place. The rest of the Internet wasn't particularly surprised at the contents of that search data -- we were all working under the assumption that everyone on AOL was searching for pictures of poo and instructions on how to murder people anyway. The data in question simply confirmed that suspicion.


At a minimum, there are several thousand present/past Congressmen/women, their spouses, and their immediate relatives. It's probable that the database contains the search records of at least one curren


This will be really interesting to watch. I mean, AOL has dirt on everyone - I can imagine it will be hard to have a court case against them when AOL can come back and say "Oh here you are searching for child porn, illegal song downloads, etc." Unless they don't have anything to be ashamed of I can see it being a very difficult case for the plaintiffs.


It is surprising that so much can be identified by deduction from data. You may assume that you can safely distribute partially masked data for reporting, development or testing when the original data contains personal information. Without this sort of information, much medical or scientific research would be vastly more difficult. However, the more useful the data is, the easier it is to mount an inference attack on it to identify personal information. Phil Factor explains.


Data about people and their activities is passed around for research purposes, and it is important to be able to mine information from this data, in a way that is appropriate and within the legal constraints of the custodianship of personal data. Many advances in medicine, for example, are made purely by finding patterns in existing patient and biomedical data. This research saves lives. 2ff7e9595c


0 views0 comments

Recent Posts

See All

Comments


bottom of page