editione1.0.2
Updated November 2, 2022Youβre reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.
Arguing that you donβt care about the right to privacy because you have nothing to hide is no different than saying you donβt care about free speech because you have nothing to say.Edward Snowden*
In 2014, Tim received a request on his Facebook app to take a personality quiz called βThis Is Your Digital Life.β He was offered a small amount of money and had to answer just a few questions about his personality. Tim was very excited to get money for this seemingly easy and harmless task, so he quickly accepted the invitation. Within five minutes of receiving the request on his phone, Tim logged in to the app, giving the company in charge of the quiz access to his public profile and all his friendsβ public profiles. He completed the quiz within 10 minutes. A UK research facility collected the data, and Tim continued with his mundane day as a law clerk in one of the biggest law firms in Pennsylvania.
What Tim did not know was that he had just shared his and all of his friendsβ data with Cambridge Analytica. This company used Timβs data and data from 50 million other people to target political ads based on their psychographic profiles. Unlike demographic information such as age, income, and gender, psychographic profiles explain why people make purchases. The use of personal data on such a scale made this scheme, which Tim passively participated in, one of the biggest political scandals to date.
Data has become an essential part of deep learning algorithms.* Large corporations now store a lot of data from their users because that has become such a central part of building better models for their algorithms and, in turn, improving their products. For Google, it is essential to have usersβ data in order to develop the best search algorithms. But as companies gather and keep all this data, it becomes a liability for them. If a person has pictures on their phone that they do not want anyone else to see, and if Apple or Google collects those pictures, their employees could have access to them and abuse the data. Even if these companies protect against their own employees having access to the data, a privacy breach could occur, allowing hackers access to peopleβs private data.
Hacks resulting in usersβ data being released are very common. Every year, it seems the number of people affected by a given hack increases. One Yahoo hack compromised 3 billion peopleβs accounts.* So, all the data these companies have about their users becomes a burden. At other times, data is given to researchers, expecting the best of their intentions. But researchers are not always sensitive when handling data. That was the case with the Cambridge Analytica scandal. In that instance, Facebook provided researchers access to information about users and their friends that was mainly in their public profiles, including peopleβs names, birthdays, and interests.* This private company then used the data and sold it to political campaigns to target people with personalized ads based on their information.
Differential privacy is a way of obtaining statistics from a pool of data from many people without revealing what data each person provided.
Keeping sensitive data or giving data directly to researchers for creating better algorithms is dangerous. Personal data should be private and stay that way. As far back as 2006, researchers at Microsoft were concerned about usersβ data privacy and created a breakthrough technique called the differential privacy, but they never used it in their products. Ten years later, Apple released products on the iPhone using this same method.
Figure: How differential privacy works.
Apple implements one of the most private versions of differential privacy, called the local model. It adds noise to the data directly on the userβs device before sending it to Appleβs servers. In that way, Apple never touches the userβs true data, preventing anyone other than the user from having access to it. Researchers can analyze trends of peopleβs data but are never able to access the details.*
Differential privacy does not merely try to make usersβ data anonymous. It allows companies to collect data from large datasets, with a mathematical proof that no one can learn about a single individual.*
Imagine that a company wanted to collect the average height of their users. Anna is 5 feet 6 inches, Bob is 5 feet 8 inches, and Clark is 5 feet 5 inches. Instead of collecting the height individually from each user, Apple collects the height plus or minus a random number. So, it would collect 5 feet 6 inches plus 1 inch for Anna, 5 feet 8 inches plus 2 inches for Bob, and 5 feet 5 inches minus 3 inches for Clark, which equals 5 feet 7 inches, 5 feet 10 inches, and 5 feet 2 inches, respectively. Apple averages these heights without the names of the users.
The average height of its users would be the same before and after adding the noise: 5 feet 6 inches. But Apple would not be collecting anyoneβs actual height, and their individual information remains secret. That allows Apple and other companies to create smart models without collecting personal information from its users, thus protecting their privacy. The same technique could produce models about images on peopleβs phones and any other information.
Differential privacy, or keeping usersβ data private, is much different from anonymization. Anonymization does not guarantee that information the user has, like a picture, is not leaked or that the individual cannot be traced back from the data. One example is to send a pseudonym of a personβs name but still transmit their height. Anonymization tends to fail. In 2007, Netflix released 10 million movie ratings from its viewers in order for researchers to create a better recommendation algorithm. They only published ratings, removing all identifying details.* Researchers, however, matched this data with public data on the Internet Movie Database (IMDb).* After matching patterns of recommendations, they added the names back to the original anonymous data. That is why differential privacy is essential. It is used to prevent userβs data from being leaked in any possible way.
Figure: Emoji usage across different languages.
This figure shows the usage percentage of each emoji over the total usage of emojis for English- and French-speaking countries. The data was collected using differential privacy. The distribution of the usage of emojis in English-speaking countries differs from that of French-speaking nations. That might reveal underlying cultural differences that translate to how each culture uses language. In this case, how frequently they use each emoji is interesting.
Apple started using differential privacy to improve its predictive keyboard,* the Spotlight search, and the Photos app. It was able to advance these products without obtaining any specific userβs data. For Apple, privacy is a core principle. Tim Cook, Appleβs CEO, has time and time again called for better data privacy regulation.* The same data and algorithms that can be used to enhance peopleβs lives can be used as a weapon by bad actors.
Apple predictive keyboard, with data it collected with differential privacy, helps users by offering the next word that should be in the text based on its models. Apple has also been able to create models for what is inside peopleβs pictures on their iPhones without having actual usersβ data. It is possible for users to search for specific items like βmountains,β βchairs,β and βcarsβ in their pictures. And, all of that is served by models developed using differential privacy. Apple is not the only one using differential privacy in its products. In 2014, Google released a system for its Chrome web browser to figure out usersβ preferences without invading their privacy.
But Google has also been working with other technologies to produce better models while continuing to keep usersβ data private.
Google developed another technique called federated learning.* Instead of collecting statistics on users, Google developed an in-house model and then deployed it to each of the usersβ computers, phones, and applications. Then, the model is trained based on the data generated by the user or that is already present.
For example, if Google wants to create a neural network to identify objects in pictures and has a model of how βcatsβ look but not how βdogsβ look, then the neural network is sent to a userβs phone that contains many pictures of dogs. From that, it learns what dogs look like, updating its weights. Then, it summarizes all of the changes in the model as a small, focused update. The update is sent to the cloud, where it is averaged with other usersβ updates to improve the shared model. Everyoneβs data advances the model.
Federated learning* works without the need to store user data in the cloud, but Google is not stopping there. They have developed a secure aggregation protocol that uses cryptographic techniques so that they can only decrypt the average update if hundreds or thousands of users have participated.* That guarantees that no individual phoneβs update can be inspected before averaging it with other usersβ data, thus guarding peopleβs privacy. Google already uses this technique in some of its products, including the Google keyboard that predicts what users will type. The product is well-suited for this method since users type a lot of sensitive information into their phone. The technique keeps that data private.
This field is relatively new, but it is clear that these companies do not need to keep usersβ data to create better and more refined deep learning algorithms. In the years to come, more hacks will happen, and usersβ data that has been stored to improve these models will be shared with hackers and other parties. But that does not need to be the norm. Privacy does not necessarily need to be traded to get better machine learning models. Both can co-exist.
Whatβs not fully realized is that Mooreβs Law was not the first but the fifth paradigm to bring exponential growth to computers. We had electromechanical calculators, relay-based computers, vacuum tubes, and transistors. Every time one paradigm ran out of steam, another took over.Ray Kurzweil*
The power of deep learning depends on the design as well as the training of the underlying neural networks. In recent years, neural networks have become complicated, often containing hundreds of layers. This imposes higher computational requirements, causing an investment boom in new microprocessors specialized for this field. The industry leader Nvidia earns at least $600M per quarter for selling its processors to data centers and companies like Amazon, Facebook, and Microsoft.
Facebook alone runs convolutional neural networks at least 2 billion times each day. That is just one example of how intensive the computing needs are for these processors. Tesla cars with Autopilot enabled also need enough computational power to run their software. To do so, Tesla cars need a super processor: a graphics processing unit (GPU).