editione1.0.2
Updated November 2, 2022Youβre reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.
Probability is orderly opinion and inference from data is nothing other than the revision of such opinion in the light of relevant new information.*
Probabilistic reasoning was a fundamental shift from the way that problems were addressed previously. Instead of adding facts, researchers started using probabilities for the occurrence of facts and events, building networks of how the probability of each event occurring affects the probability of others. Each event has a probability associated with it, as does each sequence of events. These probabilities plus observations of the world are used to determine, for example, what is the state of the world and what actions are appropriate to take.
Probabilistic reasoning involves techniques that leverage the probability that events will occur. Judea Pearlβs influential work, in particular with Bayseian networks, gave new life to AI research and was central to this period. Maximum likelihood estimation was another important technique used in probabilistic reasoning. IBM Watson, the last successful system to use probabilistic reasoning, built on these foundations to beat the best humans at Jeopardy!
The work pioneered by Judea Pearl marked the end of the Second AI Winter.* His efforts ushered in a new era, arguably creating a fundamental shift in how AI was applied to everyday situations. One could even go so far as to say that his work laid much of the groundwork for artificial intelligence systems up to the end of the 1990s and the rise of deep learning. In 1985, Pearl, a professor at the University of California, Los Angeles, introduced the concept of Bayesian networks.* His new approach made it possible for computers to calculate probable outcomes based on the information they had. He had not only a conceptual insight but also a technical framework to make it practical. Pearlβs 1988 book, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, became the bible of AI at the time.
The techniques developed in this area were much more useful than just logic alone because probabilities represent more information than a conclusion. For example, stating βthere is a 75% chance of rain in the next hourβ conveys more information than βit is going to rain in the next hour,β especially because events in the future are not certain.
Named after the 18th-century mathematician Thomas Bayes, Dr. Pearlβs work on Bayesian networks provided a basic calculus for reasoning with uncertain information, which is everywhere in the real world. In particular, Pearlβs technique was the foundational framework for reasoning with imperfect data that changed how people approached real-world problem solving. Pearlβs research was instrumental in moving machine-based reasoning from the rules-bound expert systems of the 1980s to a calculus that incorporated uncertainty and probabilistic models. In other words, he figured out methods for trying to draw the best conclusion even when there is a degree of unpredictability.
Bayesian networks were applied when trying to answer questions from a vast amount of unstructured information or when trying to figure out what someone said in languages like Chinese that have many similar-sounding words. His work applied to an extensive range of applications, from medicine and gene work to information retrieval and spam filtering, in which only partial information is available.*
Bayesian networks provided a compact way of representing probability distributions. The Bayesian network formalism was invented to allow efficient representation and rigorous reasoning with uncertain knowledge. This approach largely overcame many problems of the systems of the 1960s and 1970s, and, simply put, dominated AI research on uncertain reasoning. In the 1960s, to overcome the problems of time and space complexity, simplifying assumptions had to be made, and the systems were small in scale and computationally expensive.* The 1970s shifted to using probability theory, but unfortunately, this theory could not be applied straightforwardly. Even with modifications, it could not solve the problems of uncertainty.*
Bayesian networks are a way of representing the dependencies between events and how the occurrences or probabilities of events affect the probabilities of other events. They are based on Bayesβs theorem that states that the probability of an event happening depends on whether other relevant events have happened or the probability that they will happen.
For example, the likelihood of a person having cancer increases as the age of that person goes up. Therefore, a personβs age can be used to more accurately assess that they have cancer. Not only that, but Bayesβs rule also applies to the other side of the equation. For example, if you find out that someone has cancer, then the probability that the person is older is higher.
Figure: A Bayesian Network.
This figure shows an example of a simple Bayesian network. Each node has a table with the associated probability depending on leading nodes. You can calculate the probability that the grass is wet given that it is raining and the sprinkler is not on. Or you could, for example, determine the chances that a sprinkler is running if it is not raining and the grass is wet. The diagram simply shows the probability of one outcome based on the previous ones. Bayesian networks help determine the probability of something happening given the observation of other states.
In this case, the rules are simple to access, but when you have a lot of dependent events, Bayesian networks are a way of representing them and their dependencies. For example, stock prices may depend on many factors, including public sentiment, the central bank interest rates and bond prices, and the trading volume at the moment. A Bayesian network represents all these dependencies.
Bayesian networks addressed these problems by adding a framework that researchers could use when dealing with them. Even though they were useful in the 1990s, probabilistic reasoning does not address all possible cases due to the qualification problem described in the previous chapter. That is why, I believe, that probability reasoning fell out of favor in the early 2000s and deep learning has taken over the field since then. A few probabilities cannot describe how complex the world is.
βtechnicalβ Another technique used frequently during these years was maximum likelihood estimation (MLE). Based on a model of how the world should work and the observation of what is in the world, maximum likelihood estimation tries to determine the value of certain variables that would maximize the probability of such observation happening.
The idea behind it is that if you observe enough events occurring in the world, you would have enough samples to estimate the real probability distribution of these events. MLE finds the parameters of such a model that would best fit the observed data.
Figure: A normal distribution of heights of a given population.
For example, letβs say that you know that a normal distribution best describes the height of individuals for a certain country, like in the figure above. The y-axis represents the number of people with a certain height, and the x-axis represents the heights of the individuals. So in the center of this curve, we know that there are many people that have average height, and then as we move farther from the center on either side, there are fewer taller or shorter people.
With this technique, you can poll a lot of people to find out their height, and based on the data, you can determine the real distribution across the entire population. As shown in the figure below, after receiving the responses, you can then assume that the distribution of the heights of the entire population will be the one that maximizes the likelihood of those responses, the curved line from the first figure. The information on the distribution of heights of people inside a country can be useful for many applications. And with MLE, you can determine the most likely scenario for the heights of the population by surveying only a portion of the population.
Figure: This image shows the corresponding responses from a set of people. The curves on top of it are the assumed models based on the data given by the answers and the assumed normal distribution, using MLE. The blue line represents the height of men in the studyβs population and the pink, the height of women.
In many ways, the use of probability for inference marked this period. It preceded the revolution that multilayer neural networks, also known as deep learning, would cause in the field. Probabilistic reasoning was successful in many applications and reached its peak with the development of IBM Watson. While Watson did not use Bayesian networks or maximum likelihood estimation for its calculations, it used probabilistic reasoning to determine the most likely answer.
Watson was a project developed from 2004 to 2011 by IBM to beat the best humans at the television game show Jeopardy! The project was one of the last successful systems to use probabilistic reasoning before deep learning became the go-to solution for most machine learning problems.
Since Deep Blueβs victory over Garry Kasparov in 1997, IBM had been searching for a new challenge. In 2004, Charles Lickel, an IBM Research Manager at the time, identified the project after a dinner with co-workers. Lickel noticed that most people in the restaurant were staring at the barβs television. Jeopardy! was airing. As it turned out, Ken Jennings was playing his 74th match, the last game he won.