Understanding Polling and Its Failures
"Fighting with a large army under your command is nowise different from fighting with a small one: it is merely a question of instituting signs and signals."
-Sun Tzu
Overview and Common Assumptions
The tactical lesson of the 2016 presidential election is that only a fool would blindly trust the polls. Listening to both sides' campaign managers the night before the election it was clear that paradoxically both Trump and Clinton were going to win the election. Somehow the same polls in the same states told the universe of strategists, spectators, wonks, and traders that election result margins were both "stable" and "erroneous", both "decisive" and "within the margin of error". Not to mention the classic campaign strategist spin that "internal polls have us ahead". I contend that media organizations and private companies have spent an inordinate amount of time, money, and effort trying to estimate the result of an election given a collection of polls. These individual polls are known to be flawed, but the reasoning goes that we can use statistics to adjust for all of the errors and estimate the right result. This reasoning is fast becoming an academic relic. Traditional land lines are disappearing are disappearing from American households and response rates are falling across all mediums of communication both traditional and non-traditional [3]. Those organizations that fail to adjust to this new reality will make incorrect predictions, and, more importantly, will provide incorrect analytics. Our group advocates a new estimation methodology and model, discussed in Appendix A, that centers on the assertion that we should estimate election results using larger and therefore more accurate polls from fewer counties.
I am not writing to provide the academic evidence nor the quantitative rational for every one of my statements. I will paint in broad strokes where I find it necessary, dig into the numbers when it is useful, and will offload nuanced in depth explanations to cited works.
Section 3 dives into the mathematical basis of polling. In section 7 I will describe the new polling model our team built and tested over the last two years.
Disclaimers
A key assumption of the model we now advocate is that American elections are binary races where voters have a choice between two candidates. Yes, the United States does indeed have more than two political parties. There are the Democrats, Republicans, Green Party, Libertarian, Socialists, etc. When more than 90% of Americans go to the ballot box they vote for either a Democrat or a Republican. Anyone else is an independent candidate - and independents do not win elections in the winner-take-all American system. The highest vote percentage ever achieved by an independent candidate was 19% by Ross Perot in 1992. If an independent candidate doesn't declare early, have significant momentum going into the election, or have a large campaign war chest, there are only two choices for president. Either a Democrat or Republican will win. This is the reason of course that Bernie Sanders joined the Democratic Party. The Independent candidate will play spoiler to the party whom he/she steals more votes from. It is a grim truth that we as Americans are reduced to picking between two parties that are meant to capture every aspect of our beliefs with patchwork platforms appearing to broad coalitions of voters, but it is the truth, and until this system changes we will continue to operate our model assuming two parties.
To my own chagrin, the conservative attack that the Media is biased contains shades of the truth. What the public seems to forget is that the television Media (CNN, MSNBC, FOX) is in the business of selling advertisement slots. While it may have been true that the first television news programs were run as a public service, this is no longer the case. There are very few news programs that offer unbiased news, if such a thing can even exist. CSPAN, ProPublica, and PBS are arguably excellent examples of nonpartisan news. CSPAN is also dry as dirt, boring and substantive like plain oatmeal. Americans want to see action! We like our politics like we like our sports: fast, exciting, and close until the end. No one likes the game that ends 10-0. We want the shoot-out. We want the nail-biter. It's hurts our democracy when the Media chooses to treat politics like sports, but they do it because that's what the market demands.
Polling Math
This section should be seen as medicine. I wrote it to be accessible to the average reader, and I hope you take the time to skim it to at least understand that polling isn't all hand-waving and voodoo.
Expectation of Random Variable
We have a random variable . The random variable has expectation . We use random variables to model random phenomena, like rolling dice, or flipping coins. From now on we will use the abbreviation "R.V." interchangeably with random variable. We compute the expectation in the finite case as follows:
Don't stop reading, this is actually an easy concept to grasp. The expectation or "expected value" of the random variable is calculated by adding up every possible value can take on, multiplied by the probability of taking on that value. Let's say the random variable is modeling a dice roll. You and I both know when we roll a six-sided dice it will land one of the values: . Assuming no one is cheating, we intuit that there is an equal chance of the dice showing one of the numbers. Using the formula above we an calculate the expectation of the random variable .
I might not get around to showing the proof, but something called The Law of Large Numbers says that if we look at the value the random variable spits out many, many times, the difference between the average of those spit out values and the expectation we computed above will fall to zero. This needs to be very clear. I roll the dice once and it turns up a 3. The second time I roll it turns up a 2. The next time a 5. The next time 5. If I roll the dice a billion times, add up all the numbers I roll, and divide by a billion, the answer is all but guaranteed to be incredibly close to 3.5.
Markov's Inequality
So what was the point of this? We're getting to the good part. Here's the question you should be asking (that I borrowed [1]):
"What is the probability that the value of the random variable X, is not close to its expectation?"
I just told you that we can take a random variable and compute the value we expect as the average if we can 'roll the dice' an infinite number of times. But in the real world, you don't get to roll the dice billions of times, you often just get one roll of the dice. So we want to know how likely it is that the random variable spits out a value that isn't close at all to the expectation. First I'll give the formula, then we'll unpack it together, finally I'll give the proof.
Markov's Inequality: For a nonnegative random variable, , where for all , for any positive real number :
Unpacking the formula
As promised, let's unpack the formula step-by-step. We already know that is a random variable, but what does mean? is a symbol that means "the probability space". So whenever you see the symbol , substitute the words "the probability space" in. Every time you see and then parentheses, substitute the words "the probability of". is the same as "the probability I will eat dessert for dinner".
The probability space is the space of everything that can happen in our probability world. In the world where represents a dice roll, means every integer between to . This means that includes the numbers 1, -57, 263, -1454, and so on. "But if I roll a dice, the number -57 will never show up?!". That's right! The symbols means that is actually a function. It takes in some event from the probability space and it spits out a real number that is greater than or equal to zero. So in the the case where is a dice roll, here's a few examples of what spits out.
The probability that takes on the value '1' is :
The probability that takes on the value '-57' is zero. It will never happen:
The probability that takes on the value '4' is :
The probability that takes on the value '7' is zero. It will never happen:
So that's pretty reasonable. takes in an event that belongs to the complete probability space and spits out a number that is either zero, meaning the event will never happen, one, meaning the event will literally always occur, or something between zero and one, meaning the event has some probability of occurring. Got it? Onwards then!
is a constant that is greater than 0. We get to decide what the constant is. You want to be 0.00001. Okay. You want to be 100000. Okay. As long as isn't less than or equal to zero, you're all good. We'll see later that we can pick specific values of that let us do interesting things, but for now just remember that we decide what value takes on.
That's all there is to it! We know what everything means. The following section is the derivation of Chebyshev's Inequality from Markov's Inequality.
First the proof of Markov's Inequality (taken from [1]):
Proof. Let the event be defined by: . We want to prove .
Thus, .
Markov's Inequality is a general statement that holds for any non-negative random variable. It is useful for defining a rough upper bound on the probability of an event. For the case of polling, remember that we wanted to answer the question of how likely it is that our estimation is far away from the truth. For that we need to use Chebyshev's inequality, which can be defined as a special case of Markov's Inequality.
Chebyshev's Inequality: For any random variable, , and let be any positive real number.
Again, let's spend some time unpacking this formula. With Markov's Inequality we could compute an upper bound on the probability that our R.V spit out a value that was greater than a constant . This information is useful, but what we really want is the probability that our estimation is between an upper and lower bound. Let's say we have a simple election between candidate A and candidate B. We say that RV outputs the percentage of the vote candidate A receives. That is, the random variable outputs a number between 0 and 1. If the actual result of the election is 0.5 (a tie), Markov's inequality may be useful to tell us how likely it is our poll actually tells us that the final vote will 0.5.
Closing Remarks on the Math
Markov's and Chebyshev's inequalities are truths. You can of course disagree with the conclusions I draw in this article, and I will even point out places where you should disagree, but the math is unimpeachable. If you have any faith in mathematics, then you must at least accept the definitions above because they all come from the same mathematical foundation. If you have no faith in mathematics, god help you.
Here we begin the process of translating from the impenetrable mathematics to a model we can use to predict and estimate events in the real world. Polling is complicated process expressly because we cannot apply Chebyshev's Inequality perfectly to our world of flesh and blood. The core assumption of Chebyshev's Inequality applied to polling is that we are able to collect a perfectly representative sample of the place we are trying to predict. If we want to predict the U.S. Presidential Election with a single poll, the slice of people we poll needs to perfectly represent the whole country. We need to represent the opinions of an impossibly diverse set of three hundred million people using the voices of only a few. Chebyshev's Inequality, while brilliant, doesn't solve for voting ethno-racial and socioeconomic strata, nor income inequality, aging populations, nor the rural/urban divide.
The Importance of Counties
Counties are the smallest voting increment when we talk about the U.S. Presidential election. Each county is responsible for administering the vote within its boundaries. If you live in, say, Cayahoga county, you may only vote in Cayahoga county, and your vote for the presidential election is counted by the county government of Cayahoga county. The county reports the vote total to the state government, and the state government reports the state vote total to the Federal Election Commission. This is true even in the case of absentee ballot voting. If you choose to vote absentee, you send in your ballot to the county where you permanently reside.
Our model relies heavily on U.S. Census data that the Constitution mandates occur every ten years. The U.S. Census data is provided with the finest resolution at the county level, so this a natural limit on the resolution of our model. Even if we wanted to build a model with finer resolution, what standardized unit could we possibly use? The concept of neighborhoods or locals is poorly defined and subject to rapid change. Therefore, we use counties as the finest resolution of our model.tw
Case Study: The 2016 U.S. Presidential Election
We're going to perform a thought experiment with the fictitious Middletown County. Middletown county is (self-identified) 70% white voters, 20% Latino voters, and 10% Black voters. Middletown County is home to 1 million people. We are going to make the assumptions that all citizens of Middletown County are eligible to vote and that all citizens of Middletown exercise their right to vote. These assumptions will be addressed later, but for now all they add up to is that all 1 million citizens of Middletown vote in the 2016 election. Continuing the thought experiment, let's say we know the end result - Clinton wins the vote in Middletown County with 51.0% of the votes to Trump's 49.0% by a margin of 1.0%. We said before that all of Middletown's citizens vote, therefore Clinton received 510,000 votes, Trump received 490,000 votes, and Clinton won by a margin of 20,000 votes.
Chebyshev's Inequality guarantees that as we poll more people, the probability that our estimation of the final vote percentage is inaccurate, decreases. Let's say it again. The more people we poll, the more likely it is we are going to estimate correctly. The beautiful thing is that we now know how the mathematical tools work, so we can come up with hard numbers and not just say words that sound correct. Let's calculate the exact number of voters we need to poll to all but guarantee that we are below a specific error bound. We decide months before the election that we want to poll the population in a way that guarantees we are accurate to within 2% (total spread), 99% of the time. That means we are within +/- 1% of the actual result with 99% confidence. We both know the actual results of the election, but here are a few more results that would satisfy the above condition of Chebyshev's Inequality.
- Clinton 50.5% to Trump 49.5% -- Clinton wins -- 1.0% total absolute error
- Clinton 50.9% to Trump 49.1% -- Clinton wins -- 0.2% total absolute error
- Clinton 50.001% to Trump 49.999% -- Clinton wins -- 1.998% total absolute error
- Clinton 50.0% to Trump 49.0% -- Clinton wins -- 0% total absolute error
We clearly did an excellent job of predicting the election if made any of the claims above! If our poll came up with any one of the results above, we not only predicted the winner of the election in Middletown County, but we did it with less than 2% absolute error in the spread. Keep in mind the power of the previous statement. If we could actually predict the outcome of elections with this accuracy, I wouldn't be writing this post. I would keep this a secret, start a polling company, and make money predicting the future with a high degree of accuracy. And here's the best part - the number of people we need to poll is a set number. It doesn't matter if the underlying population of Middletown is 1 million, 1 billion, or 1 trillion. There is a specific number of people that all but guarantees (99% is essentially a certainty) that we will predict the result to within +/- 1% accuracy. So what is the magic number?
Solving Chebyshev's Inequality for an absolute result error of +/-1% with 99% confidence gives that we need to poll 250,000 people. If we do that, we will estimate the election result to within +/-1%, 99% of the time (equivalent to 2% total spread).
You might be thinking that our work is done. We calculated the number of people we need to poll, and the math checks out. But of course it isn't this easy. The math oversimplifies the numerous and aggravating complexities of the situation. Remember that there are 1 million people in Middletown County. According to the numbers at the beginning, 200,000 of the citizens in Middletown are Latino, and 100,000 are Black. Let's go through a quick sanity check - do we think the Black or Latino communities are going to vote for Donald Trump? Certainly not. There may be a universe where Donald Trump will win the majority of Latino or Black votes, but this is not it. Donald Trump may "love Hispanics!", but they aren't going to turn out to vote for him. Hillary Clinton is essentially guaranteed as a Democrat to win more than 90% of the Black vote. She is also going to win at least 55% of the Latino vote. So if we went to predominately Black and Latino neighborhoods to conduct our poll, and managed to poll 250,000 people to respond, we would deduce that Hillary Clinton is going to win Middletown County in a landslide.
It doesn't require a deep background to see that modern polling rests on assumptions that are nearly impossible to enforce. Below I detail several of the dubious assumptions pollsters make when using traditional polling methodologies. I highly recommend reading the cited literature; each of these assumptions has been comprehensively assessed in other venues.
Classic Polling Assumptions
Voter Turnout
We decided at the beginning of the case study that every eligible voter in Middletown votes on election day. This is not the case in any real election. Voter turnout has rarely exceeded 60% of eligible voters. For instance, about 55% of eligible voters vote in the U.S. Presidential Election on average. As we move to down-ballot races voter turnout numbers drop dramatically. For local elections voter turnout hovers around single digits. That we naively expected every single person in Middletown County to vote now seems laughable.
Minority Voter Suppression
Voter turnout will also be lower than 100% because a raft of deliberate policies and systemic racial issues suppress the votes of minorities. Following the passage of the 13th amendment, a host of Southern states employed literacy tests, poll taxes, and blatant voter intimidation to stifle the African-American vote. Tactics of voter suppression continued through the Jim Crow era until then President LBJ passed the Voting Rights Act of 1964. Among other things, this legislation eliminated all literacy tests and poll taxes, and gave the Justice Department final say on any voting legislation produced in a select list of Southern states with a history of voter suppression tied to systemic racism. In the last thirty years we have seen a concerted behind-the-scenes effort by conservative lawyers to challenge the legality of the Voting Rights Act of 1964, paving the way for the current raft of voter I.D. laws [4]. The tactic of requiring increased identification to vote is a thinly veiled move to suppress minority voters, who are less likely to have the proper identification to satisfy the new criteria. Not coincidentally, voter I.D legislation is cropping up in states with burgeoning minority populations that tend to vote for Democrats. A curious thought is that minority voter suppression engenders a vicious cycle. Republicans at the state and county level push for voter suppression because minorities vote for Democrats by a wide margin, and minorities often vote for Democrats because they are pushing to end minority voter suppression.
Likely Voters
Knowing what candidate a citizen would vote for is unhelpful if the citizen doesn't exercise their right to vote. Traditional polls focus heavily on assigning a likelihood to each person surveyed often by simply asking: "How likely are you to vote on or before Election Day?". As an example, we conduct a poll for an upcoming county election with a voting eligible population (VAP) of 50,000. Our 1000 citizen poll results in 100 responses from Black voters, with a margin of saying they would vote for the Democratic candidate. However, none of these poll respondents give unequivocal answers that they will vote on election day. We therefore assign each citizen a low probability of turning out to vote. Viewed in the aggregate, the poll tells us that the majority of Black citizens who vote will vote Democrat, but more importantly that many Black citizens will choose not to vote in our fictitious county election. Turnout among that segment of the population is predicted to be low. We can quickly identify issues with scaling this methodology. In a state with millions of voters, we extrapolate voter turnout based on limited data built on a racial classification system that is anything but nuanced. It would be difficult, say, to defend how the responses of 100 citizens who identify as 'Black' can accurately predict voter turnout of any states' Black population that is both ethnically diverse and geographically spread out.
Respondent Honesty
Every poll assumes that the majority of respondents are honest or conversely, that the number of dishonest respondents is non-negligible. Given the choice between Candidate A, Candidate B, and Undecided, poll respondents will provide an honest answer. There is a distinct lack of empirical evidence for this assumption because it is difficult to determine if a respondent's answer to a poll matched his/her actual voting behavior. This assumption deserves a lot of our attention. If we estimate that as low as 0.5% of our respondents will lie, we are already faced with awkward and perplexing choices in our polling methodology. Empirically we have that no racial or ethnic group has any predisposition towards lying that makes that group more likely to lie than any other racial or ethnic group. Therefore, each of our respondents is equally likely to be a liar. The making of a polling catastrophe is quite simple. Say we have a high non-response rate among uneducated White voters in a district where they are a marginal group, yet still represent enough votes to swing an election. We take whatever responses we get from uneducated White voters and extrapolate them to predict the behavior of the voting bloc. But say also that there is a liar among our uneducated White respondents, entirely possible given our uniform distribution of liars among racial and ethnic groups. We have now amplified a skewed result thereby corrupting the accuracy of our poll. If we are unfortunate enough to have lying respondents in our sample of minority groups, we can end up inordinately increasing our overall polling error.
A Polling Strategy
The polling establishment is now encountering the perfect storm of all of the issues described in the assumptions above. Response rates are falling making it more expensive to obtain large numbers of valid poll respondents, ethnic and racial groups are increasingly blurred and fragmented making prediction in these categories prone to error, and voter turnout is difficult to assess and impossible to quantify, Traditional polling may still prove effective in small district races, but it is ever more ineffective at the state and national levels. It is both comical and sad to remember that a key argument in favor of modern polling is that all the issues I described above are managed if the probability of respondents lying is low and we have a sizable population of respondents. The results of the most recent election make it safe to say that there were serious issues with the polling in this election cycle.
Hillygus explains succinctly in Public Opinion Quarterly [2]:
The political parties have built enormous databases that contain information about every registered voter in the United States. Statewide, electronic voter registration files?mandated by the 2002 Help America Vote Act?are the cornerstone of these databases. These files typically include a person's name, home address, turnout history, party registration, phone number, and other information, and are available to parties and candidates (and, in most states, anyone else who wants it). Consumer, census, political, and polling data are then merged into these files to better predict who is going to turn out, what their beliefs and attitudes are and, ultimately, how they are going to vote.
The old methods of polling are antiquated in the age of big-data and data mining. Rather, we should focus our efforts on building sophisticated statistical tools that have greater predictive power. This is already under way with several major statistical tools being applied to predictive polling. Researchers typically break polls down on based on ethnicity, age, and gender depending on the complexity and targeting of the poll. We assume that there is nothing that differentiates the kind of person who will respond to a poll from someone who won't respond to a poll within ethnic or socioeconomic strata. That is, one White female between the ages of 35-40 must be interchangeable from another woman in the same category. For polling to work, knowing how one woman votes must provide information on how women composing the same group will vote. While it sounds dehumanizing, statistics and history validate that this class of techniques has predictive power. It turns out that we can use Bayesian modeling to figure out how much a specific attribute of an individual contributes to their vote, and then use that to predict of an entire population. We can actually figure out to a precise degree how much being a man/woman, white non-white, old/young, and urban/rural effect your vote likelihood and chosen candidate. By polling thousands of people, I can figure out that in a specific county your likelihood of voting is "20% based on race, 20% based on age, 40% based on gender, and 20% based on a record of government or military service". Our unique characteristics as humans are being used as data-points in a machine that seeks to rob us of the free will we have the right to exercise in an election!
The key idea behind the model we built is that it is incredibly easy to screw up or bias a poll. Polls are very, very difficult execute without introducing error. Our group's model rests on the assumption that polls are highly effective in small geographic areas, and ineffective at predicting larger geographic areas. We contend that wide ranging polls contain too many places to introduce error, and that we can limit our prediction error by executing fewer targeted polls with high accuracy, and then reconstruct the entire voting map with signal estimation techniques. Recognizing that polling requires financial resources, we argue that those resources should be directed at getting highly accurate targeted polls in smaller geographic areas. The question then becomes what counties to select as a representative sample of all American counties. The elegant answer that downplays the difficulty is to find the counties that are the most predictive with respect to historical elections and current economic and demographic data. We built our mathematical model using a network theory approach. Our group has demonstrated that we can more accurately predict the outcome of historical US Presidential elections using highly accurate polls from approximately 50 counties rather than using traditional national polling techniques.
Notes on Theory
Refer to Appendix A for the mathematical description of the theory we utilize.
Our model computes multiple data sources to determine how counties voting patterns are correlated. Sources for our data include the CQ Elections database, official FEC Election Reports, and the US Census data. We attempt to only rely on official data to produce our county correlation matrix.
References
[1] Kousha Etessami. Markov and Chebyshev's Inequalities.
[2] D. Sunshine Hillygus. "The Evolution of Election Polling in the United States". In: Public Opinion Quarterly 75.5 (2011), pp. 962-981.
[3] Scott Keeter Michael Mokrzycki and Courney Kennedy. "Cell-Phone-Only Voters in the 2008 Exit Poll and Implications for Future Non-coverage Bias". In: Public Opinion Quarterly 73.5 (2009), pp. 845-865.
[4] Jeffrey Toobin. "Holder v. Roberts. The Attorney General Makes Voting Rights the Test Case of his Tenure." In: The New Yorker February (2014).
Appendix A: Theory
A brief overview of the core theory and requisite terminology used for this research is presented. Credit to Marques, Segarra, et. al for the presentation of this theory.
A graph is defined as . The set of nodes has size , the set of edges is such that edge if node is connected to node , and every edge in the set has a corresponding weight in the set . A signal is defined on the graph where is the value of the signal corresponding to node . Formally, . We have defined the graph structure, as well as an arbitrary signal defined on the graph.
The graph has a graph-shift-operator defined as an matrix satisfying for and . We assume that is diagonalizable, so there exists an eigenvector matrix and an eigenvalue matrix that can be used to decompose into . If is a normal matrix . This means that is unitary and gives . This yields the decomposition of the graph shift operator .
The graph shift operator allows for the representation of the signal in the frequency domain, understanding as such, as the use of a basis that is invariant to linear filtering. The Graph Fourier Transform (GFT) is defined as where are the frequency components of the signal. The inverse Graph Fourier Transform (iGFT) is therefore . We say that the signal is -bandlimited on the graph shift operator if its GFT has at most nonzero components. That is, where is a vector of zeros. If this is the case then the original signal can be reconstructed from a sampled version of the complete signal. Define a selection matrix that samples of of the graph nodes. Observe that is a selection matrix if it is binary, has exactly one nonzero entry per row and at most one nonzero entry per column. This sampled signal is defined as . The reconstruction can be carried out by
whenever a selection matrix that satisfies is used.
The difficulties specific to this theory include generating a selection sampling matrix that satisfies the above equation, working with a graph signal that is not cleanly bandlimited, and working with a graph shift operator that does not decompose to a full rank eigenvalue matrix.