Is polling data big data? A back-of-the-envelope calculation
For the past year Americans have read about polling data almost daily, so the casual observer may be forgiven for thinking there are mountains and mountains of data being collected to help make sense of this year’s unusual election. The truth is, though, that polling data is not big data, either in volume, velocity, or variety – the three ways in which data can be “big.” It might be useful to do a few back-of-the-envelope calculations to show just how tiny this data is.
The polls listed in the HuffPost Pollster (the database used by many election prognosticators) amount to only about 2.5 gigabytes of data, by our estimates. (The Pollster database lists 372 polls with an average of 40 questions posed to an average 3,000 respondents per poll, and the data from each respondent averages a little over 2KB.) Additional polls outside the Pollster database and exit polling data on election day do not add more than a few gigabytes more, so the total amount of unique polling data gathered for this election cycle is likely under 5 gigabytes, well under the conventional “big data” threshold of 100 terabytes.
An impressive figure, this isn’t. Most of us have more than 5 gigabytes of data stored on our phones. But polling data is of higher value to the world than the hundreds of family photos and videos most of us are carrying around with us. Polling data can move markets and impact international relations. It’s value and importance is high enough that it is likely to have been downloaded and replicated many thousands of times.
Replicated how many times? Perhaps as many as 10,000. We can assume that most of the 2,500 colleges and universities in the US are tracking polling data during this election, and we can assume a few thousand international observers as well. Add to that the number of media organizations, large financial organizations, and other interested groups (about 500 of them) and we’re probably looking at the 5 gigabytes having been stored 10,000 times, amounting to 50 terabytes. In addition, some groups run election simulations based on the models build with polling data, and this may take up significant space. Simulated election results would likely amount to 8 gigabytes per simulation, with several hundred simulations being run to amount to a few terabytes per study. Twenty-five such studies would add 50 terabytes to our 50 terabytes from polling data for a grand total of 100 terabytes. This still does not make polling data into big data despite reaching 100 terabytes, because the 100 terabyte threshold is for a single instance of the data, not all instances stored in many different locations.
If the total of 100 terabytes were all stored in a data center (and it’s not – much is simply stored on laptops), how would that compare to the total data volumes found in data centers? Our imminent Global Cloud Index, to be released November 10, will contain data on the storage capacity and data stored in data centers. Data stored in data centers is currently 171 exabytes globally, or 171,000,000 terabytes, which means that data associated with US election polling and forecasting represents a measly 0.0001% of total data stored in data centers around the world. The smallness of polling data is a nice illustration that data need not be “big” to be important.
What makes up the bulk of the data stored? Of overall data stored in data centers, 32 percent is associated with web and cloud services (AWS, Google Cloud, Dropbox, Youtube, Google Search), 15 percent is data stored by government, and the manufacturing, healthcare, and transportation verticals account for 9, 8, and 7 percent, respectively. Basic science logs an impressive 5 percent (or over 8 exabytes) thanks to the large amounts of data created by bioinformatics and experiments such as CERN’s Large Hadron Collider.
Want to learn more about cloud, data, and the resulting traffic? Stay tuned on SP360 for far more detail in our upcoming Global Cloud Index released tomorrow. You may also register for Global Cloud Index forecast update presentation on November 15, 2016 (Americas and EMEAR) or November 29, 2016 (APJ).
Join our conversation on Twitter through #CiscoGCI.