From Big Data to Smart Data:
critical characteristics required for ensuring massive data volumes deliver tangible results
For those of us who remember the internet in its infancy, the contrast is striking: in the early 90's predictions of an “information explosion” abounded. Everywhere we looked we found references to the ability to tap into information stored pretty much anywhere, and not just trapped on our own 20-pound desktop computers. Kevin Ashton's brilliant observation in 19991 that the profusion of RFID technology would create an “Internet of Things” was made before IBM's Watson, before the cloud, before the sharing economy, before “Big Data.” As it tuned out, prescient and directionally correct predictions like Mr. Ashton's were only off by several orders of magnitude.
Compared to the yesteryear of 128K memories, we are in the midst of an unprecedented, logarithmic growth of information in many fields of human endeavor. New greek prefixes seem to be popping up daily. From highly structured databases to randomly unstructured people-thoughts from the billion-plus people online, Big Data is both everywhere and nowhere in particular. We are limited in our pursuit of that Big Data only by access rights, regulatory hurdles, and money. There's a slight problem though: the human mind is simply not capable of processing zettabytes and yottabytes of information. With the exception perhaps of a few mathematics PhDs, we mere mortals must rely on sophisticated software algorithms to process Big Data sources and make sense out of them.
As powerful as these tools are, however, they are constrained by the quality of the underlying information. The thrust of our efforts, therefore, should be to think not about “Big Data,” but about Smart Data. We need to ensure that the characteristics of the underlying data justify the conclusions we can draw out of analytics and visualization. The crux of this problem lies in a set of primary attributes that, as a whole, determine the usability of the available information.
Some data characteristics are out of scope for this paper: we will not address here the etiology of the information—the study of the origins and causes of the state of the data. While the reader will surely determine that these attributes are in fact inextricably liked to the data etiology, it's simply beyond a reasonable scope to attempt a discussion of the wider “why” questions here. Additionally out of scope are the questions of technical architectures, data structures, etc. For the purposes of this paper we will assume that these aspects pose no threat to the availability or the delivery of the information being analyzed.
To wit, in order to morph our Big Data collections into highly useful Smart Data compilations, it will behoove anyone involved with the science of data to consider the following seven distinct, fundamental characteristics or attributes against which the source data must be considered, regardless of the underlying machine or human source. Each is illustrated by an example of both structured and unstructured data.
Accuracy: Is the information we're dealing with really telling us what we think it is? Does the data exhibit a level of precision sufficiently high to justify decisions made upon it?
We have seen far too many people working with corporate data—whether analysts, staff, or executive management—having no choice but to “whistle past the graveyard” when it comes to relying on that data. Accuracy is a critically important aspect of information science, but far too often managers must make uncomfortable assumptions that the data is minimally accurate enough to move forward. This accuracy problem is most acute in situations where humans are involved in manually copying, editing, or otherwise manipulating the information. (Bypassing systems of record will be discussed below, but for this attribute we will focus on actions that are conducted directly in systems of record.) Equally important is to reach consensus concerning the level of accuracy required on which to base subsequent decisions.
One of our clients, a multi-billion dollar US-based medical provider, was managing all of its IT programs internally and had a problem with capacity management: massive backlogs of projects, frustrated internal customers. Upon examination it became clear that there were multiple instances of systemic data inaccuracy. Incorrect project task timekeeping data (both human- and machine-driven) led planners to under- or over-forecast the personnel and time resources required for projects. More damaging to customers, however, were the queue of projects and the backlog. Based on a project priority set by the customer (questionable in itself frequently), IT managers used a homebuilt spreadsheet tool (the then-agreed system of record) to sequence projects based on various features of the projects. All well and good until we discovered a fatal mathematical flaw in the spreadsheet's scoring algorithm. The entire body of downstream data was thus made suspect and in many cases useless; as a result the IT staff was working on the wrong projects...and the customers had to settle for suboptimal project delivery.
A live example of the second question above involves personal fitness monitoring devices. These ubiquitous devices have been shown in many cases to inaccurately record the data which they're designed to monitor—steps taken, heart rates, etc.2 On the one hand, this doesn't matter for many users, because the more important aspect of these devices is that they encourage users to exercise and engage in healthy competition with other users. This in turn may lead product designers (correctly in this case) to downplay the importance of accuracy in future product releases. On the other hand, as the article points out, this raw data could be used (particularly in the aggregate) by healthcare providers to monitor patient health trends. Therein lies the double-edged sword: a very real danger exists if the providers underappreciate or ignore the accuracy problem and fail to take steps to compensate for inaccurate data. Worse, this medical scenario could easily lead to potentially damaging prescriptive action.
We will never achieve perfect accuracy in our quest for smart data. The threat of faulty conclusions can be reduced, however, to a range that is within an acceptable level of confidence. The trick of course is to get all interested parties to agree on that range. And agreement likely will be most difficult with respect to unstructured data. What do we mean by “accurate” social media blogs, or website content? Before diving into a sea of big data, we need to ensure we're looking for the smart bits.
Relevancy: Is the data pertinent to the decision at hand? Could the correct decisions or results be obtained if we discarded this information?
The attribute of relevancy is singularly critical to the process of smart data analytics. When introducing the classic “correlation v. causation” theme to our effort, we are simplifying a difficult process by eliminating burdensome information that has no bearing on the situation at hand. Massive datasets, structured or unstructured, can be made less unwieldy by determining upfront what types of data are important and disregarding that which is not.
One of our clients, a midsize industrial process manufacturer, had a unique method of forecasting customer demand for the output of multiple production lines. Rather than using tested practices such as historical forecasting, the client used an arcane type of data retrieved from their customers as “forecasts.” Extensive analysis was performed on the data showing very clearly that...it should never have been anayzed. Not only was there little correlation between this irrelevant dataset and the target in question (it didn't forecast anything,) more importantly there was no causation found. Thousands of FTE hours were spent each year in relation to this data, and when the process was finally changed those hours were spent instead adding value to the company. The correct result was obtained by eliminating irrelevant data.
In the realm of unstructured data we find many examples of information that may be irrelevant to a particular analysis effort—so long as it can be feasibly parsed out or separated. Some situations require the questionable data to be analyzed in parallel with other source data to verify its reliability and usefulness. In trolling through consumer preference data on social media sites, for example, comments about brand preferences may be irrelevant unless it can be shown that comments are strongly correlated with actual purchase transactions.
We want relevant data, not Big Data. Ideally this process starts before the information is even collected; teams must decide in advance what types of information might be relevant. There is a danger, however, of running into confirmation bias—the tendency to ascribe relevance only to that information which confirms beliefs we already have. Similarly, cognitive dissonance can skew results by forcing teams to choose between two or more mutually conflicting results that were both considered relevant. It's up to team leads to address these problems and minimize their detrimental effects.
Integrity: Does the data exist in an unimpaired and trustworthy condition, unaltered from its original state? Do multiple sources of the same data agree with one another?
An astounding fact of the digital information age: the vast majority—80% at least—of corporate information is still stored in mainframes. Harvey Tessler presents an astute argument in his 2015 article3 that mainframes are here to stay, providing as yet the most optimal solutions for the requirements of SaaS/cloud computing and IoT analytics. Less prodigious but equally important are ERP, CRM, supply chain, and similar apps. But mainframes and other secure storage systems are not the main problem for data analytics efforts. It's when people (often for very good reasons) extract data from these systems that the problems start.
We're going to pick on spreadsheets not because they're the only culprit, but because they're the most widespread. It is highly likely that spreadsheet-based information is the most common secondary source of “front-end” analytics—that is, analytics performed on a day-to-day basis by the average professional. Forced in this direction by limited analytics functionality in primary apps, these professionals do their best to maintain high integrity. The only way to prove that, ironically, is by comparison to the primary source, which of course defeats the whole purpose of extraction. And once outside the primary source the data is subject to human manipulation, copying and recopying, and outright changing from its original state. The situation is especially exacerbated when spreadsheets are passed around the office, copied to laptops, etc. Besides giving IT security leaders headaches, this uncontrolled proliferation can never match the integrity of primary sources.
On the unstructured side, a prime example is individuals' profile information on LinkedIn. There are several structured elements in a public profile such as company names and itemized skill sets. One can probably assume that the person's physical location is valid as recorded. However, analytics efforts that have secured, at considerable cost, access to this information must be aware of the integrity complications inherent in the unstructured meat of the data, such as an individual's self-description. How confident can we be that Joe Blow's sales quota record is real? Similarly, Applicant Tracking Systems (ATS) also suffer from an inability to decipher unstructured data: semantics, voice, and attitude signals are not recognizable by ATS, but these signals can be easily interpreted by HR personnel.
We are not pretending to have a solution to address rampant spreadsheets or to ensure data found on social media has a high degree of integrity. Primarily because there is limited analytics capability in the apps from which the data is extracted, spreadsheets are here to stay for the forseeable future. So are cryptic blogs and tweets. But data analytics efforts must take into account the integrity of information in unsecured or secondary sources, and take corrective action as necessary.
Reliability: Does the data exhibit dependable and steadfast behavior over over time? When analyzed or measured, will the data produce the same expected results over repeated trials?
Data reliability is absolutely critical to effective decision making, but far too often we see otherwise high-performing teams misjudge this attribute. Data can be accurate but not reliable, for reasons explained below. Note also that reliability is not synonymous with unchanging. Data obviously can and will change over time but what we're addressing here is the characteristic of data to exhibit reasonably dependable change, or at minimum, change within an acceptable level of confidence ranges.
A pervasive element in the Internet of Things is tire pressure monitoring systems, or TPMS, installed in the majority of vehicles on the road today. Although various studies have shown that these sensors are accurate to within + 1 PSI, their reliability—the tendency of the device to be performing as expected, has been under question by industry experts and professionals. Broken valve stems and closed-system battery failures can easily render the sensor unreliable. The fact that the pressure sensors themselves may be functioning properly is moot: the average consumer doesn't care which part is damaged, only that she can rely on the dashboard warning light when it's activated. Given a rough estimate of the number of vehicles on the road in the US with installed TPMS, we have something on the order of half a billion IoT devices on whose reliability millions of drivers must depend, yet show serious reliability flaws.
Net Promoter Scoring, a methodology designed to attach a numeric KPI to intangible customer sentiments, generates lots of analytical attention but exhibits serious reliability problems.4 First, while the metric is not designed to measure loyalty—the tendency of a company's customers to remain customers of that same company (and not a competitor), many businesspeople understandably confuse the two, which can generate unnecessary downstream problems. Second, astute business executives couldn't care less if a customer says “I'm very likely to recommend (x) to (y)”; they care far more whether or not the customer actually does make that recommendation, and if that leads to increased sales, which are not measured in most NPS surveys. Finally, NPS also obscures the critical fact that respondents to surveys like NPS, in general, tend to polarize: they hate it or love it, resulting in an “average sentiment” of... “Meh.” This is not particularly useful in marketing campaigns. Analyzing a database of NPS scores requires careful attention to the reliability issue, particularly if decisions affecting the entire customer base are concerned.
Glossing over or ignoring potentially unreliable data usually backfires: someone will eventually uncover the flaws. Data analytics efforts should be designed from the start (not at the end) to account for problems with data reliability, and statistical models exist which can expedite the collection and interpretation of the data. More fundamentally however, managers must assess the sources objectively—querying with questions, not SQL—in order to properly compensate for reliability issues.
Timeliness: Has the raw data been originally created in, or directly apply to, the time frame appropriate to, and germane to, the analytics we're conducting? How can we ensure that the time period we have defined is correct with respect to the answers we are attempting to generate?
This is a wider and more encompassing topic than obsolescence. Obsolescence reflects a total lack of applicability, a completely expired relevance and usefulness to the current situation. Timeliness, conversely, may include data that might otherwise be considered aged for other reasons yet is still germane and valuable to the analysis. The relevant questions are “How far back do we go,” and “Which data types require different timeframes, and why?”
The sales forecasting process in the CPG / FMCG space represents the most obvious structured-data example of this attribute. Historical sales data (e.g., point-of-sale data) are valuable inputs in sales and operations planning as well as advanced supply chain planning algorithms. As with any other historical data, however, trying to cover a needlessly long time period can be just as invalid or misleading as using too short a timeframe. A balance must be struck between using time periods long enough to comprise both periodic trends and seasonal/promotional changes on the one hand, while not overburdening the analysis with redundant or ambiguous information on the other hand.
Many data analytics efforts tap into online news articles, academic journals, white papers, and similar mountains of unstructured data. The valuable information found in these sources is challenging enough to analyze; being able to ensure the data is timely further complicates matters. Extracting the date stamp on the item may not be important if the content addresses a different time period. Similarly, in an effort to discern a competitor's strategic direction from employees' public blogs, for example, the analysis team may need to set content expiration limits in order to uncover trends that may happen rather than trends that have already begun.
Time to execute and money to fund are the usual constraints on data analytics efforts just as they are on any other project. In order to invest these assets most efficiently, analytics teams should estimate in advance, as thoroughly as practicable, what timeframe is most pertinent to the question. This estimate can be amended after the first round or two of analytics indicates that different periods are more valid. But attempting to “cover all the bases” by extending as far back in the past as possible can easily backfire.
Sufficiency: How much data is enough in order to make our decisions? How much is too much? What threshold do we need to set as the ideal balance?
This is one of the more intractable problems confronting data analysts. More data is not always better, yet we can't make informed decisions on data that is insufficient to give us a reliably clear picture of reality. Applying rigorous statistical models to the analytics process is ideal but not always feasible nor even, for some businesspeople, within the realm of their skill sets. As in many other domains, we must make educated estimates.
Working with a global nonprofit agency on a worldwide open innovation program, we were faced with a “good to have” problem: the number of ideas and insights obtained quickly rose into the high six digits. With limited staff and funding, however, it would have been impossible for the team to explore all of them. We determined a sufficient level of data to consider using both statistical tools and prudent consolidation of data elements. In the converse situation—how to judge if we have too little information—we turn to a social media example. Trolling sites such as Pinterest for consumer insights requires careful deliberation: discovering 5,000 shares of a cereal brand ad has a completely different meaning from adding up 5,000 point-of-sale purchases of that same brand.
Interpretation of unstructured visual information such as satellite imagery, film records, and other volume-intensive data types needs especial care in deciphering, as we all are limited by time and funding. Gargantuan and complex datasets can very rapidly overwhelm analysis teams. In addition to the advanced analysis software required, we are faced with the demoralizing effect of inundation, which can lead to unwise compromises and shortcuts. In order to minimize the effects of these dangers, careful consideration must be made of the logic and rules configured into the analysis applications being used. Teams should determine as carefully as possible what exactly is being sought in the data.
We want smart data, not terabytes that we can't intelligently use. Particularly in cases where the team knows volumes will be high, it may be possible to conduct pilot tests of whatever structured or unstructured sources are targeted. Upon conclusion of the pilots it can then be determined adequate thresholds. We also can't work with a paltry set of data, and statistical models can help determine minimum amounts necessary for reliable analysis. In either direction, it is necessary to set thresholds to ensure we have at least enough data to make decisions but not so much that it undermines the effort.
Legitimacy: Is the data being analyzed considered to be a primary source of record? Is management willing to defend the source, and the data in it, as being a source of legitimate corporate (or public) information?
The advent of powerful visualization and cognitive analytics software tools have astounding capacity to transform mountains of text into coherent hypotheses. Without due consideration of the data's true nature, however, these same tools can create stunning visuals of worthless information. Establishing information legitimacy should be a paramount concern of project teams tasked with convincing senior management to make significant decisions.
We're going to pick on structured data in spreadsheets again simply because of their pervasiveness. While working with one of our clients, a well-known manufacturer and distributor of consumer electronics products, it was discovered that the demand planners were changing, for certain past periods, their original sales forecast data to match the actual sales results. Under the pretext of “Well, now it's perfectly accurate,” the client persistently clung to this belief with the full knowledge that the next period's forecasting process would be tapping into data which had no legitimacy whatsoever and which would counteract the efficacy of the forecasting algorithms. Certainly not every company will encounter problems this serious. But the more information is extracted from legitimate sources of record and manipulated elsewhere, the higher the danger of that data being transformed into questionable hordes of numbers.
A recent development in the realm of unstructured data is the onset of “fake news.” As this trend is in its early stages—or so we assume—we must plan on it spreading and permeating any potential type of online information. Scanning sources such as external blogs and news outlets for insights requires careful and judicious consideration. Similarly, corporate emails are another common target for unstructured data analysis. We are in no way insinuating that corporate emails contain “fake” anything, but they can still generate misconceptions. The disclaimer at the end of many emails is itself a testament that “what you read here may be neither true nor a legitimate depiction of events/positions/etc..” Academic journals, ususally subject to rigorous peer review, are also potential sources of illegitimate information.5
As many (legitimate) news articles have shown, the sources of this false information are quickly and resoundingly discredited in the academic and business community. While corporate analytics teams may be completely innocent of any wrongdoing, the mere citation of a questionable source immediately damages the team's credibility. Far better to carefully consider sources than to publish results that end up stigmatizing the team.
There are a few final considerations that data analytics and visualization practitioners, along with the manager and executives charged with drawing conclusions out of the information, should ponder. First, we have the question of reasonableness: it is unrealistic to believe we can manage all of these characteristics down to a level of zero error. The question then becomes one of the level of confidence we are comfortable tolerating. A reasonable level of consensus needs to be reached among the data analysts and those responsible for making decisions based on any conclusions. It is critically important to address this concept because the absence of consensus can lead to conflicting assumptions about the data's usability and value.
Some may be comfortable with the “it's 80 percent correct, let's move” maxim, while others won't be satisfied without rooting out the source of errors to raise that confidence level to, say, 95%. Still others may be perfectly ready to launch on a shoestring of data. But while it is important to set levels and expectations for these attributes in advance of the data collection and analysis, this doesn't mean they are invariable. In some cases the results of data analysis lead us to change the original thresholds and expectations we set for these attributes. Any such changes must of course be clearly communicated throughout the process.
Secondly, the seven key data attributes discussed here are all important, but perhaps not equally important in different situations. How should we weight the relative dominance of these attributes against one another? Are there any we can confidently ignore because they're not applicable to our situation? Because myriad scenarios can be envisioned, we recommend following procedures for prioritization similar to those already in practice at the organization. Analytics team members will then be familiar with the logic and process of determining factor weightings and can explain their decisions with confidence.
Finally, we need to take into account the primary nature of the data: was it created in a machine-to-machine transaction, or was it conceived by a human being and then made part of the IoT? It makes sense to parse the information we're concerned with into two basic flavors: The Internet of Things That Think, and the Internet of Things That Don't. Each flavor requires a different approach to understand, interpret, and base actions upon. With the former in particular, we are often faced with the challenge of applying mathematical and objective rigor to subjective data. This topic will be fully explored in a subsequent paper by the authors.
In a convenient twist on Moore's Law we live in an age in which information is doubling approximately every 18 months. The amount of data readily available (costs aside) to even small companies is literally beyond comprehension. The tools available to crunch, synthesize, and visualize Big Data into manageable form are improving by the minute, which makes data analytics no longer an option for most organizations. As intimidating as it might be, following the guidelines discussed here will help convert that Big Data into Smart Data.
1 Ashton, Kevin. “That Internet of Things Thing,” The RFID Journal, June 22, 2009 IoT discussion
2 Pogue, David. “Fitness Trackers Are Everywhere, But Do They Work?” Scientific American, January 2015. Fitness Trackers article__Sci Am
3 Tessler, Harvey. 10/29/15 “Mainframes are Still at the Heart of the Modern Tech World,” Enterprise Systems Media, Sept. 29 2015 Mainframes at the Heart
4 Pigato, Joseph. “What Net Promoter Gets Wrong...” MarketingProfs.com, June 4, 2014 Net Promoter flaws
5 Samle, Ian. “How computer-generated fake papers are flooding academia,” The Guardian, Feb 26, 2014 Fake Academic Papers