As the cost per byte of storage has declined, it has become a habit to simply store data “just in case.” At a time when the overwhelming majority of data was generated by human beings, nobody thought much of it. Data was summarized, information extracted from it, and the raw data points were still kept should they be needed later. Later seldom came.
Cisco tells us that as of 2008, there were more things connected to the internet than people, so we can use that as the point in time when the amount of data being generated and stored had its hockey stick moment. Now we have more sensors in more places monitoring more and more activity and generating more and more data points. By 2010, then Google CEO Eric Schmidt explained how we generated and stored as much data every two days as we did from the dawn of civilization up to 2003.
That’s a lot of data.
Running Out Of Room, Or …
The natural reaction is to instinctively feel that, at some point, we’re going to run out of storage capacity. If Moore’s Law holds, that won’t happen. We’ll just keep inventing new, more compressed storage technologies.
But what we are running out of is time.
Long ago, the last thing anyone in the data center did was to make sure the daily backups were running. They would run into the night all by themselves. Then they would run through the night. Then they were still running when everyone came into the office in the morning.
Fortunately, we’re clever and adaptable, so we came up with incremental backups. Instead of recopying and recopying data we had already copied, we only copied data that had changed since the last backup. Then we moved to faster backup media. Now we’re backing up the data as we’re saving it in primary storage. Ultimately, the restore time objective becomes impossible to achieve in the time available to us.
Making Tough Choices
Now we have to make a difficult choice. Once we’ve processed the data and created valuable information, do we or do we not keep the original raw data as it was collected? Or do we decide to discard it?
Or do we have to choose to save some of the raw data and not other parts of it? What are the criteria upon which that choice can be made? How do we anticipate in our planning which data points need to be stored and which will be discarded?
Now Add Machine Learning
This problem becomes exacerbated by the introduction of machine learning and artificial intelligence technologies to data analytics. When a machine is performing much of the data collation, selection, and processing, how are we to know which data points the machine will want to retrieve to complete its analysis? What if we choose incorrectly?
Other Possible Strategies
Being more pragmatic about this challenge, we need to think about data reduction. First of all, when and where does it occur?
Many of us take a physical relocation from one place to another as an opportunity to discard belongings that we no longer need. Some perform this discarding as they are packing to move. Others, often in a rush to make the move, simply pack everything and promise to do the discarding when they arrive at the new location. Many of us have boxes upon boxes that have yet to be unpacked since we moved in many years ago.
In the classic framework, we can choose to perform data reduction at the core of the network, in the server processors that will perform all the analytics. Or we can choose to perform data reduction at the edge where the data is being collected so the load on the servers and storage are reduced.
The ultimate solution will likely be a combination of both, depending on the workload and the processing required.
Begin With The End In Mind
There has been much discussion about data science — how it’s the art of extrapolating useful information from data and turning it into knowledge that facilitates superior decision-making.
As we continue to see the internet of things produce Schmidt’s estimate of five exabytes per day, data science must expand its scope to include the development of an end-to-end data strategy. This must begin with careful planning and consideration surrounding the collection of data, layers of summarization and reduction, preprocessing, and, finally, deciding which data points get stored and which are discarded.
As always is the case with data storage issues, this will be a volume-velocity-value process based on the business use case involved and at what point data gains value. The science is nascent, but the opportunity is immense.