Approaching Unstructured Data Using Structured Strategies

We are awash in data and like a town hit by a tornado we don't know where to begin.

In the past few years, the volume of unstructured data — video clips, blog posts, forum threads, social media posts, online chats, and email — has grown exponentially. How can we capture, interpret and utilize this flood of information?

Some may say why bother? If all we’re talking about is a digital warehouse of unrelated data, then yes, perhaps you can push the delete button or back it up on a remote server for a future graduate thesis. But while interpreting this data may seem like a daunting task at first, we might learn something from the science of meteorology.

Fifteen years ago, weather forecasting was imprecise. But computers became more powerful, satellites got upgraded, and the software got better. While the tools we have now may seem blunt, great strides are being made in data analytics that will help draw a coherent picture of the thoughts, opinions, and reactions of crucial demographics.

The effort has spawned investment and manpower in developing technical solutions that will impose order and structure out of this digital chaos. the biggest challenge facing interpreting unstructured data is the fact that is vast and flooding petabytes as soon as they come online. In order to accommodate such monumental amounts of data, tools such as the open source framework Hadoop provides scalable storage and distributed processing over thousands of computers.

It would be impossible, for example, to store and analyze every single post to Twitter every day. What you can do is reduce your source of data to posts that mention a particular word, or those that are generated from a specific location, or even those posted in a specific language. By adding these stipulations, you would get an idea of how your organization is perceived, for example, in the Southwest United States.

Once you have chosen your data source, you are going to need a system in place to capture specific information and weigh its significance. An algorithm may increase the volume of data you can analyze, but will decrease the accuracy. A manual review may be more precise, but it will be time consuming and you will not be able to handle higher volumes of information. It might be possible to use a crowd-sourcing tool such as Amazon’s Mechanical Turk, where you can chop up the data pile in “bite-size” bits and then reassemble the reports once completed. It’s tricky, but it can be done.

No discussion about unstructured data solutions would be complete without mentioning a few of the large players in this burgeoning field. All three of these companies have a global footprint with local consultants operating in many countries.

While they tend to focus on medium and large corporations, the Cary, North Carolina-based SAS has emerged as one of the market leaders in the field of analyzing unstructured data and providing useful insights. They have a analytics solution that runs on Hadoop's data storage and processing framework that deserves a look. SAS has published white papers on the topic but you need to be registered on the site since their information is not open to the public.

About a year ago, Oracle Corporation purchased Endeca Technologies, a company based in Cambridge, Massachusetts. Oracle is a software company that specializes in database management, although they have since expanded into other areas. They have posted an 11-page white paper on the subject (PDF) that makes a case for the advantages of going with an Oracle solution that targets IT professionals. Oracle now owns the rights to MySQL, the database format used by Wordpress and other popular open-source solutions.

Not to be outdone, IBM has come up with a term I rather like: Big Data. It's essentially a different label for the problem of getting actionable intelligence on unstructured data. They have a blog devoted to the subject. While you might expect otherwise, David Corrigan makes the point that even 5-person shops can benefit from having the right tools. "Whether you’re a 5 person shop or part of the Fortune 500, you can have big data. Corrigan says that any company can find an opportunity to analyze new sources of data and use the insights lurking within."

Don’t think of unstructured data as a tsunami wave that might engulf your organization. Instead, think of it as a large dynamic puzzle that can offer clues about important trends that affect your business. Instead of relying on customer surveys with unreliable results, you can peer right into the thoughts and feelings of your target demographics at the moment of impulse. We are still early in the process and the tools and techniques needed to harness and utilize this unstructured data need to be perfected. Because unstructured data has such high potential to benefit businesses, those who know how to architect unstructured data into usable structured data are going to be in very high demand over the coming years.

