Blog

These days, you can find many references to unstructured data in the marketing sections of analytics vendor websites. So what does it really mean, anyway? Is the term useful or just a marketing buzzword?

Wikipedia defines unstructured data as “does not have a pre-defined data model or is not organized in a predefined manner.” Pretty clearly, data in a relational database is structured data. What about JSON in a NoSQL database like MongoDB? Is a Microsoft Word .docx file structured? It is obviously organized in a predefined manner (although perhaps best described as a Rube Goldberg contraption).

Semi-structured and unstructured data

I will group data into three categories: structured, semi-structured, and unstructured. Semi-structured data includes any data that can be viewed as a set, sequence, or hierarchy of records, where the set of fields in each record has certain minimal members, but is extensible. For example, log files are semi-structured data, as they can be interpreted as a time-ordered sequence of records. Website clickstream data also fits into this category.

Now, we can define unstructured data as any data that does not fit into these two categories. This includes documents, media files (photo, video, and audio), executables, genomics data, etc. Clearly, none of these can be interpreted as record sets. The vast majority of the data on my laptop, at least by size, falls into this category.

Some data formats are less clear. A spreadsheet might fall into semi-structured data, but only if it had a true table structure. An XML document might be semi-structured if it represents the records returned from a REST API call, but unstructured if it represents a web page. What about an RSS feed? It could be viewed as either unstructured data to be rendered as a webpage or semi-structured data — a time series sequence of data to be processed by an analytics tool. Thus, the boundary between semi-structured and unstructured is somewhat fluid for some formats and may depend on a given object’s intended use.

Analytics for unstructured data

Modern analytics engines like Hadoop were an advancement over traditional data warehouses — they extended the domain of analytics to include semi-structured data. The purpose of the map step in a map-reduce process is to convert semi-structured data into structured data for aggregation. Some unstructured data can be converted very easily into a semi-structured format for analysis by algorithms like map-reduce. For example, web pages can be converted almost effortless into lists of outgoing links for Google’s page-rank algorithm.

For other unstructured data formats, it is harder to see how they can be converted to a form amenable to analytics. How would you preprocess a video file to extract semi-structured data? Perhaps one might extract metadata properties and header data. A fingerprint can be generated to uniquely identify the content for de-duplication or detecting piracy. Other properties of the video could be extracted by some kind of domain-specific analysis. For a PDF file, one might look for keywords, extract external references (e.g. URLs or a bibliography), or try to auto-summarize the document. Clearly, analytics for this kind of unstructured data is as not well-explored as it is for semi-structured data.

Summary

We’ve grouped data into three categories: structured, semi-structured, and unstructured. If data cannot be broken down into records, it can be considered unstructured. The boundaries between each category are fluid and depend somewhat on the intended use of the data. Traditional data warehouses are focused on structured data, while tools like Hadoop are more focused on semi-structured data. Standardized analytics against truly unstructured data is still a largely open problem.