MIS 587: Big Data, Unstructured Data, Structured Data, What?

Big Data is a term used to describe sets of data so large, that it leaves traditional database systems inadequate. Big Data deals with two forms of data: Structured Data and Unstructured Data. So what are these two classifications of data?

Structured Data

To put it simply, structured data is data which can be stored in relational databases. This form of data can be easily identified and processed by data mining tools. Any meaningful data usually in the forms of labeled rows and columns would usually fall under the structured data category.

Unstructured Data

On the other hand, unstructured data is data which does not have a predefined data model, and cannot be directly stored in a relational database. Unstructured data may be textual: Word documents, emails, instant messages, tweets, etc. or non-textual such as audio, video, info-graphs, etc. This sort of data may contain numbers, facts and figures but will require specific applications to sort through and get meaningful data.

According to experts, about 80% of data available to an organization is usually unstructured.

To glean any sort of meaningful information from this large volume of data, organizations will have to process this data. However, with unstructured data this is usually easier said than done with traditional tools.

The data sources available to organizations can be classified into two types:

Traditional Data Sources

Data from these sources is mostly transactional data and is usually structured and stored in databases.

OLTP Databases: Oracle, Sybase, DB2, SQL Server, MySQL, Postgres, etc.

Enterprise Applications: ERP, CRM, HRM, etc.

Third Party Data: Consumer databases, Stock Trade data, etc.

Non-Traditional Data Sources

Web Applications: Website data, Mobile Applications data, etc.

Social Media Data: Twitter, Facebook, Linkedin, etc.

Newer Third Party Data: APIs, Public Data, etc.

Others: Sensors, Device logs, etc.

With the explosion of the Connected Web and the Internet of Things, organizations today have access to more data than they can consume and analyze. They are being inundated with unstructured data from non-traditional sources.

Role of Data Warehouse

A data warehouse can be used to integrate the data from all the different sources available to an organization. This can be used to create a single source of data for an organization which can then be used to perform analysis and gather business intelligence.

A major function in the extracting of data from different sources to be stored in the data warehouse is the ETL process: Extract, Transform, Load. Since a data warehouse collects data from various different sources, ETL is a key process to bring all the data together in a standard, homogeneous environment.

Various Data Sources - Limitations of Data Warehouse

Traditional ETL systems are incapable of dealing with unstructured data. With most of the data available to organizations being unstructured, ETL processes have evolved to incorporate unstructured data. A major chunk of unstructured data is textual data. ETL processes now incorporate Natural Language Processing and Text Analytics to transform unstructured data to store in a Data Warehouse.

Collecting data from different sources comes with a challenge of quality. Since data is being collected from various disparate sources, there may be many sources of error. Inconsistent data, missing data, conflicting data, etc. are all challenges of Data Quality in a data warehouse. Transforming data from various sources has costs associated with it. These costs may be in terms of processing power, resources usage, or the time it takes to transform the data. The costs to transform data may lead to inconsistent information. Reporting and Analytics based on inconsistent data may lead to incorrect decision making which negates one of the core purposes of data warehouses.

Future of Data Warehouses

Traditional Data Warehouses were designed to incorporate mostly structured data. However, with a plethora of information available in the form of unstructured data, ETL processes are being upgraded to incorporate this new form of data. The revolution of traditional data warehouse begins with the ETL process. The architecture of the data warehouses stays the same, but the back room processes in DW need to change. The back room process for ETL will incorporate changes to move away from focussing on just structured data and focus on both structured and unstructured data.

Another factor which would contribute to the evolution of data warehouses would be analytics. Analytics would not just use the data warehouse as a repository anymore. Data warehouses will grow into "analytics" warehouses. The architecture for the analytics warehouse builds on the traditional data warehouse architecture in three primary ways:

1. A distributed file system (like Hadoop) sits between source data systems and the data warehouse. It collects, aggregates, and processes huge volumes of unstructured data, and stages it for loading into the data warehouse.

2. Structured and unstructured data from back-end systems can be brought into the data warehouse in real- and near-real time.

3. Engines that use statistical and predictive modeling techniques to perform data discovery, visualization, inductive and deductive reasoning, and real-time decision-making reside between the data warehouse and end users. These engines identify patterns in big data. They can also complement and feed traditional ad hoc querying tools and business intelligence applications.

The ability to integrate analytics, real-time reporting, business intelligence and traditional warehouses will change the economics of the data warehouses for the better.

References:

http://www.digitaldoughnut.com/blog/blog/exclusive-unveiling-the-types-of-data-in-an-organisation-

http://www.slideshare.net/kgraziano/data-warehousing-2016

http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/

http://blog.aylien.com

http://www.research.ibm.com/articles/doctors_at_research.shtml

MIS 587

Wednesday, March 2, 2016

Big Data, Unstructured Data, Structured Data, What?