Wednesday, March 30, 2016

Presentation and Visualization Methods


“A GOOD SKETCH IS BETTER THAN A LONG SPEECH” — NAPOLEON BONAPARTE


The notion that visual communication is much more effective than textual communication has been around for ages. A few years back a prevalent method to show a trend was arranging data in rows and columns. However, with the advent of big data, the number of rows in the same file has increased to millions, and with unstructured data, we can't even arrange data in rows and columns that easily.

This is the reason for the existence of data visualization. Management teams have now realized if they do not have effective methods of communicating data visually, their organization would waste an inordinate amount of time and resources sifting through rivers of bits to find effective insights that they need. The primary goal of Data Visualization is to communicate information clearly and visually in the form of statistical graphics, plots, information graphics, etc. The usage of these various methods various across the type of industry and the type of data which is to be communicated. Here are a few examples of some industries and what type of data would they be interested in.


Healthcare

Healthcare is the maintenance or improvement of health via the diagnosis, treatment, and prevention of disease, illness, injury, and other physical and mental impairments in human beings. In the United States, health care is of primary importance both socially and politically. Healthcare data has the potential to reduce costs, enhance quality, and improve the patient experience. The challenge is how to get from information to insight to action. An example of a visualization for the healthcare industry would be to view the volume of patients at different times in a day across days of a week or to find the wait time vs their care score.



Education


Data visualization can be a powerful tool to gain insights into the education system. At the high-level visualization can be used to identify the regions or schools which are performing well and which are lagging behind. It can also be used to identify student population and demographics about the enrollment in schools. It can also be used at a school-level to identify performances within a school. Data visualization can be used to create a dashboard for teachers to track the performance of students in their courses.




Transportation

The transportation industry has too many subcategories and sectors within itself. So let's focus on the airline industry. Within the airlines industry, we can have multiple factors, but I'd like to focus on the misery map or the delays and cancellations of flights. Using data visualizations we can plot the data of different airlines over a period of time to determine which airlines have the best performance in terms of on-time performance or how the trends have changed over the years. We can also look at data about which airport are the busiest and usually have delays. Another visualization which I find intriguing is the airport traffic visualization. It shows the incoming and outgoing traffic of flights of a specific airport and can be used to optimize traffic on the airport.






https://www.tableau.com/stories/workbook/improve-patient-satisfaction-improving-cycle-time
https://performanceanalyticsaus.wordpress.com/2012/11/19/a-good-sketch-is-better-than-a-long-speech-napoleon-bonaparte/
https://www.tableau.com/stories/workbook/identify-and-monitor-underperforming-students
http://flightaware.com/miserymap/
http://www.gioviz.com/2015/08/visualization-air-traffic-airports.html




Wednesday, March 2, 2016

Big Data, Unstructured Data, Structured Data, What?



Big Data is a term used to describe sets of data so large, that it leaves traditional database systems inadequate. Big Data deals with two forms of data: Structured Data and Unstructured Data. So what are these two classifications of data?




Structured Data 


To put it simply, structured data is data which can be stored in relational databases. This form of data can be easily identified and processed by data mining tools. Any meaningful data usually in the forms of labeled rows and columns would usually fall under the structured data category.


Unstructured Data


On the other hand, unstructured data is data which does not have a predefined data model, and cannot be directly stored in a relational database. Unstructured data may be textual: Word documents, emails, instant messages, tweets, etc. or non-textual such as audio, video, info-graphs, etc. This sort of data may contain numbers, facts and figures but will require specific applications to sort through and get meaningful data.

According to experts, about 80% of data available to an organization is usually unstructured.



To glean any sort of meaningful information from this large volume of data, organizations will have to process this data. However, with unstructured data this is usually easier said than done with traditional tools. 

The data sources available to organizations can be classified into two types:

Traditional Data Sources


Data from these sources is mostly transactional data and is usually structured and stored in databases. 

OLTP Databases: Oracle, Sybase, DB2, SQL Server, MySQL, Postgres, etc.

Enterprise Applications: ERP, CRM, HRM, etc.

Third Party Data: Consumer databases, Stock Trade data, etc.

Non-Traditional Data Sources


Web Applications: Website data, Mobile Applications data, etc.

Social Media Data: Twitter, Facebook, Linkedin, etc.

Newer Third Party Data: APIs, Public Data, etc.

Others: Sensors, Device logs, etc.

With the explosion of the Connected Web and the Internet of Things, organizations today have access to more data than they can consume and analyze. They are being inundated with unstructured data from non-traditional sources. 

Role of Data Warehouse


A data warehouse can be used to integrate the data from all the different sources available to an organization. This can be used to create a single source of data for an organization which can then be used to perform analysis and gather business intelligence.



A major function in the extracting of data from different sources to be stored in the data warehouse is the ETL process: Extract, Transform, Load. Since a data warehouse collects data from various different sources, ETL is a key process to bring all the data together in a standard, homogeneous environment. 

Various Data Sources - Limitations of Data Warehouse


Traditional ETL systems are incapable of dealing with unstructured data. With most of the data available to organizations being unstructured, ETL processes have evolved to incorporate unstructured data. A major chunk of unstructured data is textual data. ETL processes now incorporate Natural Language Processing and Text Analytics to transform unstructured data to store in a Data Warehouse. 

Collecting data from different sources comes with a challenge of quality. Since data is being collected from various disparate sources, there may be many sources of error. Inconsistent data, missing data, conflicting data, etc. are all challenges of Data Quality in a data warehouse. Transforming data from various sources has costs associated with it. These costs may be in terms of processing power, resources usage, or the time it takes to transform the data. The costs to transform data may lead to inconsistent information. Reporting and Analytics based on inconsistent data may lead to incorrect decision making which negates one of the core purposes of data warehouses.

Future of Data Warehouses


Traditional Data Warehouses were designed to incorporate mostly structured data. However, with a plethora of information available in the form of unstructured data, ETL processes are being upgraded to incorporate this new form of data. The revolution of traditional data warehouse begins with the ETL process. The architecture of the data warehouses stays the same, but the back room processes in DW need to change. The back room process for ETL will incorporate changes to move away from focussing on just structured data and focus on both structured and unstructured data. 

Another factor which would contribute to the evolution of data warehouses would be analytics. Analytics would not just use the data warehouse as a repository anymore. Data warehouses will grow into "analytics" warehouses. The architecture for the analytics warehouse builds on the traditional data warehouse architecture in three primary ways:
1. A distributed file system (like Hadoop) sits between source data systems and the data warehouse. It collects, aggregates, and processes huge volumes of unstructured data, and stages it for loading into the data warehouse.
2. Structured and unstructured data from back-end systems can be brought into the data warehouse in real- and near-real time.
3. Engines that use statistical and predictive modeling techniques to perform data discovery, visualization, inductive and deductive reasoning, and real-time decision-making reside between the data warehouse and end users. These engines identify patterns in big data. They can also complement and feed traditional ad hoc querying tools and business intelligence applications.
The ability to integrate analytics, real-time reporting, business intelligence and traditional warehouses will change the economics of the data warehouses for the better.




References:

http://www.digitaldoughnut.com/blog/blog/exclusive-unveiling-the-types-of-data-in-an-organisation-

http://www.slideshare.net/kgraziano/data-warehousing-2016

http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/

http://blog.aylien.com

http://www.research.ibm.com/articles/doctors_at_research.shtml