Wednesday, March 30, 2016

Presentation and Visualization Methods


“A GOOD SKETCH IS BETTER THAN A LONG SPEECH” — NAPOLEON BONAPARTE


The notion that visual communication is much more effective than textual communication has been around for ages. A few years back a prevalent method to show a trend was arranging data in rows and columns. However, with the advent of big data, the number of rows in the same file has increased to millions, and with unstructured data, we can't even arrange data in rows and columns that easily.

This is the reason for the existence of data visualization. Management teams have now realized if they do not have effective methods of communicating data visually, their organization would waste an inordinate amount of time and resources sifting through rivers of bits to find effective insights that they need. The primary goal of Data Visualization is to communicate information clearly and visually in the form of statistical graphics, plots, information graphics, etc. The usage of these various methods various across the type of industry and the type of data which is to be communicated. Here are a few examples of some industries and what type of data would they be interested in.


Healthcare

Healthcare is the maintenance or improvement of health via the diagnosis, treatment, and prevention of disease, illness, injury, and other physical and mental impairments in human beings. In the United States, health care is of primary importance both socially and politically. Healthcare data has the potential to reduce costs, enhance quality, and improve the patient experience. The challenge is how to get from information to insight to action. An example of a visualization for the healthcare industry would be to view the volume of patients at different times in a day across days of a week or to find the wait time vs their care score.



Education


Data visualization can be a powerful tool to gain insights into the education system. At the high-level visualization can be used to identify the regions or schools which are performing well and which are lagging behind. It can also be used to identify student population and demographics about the enrollment in schools. It can also be used at a school-level to identify performances within a school. Data visualization can be used to create a dashboard for teachers to track the performance of students in their courses.




Transportation

The transportation industry has too many subcategories and sectors within itself. So let's focus on the airline industry. Within the airlines industry, we can have multiple factors, but I'd like to focus on the misery map or the delays and cancellations of flights. Using data visualizations we can plot the data of different airlines over a period of time to determine which airlines have the best performance in terms of on-time performance or how the trends have changed over the years. We can also look at data about which airport are the busiest and usually have delays. Another visualization which I find intriguing is the airport traffic visualization. It shows the incoming and outgoing traffic of flights of a specific airport and can be used to optimize traffic on the airport.






https://www.tableau.com/stories/workbook/improve-patient-satisfaction-improving-cycle-time
https://performanceanalyticsaus.wordpress.com/2012/11/19/a-good-sketch-is-better-than-a-long-speech-napoleon-bonaparte/
https://www.tableau.com/stories/workbook/identify-and-monitor-underperforming-students
http://flightaware.com/miserymap/
http://www.gioviz.com/2015/08/visualization-air-traffic-airports.html




Wednesday, March 2, 2016

Big Data, Unstructured Data, Structured Data, What?



Big Data is a term used to describe sets of data so large, that it leaves traditional database systems inadequate. Big Data deals with two forms of data: Structured Data and Unstructured Data. So what are these two classifications of data?




Structured Data 


To put it simply, structured data is data which can be stored in relational databases. This form of data can be easily identified and processed by data mining tools. Any meaningful data usually in the forms of labeled rows and columns would usually fall under the structured data category.


Unstructured Data


On the other hand, unstructured data is data which does not have a predefined data model, and cannot be directly stored in a relational database. Unstructured data may be textual: Word documents, emails, instant messages, tweets, etc. or non-textual such as audio, video, info-graphs, etc. This sort of data may contain numbers, facts and figures but will require specific applications to sort through and get meaningful data.

According to experts, about 80% of data available to an organization is usually unstructured.



To glean any sort of meaningful information from this large volume of data, organizations will have to process this data. However, with unstructured data this is usually easier said than done with traditional tools. 

The data sources available to organizations can be classified into two types:

Traditional Data Sources


Data from these sources is mostly transactional data and is usually structured and stored in databases. 

OLTP Databases: Oracle, Sybase, DB2, SQL Server, MySQL, Postgres, etc.

Enterprise Applications: ERP, CRM, HRM, etc.

Third Party Data: Consumer databases, Stock Trade data, etc.

Non-Traditional Data Sources


Web Applications: Website data, Mobile Applications data, etc.

Social Media Data: Twitter, Facebook, Linkedin, etc.

Newer Third Party Data: APIs, Public Data, etc.

Others: Sensors, Device logs, etc.

With the explosion of the Connected Web and the Internet of Things, organizations today have access to more data than they can consume and analyze. They are being inundated with unstructured data from non-traditional sources. 

Role of Data Warehouse


A data warehouse can be used to integrate the data from all the different sources available to an organization. This can be used to create a single source of data for an organization which can then be used to perform analysis and gather business intelligence.



A major function in the extracting of data from different sources to be stored in the data warehouse is the ETL process: Extract, Transform, Load. Since a data warehouse collects data from various different sources, ETL is a key process to bring all the data together in a standard, homogeneous environment. 

Various Data Sources - Limitations of Data Warehouse


Traditional ETL systems are incapable of dealing with unstructured data. With most of the data available to organizations being unstructured, ETL processes have evolved to incorporate unstructured data. A major chunk of unstructured data is textual data. ETL processes now incorporate Natural Language Processing and Text Analytics to transform unstructured data to store in a Data Warehouse. 

Collecting data from different sources comes with a challenge of quality. Since data is being collected from various disparate sources, there may be many sources of error. Inconsistent data, missing data, conflicting data, etc. are all challenges of Data Quality in a data warehouse. Transforming data from various sources has costs associated with it. These costs may be in terms of processing power, resources usage, or the time it takes to transform the data. The costs to transform data may lead to inconsistent information. Reporting and Analytics based on inconsistent data may lead to incorrect decision making which negates one of the core purposes of data warehouses.

Future of Data Warehouses


Traditional Data Warehouses were designed to incorporate mostly structured data. However, with a plethora of information available in the form of unstructured data, ETL processes are being upgraded to incorporate this new form of data. The revolution of traditional data warehouse begins with the ETL process. The architecture of the data warehouses stays the same, but the back room processes in DW need to change. The back room process for ETL will incorporate changes to move away from focussing on just structured data and focus on both structured and unstructured data. 

Another factor which would contribute to the evolution of data warehouses would be analytics. Analytics would not just use the data warehouse as a repository anymore. Data warehouses will grow into "analytics" warehouses. The architecture for the analytics warehouse builds on the traditional data warehouse architecture in three primary ways:
1. A distributed file system (like Hadoop) sits between source data systems and the data warehouse. It collects, aggregates, and processes huge volumes of unstructured data, and stages it for loading into the data warehouse.
2. Structured and unstructured data from back-end systems can be brought into the data warehouse in real- and near-real time.
3. Engines that use statistical and predictive modeling techniques to perform data discovery, visualization, inductive and deductive reasoning, and real-time decision-making reside between the data warehouse and end users. These engines identify patterns in big data. They can also complement and feed traditional ad hoc querying tools and business intelligence applications.
The ability to integrate analytics, real-time reporting, business intelligence and traditional warehouses will change the economics of the data warehouses for the better.




References:

http://www.digitaldoughnut.com/blog/blog/exclusive-unveiling-the-types-of-data-in-an-organisation-

http://www.slideshare.net/kgraziano/data-warehousing-2016

http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/

http://blog.aylien.com

http://www.research.ibm.com/articles/doctors_at_research.shtml




Wednesday, February 17, 2016

Dimension Modeling for the Hotel Industry

About the Business:


To illustrate the concept of dimension modeling, I would like to take the example of the Hospitality industry. And to take a more tangible example, I would like to focus on Hilton Worldwide Holdings (HLT from now on). HLT encompasses a large portfolio of hotels and resorts worldwide. It was founded by Conrad Hilton in 1919 in Cisco, Texas, and later bought over by the Blackstone Group for about $26 billion in 2007. It is the sixth largest hotel chain globally by the number of Hotels.



Performance Metrics:


Now if I was the CEO of Hilton Worldwide (Christopher J. Nassetta), and I wanted to see some performance metrics for HLT I would probably want to see some of the points below:

* Which line of hotels is the most profitable? - To find out which hotel chain is raking in the most revenue.

* Which line of hotels is the most popular? - To find out which hotel chain is attracting the most customers.

* Which location-segment is the most popular? - To find out which locations do customers frequently visit.

* Which market-price segment is the most popular? - To find out segment is the most popular amongst Luxury, Upscale, Midscale, Economy
  
* What is the occupancy of a specific hotel for a particular time period? (Paid Rooms Occupied/Rooms Available) - To find out how occupied is a hotel during a time period.

* What is the Revenue per Available Room for a specific hotel chain? (Total Room Revenue / Total Rooms Available) - To find out the revenue earned per room in a given period.

* What is the Average Daily Rate for a particular hotel? (Rooms Revenue / Paid Rooms Occupied) - To find out the average rental income per paid occupied room in a given time period.

There might be many more metrics but let's focus on these for now. Now to get all the above information we will have to capture a lot of data. Particularly, we will need data of bookings done at hotels, the hierarchy of hotels (which chain do they belong to), the revenue paid by customers for the bookings (and the discount), the number of rooms categorized by class in each hotel, etc. The definition of "revenue" might be different for each business segment but for this article, let's assume the revenue is the amount a customer pays for a hotel room.

Dimensional Modeling:


To accommodate the lowest level of facts, we will be capturing the booking data i.e. transaction data of each customer. Measures like Revenue per available room, Average Daily rate can be derived by the BI tool based on the data captured. Revenue over a time period can be done by adding the revenue obtained from each room of a hotel over a period categorized by date (since all dimension models should have a date dimension). We can also have a periodic snapshot table over a month or over a quarter identifying the revenue and the average occupancy over a period.

The dimensions for such a model would include the following:
  • Hotel Chain Dimension
  • Hotel Dimension
  • Customer Dimension
  • Date Dimension
  • Room Type Dimension
  • Deal Dimension
A high level dimensional model for the hotel industry would look something like this:







Wednesday, February 3, 2016

Comparison of Business Intelligence and Analysis Tools


Organizations today suffer from the DRIP syndrome. DRIP stands for 
Data Rich Information Poor. There is too much of data coming in from various different sources but there is no central system to extract some useful information from the data. This is where Business Intelligence comes in, to make sense of all this data.

There are quite a few commercial vendors offering Business Intelligence and Analysis products. This article deals with comparing a few vendors based on a weighted criterion. First, let's talk about the criteria.

For the weighted decision matrix, I will be using the following criteria:

Data Sourcing: The first step of analysis using a BI tool is getting the data. Data can be from text files, flat files, CSVs, databases, servers, etc. The more sources a tool supports, the more are the opportunities for getting data. Also, it should be easy for an end-user to connect the data sources to the BI tool.

Usability: No matter how good a tool is at its core functions if it's not easy to use it will most probably be rejected by the end user. A tool should be user-friendly, the UI should not be too complex and the end user should not be lost while trying to figure how to go about using the tool.

Filtering and Visualization: The end user should be able to drill down to the data they are seeking. This can be done using filters, selecting criteria, the intelligence of the software (maybe?), etc. Another function associated with selecting data is visualization. A picture is worth a thousand words and certainly more than 10,000 rows on a spreadsheet! If a tool allows the user to see data in terms of graphs or charts and then allow drilling down, that tool is intuitive.

Reporting: The tool should allow the creation of reports. A BI tool would be marginally useful if it cannot be used to share information. Reports should be customizable based on user selected criteria and the tool should allow exporting reports in different formats. Also, integration with most popular extensions e.g. Words, Excel, PDF, etc. would be a plus.

Cost: Cost is one of the first factors which is looked at while selecting a business tool, especially by the finance division. The cost, of course, is relative. A big brand license would be too expensive to a small scale customer and vice versa. But a lot of cost-to-features can be done while determining the selection of a BI tool.

Added Features - These features are not the core features of a BI tool but have gone from could have to should have or even must-have. Following are the add-on features:

Admin & Management: BI tools are not stand-alone tools anymore. They are full-fledged applications. Thus, these applications should allow administration of users, access rights, security, etc.

Deployment: BI tools should make it easy for users to implement them in their own system. It should be easy for end users to integrate the tool with their current existing systems.

Mobility: A cell phone today is more powerful than the computers on the Apollo 11's mission. And with users being more mobile than ever, BI tools are expected to seamlessly integrate users on mobile devices and offer the same or almost the same experience as a desktop user.

This article will be comparing the following commercial vendors:
  • IBM Cognos
  • Microstrategy
  • Tableau
  • SAS
  • Microsoft


A weighted scoring matrix criterion of features of all the popular BI tools yields the following result:




From the above matrix, we can see that Tableau scores high on Usability, while SAS scores high on Filtering & Visualisation. Microstrategy scores high on Mobility and Deployment and Reporting. Microsoft scores on Administration while SAS and Tableau tie on Data Sourcing. Overall, Microstrategy ranks ahead because of its higher ranking of features.

Sources:

https://www.gartner.com/doc/2989518/magic-quadrant-business-intelligence-analytics

http://www.pcmag.com/article2/0,2817,2491954,00.asp

http://www.sas.com/content/dam/SAS/en_us/doc/analystreport/ovum-decision-matrix-bi-105875.pdf

http://searchbusinessanalytics.techtarget.com/feature/How-to-evaluate-and-select-the-right-BI-analytics-tool