Murphy Choy

Archive for May, 2011|Monthly archive page

Sensitivity analysis and SAS

In Uncategorized on May 31, 2011 at 6:23 am

Many people are often confused about the meaning of sensitivity analysis and typically finds it hard to understand how the values are calculated. This is further compounded by the problem of equivocation of the same term across different industries. For many people, this number was not even in their analysis of the modeling.

Sensitivity analysis in different context has different meanings. In generic modeling, sensitivity analysis will refer to the amount of change to the model given a change to a particular factor. The term is also more commonly used in linear programming and operation research. In cases such data mining, sensitivity analysis less commonly encountered.

The most common way to create a sensitivity analysis is to run linear regression with an independent variable and a dependent one. Doing this allows one to measure the change to one variable given a change to the other one. However, there are a few problems to note. Such models are typically less than useful when the R-square is less than 0.8. This is because the explanatory power is very weak and may not model the process properly. Another problem commonly encountered was that there were manipulation of the variables to increase the R-squared value. Such manipulation is rare but usually increases the R-squared value to extremely high levels.

Sensitivity analysis should always be done in a cautious manner to ensure its usefulness.

Understanding the business, not just random modeling

In Uncategorized on May 29, 2011 at 10:46 am

I was involved in a small discussion with my friends about the recent data mining competitions which are pretty out of the world kind of modeling. While it is very interesting for researchers to try out different techniques, it is totally a waste of effort and something that might be detrimental to reputation of the data miners and analysts.

One of the main issue in data mining is to ensure that one can understand the final outcome of the result in the context of the business. The power of neural network in non-linear modeling has been widely recognized. The lack of neural network modeling in many areas is commonly attributed to the difficulty of understanding the model which makes it difficult to relate the different layers of ideas together. The inability to relate business and models causes a lot of trouble for BI wannabes.

The other end of the problem is that we can have modeling that attempts to explain something which obviously cannot be explained by those very factors. Such modeling is just basically nonsense and extremely tenous in establishing the truth. This is further compounded by the massive datasets making all models statistically significant but not necessarily useful.

Hopefully, when we do modeling, more people will be able to use Occam razor before dredging data.

Programming VS point and click

In Uncategorized on May 27, 2011 at 4:23 am

Many analysts debate on the need for the analysts to learn programming and whether this skill has been superseded by the advance of point and click statistical packages. In fact, this is an extremely hot topic now given the proliferations of statistical packages.

Point and click has its inherent advantages of making analysis or even for that matter data manipulation simpler for beginner to use. This is extremely attractive for companies starting out on analytics and lack the necessary talents for doing all these work. At the same time, point and click offers to provide simple tools for advance users the ease of doing simple tasks without resorting to extreme programming.

With ease and simplicity lies its major downside, any point and click software will have massive amount of difficulties completing huge number of redundant tasks as these requires many clicks to achieve a medium amount of tasks. In this aspect, programming is inevitably more efficient for redundant tasks. At the same time, good macros can be developed to achieve all these tasks in subsequent work.

A balance between the two is always desired and my opinion is that they complement one another.

Visualization: The double edged sword

In Uncategorized on May 26, 2011 at 2:09 am

Many recent developments in Business Intelligence have been focused on the field of Visual Analytics. There are now many tools that one can use to visualize their data in charts and advance visual plots. The tools may even have an interactive portion that allows one to select subgroups.

I do not deny the benefits of a good visual display. Any good visual display simplifies the need for complicated statistics and cumbersome testing. However, with any simple tools, there lies danger within.

Any visual displays can be manipulated in ways to distort information which can distort the truth. This is particularly easy given the ease of use of charting packages available now. I have seen several such examples in recent days on the internet as well as Newspaper.

A word of advice when analyzing charts is to take it with a pinch of salt and verify any additional beliefs with tests. A far more safer proposition.

Singapore SAS Analytics Forum 2011

In Uncategorized on May 25, 2011 at 1:43 am

Today is the SAS Forum for Singapore SAS users. Hope to catch everyone there later.

Server Installations VS Virtualization

In Uncategorized on May 24, 2011 at 2:41 am

It is an interesting start to the day. Some one has commented about the problems of virtualization and the slowness of the system. I have to say that such problems existed since the beginning of virtualization but has improved considerably in the past few years.

One key problem of virtualization is the requirements of having an OS running another OS on top of it. This problem is extremely serious when there is insufficient RAMs or CPUs to run the virtualization. However, in mid-range servers, we have systems with a couple of CPUs and RAMs that are huge in size. With a good number of CPUs to run the image with respectable amount of RAM, virtualization has less performance issues.

Another important aspect of virtualization is the portability of the image which can be tranferred round easily to be deployed to another system should there be catastrophical disasters to the servers. However, many people tend not to take this into consideration when running the servers. The other important to note is that should there be any accidents that destroyed the virtualized image, we can always back it up with the original image.

All in all, given a good system, virtualization should not that much of an issue.

Cluster Analysis: CCC, PFS and PTS.

In Uncategorized on May 23, 2011 at 2:30 am

Most SAS programmers who are familiar with PROC Cluster will have some degree of familiarity with CCC which is one of SAS’s greatest contribution to cluster analysis. However, few people are aware of the other options for determining the number of clusters.

The two other major options are Pseudo F-squared statistics and Pseudo T-squared statistics. These two major statistics are almost as powerful as CCC in terms of estimating the natural number of clusters. Some advantages in using these statistics is that they show a multitude of possible solutions for the number of clusters.

However, much of their use have been superseded by CCC. Many other softwares still retain these two major statistics but in the case of SAS Enterprise Miner and Base SAS/SAS Stat, the default is CCC.

Understanding Cluster Analysis

In Uncategorized on May 20, 2011 at 2:30 am

Many analysts that I came across who are dealing with customer segmentations have a great deal of experience in dealing with cluster analysis. However, few of the analysts have any clue about the workings of cluster analysis.

Cluster analysis is a very non distinct technique of data exploration. One key issue in cluster analysis is the unknown number of clusters that exist in a dataset. This is compounded by the problem of measurement statistics being fuzzy and relatively hard to interpret. While this problem has been mitigated by CCC (Sarle, 1990), it remains hard to precisely measure the number of clusters needed.

Another major issue that exist is the presence of outliers which severely distorts the cluster analysis. This is particularly obvious for users of CCC which becomes severely negative in some situations.

In the next few posts, I will detail some of the solutions that we can use to solve these problems.

Unusual comparisons standards for software

In Uncategorized on May 19, 2011 at 2:31 am

I was recently involved in an online discussion/debate on different statistical software that were in the market. This discussion inevitably led to the discussion between R and SAS while at the same time discussing the strength and weaknesses of the software.

I have often been bombarded by many people who claims the superiority of one software over another. This is interesting from the perspective that many people will be comparing the performance of the softwares without any sense of the underlying hardware specifications. This argument seems to be more pervasive in discussion involving R.

I am all for R if the users loves it and having been a R programmer before switching to SAS. However, I find the argument about the huge amount of RAM needed to run the data merging and sorting to be misleading. Personally, I have rarely came across any PCs/Laptops with more than 8Gb of RAM while the majority having RAM size of 2Gb- 4Gb. This makes it difficult for one to use R to manage data sets of 1Gb and beyond.

However, SAS has none of these problems and I am now even processing a 6Gb data on a Pentium 4 with 512Mb Ram. While it takes a long time, it is not crashing on me yet.

This post is here to highlight some facts and not to flame people.

Variable creation and how they relate to the problem

In SAS on May 18, 2011 at 5:46 am

I was meditating about some analytic problems that some of my contemporaries are facing and I realize the main problem lies in variable creation. Many of their problems are modeled with variables which are too Naive by any standard. This is compounded by extensive abuse of mathematical transformation which create all sorts of funny variables which are hard to interpret.

One important principle to variable creation is the relevance to the question. For example, any one trying to answer about the life span of an individual will be seeking information from the person’s lifestyle, family history as well as some general population information on life span. However, what I noticed is that many people will just put a whole bunch of weird information which might not make sense such as the number of children or whether the parents are alive.

Another important point to note is transformation and skewness of data. While transformation can make a data more normal looking, it might not work all the time. This is compounded by the massive effect of skewness which needs to be addressed. Sometimes, the problem with skewness of data is the fact that we might be cutting the data too fine causing unnecessary problems within the framework of curse of dimensionality.

Hope this help.