Murphy Choy

Data preparation, not just another data exercise

In Uncategorized on June 1, 2011 at 10:24 am

Preparing data in the appropriate manner for modeling work is usually interesting. There are plenty of unusual things that one can find in the raw data that it will be almost a miracle for any analyst to get hold of clean data. This issue in data quality also translate into additional work when one has to prepare the data in a way so as to be for ready for modeling. One of the most commonly encoutered problem is the format of the data passed between analyst.

Very often, one will be encountering data which takes the form of a excel table. While in theory this is not a very tough data format, it makes it difficult to be used in many other cases. This is especially the case when there are many merged cells and empty spaces in the data. This is also problematic should the titles be repetitive with big multiple rows indicating the range of the data.

Data to be used in modeling tends to be better described as a simple table that captures observations at given time frame. This tend to lead to better results and easier manipulation.


