Friday, June 21, 2013

Easy Deciles with PROC RANK

One of the tasks I'm working on right now requires creating deciles for a large data set. I'm generally more comfortable using sql queries than the built in SAS functions, but it turns out SAS has something really good for this - the RANK procedure.

Apparently, most programmers use data steps and macros to create deciles, but I honestly can't see why.

PROC RANK data=<dataset> out=<output dataset> groups=<num_quantiles>;
 var = <list of variables to rank>;
 rank = <list of column names for ranks>;
run;

That's it. So basically, you can use this to create deciles (or other quantiles) on an arbitrarily sized data set for an arbitrary number of variables to quantize.

Reference:
Jonas V. Bilenas, JP Morgan Chase Bank, Wilmington, DE, Using PROC RANK and PROC UNIVARIATE to Rank or Decile Variables. NESUG 2009

2 comments:

  1. I have a SAS Programming Problem that you may have already solved:

    My Data Set contains three sets of continuous variables:

    DQ01 - DQ59 DE01 - DE59 & DL01 - DL59.

    ( 177 variables ) Each standardised with Mean = 50 and Variance = 100

    1. I want to bin each continuous variable using deciles or semi-deciles
    that have been computed using PROC Univariate / Summary.

    2. Compute and output the Percentiles for each Variable.

    3. For each variable compare the observed values with the Percentile
    Cut-Points and then allocate that observation to a Decile Bin.

    4. Optimise the Bin Allocation based on a metric such as the GINI.

    5. Apply a Robust WOE Transformation to each Binned Variable.
    subject to the following constraints:
    a. The % frequency within each bin > 5%
    b. The WOE transformation is Monotonic

    6. Fit a Binary Logistic Regression Model to the WOE-Transformed Variables.

    If you have any advice or suggestions w.r.t. the above please let me know.

    Regards

    ReplyDelete
  2. I have a SAS Programming Problem that you may have already solved:

    My Data Set contains three sets of continuous variables:

    DQ01 - DQ59 DE01 - DE59 & DL01 - DL59.

    ( 177 variables ) Each standardised with Mean = 50 and Variance = 100

    1. I want to bin each continuous variable using deciles or semi-deciles
    that have been computed using PROC Univariate / Summary.

    2. Compute and output the Percentiles for each Variable.

    3. For each variable compare the observed values with the Percentile
    Cut-Points and then allocate that observation to a Decile Bin.

    4. Optimise the Bin Allocation based on a metric such as the GINI.

    5. Apply a Robust WOE Transformation to each Binned Variable.
    subject to the following constraints:
    a. The % frequency within each bin > 5%
    b. The WOE transformation is Monotonic

    6. Fit a Binary Logistic Regression Model to the WOE-Transformed Variables.

    If you have any advice or suggestions w.r.t. the above please let me know.

    Regards

    ReplyDelete