Data Analysis using the SAS Language/Examples

Sample code highlights features and demonstrates how to accomplish a task. Understanding the syntax of individual statements and procedures doesn't provide the high level needed for for understanding SAS programs.

SAS is a kit full of tools and parts that are made to work well together. As you review these examples you will see how the data step amd proc steps dovetail together to provide a powerful analytical platform.

Analyzing data using by groups
This technique can be used with most SAS procedures. First, sort the data by a group identifier, then use the by statement in a procedure to do an independent analysis for each group.

Problem Description
Statistics were collected on different charactistics for a variety of wines from different vineyards and vintages. A separate model is developed for each vineyard using the same variables. The goal, for this problem, is to build a regression model of wine sales using these characteristics.

Explanation
See: proc sort for more information on sorting.

The data set must be sorted by the variable that we want to group each analysis by. In this case, the vineyard. Sorting puts a note in the data set so other procedures know the data set is sorted and on which variables it is sorted by.

Conducting a T-test to determine change
T-tests, or their equivalent, are available in several SAS procedures. Proc TTest compares one set of observations with another set using a class variable to distinguish each group. One limitation with the ttest procedure is the inability to test difference pre and post effect across a set of subjects.

Problem Description
A sample of students was selected and measured using a test. Then an effect, or intervention, was applied and the students are tested again. Now we have two data sets one for the pretest period and the other for the posttest period.

We need to calculate the difference between pre and post and test to see if there is a significant difference. This is a one-tailed test, however SAS does a two tail-test so the result needs to be adjusted to get the one-tail result.

Explanation
The two data sets, pretest and posttest, have the same variables, student_ID and score. In order to merge them, they must be sorted by student_ID and score must be renamed. Failure to rename score will cause the varaible to only have the value from the posttest data set. There will be four variables in the final data set, student_id, pre, post and difference. Difference is obtained by subtracting Pre from post. Difference is the change in score for each student for the period from before until after the intervention. The Means produces the student's t statistic and the probabiity of this getting this value given the null hypothesis is true (i.e. difference = 0).

Purpose
Many times data for different parts of an organisation are delivered as separate files. These files may be organised the same way, same variables, perhaps the same format. Manual concatenation of these files may not always be convenient and will add to system overhead. Here is some code for automating the process. It also demonstates some new statement options.

Explanation
An assumption has been made that all the reports are in a directory called reports on the C: drive and that these files have the extension txt. the pipe option lets sas read from the dir command output (the list of files with the txt extension). A complete filename is created by appending the path to the filename. This list of files is stored in a SAS data set which is the input for the next step. The filevar option on the infile statement give the name of the variable that contains the filename that we stored in the dir_rpts data set. The variable, eof, will be false until the end of file condition is reached. Once the end of file condition is reached, the next filename will be obtained from dir_rpts data set. The file will be input and the process repeats. The data step ends when all the filenames in dir_rpts have been processed.