Reference no: EM133147833 , Length: word count:800
ITECH3101 Business Analytics and Decision Support - Federation University
Exercise - SAS programming -1 Working with data
Project 1: Knowing INPUT statement
The INPUT is part of the DATA step and is used to read raw data. When reading data, we list the variable names after INPUT keyword, in the order variable names appearing in the data file.
1. Reading data values separated by spaces
This form of input is referred to as list input. In this case, data values must be separated by one or more space, and missing values are represented by periods. So we must read very variable on a line.
After clicking Run tap, we get:
2. Reading data values separated by commas
Data values are separated by commas. In this case, we have two choices to input data.
(1) DLM =option The comma and tab characters are common delimiters found in data files and you can enclose the delimiter character in quotation marks after the DLM=option( for example, DLM=',').
Download Dataset: Books.dat from Week 7 in Moodle and save it under myfolders on your computer.
In above code, infile statement directs the input statement to read an external file called Books.dat and to use commas as data delimiter. All the data values in Books.dat are separated by commas.
Grace,3,1,5,2,6
Martin, 1,2, 4,1,3
Scott,9,10,4,8,6
After clicking Run tap, we get:
(2) DSD option This option does three things. First, it ignores delimiters in data values enclosed in quotation marks and will strip off the quotes in the process. This option is ideal for reading comma-separated variables(CSV) files. Second, it does not read quotation marks as part of data value. Third, it treats two delimiters in a row as a missing value.
Download Dataset: Books-2.csv from Week 7 in Moodle and save it under myfolders on your computer.
In above code, infile statement directs the input statement to read an external file called Books-2.csv and to use commas as data delimiter. All the data values in Books-2.csv are separated by commas. When using DSD option, we add missover option if there is any missing data at the end of your data lines. The missover tells SAS that if it runs out of data, don't go to the next data line to continue reading. Notice that for Martin, value 1 has a quotes and the second value is missed.
Grace,3,1,5,2,6
Martin, "1", , 4,1,3
Scott,9,10,4,8,6
After clicking Run tap, we get:
3. Reading data values using informats with list input
In some cases, we want to read dates that are non-standard data. Using date informats, SAS will convert conventional forms of dates like 10-31-2013 or 31OCT13 into a number, the number of days since January 1, 1960. This number is referred to as a SAS date value. The three types of informats have the following general forms:
Character Numeric Date
$informatw. informatw.d informatw.
The $ indicates character informats; informat is the name of the informat; w is the total width, and d is the number of decimal places (numeric informats only). The period is a very important part of the informat name.
One of methods is to precede the INPUT statement with an INFORMAT statement where we assign an informat to each variable.
In above code, variable names such as dob and visit have an informat MMDDYY10. and each variable has 10 columns wide.
After clicking Run tap, we get:
Another method is to supply the informats directly in the INPUT statement.
In above code, variable name dob uses colon modified informats. The colon modifier tells SAS to read for the length of the informats ( 10 for dob), or until it encounters a delimiter, whichever comes first.
After clicking Run tap, we get:
4. Reading data values using column input
We can specify which columns to read for each data values. The syntax is to list the variable names, followed by the column or columns to read.
In above code, data values for variable name id are in columns 1 through 3, values for variable name gender are in columns 4, values for variable name height are in columns 5 through 6, values for variable name weight are in columns 7 through 11. Note that value of height for the third observation in data is missing.
After clicking Run tap, we get:
5. Reading data values using formatted input
Formatted input indicates that we specify a starting column and an INFORMAT(which specifies how many columns to read). This is an alternative approach to specify the starting and ending columns. This approach is specially useful when we are given a coding layout like the following:
Variable Starting column Length Type Description
id 1 3 num subject id
gender 5 1 char gender M=male
F=female
age 7 2 num age in years
height 10 2 num height in inches
v_date 13 8 date visit date in
MMDDYYYY
In above code, the @n symbol, called column pointer, indicates the staring column for each variable, where n is the number of the column SAS should move to.
After clicking Run tap, we get:
Example:
Download Dataset: NatPark.dat from Week 7 in Moodle and save it under myfolders on your computer.
The dataset NatPark includes information about U.S. national parks: name, state, year established and size in acres:
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,917
Great Smoky Mountains NC/TN 1926 520,269
Wolf Trap Farm VA 1966 130
In above code, the column pointer @40 tells SAS to move to column 40 before reading the value for Acreage.
After clicking Run tap, we get:
Project 2: Reading data from external files
Task: This project will primarily focus on reading data from external files where data in text files are separated by a delimiter such as spaces, comma, and other delimiters.
1. Reading CSV files
CSV files(comma-separated values) use commas to separate data values and interpret two commas in a row to mean that there is a missing value for that data field.
Download Dataset: Comma_Delimiter.txt from Week 7 in Moodle and save it under myfolders on your computer.
The dataset Comma_Delimiter.txt includes information; 103-34-7654,Daniel,Boone,M,56,PA
676-10-1020,Fred,Flintstone,M
454-30-9999,Tracie,Wortenberg,F,34,NC 102-87-8374,Jason,Kid,M,23,NJ
888-21-1234,Patrice,Marcella,F,,TX 788-39-1222,Margaret,Mead,F,77,PA
Note that there are missing value in observations 2 and 5.
In above code, dsd (Delimiter-Sensitive Data) option is added on the inflile statement to read data from a CSV file. This option does several things: first, it understands that the data values in the file are separated by commas. Next, it understands that two commas
indicate a missing value. In addition, if file contains a data value in quotation marks(e.g., a state name like' New York' that consists of two words separated by a blank space), it will ignore any delimiters inside the quotes and strip the quotes.
The keyword missover option comes into play when there are missing values at the end of the data line and you are using list input. This option tells the program to set each of these variables to a missing value. This is not necessary if file has data values for every variable for every line of data.
After clicking Run tap, we get:
2. Reading data from Excel files
Download Dataset: Grades.xls from Week 7 in Moodle and save it under myfolders on your computer.
We can use built-in Utility option on the Tasks tab of SAS studio to read Excel data.
Step 1. Click on the Snippets tab in SAS Studio and select Data from the drop-down list:
Step 2: select Import XLSX file( suitable for XLS or XLSX files). This brings up the following code snippet
Step 3: Enter the name of your Excel file to replace <Your XLSX File>. Your import program should look like this:
Please note that if your file is an older XLS file instead of the newer XLSX file, just replace the DBMS=XLSX with DBMS=XLS.
After clicking Run tap, we get:
Project 3: Data sets processing
In this project, we work on processing data sets. Basic operations include selecting a subset from one data set to form another, and combing data from several data sets into a single a data set.
1. Create subset
We use SET statement to read observations from a old SAS data set to create a new SAS data set.
If we only want to look at a small portion of a large data set, we can create a subset of the large data set. Starting with DATA statement, we specify name of new data set, and then follow with SET statement specifying name of old data set that we want to read.
DATA new-data-set;
SET old-data-set;
(a) use IF statement
We can use if statement to create a subset. a simple example:
data women;
set ALL;
if gender='F';
run;
Data set ALL contains a variable, gender, which has values of 'M' and 'F'. Any observation with gender equals to 'F' will be written to the new data set women.
Example
Download Dataset: Shakespeare.dat from Week 7 in Moodle and save it under myfolders on your computer.
In above code, if statement selects only comedies. After clicking Run tap, we get:
(b) Use where statement
Where = data set option is the most flexible of all ways to subset data. We can use it in DATA septs or PROC steps. The basic form of a WHERE = data set option is
Where = (condition).
Only observations satisfying the condition will be used by SAS. The previous program using a where statement would be:
data women;
set ALL;
where gender='F';
run;
Example
Download Dataset: Artists.dat from Week 7 in Moodle and save it under myfolders on your computer.
In above code, where statement selects only impressionist painters. After clicking Run tap, we get:
Summary
1. If your data are in a raw data file, then you can read part of the file using multiple INPUT statements.
2. Subsetting if statement works only in DATA steps, and WHERE statement can work in both DATA steps and PROC steps.
3 Using a WHERE statement in a procedure does not create a new data set.
2. Combining different data from multiple SAS data sets
(a) Using SET statement to stack datasets
With two or more data sets, SET statement can concatenate or stack the data set one on top of the other. In DATA step, first specify name of new SAS data set, and then list names of old data sets you want to combine in the SET statement.
DATA new-data-set
SET old-data-set-1 old-data-set-2
Example
Download Dataset: South.dat and North.dat from Week 7 in Moodle and save it under myfolders on your computer.
In above code, the first two parts of the code read raw data from the south and north data sets into SAS data set. The third part combines the two SAS data sets using a SET statement.
After clicking Run tap, we get:
Project 4. Answering questions (Please do this at your home by using your own computer)
1. Briefly describe five techniques (or algorithms) that are used for classification modeling
2. Given the three lines of data:
1,3,5,7
2,4,6,8
9,8,7,6
write a SAS DATA step to read these data, assigning the four data values to the variables X1 to X4. (hint: you may use infile datalines statement).
Project 5. Creating a Professional Report
Summarize the above experiments procedure, results, answering questions and screenshots (project 1, 2, 3, 4) into one report. Your report is the assignment that is required to be submitted for evaluation on week 11. Create a report by following below steps.
You can add a chapter called Chapter 7 in your previous report.
1. Open your last week's report and find the end of last week's report.
2. Copy this week's related experimental results, your findings and Screenshots, and paste them at the end of last week's report.
3. Delete original Table of Content you created.
4. Select all content , align all text to both left and right margin
5. Use shortcut key approach, generate Chapter 7: SAS programming -1 Working with data
6. Then use shortcut key approach to generate proper sub-chapters for this week's lab work.
7. Insert Table of Contents to your report.
Attachment:- SAS programming.rar