# Tutor profile: Kevin G.

## Questions

### Subject: SAS

For this problem, use the SAS Studio dataset SASHELP.DEMOGRAPHICS. Test the hypothesis (at alpha 0.05) that at least one of the regions has a significantly different population count than the others, assume that the distribution of population counts are normal.

We begin by importing the dataset with this code: data dm; /*This sets the name of the dataset*/ set sashelp.demographics; /*Input the library.name of the dataset*/ run; After such, we run an ANOVA with using proc ANOVA: proc anova data=dm; /*This tells the program to run an ANOVA*/ class region; /*This sets the grouping variable*/ model pop=region; /*This tells sas to run an ANOVA based with pop as dependent variable, grouped by region*/ run; With a p-value of 0.0248, we can be certain that the sufficient evidence supports the claim that at least one of the regions has a significantly different population count than the others.

### Subject: R Programming

(Load and install alr4 package in R and import the data file "mile" stored in that library, show your code) Fit a separate regression model for males and females with "Year" as a predictor of "Time". Sketch the appropriate visualization in a single plot.

Importing the data requires this code: library(ggplot2) library(alr4) data("mile") I'm also opening the ggplot2 library since we are required to run a visualization of this context. To create a separate datasets we can either use dplyr's filter() function or stick to base R's subset(). We'll proceed directly to our regression: model.male<-lm(Time~Year, data=subset(mile, Gender=="Male"))#Generates linear regression for with male subset model.female<-lm(Time~Year, data=subset(mile, Gender=="Female")) #Generates linear regression for female subset To print out the results, we run: summary(model.male) summary(model.female) To visualize these, we will be using ggplot() ggplot(mile, aes(x=Year, y=Time, shape=Gender, color=Gender))+ geom_point(size=3)+ geom_smooth(method=lm)

### Subject: Statistics

The manufacturer of an industrial printer reports that mean area of tarpaulin the machine will print before it needs to undergo recalibration is 4,321 square feet with 120 square feet standard deviation. The distribution of area printed per machine is not skewed and approximately normal. The manufacturer wants to provide guidelines to potential customers as to how long they can expect a cartridge to last without recalibration 95% of the time. What minimum area should they advertise?

Working this out requires mastery of areas under the normal curve. If you feel like you are still having trouble with this topic, you may want to watch this video at first: https://www.youtube.com/watch?v=g7FVtTvFUgM We begin by finding the corresponding z-score where the area below it is 0.95. This can be done since the assumption that the distribution follows an approximately normal distribution is given. Using R, we run this code: qnorm(0.95) This code gives the correspondign z-score for a probability value of 0.95. The result is: 1.644854 We will now use these values (z= 1.644854, μ=4,321 and σ=120) to the formula for standard normal distribution to find x, the minimum area which 95% of the distribution falls. z=(x-μ)/σ 1.644854 = (x-4321)/120 120*1.644854 = x-4321 x - 4321 = 197.3825 x = 197.3825 + 4321 x = 4518.382 This implies that the manufacturer should advertise that the machine will be able to print 4518.38 square feet seamlessly 95% of the time.