Tutor profile: Amanda K.
Subject: R Programming
Help! How do I use ggplot?
First, make sure you've installed the ggplot2 package using install.packages("ggplot2") and that you've loaded the library using library(ggplot2). Next, make sure your data are in a data frame (if you run class(yourdata), it should return "data.frame") with meaningful column names (check by running colnames(yourdata)) in long format. Long format means that each row of your data frame is an observation of a specific condition, and the column names define the type of variable. For example, if you have data measuring fish activity under three different water pH levels, your columns would be "subjectID", "activity", and "condition" where, for example, the first rows would be (1, 10, "lowpH"), (1, 20, "neutralpH"), (1, 30, "highpH"), (2, 15, "lowpH"), (2, 12, "neutralpH"), (2,20, "highpH"). Note that each individual is represented in three separate rows, once for each condition. If your data are in short format, you can convert them to long format using the melt function in the reshape2 package. Short data generally have more columns and fewer rows - for example, the previous data in short format would have columns "subjectID", "lowpH", "neutralpH", and "highpH", where rows are (1, 10, 20, 30) and (2, 30, 15, 12), so each row is representing data values across conditions for one individual. Once your data are in long format, it's time to start plotting! All types of ggplot plots have the same general format: ggplot()+plottype(data=yourdata, aes(x=col1, y=col2, group=col3, fill=col4, color=col5, size=col6, alpha=col7...)). In this format, "yourdata" is the name of your data frame and aes contains values that specify which columns of "yourdata" to plot. So for example, if my dataframe is called "df" with column names "treatment", "gender", and "value" and I want to make a boxplot of value grouped by treatment, I would do ggplot()+geom_boxplot(data=df, aes(x= treatment, y=value). This will plot treatment as a categorical variable on the x-axis and value as a continuous variable on the y-axis. By adding more columns for "fill" or "color", I can make grouped plots, for example ggplot()+geom_boxplot(data=df, aes(x= treatment, y=value, fill=gender)) will create 4 boxplots where x-axis location indicates treatment type and fill color specifies gender. Change the type of plot by changing geom_boxplot to a different function, such as geom_bar for barplot, geom_point for scatterplots, geom_smooth for fitted lines, and more! Don't forget you can use ? in RStudio to pull up RDocumentation in the RStudio console, or search for RDocumentation online, to find all the types of plots available and specific input requirements.
The following paragraph was written by a non-native English speaker. Rewrite the paragraph to effectively convey the material in proper English. I collected data from 27 individuals. Sick patients in hospital for bacterial or viral infection, in normal room for comfort. I asked the patients list of questions (Supplement B) with Likert scale. I and patients did not know patient diagnosis before questioning to remove bias. After completed data collection, R was used for analysis. T-test was used to compare scores in viral infection patients vs. not.
I used a Likert scale questionnaire (see Supplement B) to collect data from 27 sick individuals in the hospital diagnosed with either viral or bacterial infections. To help reduce bias, the study was performed double-blind - neither I nor the patients knew their diagnoses prior to the survey. All patients were surveyed in their own normal hospital rooms for comfort. After collecting data, I used R statistical analysis software to analyze the data using a t-test to compare survey responses in patients with viral infections versus bacterial infections.
Every month for two years, birds will be caught, measured, tagged, and released to detect changes in bird health. A new power plant will be constructed in the birds' habitat during the two-year span, and researchers are specifically interested in how the plant will affect the birds. Birds will be uniquely tagged so individuals caught once can be identified during subsequent catches. The field scientists are capable of taking a wide variety of measurements of the birds, and they have come to you, a statistician, to help them develop an experimental design and statistical analysis plan. What are your recommendations and analysis plan? Be as specific as possible.
I would recommend that the field researchers collect a variety of data each time the birds are caught, including quantitative variables such as weight, length, and estimated age, categorical variables such as color, energy level, feather appearance, and whether the bird appears to have a disease, and any other variables the researchers believe would be indicative of overall health. If possible, researchers should develop a rating system to measure a variety of variables for each bird (for example, bird feather quality could be rated on a scale from 1 to 5, with 5 being perfect feather quality and 1 being extremely poor feather quality). Ideally, researchers should capture and recapture as many birds as possible, and more so as many of the same birds each catch as possible, to maximize sample size. My concerns about capture are that the sample may not be a truly random sample from the population, and in particular that captured birds may become trap-averse over time. I would talk to the researchers about helping to ensure the sample is random by possibly sampling at multiple random locations and carefully disguising traps to avoid trap-aversion. After data collection, statistical analyses could include regression analyses to track health indicators over time across the samples. Key assumptions of parametric linear regression are normality of the residuals and no relationship between regression residuals and fitted values. If these assumptions are violated, I would first attempt to transform the data and then if that fails, move on to non-parametric regression measures. Further analyses could include before/after tests on data collected before and after plant construction. If possible, the ideal test would be a paired t-test in which each value represents the mean measurement for an individual at all time points either before or after plant construction. If there are insufficient matched pairs (i.e. too few individual birds were captured at both time points), I would instead pool the data to perform a two-sample t-test to compare average bird weight before and after construction. In either case, a key assumption of t-tests is that data in the sample or samples are normally distributed. If a histogram and Q-Q plot indicate severe deviations from normality, I would instead use the non-parametric Wilcoxon rank-sum test.
needs and Amanda will reply soon.