A data set testing has 2 columns OrdeValue and TimeBlock.Calculate the mean of OrderValue for each of the TimeBlock. Assume that TimeBlock is a categorical attribute of the data.
Data manipulation using plyr. This is how plyr works on your data. So “d” stands for a “dataframe” , “l” for a list and “a” for an “array”. And for now just remember that “ply” stands for “apply”. So , you have a data frame “testing” and want to split it or rather cut it into pieces by the value of TimeBlock and then “apply” a specific operation on each of the pieces or chunks and finally want to return the results from each of the chunk operation into a “list” of results. Note that the list will have the same number of results as there are chunks in our slpit data. So, to do that we will use a function called as dlply() , i.e. take a data frame, apply an operation on each of chunks and return a list of results. Now, if our data was not in a data frame, we can now make operations like above on the fly depending on whats the input and what you want as output. Table 1: A cheatsheet to generate functions array dataframe list array dataframe list array aaply adply alply dataframe daply ddply dlply list laply ldply llply • .data - data ( a dataframe in our case) to be processed • .variables - splitting variables (TimeBlock in our case) • .fun - function called on each piece (mean in our case) Splitting it by “TimeBlockA” and calculating a simple mean of OrderValue for each chunk of TimeBlockA. Also,producing the o/p as a list of means. df<-testing[,] dlply(df,"TimeBlock",function(x)mean(x$"OrderValue"))
A csv file has a list of salary brackets for employees . How would you handle missing values , garbage values to get clean salaries from that column.
Say our file is Test.csv and has a column 'Salary' Our Python script looks like import pandas as pd import re df = pd.read_csv('text.csv') #fill missing values df['salary'].fillna('', inplace=True) # remove letters from the salary range df['salary']=df['salary'].apply(lambda x :re.sub("[a-zA-Z]","",x)) df['salary']=df['salary'].replace('\,','',regex = True) #remove commas df['salary']=df['salary'].replace('\$','',regex = True) #remove $ df['salary']=df['salary'].replace('\ ','',regex = True) # spaces
A man has 15 Husky dogs for his dog cart to get to town from his village.With each trip to the town, any dog can get tired and faint.The probability of any dog failing is 10 %. The dogs are independent, and the failure of one dog does not impact the probability of failure of other dogs. The dog's rest and feed after each trip hence they are refreshed from any fatigue. (a) If 6 or more dogs fail, the entire cart will shut down. What is the probability of the cart shutting down on any one trip? (b) What is the probability that the cart runs successfully at least ten times before shutting down?
Solution Part (a) Anyone trip is Binomial distribution of 15 dogs as success and failure Hence P : The probability of failure on an individual trial(1-p) = 0.1 n: The number of trials in the binomial experiment. = 15 x: The number of failures that result from the binomial experiment = 6 Probability of 5 failures or less = b(x < 5; 15, 0.1) = 0.9977503 = 99. 77503% Probability of a cart shut down = 100*(1-0.9977503) = 0.224967% Solution Part (b) Now any one trip has a failure probability of 0.224967% Hence we can assume a series of run as a binomial distribution itself Hence P : The probability of success = 0.9977503 n: The number of trials in the binomial experiment. = 10 x: The number of failures that result from the binomial experiment = 0 Probability of 10 success in first 10 runs b(10; 10, 0.9977503) = 10C10 * (0.9977503)10 * (0.00224967)0 = 0.97772939079 = 97.77%