Friday, March 29, 2013

Session # 10 - March 26, 2013 -- Plotting in R

IT BAL Assignment#10

Question 1: 

Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length, bind them together.Create 3 dimensional plots of the same.

Solution:

First creating a random data set of 50 items with mean =30 and standard deviation =10

> data <- rnorm(50,mean=30,sd=10)
> data

Taking sample data of length 10 from the created data set in three different vectors x,y,z
> x <- sample(data,10)
> x

> y <- sample(data,10)
> y

> z <- sample(data,10)
> z

Binding the three vectors x,y,z into a vector T using cbind
> T <- cbind(x,y,z)
> T

                                                                       Data Set
Plotting 3d graph 

Command:

> plot3d(T[,1:3])
                                                                        3D Plot
Plotting of graph with labels for axes and color

Command 
> plot3d(T[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(500))
 
                                                                      3D Plot with color
Plotting of graph with labels for axes, color and type = spheres

Command
> plot3d(T[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(5000), type='s')
                                                                      3D Plot with spheres
Plotting of graph with labels for axes, color and type = points

Command

> plot3d(T[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(5000), type='p')
                                                                   3D Plot with points
Plotting of graph with labels for axes, color and type = lines

Command

> plot3d(T[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(5000), type='l')
                                                                     3D Plot with lines
Question2 :


Choose 2 random variables 
Create 3 plots: 
1. X-Y 
2. X-Y|Z (introducing a variable z and cbind it to z and y with 5 diff categories)
3. Color code and draw the graph 
4. Smooth and best fit line for the curve

Solution

Creating a data set for two random variables and then introducing third variable z

Command:

> x <- rnorm(5000, mean= 20 , sd=10)
> y <- rnorm(5000, mean= 10, sd=10)
> z1 <- sample(letters, 5)
> z2 <- sample(z1, 5000, replace=TRUE)
> z <- as.factor(z2)
> z
                                                                      Data Set 
Creating Quick Plots

Command:

>qplot(x,y)
                                                                     x and y qplot
>qplot(x,z)
                                                                          x and z qplot
For semi-transparent plot

> qplot(x,z, alpha=I(2/10))
                                                                    Semi-transparent Plot
For coloured plot

> qplot(x,y, color=z)
                                                                      Coloured plot
For Logarithmic coloured plot

> qplot(log(x),log(y), color=z)
                                                                       Logarithmic Plot
Best Fit and Smooth curve using "geom"

Command:

> qplot(x,y,geom=c("path","smooth"))
                                                                         geom='path'

> qplot(x,y,geom=c("point","smooth"))
 
geom='point'
> qplot(x,y,geom=c("boxplot","jitter"))
                                                                geom='boxplot' and 'jitter'

Friday, March 15, 2013

IT BAL LAB Session#8

Session # 8 :


In this session we learnt about the panel data generation and its various models.

Panel Data refers to the combination of various time series data cascaded together
The basic function used for panel data generation and estimation is plm.

The data set we have used in this session in "Produc".

The description for the same is as under.

- state : the state
- year : the year
- pcap: private capital stock
- hwy : highway and streets
- pc: public capital
- gsp: gross state products
- emp: labor input measured by the employement in non–agricultural payrolls
- unemp: state unemployment rate

Use the data set "Produc" , a panel data set within plm package for panel estimations.

Assignment :
To calculate the values for all the 3 models and decide which models best fits the data set for panel estimation ?

Solution :
Step1 : calculating value for pooling model

Step2 : calculating value for fixed model
Step3 : calculating value for random model




To choose the best model that fits the data set "Produc" ,we need to run pairwise hypothesis tests among the 3 models and select the best fit in the end.

Test1 :
Between pooling and fixed model

Command :
pFtest (fixed1 , pooled)



Test details :
H0: Null: the individual index and time based params are all zero
Alternative Hypothesis : atleast one of the index and time based params are non zero

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.. Null hypothesis is rejected.

Hence Fixed model is better than the pooling model.


Test2:
Between pooling and random model

Command :
plmtest (pooled)


Test details :
H0: Null: the individual index and time based params are all zero : Pooling Model
Alternative Hypothesis : atleast one of the index and time based params are non zero : Random Model

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.. Null hypothesis is rejected.

Hence random model is better than the pooling model.


Test3:
Between fixed and random model

Command :


We use Hausman test -:
phtest(random1 , fixed1)


Test details :
H0: Null: individual effects are not correlated with any regressor : Random Model
Alternative Hypothesis : Individual effects are correlated : Fixed Model

The hypothesis test suggests that the one of the models is inconsistent.
As the p-value is too low.. Null hypothesis is rejected.

Hence fixed model is better than random model.


Conclusion :-
We can conclude that fixed model best fits the "Produc" data set panel data estimations. i.e there is significant correlation observed with the regressor variables and index impact exists.
Hence, we would choose "Fixed" model to estimate the panel data presented by "Produc" data set.

Wednesday, February 13, 2013

IT BAL Session #6

IT BAL :- Session 6

Assignment:
Question   1) create log of returns data (from 01.01.2012 to 01.01.2013) and calculate historical volatility
  Question 2)Create ACF plot for the log returns data ,perform adf test and interpret.

Commands:
> stockprice<-read.csv(file.choose(),header=T)
> head(stockprice)
> closingprice<-stockprice[,5]
> closingprice.ts<-ts(closingprice,frequency=252)
> returns<-(closingprice.ts-lag(closingprice.ts,k=-1))/lag(closingprice.ts,k=-1)
> z<-scale(returns)+10
> logreturns<-log(z)
> logreturns
> acf(logreturns)
From the above graph, we can see that the measurements lie with in the 95% confidence interval. Therefore, the time series is stationary.
> T=252^0.5
> historicalvolatility<-sd(logreturns)*T
> historicalvolatility
> adf.test(logreturns)
From the test results, we can see that p-value=0.01 (<0.05).
 Therefore, we reject the null hypothesis and accept the alternate hypothesis which states that the time series is stationary.
 
 

Thursday, February 7, 2013

Session #5 Feb 5,2013

Session 5

Returns and Forecasting

Assignment1: Find returns of NSE data>6months.having selected the 10th data as start and 95th data point as end.Also plot the assignment .

Solution:
Step 1: Read data  in the form of CSV file for the period 1/12/2011 to 5/02/2013
Command:
 z<-read.csv(file.choose(),header=T)

Step 2:Choose the Close column.
Command:
 close<-z$Close

Step 3:Form a matrix of order 1X298 as 298 data points are available in close.
Command:
dim(close)<-c(1,298)

Step 4:Create time-series objects for close data from element (1,10 to1,95)
Command:
close.ts<-ts(close[1,10:95],deltat=1/252)
Step 5:Calculate difference between preceding and succeeding value
Command:
close.diff<-diff(close.ts)
Step 6: Calculate return :
Command:
return<-close.diff/lag(close.ts,k=-1)
final<-cbind(close.ts,close.diff,return)
Step 7: Plot
Command:
plot(return,main="Return from 10th to 95th")
plot(final,main="Data from 10th to 95, Difference, Return")
Assignment 2:1-700 data is available, Predict the data from 701-850, use the GLM estimation using LOGIT Analysis for the same.

Solution :
Step 1:Read data  in the form of CSV file

Command:
z<-read.csv(file.choose(),header=T)

Step 2:Check the dimension of z
Command
dim(z)


Step 3:Choose 1-700 data
Command

 new<-z[1:700,1:9]

Step 4:
Command
head(new)

Step 5:
Identify the factor and run the Logit regression
Command

 new$ed <- factor(new$ed)
 new.est<-glm(default ~ age + ed + employ + address + income, data=new, family ="binomial")
 summary(new.est)

Step 6
Prediction<-z[701:850,1:8]
 Prediction$ed<-factor(Prediction$ed)
 Prediction$prob<-predict(new.est, newdata =Prediction, type = "response")
 head(Prediction)

Wednesday, January 23, 2013

Gaurav Bhattacharya :- BA Session 3

Session3 - Business Application Lab


ASSIGNMENT 1a:

Fit ‘lm’ and comment on the applicability of ‘lm’
Plot1: Residual vs Independent curve
Plot2: Standard Residual vs independent curve

> file<-read.csv(file.choose(),header=T)
> file
  mileage groove
1       0 394.33
2       4 329.50
3       8 291.00
4      12 255.17
5      16 229.33
6      20 204.83
7      24 179.00
8      28 163.83
9      32 150.33
> x<-file$groove
> x
[1] 394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33
> y<-file$mileage
> y
[1]  0  4  8 12 16 20 24 28 32
> reg1<-lm(y~x)
> res<-resid(reg1)
> res
         1          2          3          4          5          6          7          8          9
 3.6502499 -0.8322206 -1.8696280 -2.5576878 -1.9386386 -1.1442614 -0.5239038  1.4912269  3.7248633
> plot(x,res)
 As the plot is parabolic, so we will not be able to do regression.


Assignment 1 (b) -Alpha-Pluto Data
Fit ‘lm’ and comment on the applicability of ‘lm’
Plot1: Residual vs Independent curve
Plot2: Standard Residual vs independent curve
Also do:
Qq plot
Qqline

> file<-read.csv(file.choose(),header=T)
> file
   alpha pluto
1  0.150    20
2  0.004     0
3  0.069    10
4  0.030     5
5  0.011     0
6  0.004     0
7  0.041     5
8  0.109    20
9  0.068    10
10 0.009     0
11 0.009     0
12 0.048    10
13 0.006     0
14 0.083    20
15 0.037     5
16 0.039     5
17 0.132    20
18 0.004     0
19 0.006     0
20 0.059    10
21 0.051    10
22 0.002     0
23 0.049     5
> x<-file$alpha
> y<-file$pluto
> x
 [1] 0.150 0.004 0.069 0.030 0.011 0.004 0.041 0.109 0.068 0.009 0.009 0.048
[13] 0.006 0.083 0.037 0.039 0.132 0.004 0.006 0.059 0.051 0.002 0.049
> y
 [1] 20  0 10  5  0  0  5 20 10  0  0 10  0 20  5  5 20  0  0 10 10  0  5
> reg1<-lm(y~x)
> res<-resid(reg1)
> res
         1          2          3          4          5          6          7
-4.2173758 -0.0643108 -0.8173877  0.6344584 -1.2223345 -0.0643108 -1.1852930
         8          9         10         11         12         13         14
 2.5653342 -0.6519557 -0.8914706 -0.8914706  2.6566833 -0.3951747  6.8665650
        15         16         17         18         19         20         21
-0.5235652 -0.8544291 -1.2396007 -0.0643108 -0.3951747  0.8369318  2.1603874
        22         23
 0.2665531 -2.5087486
> plot(x,res)
> qqnorm(res)
 > qqline(res)


Assignment 2: Justify Null Hypothesis using ANOVA

> file<-read.csv(file.choose(),header=T)
> file

   Chair Comfort.Level Chair1
1      I             2      a
2      I             3      a
3      I             5      a
4      I             3      a
5      I             2      a
6      I             3      a
7     II             5      b
8     II             4      b
9     II             5      b
10    II             4      b
11    II             1      b
12    II             3      b
13   III             3      c
14   III             4      c
15   III             4      c
16   III             5      c
17   III             1      c
18   III             2      c
> file.anova<-aov(file$Comfort.Level~file$Chair1)
> summary(file.anova)

            Df Sum Sq Mean Sq F value Pr(>F)
file$Chair1  2  1.444  0.7222   0.385  0.687


Tuesday, January 15, 2013

BIS_LAB SESSION 2

Business Application IT Lab

IT BA lab Assignment #2


Session 2:

Today we have learnt about creation,inverse,transpose and multiplication of matrices.Then we moved on to
regression and residual analysis by taking NSE historical data for NIFTY index for a certain period.Finally we had an introductory idea about how to plot normally distributed curve.

Assignment 1: 
Create two matrices of say size 3 X 3 and select the column 1 from one matrix and column 3 from second matrix. After selecting the columns in objects say x1 and x1  merge these two columns using cbind to create a new matrix .
Solution:
To create a matrix:
x <- c[1:9]
dim(x) <- c(3,3)
y <- c[10:18]
dim(y) <- c(3,3)
To select a column
z1 <- x[ ,3]
z2 <- y[ ,2]
z3<- cbind(z1,z2)
Output:
Assignment 2:
Multiply both the matrices.
Solution:
z <- x %*% y
Output:
Assignment 3:
Read historical data of NIFTY indices from NSE for the period 1st Dec 2012 to 31st Dec 2012. Find regression and residuals
Solution:
To read the csv file:
nse <- read.csv(file.choose(),header=T)
For finding the regression and residuals the following commands are used
reg <- lm(High ~ Open , data = nse)
residuals(reg)
Output:
Assignment 4:
Generate a normal distribution data and plot it.
Solution:
For creating the ND following commands are used:
x<-rnorm(40,0,1)
y<-dnorm(x)
For plotting the data
plot(x,y)
Output:

Tuesday, January 8, 2013

Session # 1 - 8 Jan 2013

An intro to the world of R:

R is an open source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. There are some important differences, but much code written for S runs unaltered. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.

Assignment 1:

Draw a histogram after concatenating 3 data points.

Soln : 
Commands used are as under -:

> x<-c(1,2,3)
> plot(x, type = "h")


Histogram





Assignment 2:  Drawing a line graph with points and naming the graph and the axis.
Soln : Let z be the variable that contains data from the .csv file selected.

Reading from the csv file is done as under -:  

> z<-read.csv(file.choose(), header=T)
This command prompts the user to select the data file from the saved location.

zcol1 be the variable that contains contents of column 3 from the excel data.
the following commands were used.
> zcol1<-z[,3]
> plot(zcol1 , type="b" , main="NSE Graph" , xlab="Time" , ylab="indices")
 
 


Assignment 3:
Create a scatter plot by using share HIGH and LOW values from the NSE Historical data as obtained from the .csv file.

Soln :
HIGH values as obtained in previous ques 
> zcol1<-z[,3]
LOW values are in column 4 from the csv file
> zcol2<-z[,4]
To plot the scatter plot 
> plot(zcol1,zcol2)
 
Assignment 4 :
To find the volatility between the share values obtained from NSE historical data and obtain the range for the same.

Soln :-
To obtain the volatility , we wold require the maximum value amongst the HIGH values and the minimum values amongst the LOW values.
Merging both the columns into one vector variable 'y' to get the HIGH and LOW values together.
> y<-c(zcol1,zcol2)
> summary(y)
 will give the min and the max value as under -:
   Min.    1st Qu.  Median    Mean   3rd Qu.    Max.
   4888    5660    5723        5758    5884       6021 

> range(y)

will give the desired range of volatility

[1] 4888.20 6020.75


Assignment 5:
To create a matrix.
 
Soln: