College Fees Predictive Analysis using R
Introduction
The goal of this article is to build a Prediction model using Linear Regression for predicting the college tuition fees on the basis of different parameters like Percent of students who graduate, Student to faculty ratio and Percent of faculty with Ph.D etc.
Language used is R
Data :- “Tuition_Dataset_Training.csv” (1150 row,11 column)
[Open source Data]
PHASES 1:- Data Processing
Data Processing is the first step for data analysis. It is important to prepare data so that it will not contain any missing values, outliers are invalid entries.
Data Types :- The data type for each column in given data set is al below:-
Missing Values :- Missing values leads to problems in our data analysis methods. We can identify the number of missing values by looking at the summary.
Missing Values Handling:- There are many ways to handle missing values like:-
- Remove the rows contain the NA value
- Replace NA values with some constant
- Replace the value with mean, median value etc
Here I am replacing the missing value with mean values.
After replacing the missing value the data summary is like this :-
Outliers:- Outliers may represent errors in our data analysis. Outliers will be the values which exist in extreme limits either low or high. So, it tends to change the result of data analysis completely. It is important to handle outliers. To identify outliers we may use different methods like plotting, histograms.
Scatter Plot:-
For Handling Outliers I did winsorization of .25% to .975%.
Descriptive Summary:- Descriptive summary of data after removing outliers and missing values is as below:-
Phase :- Exploratory Data Analysis (EDA)
Frequency distribution of data is as follows:-
Correlation:- Correlation matrix is a given below:-
Highly correlated variable to tuition are as following:-
In df data set:-
df$graduat + df$pct_phd + df$alumni + df$public.private + df$sf_ratio +df$fulltime +df$fac_comp
Phase :- Linear Regression
Linear Regression:- Linear Regression is a very well accepted approach for modelling the relationship between a dependent variable and one or more independent variables also known as explanatory variable. I applied linear regression for modelling the relationship between the tuition fee and other given variables.
This model is trained on the given data set. Now we can predict the tuition fee for other data set based on this model.
Our Model is very Significant with a R value 0.69.
Using the Tuition_Dataset_Validation_Student.csv data set after replacing the missing values by means. We will predict the tuition fee for each entry. and export them to a CSV file “Prediction.csv”