Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Science

 

Linear regression

 


In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

 

Example: Systolic blood pressure of newborns Is:
6 Times the Age in days + Random Error
SBP = 6 * age(d) + e
Random Error May Be Due to Factors Other Than age in days
(e.g. Birthweight)

In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models.




It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

X 0 1 2 3 4 5 6 7 8 9
Y 2.1 4.7 4.8 6.6 8.5 9.9 10.1 10.9 11.7 13.1

 

How can we use a model to decribe this data? SLR



Consider the model function:  y = α + β x , which describes a line with slope β and y-intercept α.

In general such a relationship may not hold exactly for the largely unobserved population of values of the independent and dependent variables; we call the unobserved deviations from the above equation the errors.

 

Suppose we observe n data pairs and call them {(xi, yi), i = 1, ..., n}. We can describe the underlying relationship between yi and xi involving this error term εi by yi = α + β xi + εi .

 

 

The goal is to find estimated values α_hat and β_hat for the parameters α and β which would provide the "best" fit in some sense for the data points.

 Here, the "best" fit will be understood as in the least-squares approach: a line that minimizes the sum of squared residuals.<<<<-------WHY?


 

Try to write your code to figure out a linear model to describe the above data, and also test it on the following data.

X 9 8 7 6 5 4 3 2 1 0
Y 2.1 4.7 4.8 6.6 8.5 9.9 10.1 10.9 11.7 13.1

 




Here is the public dataset drawn from the U.S. Army Anthropometric Survey  form University of Michigan

Try to use your program to build a Linear Model on the following dataset.

Training Data

and test your model on the following dataset

Testing Data

Verify your prediction vs ActualData
and caculate the sum of squared error

 

 

Other Performance Evaluation Functions

 





 

 

How can we solve the problem with Math?