September 2023 – mathematical statistics 522

September 30, 2023

September 29, 2023

I ran a model on a dataset to predict diabetes based on inactivity and obesity.
The model became overfit as I worked on it.
I used the scikit-learn to test the model after overfitting was identified in the model by performing k-fold cross validation on the data.
Producing a number of partitions of sample observations from the training dataset is the objective of cross validation.
Number of partitions are determined by the number of observations.
I used a different fold as the validation set each time I trained and evaluated the model after folding the data into K(5) folds.
Performance metrics from each fold are summed to determine the model’s generalization performance.
Following cross validation, it does prevent overfitting to some extent.

September 27, 2023September 27, 2023

September 27, 2023

Today’s Lecture,

A. Mean Squared Error:-

It Tells us about the performance of the model.
It penalized the error of the model.
It is mostly used in cost function.
Points which are far away from the best fit line are the outliers.
It is sensitive to outliers. B. More Complex Model:-
Complexity of the model is depend on the parameters used for the model.
More parameters means more complexity.
We have seen the example of the quadratic equation which shows the complexity of the model.

September 26, 2023

September 25, 2023

Topics Cover in today’s lecure:

There are two methods of sampling. A. Cross- Validation B. Bootstrap
The difference between testing error and training error. Training error is often quite lower than the test error.
Training error is not a good surrogate to testing error.
Understand the concept of overfitting through the graph of testing and training data.
Bias– How far on the average the model is true.
Variance– How much that the estimate varies around it average.
Bias and Variance together gives prediction error.
K- fold validation – This is used to estimate test error.
Data is divided into equal size in this technique.
True and estimated test for mean squared error for data.
The estimate of prediction error is biased upward, this is the issue with cross-validation.
Bias can be minimize but in that situation variance will be high.
Understand when we should use cross- validation.
Right and wrong ways to apply cross-validation.

September 23, 2023September 23, 2023

September 22, 2023

Today work on some project tasks,

Firstly, I compare obesity with diabetes and plot a graph. The r squared value is 0.36
Then, I compare inactivity with diabetes. The r squared value was as same as the 0.36.
Then got the values of standard Deviation, Variance, Mean, Skewness and heteroscedasticity.

September 21, 2023

September 20, 2023

About today’s lecture,

Professor gave an example of perfect linear regression model which fits the dataset so well.
The data was about pre-molt and post- molt of the crabs.
Variance, Skewness was high for this model.
All the data points was around the best fit line.
The histogram of post-molted was slightly moved towards left and was having higher peak value.
The error was comparatively minimal.
r2_score value was up to the mark.
Then we saw the concept of t-test.
t-test compares the means of two groups.
If software says t-test is violated then we use Monte Carlo Test.
In Monte Carlo Test, it predict possible outcome of an uncertain event.

September 18, 2023

Today’s lecture was about some equations and few regression concepts.

Correlation between predictor and variables

2. Interaction Term

3. Generalized Linear Mode

4. Overfitting = In this, data fits in the model so well but it does not predicts anything.

5. Method to check overfitting = Cross-validation.

Work on project…

Extract the data from the given pdf.
Checked first few figures for reference by using (.head).
Checked the shape of the dataset (df.shape).
Get description of the dataset where I got to know Count, Mean, Max, Min, Std, and percentage wise distribution (df.describe()).
Checked data types (df.types).
Checked if there is any null value (df.isnull.sum()).

September 15, 2023September 18, 2023

September 15, 2023

Today’s discussion session was very helpful. The questions asked by the students cleared up some basic concepts about linear regression.

The explanation of ” Why we calculate distance parallel to the Y-axis in linear regression” helped to clarify conceptually.

2. Secondly, the professor suggests building a profile on Linkedin.

Work Done Towards Project:-

I tried multiple linear regression on inactivity, diabetes, and obesity. I took 20% data for testing and 80% for training. As a result, the r2 score value was very low so I divided the data 50-50 to get the r2 score close 1.

September 15, 2023

September 13, 2023

Things I learned from today’s lecture.

P-Value:- It is the probability of obtaining the results while performing hypothesis tests. It assumes that the null hypothesis is correct.

2. If the P-value is less than 0.05 it indicates stronger proof against the given statement and if the P-value is greater than 0.05 it indicates weaker proof against the given statement.

3. Heteroscedasticity:- In heteroscedasticity the variance of the residual term, varies widely.

4. R-square:- It defines the variance in our data. If it is high, the prediction of the model is good.

September 10, 2023

Hello world!

Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!