December 10, 2023- Project 3 Report
December 8, 2023- Project 2 Corrected Report
Project 2 Corrected Report
MTH522_Project2_Report (1)December 6, 2023
- Time series analysis was done on the “economic indicators” data to identify underlying trends, seasonality, and cyclical behaviors.
- By time variable, the data were arranged chronologically in the “Year” and “Month” columns.
- I later created a single date time column by combining the ‘Year’ and ‘Month’ columns to show the time.
- Additionally, time series plots were used to visualize the data in order to see trends, patterns, and seasonality in each economic indicator over time.
- Then I used methods like seasonal decomposition to break down the time series into its constituent parts (trend, seasonality, and residual) in order to analyze the individual contributions.
- A time series dataset is split into three parts using the seasonal decomposition technique: the trend, seasonality, and residual or error components.
- Used forecasting models to project future values of economic indicators, such as ARIMA (AutoRegressive Integrated Moving Average). After that, I divided the data into test and training sets so that I could train the model and assess its effectiveness.
- Evaluated the forecasting model’s accuracy using a metric such as Mean Absolute Error (MAE), and evaluated the model’s performance by comparing the predicted values with the test dataset’s actual values.
December 4, 2022
Today’s Lecture:
- I’ve experienced When building decision trees, a technique called pre-pruning, sometimes called early halting, involves applying limitations to stop the tree’s growth before it reaches full maturity.
- This is deciding whether to stop splitting nodes throughout the building of the tree based on predetermined standards, including imposing a maximum depth limitation, establishing a minimum sample requirement for node splitting, or necessitating a minimum quantity of samples in a leaf node.
- Pre-pruning is primarily done to keep the tree from growing too complicated and overfitting the training set, which will eventually improve the model’s capacity to generalize to new, unobserved data.
- However, in order to enhance overall model performance, post-pruning, also known as simple pruning, is extending a decision tree to its full depth before carefully deleting any branches or nodes.
- After allowing the original tree to grow unrestrictedly, branches that do not substantially improve prediction accuracy are trimmed in accordance with predetermined standards.
- The trade-off between the tree’s complexity and its fit to the training set is taken into account by common techniques like cost-complexity pruning. Post-pruning is primarily used to reduce overfitting and simplify the decision tree, especially in cases where the tree structure retains noise or training data-specific patterns that may not translate well to fresh data.
- In conclusion, post-pruning removes portions of the tree whereas pre-pruning regulates tree growth throughout building.
November 29, 2023
In today’s lecture,
- we explore the mechanics of the housing market, paying close attention to how median home values have changed over time and how they relate to larger trends in the economy.
- Our study shows the market’s varying highs and lows, just like a roadmap might.
- While price dips or plateaus may indicate a cooling market owing to economic developments or shifting buyer attitudes, rising prices frequently imply a solid economy and significant housing demand, signifying buyer confidence.
- These patterns are linked to more general economic factors like employment and interest rates.
- For example, a robust job market might raise the ability to purchase a property, which in turn drives up prices, while changes in borrowing rates can affect the fervor of potential buyers.
- Additionally, we observed possible seasonal patterns in the market.
- Time series analysis was performed on ‘economic indicators’ data in order to understand underlying patterns such as trends, seasonality, and cyclic behaviour.
- The data was arranged chronologically by time variable (columns ‘Year’ and ‘Month’).
- I later combined the ‘Year’ and ‘Month’ columns into a single date time column to indicate time.
- The time series data was also visualised using time series plots to observe patterns, trends, and seasonality in each economic indicator over time.
- The time series was then decomposed into its components (trend, seasonality, residual) for analysis using techniques such as seasonal decomposition.
- Seasonal decomposition is a time series analysis method that separates a time series dataset into three parts: trend, seasonality, and residual or error components.
- To predict future values of economic indicators, a forecasting model such as ARIMA (AutoRegressive Integrated Moving Average) was used.
- The data was then divided into training and test sets in order to train the model and evaluate its performance.
- To assess the accuracy of the forecasting model, a metric such as Mean Absolute Error (MAE) was used, and the predicted values were compared to the actual values in the test dataset to evaluate the model’s performance.
November 27, 2023
- In a VAR model, each variable is modeled as a linear function of past lags of itself and past lags of other variables in the system.
- VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model.
- An estimated VAR model can be used for forecasting, and the quality of the forecasts can be judged, in ways that are completely analogous to the methods used in univariate autoregressive modelling.
- Using an autoregressive (AR) modeling approach, the vector autoregression (VAR) method examines the relationships between multiple time series variables at different time steps.
- The VAR model’s parameter specification involves providing the order of the AR(p) model, which represents the number of lagged values included in the analysis.
November 15, 2023
- After analysing the “food establishment inspections” dataset, it was determined that the decision tree technique would work best because it can be used for both regression and classification, can handle both numerical and categorical data, implicitly perform feature selection, robust to outliers and missing values.
- They use a structure akin to a tree to represent decisions and their possible outcomes.
- The nodes in the tree stand for features, the branches for decision rules, and the leaves for the result or goal variable.
- Decision trees determine the optimal feature to split the dataset at each node based on a variety of parameters, such as information gain, entropy, and Gini index.
- The algorithm recursively splits the dataset based on the selected features until a stopping criterion is
Project 2
November 10, 2023
- After running the DBSCAN algorithm, I went on to create OPTICS, which is an extension of it.
- preprocessed the data by transforming the object values into numerical ones and normalizing/standardizing the data.
- Determined the distance between each location using the Haversine distance formula.
- The OPTICS algorithm was subsequently implemented as follows: optics = OPTICS(min_samples=5, metric=’precomputed’)optics.fit (matrix of haversine)
collected cluster information, including reachability distances and core samples, and identified the noise sites. - In order to comprehend the spatial distribution, the clusters were also visualized on the map.
November 08, 2023
- Used the latitude and longitude coordinates and relevant attributes.
- Calculated the Euclidean distance to measure dissimilarity between locations.
- Then calculated a pairwise distance matrix to represent dissimilarity between all location pairs.
- Performed agglomerative hierarchical clustering algorithm to the distance matrix
- Visualized the dendrogram, which displayed the hierarchical structure of clusters.
November 06, 2023
Here is an overview of the main conclusions drawn from the data analysis:
1. The “flee” variable is divided into four categories: not fleeing, car, foot, and other.
2. Following the incident, about 4,000 people remained in their place.*
3. About 1300 people ran away, some on foot and some in cars.
4. Method of Death: Over 7000 individuals were shot, while the remaining 8000 were shot and subjected to tasers.
5. Gender Distribution: Compared to females, men were more likely to be involved in the offenses.
6.Racial Distribution: A variety of racial backgrounds were represented, with whites and blacks making up the majority.
7.Geographic Distribution: The reported crime rate was greater in three states.
8.Indices of Mental Illness: Approximately 2000 individuals exhibited indicators of mental illness.
4.Danger Level: About 4000 individuals were recorded
November 3, 2023
- K-Means and DBSCAN are two distinct clustering algorithms.
- K-Means is a partition-based clustering method where you need to specify the number of clusters (K) beforehand, and it assigns data points to the nearest cluster centroid.
- However, it can be sensitive to the initial placement of centroids and primarily performs hard clustering, meaning each point belongs to one cluster. K-Means also assumes that clusters are spherical in shape.
- On the other hand, DBSCAN is a density-based clustering algorithm that automatically identifies clusters based on data density without requiring you to specify the number of clusters in advance.
- It excels at identifying clusters of arbitrary shapes, handles noise or outliers, and can accommodate clusters of varying sizes.
- While DBSCAN primarily performs hard clustering, there are extensions available to achieve soft clustering when necessary.
- In summary, the choice between K-Means and DBSCAN depends on factors like the knowledge of the number of clusters and the data’s characteristics, as they have different strengths and use cases in the realm of data clustering.
November 1, 2023
- I carried on working today and we talked about the population analysis using the specific police station data in class.
- I also went over the race analysis I’ve done and how I want to combine it with the specific police station data. Apart from that, I was trying to create a map using the locations, longitude, and latitude information so that I could examine the clusters and the regions where there were a lot of gunshots.
- I was using the geopandas library for this, but it appears that the world map extension is no longer supported, which is why, when the individual point grouping is finished, there is an issue when trying to import the map.
- I am looking for more libraries that I can utilize.
October 29, 2023
Some more goe libraries…
- GeoDjango: GeoDjango is an extension of Django, a popular web framework for Python. It adds geospatial database support and tools for building geospatial web applications.
- folium: The Python module Folium facilitates the creation of interactive Leaflet maps. It is especially helpful for viewing geospatial data and for building online maps with personalized layers, popups, and markers.
- Shapely: Shapely is a geometric operations library that may be used on geometric objects. For the creation, manipulation, and analysis of geometric shapes, it is frequently used in conjunction with other geospatial libraries.
- geopandas: An open-source Python package called GeoPandas expands Pandas’ functionality for handling geographic data. It enables us to make maps and plots, engage with geospatial datasets, and carry out geospatial activities.
October 27, 2023
K-values:
By reducing the squared distances to the cluster centres, data is divided into K clusters.
assumes that the clusters are uniformly sized and spherical.
needs the number of clusters (K) to be specified in advance.
Unusual cluster forms and noisy data are not well suited.
sensitive to the original location of the cluster centre.
Effective for datasets of a moderate size.
DBSCAN –
identifies clusters in high-density areas and outliers in low-density areas by classifying data points according to their density.
able to locate clusters of any size and any shape.
determines the number of clusters automatically.
robust for managing outliers and noisy data.
less dependent on the parameter selection.
less effective with data that has many dimensions.
October 23, 2023
Today’s Lecture:-
K-Means Clustering:-
- K-means is a partitioning method that aims to divide a dataset into K distinct, non-overlapping clusters. It is a centroid-based approach, where the data points are assigned to the cluster with the nearest centroid.
- To have an improved comprehension of medoids, we may compare them to the core points in K-Means clustering.
- The relationship between averages and middle values in a list is comparable to the relationship between medoids and center points.
- But it’s important to remember that while averages and central points might not always be real data points, medians and middle values are.
- The primary distinction between K-Means and K-Medoids is how they arrange the data.
- While K-Means arranges information according to the distances between data points and central points, K-Medoids arranges data according to the distances to medoids.
- Since K-Medoids do not depend on center points, they are more robust and resistant to the effects of unusual data, making them an excellent choice for managing outliers.
October 20, 2023
- The data contains latitude and longitude variables which is known as geo position data, from which insights can be extracted and later analysed.
- From the data containing longitude and latitude variables, location of where the shooting took place can be analysed.
- Later on, Geodesic distance can be used to find the distance between the each longitude and latitude co-ordinates of one place to another.
- I have used haversine formula to find the geodesic distance that is reasonably accurate estimate of the shortest distance between the two points.
- Then created Geolist plot using matplotlib by including all the longitude and latitude co-ordinates.
- After analysing the visualisation, performed a clustering algorithm, KNN to group points that are close to each other on the map.
- Also created heatmap to identify the regions with low and high concentration of data points.
- Since the geolist plot had time stamps, analysed the distribution of points over times.
- Another clustering algorithm, DBSCAN (Density based spatial clustering of applications with noise) is used for grouping spatial data points based on their density.
- It used for discovering clusters with irregular shapes and handling outliers. Geo histogram is used for this specific data.
- Next, i will find the outliers in the data and perform the DBSCAN algorithm.
October 18, 2023
- The link between the independent factors and the dependent variable’s log-odds (binary outcome) is represented by the coefficients in logistic regression.
- When the logistic regression model is being trained, the coefficients are estimated.
- The projected probabilities are then obtained by applying the logistic function, also known as the sigmoid function, to these log-odds.
- Logistic regression coefficients in the context of geographical data can be interpreted in a manner similar to that of ordinary logistic regression, but with a spatial context.
- The link between the spatially distributed independent variables and the likelihood that an event will occur (binary outcome) will be attempted to be captured by the logistic regression model.
- I used the formula log-odds = B0 + B1 * LATITUDE + B2 * LONITUDE +… + Bn * LONGITUDE because our data includes both longitude and latitude.
In this case, log-odds represents the odds’ natural logarithm.
October 16, 2023
In today’s lecture, most of the topics were revised by the professor.
- Geoposition data- It is data that contains the longitude and latitude of the coordinates.
- Geodesic distance between two locations took the latitude/longitude pairs.
- A geolist plot is used to create a geographic plot on the map
- DBSCAN method for clustering- It is a density-based clustering method.
- Geopy is the library used to perform geographic operations.
- We can install simply by “pip install geopy”.
- A geo histogram looks like a heat map.
October 13, 2023
- The dataset contains variables longitude, latitude, is_geocoding_exact.
- Using geodesic distance for data analysis allows accurate measurements of distances on a curved surface.
- In order to find the geodesic distance, Haversine formula can be used by giving the longitude and latitude coordinates of two points as input.
- This formula converts the degrees to kilometers and returns the result.
- here, ANOVA test can be used to assess whether there are significant differences in the distances.
project:
6. calculated the geodesic distances and the result was 877.46 km, 956.23km and so on
October 11, 2023
Today’s Class:–
- Geodesic Distance: – It is the shortest curve distance between two points.
- There are Python packages available for it. e.g. geopy, geokernels.
- We can use the imputation method to handle missing values in the dataset.
- I came to know about some impute methods such as i) Next or Previous Value. ii) K Nearest Neighbors. iii) Maximum or Minimum Value. iv) Missing Value Prediction. v) Most Frequent Value. vi) Average or Linear Interpolation.
- In this dataset, if we try to compare race with age, we need to perform a t-test more than 15 times as there are 7-8 variables. This is not a convenient way, so professor suggested to do analysis of variance. For that, we can use Anova test.
- Anova-test:- Is a test used to determine differences between research results from three or more unrelated samples or groups.
In Project,
Basis analysis of the dataset:-
- There are 8768 rows and 12 columns.
- Mean- 37.28, Std-12.99 , Min-2.00, Max-92.00
- There are 582 null values in name, 605 in age, 49 in gender, 210 armed, 57 city, 1190 in flee.
Project 1
October 06, 2023
Worked on project.
- Read CSV files in python which includes %inactivity, %obesity and %diabetes.
- Then I calculated mean and std dev for each variable.
- Then calculated z_score for every variable.
- Converted negative values to positive of each variable.
- Then, assign threshold value to each variable to calculate outliers.
- Calculated outliers, and concluded that the data has more outlier.
October 4, 2023
Today we work on project.
- We have three databases given %diabetes, %obesity, %inactivity. Apart from these databases we will be using region and %child and poverty dataset.
- We will be predicting diabetes in this project.
- In EDA part, we are going to find out correlation between i) obesity and diabetes ii) Inactivity and diabetes.
- We find out mean, median, mode, avg, standard deviation, skewness and outliers.
- Then we perform resampling methods on the data.
- First we perform K-Fold Cross Validation. Then, Bootstrap.
- As there are more than two variables we have to perform multi-linear regression in this case.
- Then we check for homoscedasticity and heteroscedasticity for that we use breush-pagan test.
October 2, 2023
- Regression analysis is performed on data that may have a high variance across distinct independent variable values.
- Heteroscedasticity, which denotes variable variances around the fitted values, is associated with this type of data.
- The data points are spread around the fitted line when we do regression.
- There should be as little scattering as feasible in a good regression model.
- The model is referred to as homoscedastic when the dispersion is uniform. The model is heteroscedastic if it is not.
- The form of the typical heteroscedastic distribution resembles a cone.
- Most frequently, the source of this type of cone-shaped distribution is the data itself.
- Sometimes it is quite normal for the dependent variable’s variance to change and to not remain constant across the whole dataset.
- In other words, the newest data points may show more variety than the prior data did.
- When I performed the linear regression model, heteroskedasticity was noticed in the similar way.
- Later performed Cross Validation to avoid this problem.
September 29, 2023
- I ran a model on a dataset to predict diabetes based on inactivity and obesity.
- The model became overfit as I worked on it.
- I used the scikit-learn to test the model after overfitting was identified in the model by performing k-fold cross validation on the data.
- Producing a number of partitions of sample observations from the training dataset is the objective of cross validation.
- Number of partitions are determined by the number of observations.
- I used a different fold as the validation set each time I trained and evaluated the model after folding the data into K(5) folds.
- Performance metrics from each fold are summed to determine the model’s generalization performance.
- Following cross validation, it does prevent overfitting to some extent.
September 27, 2023
Today’s Lecture,
A. Mean Squared Error:-
- It Tells us about the performance of the model.
- It penalized the error of the model.
- It is mostly used in cost function.
- Points which are far away from the best fit line are the outliers.
- It is sensitive to outliers. B. More Complex Model:-
- Complexity of the model is depend on the parameters used for the model.
- More parameters means more complexity.
- We have seen the example of the quadratic equation which shows the complexity of the model.
September 25, 2023
Topics Cover in today’s lecure:
- There are two methods of sampling. A. Cross- Validation B. Bootstrap
- The difference between testing error and training error. Training error is often quite lower than the test error.
- Training error is not a good surrogate to testing error.
- Understand the concept of overfitting through the graph of testing and training data.
- Bias– How far on the average the model is true.
- Variance– How much that the estimate varies around it average.
- Bias and Variance together gives prediction error.
- K- fold validation – This is used to estimate test error.
- Data is divided into equal size in this technique.
- True and estimated test for mean squared error for data.
- The estimate of prediction error is biased upward, this is the issue with cross-validation.
- Bias can be minimize but in that situation variance will be high.
- Understand when we should use cross- validation.
- Right and wrong ways to apply cross-validation.
September 22, 2023
Today work on some project tasks,
- Firstly, I compare obesity with diabetes and plot a graph. The r squared value is 0.36
- Then, I compare inactivity with diabetes. The r squared value was as same as the 0.36.
- Then got the values of standard Deviation, Variance, Mean, Skewness and heteroscedasticity.
September 20, 2023
About today’s lecture,
- Professor gave an example of perfect linear regression model which fits the dataset so well.
- The data was about pre-molt and post- molt of the crabs.
- Variance, Skewness was high for this model.
- All the data points was around the best fit line.
- The histogram of post-molted was slightly moved towards left and was having higher peak value.
- The error was comparatively minimal.
- r2_score value was up to the mark.
- Then we saw the concept of t-test.
- t-test compares the means of two groups.
- If software says t-test is violated then we use Monte Carlo Test.
- In Monte Carlo Test, it predict possible outcome of an uncertain event.
September 18, 2023
Today’s lecture was about some equations and few regression concepts.
- Correlation between predictor and variables
2. Interaction Term
3. Generalized Linear Mode
4. Overfitting = In this, data fits in the model so well but it does not predicts anything.
5. Method to check overfitting = Cross-validation.
Work on project…
- Extract the data from the given pdf.
- Checked first few figures for reference by using (.head).
- Checked the shape of the dataset (df.shape).
- Get description of the dataset where I got to know Count, Mean, Max, Min, Std, and percentage wise distribution (df.describe()).
- Checked data types (df.types).
- Checked if there is any null value (df.isnull.sum()).
September 15, 2023
Today’s discussion session was very helpful. The questions asked by the students cleared up some basic concepts about linear regression.
- The explanation of ” Why we calculate distance parallel to the Y-axis in linear regression” helped to clarify conceptually.
2. Secondly, the professor suggests building a profile on Linkedin.
Work Done Towards Project:-
I tried multiple linear regression on inactivity, diabetes, and obesity. I took 20% data for testing and 80% for training. As a result, the r2 score value was very low so I divided the data 50-50 to get the r2 score close 1.
September 13, 2023
Things I learned from today’s lecture.
- P-Value:- It is the probability of obtaining the results while performing hypothesis tests. It assumes that the null hypothesis is correct.
2. If the P-value is less than 0.05 it indicates stronger proof against the given statement and if the P-value is greater than 0.05 it indicates weaker proof against the given statement.
3. Heteroscedasticity:- In heteroscedasticity the variance of the residual term, varies widely.
4. R-square:- It defines the variance in our data. If it is high, the prediction of the model is good.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!