October 29, 2023

Some more goe libraries…

  1. GeoDjango: GeoDjango is an extension of Django, a popular web framework for Python. It adds geospatial database support and tools for building geospatial web applications.
  2. folium: The Python module Folium facilitates the creation of interactive Leaflet maps. It is especially helpful for viewing geospatial data and for building online maps with personalized layers, popups, and markers.
  3. Shapely: Shapely is a geometric operations library that may be used on geometric objects. For the creation, manipulation, and analysis of geometric shapes, it is frequently used in conjunction with other geospatial libraries.
  4. geopandas: An open-source Python package called GeoPandas expands Pandas’ functionality for handling geographic data. It enables us to make maps and plots, engage with geospatial datasets, and carry out geospatial activities.

October 27, 2023

K-values:

By reducing the squared distances to the cluster centres, data is divided into K clusters.
assumes that the clusters are uniformly sized and spherical.
needs the number of clusters (K) to be specified in advance.
Unusual cluster forms and noisy data are not well suited.
sensitive to the original location of the cluster centre.
Effective for datasets of a moderate size.
DBSCAN –

identifies clusters in high-density areas and outliers in low-density areas by classifying data points according to their density.
able to locate clusters of any size and any shape.
determines the number of clusters automatically.
robust for managing outliers and noisy data.
less dependent on the parameter selection.
less effective with data that has many dimensions.

October 23, 2023

Today’s Lecture:-

K-Means Clustering:-

  1. K-means is a partitioning method that aims to divide a dataset into K distinct, non-overlapping clusters. It is a centroid-based approach, where the data points are assigned to the cluster with the nearest centroid.
  2. To have an improved comprehension of medoids, we may compare them to the core points in K-Means clustering.
  3. The relationship between averages and middle values in a list is comparable to the relationship between medoids and center points.
  4. But it’s important to remember that while averages and central points might not always be real data points, medians and middle values are.
  5. The primary distinction between K-Means and K-Medoids is how they arrange the data.
  6. While K-Means arranges information according to the distances between data points and central points, K-Medoids arranges data according to the distances to medoids.
  7. Since K-Medoids do not depend on center points, they are more robust and resistant to the effects of unusual data, making them an excellent choice for managing outliers.

October 20, 2023

  1. The data contains latitude and longitude variables which is known as geo position data, from which insights can be extracted and later analysed.
  2. From the data containing longitude and latitude variables, location of where the shooting took place can be analysed.
  3. Later on, Geodesic distance can be used to find the distance between the each longitude and latitude co-ordinates of one place to another.
  4. I have used haversine formula to find the geodesic distance that is reasonably accurate estimate of the shortest distance between the two points.
  5. Then created Geolist plot using matplotlib by including all the longitude and latitude co-ordinates.
  6. After analysing the visualisation, performed a clustering algorithm, KNN to group points that are close to each other on the map.
  7. Also created heatmap to identify the regions with low and high concentration of data points.
  8. Since the geolist plot had time stamps, analysed the distribution of points over times.
  9. Another clustering algorithm, DBSCAN (Density based spatial clustering of applications with noise) is used for grouping spatial data points based on their density.
  10. It used for discovering clusters with irregular shapes and handling outliers. Geo histogram is used for this specific data.
  11. Next, i will find the outliers in the data and perform the DBSCAN algorithm.

October 18, 2023

  1. The link between the independent factors and the dependent variable’s log-odds (binary outcome) is represented by the coefficients in logistic regression.
  2. When the logistic regression model is being trained, the coefficients are estimated.
  3. The projected probabilities are then obtained by applying the logistic function, also known as the sigmoid function, to these log-odds.
  4. Logistic regression coefficients in the context of geographical data can be interpreted in a manner similar to that of ordinary logistic regression, but with a spatial context.
  5. The link between the spatially distributed independent variables and the likelihood that an event will occur (binary outcome) will be attempted to be captured by the logistic regression model.
  6. I used the formula log-odds = B0 + B1 * LATITUDE + B2 * LONITUDE +… + Bn * LONGITUDE because our data includes both longitude and latitude.
    In this case, log-odds represents the odds’ natural logarithm.

October 16, 2023

In today’s lecture, most of the topics were revised by the professor.

  1. Geoposition data- It is data that contains the longitude and latitude of the coordinates.
  2. Geodesic distance between two locations took the latitude/longitude pairs.
  3. A geolist plot is used to create a geographic plot on the map
  4. DBSCAN method for clustering- It is a density-based clustering method.
  5. Geopy is the library used to perform geographic operations.
  6. We can install simply by “pip install geopy”.
  7. A geo histogram looks like a heat map.

October 13, 2023

  1. The dataset contains variables longitude, latitude, is_geocoding_exact.
  2. Using geodesic distance for data analysis allows accurate measurements of distances on a curved surface.
  3. In order to find the geodesic distance, Haversine formula can be used by giving the longitude and latitude coordinates of two points as input.
  4. This formula converts the degrees to kilometers and returns the result.
  5. here, ANOVA test can be used to assess whether there are significant differences in the distances.

project:

6. calculated the geodesic distances and the result was 877.46 km, 956.23km and so on

October 11, 2023

Today’s Class:

  1. Geodesic Distance: – It is the shortest curve distance between two points.
  2. There are Python packages available for it. e.g. geopy, geokernels.
  3. We can use the imputation method to handle missing values in the dataset.
  4. I came to know about some impute methods such as i) Next or Previous Value. ii) K Nearest Neighbors. iii) Maximum or Minimum Value. iv) Missing Value Prediction. v) Most Frequent Value. vi) Average or Linear Interpolation.
  5. In this dataset, if we try to compare race with age, we need to perform a t-test more than 15 times as there are 7-8 variables. This is not a convenient way, so professor suggested to do analysis of variance. For that, we can use Anova test.
  6. Anova-test:- Is a test used to determine differences between research results from three or more unrelated samples or groups.

In Project,

Basis analysis of the dataset:-

  1. There are 8768 rows and 12 columns.
  2. Mean- 37.28, Std-12.99 , Min-2.00, Max-92.00
  3. There are 582 null values in name, 605 in age, 49 in gender, 210 armed, 57 city, 1190 in flee.

October 06, 2023

Worked on project.

  1. Read CSV files in python which includes %inactivity, %obesity and %diabetes.
  2. Then I calculated mean and std dev for each variable.
  3. Then calculated z_score for every variable.
  4. Converted negative values to positive of each variable.
  5. Then, assign threshold value to each variable to calculate outliers.
  6. Calculated outliers, and concluded that the data has more outlier.

October 4, 2023

Today we work on project.

  1. We have three databases given %diabetes, %obesity, %inactivity. Apart from these databases we will be using region and %child and poverty dataset.
  2.  We will be predicting diabetes in this project.
  3. In EDA part, we are going to find out correlation between i) obesity and diabetes ii) Inactivity and diabetes.
  4. We find out mean, median, mode, avg, standard deviation, skewness and outliers.
  5. Then we perform resampling methods on the data.
  6.  First we perform K-Fold Cross Validation. Then, Bootstrap.
  7. As there are more than two variables we have to perform multi-linear regression in this case.
  8. Then we check for homoscedasticity and heteroscedasticity for that we use breush-pagan test.

October 2, 2023

  1. Regression analysis is performed on data that may have a high variance across distinct independent variable values.
  2. Heteroscedasticity, which denotes variable variances around the fitted values, is associated with this type of data.
  3. The data points are spread around the fitted line when we do regression.
  4. There should be as little scattering as feasible in a good regression model.
  5. The model is referred to as homoscedastic when the dispersion is uniform. The model is heteroscedastic if it is not.
  6. The form of the typical heteroscedastic distribution resembles a cone.
  7. Most frequently, the source of this type of cone-shaped distribution is the data itself.
  8. Sometimes it is quite normal for the dependent variable’s variance to change and to not remain constant across the whole dataset.
  9. In other words, the newest data points may show more variety than the prior data did.
  10. When I performed the linear regression model, heteroskedasticity was noticed in the similar way.
  11. Later performed Cross Validation to avoid this problem.