November 2023 – mathematical statistics 522

November 30, 2023

November 29, 2023

In today’s lecture,

we explore the mechanics of the housing market, paying close attention to how median home values have changed over time and how they relate to larger trends in the economy.
Our study shows the market’s varying highs and lows, just like a roadmap might.
While price dips or plateaus may indicate a cooling market owing to economic developments or shifting buyer attitudes, rising prices frequently imply a solid economy and significant housing demand, signifying buyer confidence.
These patterns are linked to more general economic factors like employment and interest rates.
For example, a robust job market might raise the ability to purchase a property, which in turn drives up prices, while changes in borrowing rates can affect the fervor of potential buyers.
Additionally, we observed possible seasonal patterns in the market.
Time series analysis was performed on ‘economic indicators’ data in order to understand underlying patterns such as trends, seasonality, and cyclic behaviour.
The data was arranged chronologically by time variable (columns ‘Year’ and ‘Month’).
I later combined the ‘Year’ and ‘Month’ columns into a single date time column to indicate time.
The time series data was also visualised using time series plots to observe patterns, trends, and seasonality in each economic indicator over time.
The time series was then decomposed into its components (trend, seasonality, residual) for analysis using techniques such as seasonal decomposition.
Seasonal decomposition is a time series analysis method that separates a time series dataset into three parts: trend, seasonality, and residual or error components.
To predict future values of economic indicators, a forecasting model such as ARIMA (AutoRegressive Integrated Moving Average) was used.
The data was then divided into training and test sets in order to train the model and evaluate its performance.
To assess the accuracy of the forecasting model, a metric such as Mean Absolute Error (MAE) was used, and the predicted values were compared to the actual values in the test dataset to evaluate the model’s performance.

November 28, 2023

November 27, 2023

In a VAR model, each variable is modeled as a linear function of past lags of itself and past lags of other variables in the system.
VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model.
An estimated VAR model can be used for forecasting, and the quality of the forecasts can be judged, in ways that are completely analogous to the methods used in univariate autoregressive modelling.
Using an autoregressive (AR) modeling approach, the vector autoregression (VAR) method examines the relationships between multiple time series variables at different time steps.
The VAR model’s parameter specification involves providing the order of the AR(p) model, which represents the number of lagged values included in the analysis.

November 15, 2023November 28, 2023

November 15, 2023

After analysing the “food establishment inspections” dataset, it was determined that the decision tree technique would work best because it can be used for both regression and classification, can handle both numerical and categorical data, implicitly perform feature selection, robust to outliers and missing values.
They use a structure akin to a tree to represent decisions and their possible outcomes.
The nodes in the tree stand for features, the branches for decision rules, and the leaves for the result or goal variable.
Decision trees determine the optimal feature to split the dataset at each node based on a variety of parameters, such as information gain, entropy, and Gini index.
The algorithm recursively splits the dataset based on the selected features until a stopping criterion is

November 13, 2023

Project 2

MTH522_Project2_Report

November 10, 2023November 11, 2023

November 10, 2023

After running the DBSCAN algorithm, I went on to create OPTICS, which is an extension of it.
preprocessed the data by transforming the object values into numerical ones and normalizing/standardizing the data.
Determined the distance between each location using the Haversine distance formula.
The OPTICS algorithm was subsequently implemented as follows: optics = OPTICS(min_samples=5, metric=’precomputed’)optics.fit (matrix of haversine)
collected cluster information, including reachability distances and core samples, and identified the noise sites.
In order to comprehend the spatial distribution, the clusters were also visualized on the map.

November 8, 2023November 9, 2023

November 08, 2023

Used the latitude and longitude coordinates and relevant attributes.
Calculated the Euclidean distance to measure dissimilarity between locations.
Then calculated a pairwise distance matrix to represent dissimilarity between all location pairs.
Performed agglomerative hierarchical clustering algorithm to the distance matrix
Visualized the dendrogram, which displayed the hierarchical structure of clusters.

November 6, 2023November 9, 2023

November 06, 2023

Here is an overview of the main conclusions drawn from the data analysis:
1. The “flee” variable is divided into four categories: not fleeing, car, foot, and other.

2. Following the incident, about 4,000 people remained in their place.*

3. About 1300 people ran away, some on foot and some in cars.

4. Method of Death: Over 7000 individuals were shot, while the remaining 8000 were shot and subjected to tasers.

5. Gender Distribution: Compared to females, men were more likely to be involved in the offenses.

6.Racial Distribution: A variety of racial backgrounds were represented, with whites and blacks making up the majority.

7.Geographic Distribution: The reported crime rate was greater in three states.

8.Indices of Mental Illness: Approximately 2000 individuals exhibited indicators of mental illness.

4.Danger Level: About 4000 individuals were recorded

November 3, 2023November 4, 2023

November 3, 2023

K-Means and DBSCAN are two distinct clustering algorithms.
K-Means is a partition-based clustering method where you need to specify the number of clusters (K) beforehand, and it assigns data points to the nearest cluster centroid.
However, it can be sensitive to the initial placement of centroids and primarily performs hard clustering, meaning each point belongs to one cluster. K-Means also assumes that clusters are spherical in shape.
On the other hand, DBSCAN is a density-based clustering algorithm that automatically identifies clusters based on data density without requiring you to specify the number of clusters in advance.
It excels at identifying clusters of arbitrary shapes, handles noise or outliers, and can accommodate clusters of varying sizes.
While DBSCAN primarily performs hard clustering, there are extensions available to achieve soft clustering when necessary.
In summary, the choice between K-Means and DBSCAN depends on factors like the knowledge of the number of clusters and the data’s characteristics, as they have different strengths and use cases in the realm of data clustering.

November 2, 2023

November 1, 2023

I carried on working today and we talked about the population analysis using the specific police station data in class.
I also went over the race analysis I’ve done and how I want to combine it with the specific police station data. Apart from that, I was trying to create a map using the locations, longitude, and latitude information so that I could examine the clusters and the regions where there were a lot of gunshots.
I was using the geopandas library for this, but it appears that the world map extension is no longer supported, which is why, when the individual point grouping is finished, there is an issue when trying to import the map.
I am looking for more libraries that I can utilize.