Final Project
Final Project
The aim of this project is to experience data analysis where you will use the statistical methods taught in this course to come up with some objective findings. You will collect the data from your own choice of data repositories and conduct inferential analysis as discussed in the class.
The Final Project will be graded by the following criteria:
Appropriateness, thoroughness, and accuracy of analysis 60%.
Effectiveness of communication 40%. Includes writing, organization, professionalism & style.
STEP 1:
Begin searching the net and other resources for a dataset that captivates your interest. Your dataset should contain at least 200 observations and 5 to 20 variables.
I prioritized my search through data.gov. I believe there is a plethora of data to choose from. However, it can be overwhelming to pick the right one from such a large collection. For my project, I wanted to focus on an issue that can be addressed through pattern recognition and trend analysis. I ended up selecting a dataset titled "Motor Vehicle Collisions - Crashes". This dataset is from the city of New York City with details of various crash events. Originally this dataset had 2,138,106 observations and 29 variables. This will be altered through data cleaning.
To the left is an image of the original 29 variables. This will be shortened down to analyze trends better.
Link to data set = Motor Vehicle Collisions - Crashes
Why was this dataset selected???
Driving is something we either do or partake in. There is a high chance that almost all of us have rode in a car in some capacity. Driving helps us reach our destinations in a quarter of the time. However, even though cars bring convenience, they also bring great safety hazards. If the proper safety recommendations are not adhered to, people can get injured or even die.
I believe by looking at various factors of vehicular accidents, we are able to pinpoint what issues cause the greatest risks. By examining this dataset that encompasses one of the largest cities in the world, we can achieve great inferences that can not only impact the city of New York, but also cities around the World.
STEP 2:
Your next step is to take a sample in order to establish your hypothesis.
This project aims to explore where there is a significant relationship between specific contributing factors and the severity of crashes. This raises the question: do specific factors lead to more severe crashes?
I will develop a hypothesis using a two-sample t-score.
Null Hypothesis (H0): There is no significant relationship between specific contributing factors and the severity of crashes.
Alternative Hypothesis (H1): Certain contributing factors (like driver distraction or alcohol) are strongly correlated with more severe accidents.
To accomplish this, we will identify various accidents and if specific factors lead to more accidents or if there is no causation.
As mentioned, I will first prioritize cleaning this data set to make analyzing easier.
Using this function, I was able to eliminate columns I found unnecessary such as Zip code, Latitude, Longitude, Location, On street name, and Off street name.
This resulted in:
I want to prioritize variables such as Number of persons injured, number of persons killed, number of pedestrians injured, number of pedestrians killed, number of cyclist injured, number of cyclist killed, number of motorist injured, number of motorist killed, and contributing factor vehicle 1-5.
CODE:
This section will cover the code that I used to analyze trends.
I first wanted to divide this analyzation into TWO sections. The first section will determine if factors cause more accidents. In essence, does an external contributor lead to more accidents? The second section will determine what factors cause the most amount of crashes along with the severity of the crash. With the combination of these TWO sections, we will be able to determine if external factors cause more accidents and what factors cause the most accidents.
We want to compare the notion of if factors cause more accidents. In the dataset, there is a contributing factor labeled "unspecified". We'll assume that we do not know what caused the accident. I will use this value for accidents that do not have specific causations. I want to compare the amount and severity of accidents of unspecified accidents to other contributing factors.
To address this, I created the severity_score column. This is done by using the mutate() function and combining contributing factors.
Next I created a new column titled factor_category to classify the crashes based on whether the primary_factor is specified or unspecified. If the primay_factor is "Unspecified", the crash is categorized as Unspecified. Otherwise it is categorized as Specified.
WELCH TWO SAMPLE t-test
I will use Welch Two Sample t-test to compare the average severity scores of two groups: Unspecified and Specified.
t-Statistic: measures the size of the difference relative to the variation in the data.
Degrees of Freedom(df): Highlight the sample size to asses the significance of the t-statistic.
p-value: This shows if the difference between the two group means is statistically significant.
The p-value is 2.2e-16. Due to the p-value being less than 0.05 (<0.05), we reject the null hypothesis. There is a statistically significant difference in the severity scores between the two groups. Specific factors such as Alcohol Involvement, Driver Inattention/Distraction, Unsafe Speed, and Aggressive Driving/Road Rage have a severity score that is significantly different.
The mean of the specified group is 0.7029 while the mean of the unspecified group is 0.5340. Crashes with identified contributing factors tend to have higher severity scores. This shows that crashes with causes are led to have more severe outcomes. Crashes without identified causes have lower mean severity scores.
Next we use the mutate() function to create new columns based on existing columns.
A severity_score is implemented to calculate the severity of each crash based on the number of injuries and fatalities, with different weights for each category. This means 1 point for injuries and 5 points for fatalities. This encompasses pedestrians, cyclists, and motorists. Using the cut() function, the severity score is divided into four categories: "No Injury", "Minor", "Moderate", and "Severe".
Next we use the filter() function to remove any missing values (NA).
CODE OUTPUT:
This is an output of the code. As mentioned, Primary_factor encompasses CONTRIBUTING FACTOR VEHICLE 1, CONTRIBUTING FACTOR VEHICLE 2, CONTRIBUTING FACTOR VEHICLE 3, CONTRIBUTING FACTOR VEHICLE 4, CONTRIBUTING FACTOR VEHICLE 5.
Total_crashes represent the total number of crashes associated with each primary factors. It counts how many rows in the dataset correspond to each contributing factor.
Avg_severity is the weighted measure based on the number of injuries and fatalities. A higher severity score suggests that crashes with this contributing factor tend to be more severe.
Severe_crash_rate is the proportion of crashes that were classified as "Moderate" or "Severe" for each contributing factor. A higher severe crash rate suggests that a specific contributing factor is associated with more serious accidents.
The highest total crashes include Unsafe speed at 31,433. Traffic Control Disregarded was at 38,739. Alcohol Involvement was at 23,916. The average severity for unsafe speed was 1.50 and 16.2 percent of crashes resulted in moderate or severe. The average severity for Traffic Control Disregarded was 1.49 and 15.2 percent of crashes resulted in moderate or severe. Lastly, alcohol involvement had a average severity of 1.03 with 10.6 percent of crashes resulted in moderate or severe. One result that was very surprising was Illness. Even though there were less reported crashes, the average severity and severe crash rate were higher than all primary factors. At a whopping 2.13 severity rate, combined with an 18 percent crash result in either moderate or severe accidents, illness seemed to be a very deadly combination when paired with driving.
This is the output of severity_distribution. This takes into consideration of the number of crashes that fall within the four categories. As mentioned before, this score was calculated through a formula. Number of persons injured would be multiplied by 1 and number of persons killed will be multiplied by 5. If the total equals zero, then there were no injuries. If the total is in between 0 to 2, then there were minor injuries. If the total was between 2 to 5, then there were moderate injuries. Anything above 5 resulted in severe injuries.
Based off the severity_distribution, no injury had the highest amount at 1051460, minor next at 272565. moderate at 53810, and severe at the lowest at 30,744.
PEARSON'S CHI-SQUARED TEST
A Pearson's Chi-Squared test was performed. This test was chosen based off multiple factors. First it is best for categorical data such as Independent and dependent variables such as primary_factors and severity_category. Chi-square tests are used to determine significance between two categorical variables.
The p-value indicates a value less than 2.2e-16.
If p > 0.05, we fail to reject H0. This means no significant relationship exists.
If p < 0.05, we reject H0. This concludes that certain contributing factors are associated with crash severity.
Since 2.2e-16 < 0.05, we reject H0. Significant evidence supports H1. Certain contributing factors, such as alcohol, are strongly associated with severe crashes.
Since we now know there is a relationship between certain factors and crashes, we will pinpoint what factors cause more crashes.
STEP 3:
Null Hypothesis (H0): There is no significant relationship between specific contributing factors and the severity of crashes.
Alternative Hypothesis (H1): Certain contributing factors (like driver distraction or alcohol) are strongly correlated with more severe accidents.
For this analysis, I utilized PEARSON'S CHI-SQUARED TEST and WELCH TWO SAMPLE t-test. WELCH TWO SAMPLE t-test was used compare the average severity scores of two groups: Unspecified and Specified. This was done initially. From this test the p-value given was 2.2e-16. Due to the p-value being less than 0.05 (<0.05), we reject the null hypothesis. There is a statistically significant difference in the severity scores between the two groups. From this test, we are able to see that Specified factors had a higher severity score thus indicating that certain factors will cause more accidents than non-specified factors.
A Pearson's Chi-Squared test was performed to see if there was significance between the two categorical variables, primary_factors and severity_category. The p-value provided was 2.2e-16. Since the value is less than 0.05, we are also able to reject the null hypothesis. If p > 0.05, we fail to reject H0. This means no significant relationship exists. If p < 0.05, we reject H0. This concludes that certain contributing factors are associated with crash severity. Based off this, we now also know that certain primary factors such as alcohol involvement have significant impact on the severity of the crashes. I further broke this down by finding the factors that lead to the most severity. We can see factors like Traffic Control Disregarded causes the most crashes followed by unsafe speeds and alcohol involvement. The most severe crashes are from illness. Even though illness may not contribute to the most amount of crashes, these crashes are found to be more deadly.
STEP 4:
Generate visualization and provide short summary of your study – Abstract.
WELCH TWO SAMPLE t-test boxplot visualization
This is a bar plot that highlights the total amount of crashes as well the average severity of the crashes. From this bar plot, we are able to see the traffic control disregarded remains the top contributor to crashes. Unsafe speeds and Alcohol involvement follow after. This visualization lets the audience discern factors that cause the most accidents easily.
Average Severity vs. Severe Crash Rate scatter plot visualization
This is a scatter plot with a trend line the demonstrates the relationship between two variables, avg_severity and severe_crash_rate. From this, we can see Illness has the highest average_severity followed by a high sever_crash_rate. This visualization allows us to see what factors have outstanding rates. Interestingly from this graph, there is one major outlier which is illness and other factors with low severe crash rates and low average severity. For example, alcohol involvement has a low severe crash rate and average severity but has one of the highest total crashes.
Takeaway from visualizations and analysis:
From these visualizations, we are able to see trends that would have not been picked up through standard data sets. As mentioned, primary factors with higher total crashes have lower severe crash rates and average severity. This draws even more questions as to why that happens. Why does something like alcohol involvement have so many accidents but very low severity crash rates? This analysis brings forth more questions that could be looked into. Going into this analysis, I would have not expected results such as this. I hope to continue looking into why these characteristics are presented. Normally one would assume that something that causes more accidents would potentially cause more injuries and deaths.
This project allowed me to understand the importance of analyzing datasets. We hold certain preconditions to be true. However, at the end of this analysis, my perspective was changed. I have more questions that need to be answered. Using what I learned from this semester, I used a variety of statistical implementations to complete this analysis. The data shows true trends that we would have not thought to be possible.
ABSTRACT/SUMMARY
Using a data set that originally contained 2,138,106 observations and 29 variables (but shortened to 21 variables), this data set was cleaned up to prioritize the number of persons injured, number of persons killed, number of pedestrians injured, number of pedestrians killed, number of cyclist injured, number of cyclist killed, number of motorist injured, number of motorist killed, and contributing factor vehicle 1-5. The first step of this analysis was to determine if factors cause more accidents. By using WELCH TWO SAMPLE t-test, I was able to compare "unspecified" and "specified" factors to see which of these cause more accidents. Once it was determined that "specified" factors had a higher statistical significance, I implemented PEARSON'S CHI-SQUARED TEST to test the statistical significance of primary factors (alcohol involvement, unsafe speed, illness, etc.) and the severity_category. With this, I was able to see the significant impact on the severity of the crashes. To continue this further, I was able to pinpoint factors that led to the most amount of crashes and the most amount of severity. Factors such as Traffic Control Disregarded resulted in the most crashes while illness resulted in the most severe crashes. This notion was further reinforced through various visualization methods such as boxplot, bar plot, and scatter plot. With a combination of all these statistical tools, this analysis is able to demonstrate the importance of understanding accident data for the potential to help other communities in bettering their driving safety strategies. The results of this analysis should reinforce the importance of driving with no distractions or other influences to not only protect yourself, but also the lives' of others.
Comments
Post a Comment