Table of Content
This is a dataset about the crimes reported in the city of Seattle (USA) between 2008 and 2022. It comes directly from the City of Seattle Open Data Portal.
This project is about a:
Basic Exploratory Data Analysis (EDA)
In this project, I'm going to use Excel.
You can download the Excel Document by clicking on the button below:
Note: The document could be opened in Google Sheets and some Excel features can't be displayed and will be dropped if you make changes. Also, formatting won't be exactly the same.
So, for a better experience, download the document by clicking on File=> Download => Microsoft Excel (.xlsx).
Some information about the dataset:
The dataset represents the crimes reported in Seattle between 2008 and 2022.
There are almost 970K rows. Each row is crime data.
There are 17 Columns:
Report Number: (Text) Primary key/UID for the overall report. One report can contain multiple offenses, as denoted by the Offense ID.
Offense ID: (Text) Distinct identifier to denote when there are multiple offenses associated with a single report.
Offense Start Date Time: (Date & Time) Start date and time the offense(s) occurred.
Offense End Date Time: (Date & Time) End date and time the offense(s) occurred, when applicable.
Report Date Time: (Date & Time) Date and time the offense(s) was reported (Can differ from date of occurrence).
Group A B: (Text) Corresponding offense group.
Crime Against Category: (Text) Corresponding offense crime against category.
Offense Parent Group: (Text) Corresponding offense parent group.
Offense: (Text) Corresponding offense.
Offense Code: (Text) Corresponding offense code.
Precinct: (Text) Designated police precinct boundary where offense(s) occurred.
Sector: (Text) Designated police sector boundary where offense(s) occurred.
Beat: (Text) Designated police beat boundary where offense(s) occurred.
MCPP Neighborhood: (Text) Designated Micro-Community Policing Plans (MCPP) boundary where offense(s) occurred.
Blurred Address: (Text) Offense(s) address location blurred to the one hundred blocks.
Blurred Longitude Coordinate: (Number) Offense(s) spatial coordinate blurred to the one hundred blocks.
Blurred Latitude Coordinate: (Number) Offense(s) spatial coordinate blurred to the one hundred blocks.
You can download the Data Source from the City of Seattle Open Data Portal:
There are no duplicates in the dataset.
As the dataset is huge (Almost 1 million rows and 17 columns), while importing the CSV file, I decided to exclude the columns I'm not going to use in my analysis. The excluded columns are:
Offense ID.
Offense End Date Time.
Group A B.
Offense.
Offense Code.
Sector.
Beat.
MCPP Neighborhood.
Blurred Address
Blurred Longitude Coordinate.
Blurred Latitude Coordinate.
I only kept 6 columns and this is why:
Report Number: It is the Primary key of the table. It could be useful later on if I want to reintegrate an excluded column or if I want to use vlookup function...
Offense Start Date Time: It's necessary to know when a certain offense happened (analysis by time, analysis by days, analysis by months, analysis by years...).
Report Date Time: It's important to see when the offense got reported to maybe calculate the duration from occurrence to report, to notice if there are patterns...
Crime Against Category: Understanding against which category (Society, Person, Property...) the crime was done will help explore the data.
Offense Parent Group: It's good to group the offenses and see the trends and the patterns of the offense groups.
Precinct: It's the best spatial variable that gives a general idea of how the offenses are distributed (West, East, North...).
I used the Proper() function to transform the text in columns "Crime Against Category" & "Offense Parent Group" from all capital words to proper words.
In the column "Precinct", I changed the letters (W, S, N, SW...) with their actual meanings (West, South, North, Southwest...)
I added a new column "Duration from Occurrence to Report" which calculates the timeframe between the "Offense Start DateTime" and the "Report DateTime" in the format "dd - hh:mm". In some rows, the "Offense Start DateTime" comes after the "Report DateTime" which is not logical at all. As I don't have any detail about why would it be this way (system bug...), I used the ABS() function to calculate the absolute value of the duration.
In the column "Offense Start DateTime", I deleted the blank rows.
In the dataset, some crimes happened years and even decades before 2008. Those crimes represent less than 0.2% of all the data. So, I deleted them to only deal with the crimes that happened AND were reported after 2008 until 2021.
I added a new column "Offense Day of Week" which specifies the day of the week when the offense happened. I used the Text() function .
After cleaning and preparing my data, I put it in a Table called "Dataset".
The idea behind this sheet is to analyze how crimes change through time (years, quarters, months, days of the week, and hours of the day) and if there are seasonality or some patterns.
From 2008 to 2021, the number of crimes kept increasing and decreasing but without giving us any significant insight. The best year was 2012 when the number of crimes was only about 58,980 while the worst year was 2020 with 76,047 crimes.
There is no seasonality by quarters!
There is no seasonality by months!
February registered the lowest number of crimes but it is most likely due to how short is it (28 or 29 days) compared to the other months (30 or 31 days) and for 14 years of the dataset.
Also by days of the week, there is no seasonality!
Grouping the crimes by the 24 hours of the day, we can clearly see how a big part of them happen between noon (12 PM) and 1 AM with the exact hour after midnight representing by itself 8% of the crimes.
1 Am to 12 PM is the quiet part of the day with the hour of 5 AM representing only less than 1.5% of the crimes.
We can conclude that crimes don't happen at a certain quarter, month, or day in particular but happen differently at certain hours of the day.
In this sheet, I tried to analyze how the crimes are distributed by precincts.
Around a third (32%) of Seattle's crimes between 2008 and 2021 happened in the north precinct, followed by the west precinct (27%), while the east and the south precincts had a fairly equal number of crimes (respectively, 15% and 14%). The southwest precinct recorded the lowest number of crimes by almost 10%.
In this sheet, I tried to analyze how the crimes are distributed against categories.
Almost 3 crimes out of 4 were against property while 15% were against people and 11% were against Society.
In this sheet, I tried to analyze how the crimes are distributed by offense groups.
More than a third (37%) of crimes are "Larceny-Theft". "Assault Offenses" registered as the second most frequent crime with 14%, followed by "Burglary/Breaking & Entering" with 11%, then "Destruction/Damage/Vandalism of Property" with 9%. The other offense groups share the rest (28% of crimes).
In this sheet, I tried to analyze the relationship between variables and draw some insights.
By analyzing the frequency of crime against category per precinct, I can see that there isn't a big difference between how the categories of crimes are distributed from one precinct to the other.
I might say that the share of crimes against property in the north and the southwest precincts is slightly higher than the share of crimes against property in the other precincts.
By calculating the frequency of crimes by offense groups in each precinct, and using conditional formatting, I can easily spot how the offense group "Larceny-Theft" is largely represented in the north (12.5%) and the west (11.6%) precincts.
As we saw before, crimes happened differently following the hours of the day, that's why I tried to see their frequency in each hour of the day by precinct. I could confirm that crimes happen mostly between noon and 1 AM, and especially at midnight.
I noticed that the duration from crimes' occurrence to report is influenced by the type of offenses. For example, Drunkness, Bribery, or Gambling offenses are being reported just a few hours after happening, while Sex offenses, Pornography/Obscene material, and Human trafficking needs weeks and months until being reported.
As this is a big dataset, it's obvious that using Tableau or Python to analyze and visualize the data makes perfect sense. But I wanted to showcase that Excel as well can do the job.
For sure there are so many analyses that could be done with this dataset. It all depends on the questions we could generate during business requirements.
This Exploratory Data Analysis (EDA) would help us understand and get some basic insights into our dataset before trying deeper analysis or any kind of data modeling, forecasting...