Analyzing Global Suicide Rate from 1985 to 2016

Ziyu Wang

Outline

1. Introduction

In this project, I want to analysis the suicide rate in each country and their the relationship between suicide rate with gender, age, generation and GDP from year 1985 to 2016

1.1 Background Infromation

Suicide is the tenth leading cause of death in the United States (US), with nearly 100 suicides occurring each day and over 36,000 dying by suicide each year. Most of us can hardly imagine the suffering that precedes suicide and the pain left in its wake. It makes me wondering why people would desprate to end their life. Dose the society has some effect on suicide rate? Does the gdp of a country has effect on suicide rate.

With all this questions being asked, I want to use detailed data to examine the relationship between suicide rate and the sex, ages and gdp of each country.Predicting the suicide rates using Machine Learning algorithms and analyzing them to find correlated factors causing increase in suicide rates globally. Hope after reading this project, the reader can have a better understanding of factors that affect suicide and why metal health is just as much import as physical health.

1.2 Library Used

2 About the data

the dataset I am using contains data from 101 countries, including the year, sex, age, times they attempt to commit suicides and the gdp for each year of each country from 1985-2016.The overview of this dataset is, it has 27820 samples with 12 features.

2.1 Data Source

I got this data from kaggle, and this is the reference from kaggle page. https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016.

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

2.2 Data Load and view

The first step is to load data from the web. Since we need to use the column name a lot in the later data analyses, I want to make the column names simpler. I download the data as a csv file. Loading in the dataset save it as a dataframe. Let's take a look of it.

Also, for future conviniences, we should change the age 5-14 years to 05-14 format.

3. Data Analysis & Visualization

Now I want to process the data by some columns and compare it within the column. For example, to compare the suicide rate with sex, age, and generation. By using tools like lineplot and bar chart, we can understand how the data is distributed and the how features are related to each other.

From the plot above, we can see throught 1985 to 2016, the suicide rate of male is always considerably higher than female. How about age?

We can see the suicide rate is getting higher as the age is getting higher, so we know age is a factor of suicide.

Even though age of 75+ years have the highest suicides rate, from the bar plot we can see that age of 35-54 years has the highest number followed by 55-74 years. One shocking part is we can see there still are cases in the 5-14 years although they are very less.

From the pie chart, we can see that the Boomers generation has the highest suicide rate during this three decades. Slient and Generation X also have high percentage of 26.4% and 22.7%.

Now, let's see the suicide rate among each country. From the above imformation, we know that there are 101 countries in the dataset. We want to list them and map them. First, we need to create a country list.

From above plot we can see that Lithuania has the most suicide rate followed by Russian Federation and Sri Lanka.

Now let's take a look of the correlation of each column. Seaborn makes it easy to do the correlation. in the correlation matrix below, the higher the correlation coefficient between two variables. the deepper the color is.

From the Correlation Heatmap, we see that the suicides number are obvisely related to the population. And the gdp per capita is highly related to the HDI for the year. What suprises me is that the suicides_number has less correlation with the gdp per capita. Before doing this correlation, I thought the suicide number would have a high correlation with gdp that hihger gdp would have less suicide number, turns out, that's not true.

4. Machine Learning Algorithm

In the last section. we visualized the data to help a better understanding of how suicide related to the age,sex, country and the generation. We also correlated each column and had a better understanding how the attribure groups are correlated with each other. In this section, I want to build a machine learning model from these features-label pairs, which comprise our training set. My goal is to make accurate predictions for new, never-before-seen data.

4.1 Dataset Standarlization

HDI for year has approximately 70% of the column are null values. This may tamper the model performance so, dropping the HDI for year column from the dataset. The county - year column seens just repeat data, so also drop it. Also, in the column gdp for year, drop the ','. Use LabelEncoder to convert non-numeric column to numerical label.

4.2 Splitting Data

4.3 Model Building & Training

In this section, we finally start to buiding the ML model. I want to use Linear Regression and Decision Tree.

4.3.1 Linear Regression

Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method for regression. Linear regression finds the parameters w and b that minimize the mean squared error between predictions and the true regression targets, y, on the training set.

The accuracy on trainning Data and test Data are just 0.3. Therefore, the performance of this model is not very great. However, we observed that the score on the trainning set and test sets are very close. That means we are underfitting, not overfitting

Now we can use Linear Regression to predict the suicide rate. note it's human behavior, so we can expect R square to less than 50%

4.3.2 Decision Tree : Regression

Decision trees are widely used models for classification and regression tasks. Essentially, they learn a hierarchy of if/else questions, leading to a decision. Learning a decision tree means learning the sequence of if/else questions that gets us to the true answer most quickly.

In the machine learning setting, these questions are called tests. To build a tree, the algorithm searches over all possible tests and finds the one that is most informative about the target variable.

This results is quite accurate! For Decision Tree, the accuacy for trainning Data is 0.966 and 0.952 for test Data. This is definitly more accurate than the Linear Regression.

For more information about machine Learning, I found this amazing website that gives tutorial about Machine Learning skills and you can also download a free ebook about Machine Learnning here

5. Conclusion

5.1 Tutorial recap

With this tutorial, we set out to analyze the suicide rate globaly with the relationship of gender, age, generation and countries. From data analysis and visualization part, we know that the suicide rate is related to the gender, ages, generation. Also, the suicide rate between different country has a huge differences. We can see that male population are more prone to commit suicide than female. Moreover, according to the suicide rate of different age group, the rate of elderly is higher. Is this mean that the world is not providing enough help for elder people? From the country list, we also observed that there are higher suicide number in Asia and Europe countries. During the machine learning part, I used Linear Regression model and Decision tree model to predict the suicide rate. We know that for this data, use decision tree to predict is better than use Linear Regression.

5.2 Suicide Prevention

This is a tragic reaction to stressful life situations. Many people are suffuring from mental issues and would think there's no way out and end their own life is the only way to end the pain. But it is not solvable, it just take time and other people's understanding. With such a large number of people die from suicide, it's not just only problem of indivisuals but also a society's concern. Judging and misunderstanding can only make those who suffer from mental health to feel wrose. Learn suicide warning signs and how to reach out for immediate help and professional treatment. You may save a life — your own or someone else's. To prevent this, first is to get the treatment you neeed. Establish a suppport network is also important. And never forget that feeling and pain are temporary.

For immediate help

If you're feeling overwhelmed by thoughts of not wanting to live or you're having urges to attempt suicide, get help now.

Call 911 or your local emergency number immediately. Call a suicide hotline. In the U.S., call the National Suicide Prevention Lifeline at 1-800-273-8255 any time of day — press "1" to reach the Veterans Crisis Line or use Lifeline Chat.