Using Tree Based Gradient Boosting Models to classify terrorism events as Suicide Attacks
13 June 2017
My team Gonzo at UTS used the Global Terrorism Database (GTD) to explore whether distinct features of terrorism events could predict the ABC’s online reaction to them. We did this through web scraping the ABC’s Twitter feed, and Google Search results, and then built generalized linear models, and ElasticNet regularization models.
Our research illustrated the dramatic increase in terrorism events in recent years, and as shown below (Figure 1), the absolute number and proportion of suicide attacks is also on the rise. Most of these attacks were through bombings or explosions (Figure 2). I wanted to explore these suicide attacks further, and identify what were the most important characteristics in the GTD or most influential factors in determining the classification of suicide attack. This paper represents my exploration and is definitely not perfect!
Figure 1 Terrorist Attacks during 2005-2015
Figure 2 Terrorist Attack Types during 2005-2015
My aims in the work discussed in this blog are, firstly, to deepen our teams understanding of how we can use the database itself, and secondly, to use a new statistical method, decision tree classification, (a gradient boosting model), to answer my new research question: how can Gradient Boosting Models classify terrorism events as Suicide Attacks?
I had to change the data import by converting all the logical variables to factors for the gbm package, and make sure there were no NAs.
The package also has limitations in the number of levels a factor can have, so the research focused on the Middle East and North Africa, and South Asia regions in the GTD.
In addition, further filtering occurred on city to only those cities that had experienced a suicide attack, and this way I could keep my city and group name levels below 1024.
Gbm also takes binary outcome variables, so I translated my target “suicide” into “outcome_binary”.
After initial data exploration, and reading up on the GTD codebook, and getting extreme correlation in my outcome and variables, 3 variables were removed from the data: Weapsubtype1_txt=” Suicide…”, Nkillterr and terrorist_killed
The gbm model
My data sets were split 70/30 into training and testing sets. My best cross validated gbm model is shown below:
gbm_fit = gbm(outcome_binary ~ ., distribution = “bernoulli”, data = training,cv.folds=10,
verbose = “CV”, n.trees = 100, interaction.depth = 3)
As this is a classification model with a binary outcome, I evaluated the model by calculating the Confusion matrix shown below.
|Prediction||Suicide= No||Suicide = Yes|
|Suicide = Yes||1059||620|
Table 1 Confusion Matrix
Due to the high number of false positives (1059 of 1679), precision of the model is 37%, but accuracy is high (89%) due to the negative prediction values. Due to the sparsity of the response variable, this is a common result. This can be shown graphically in Figure 3 ROC Chart.
The Area under the Curve (AUC) was 98.33%. This is really high (100% is perfect). This is illustrated by the very small gap in the training and testing Gains Chart Receiver Operating Characteristic (ROC) curve in Figure 3.
I did have some faith in the result however as I had already removed the 3 variables that were high correlated with the suicide variable.
Figure 3 ROC Chart
Sensitivity score is 99%, and specificity 88.6%. I used our lecturer Stephan’s model evaluation code, but I have to say, something looks odd with the charts (Figure 4)
Figure 4 Sensitivity Specificity Chart
The model calculated the probability threshold for classification as suicide attack was 6.35%. The gbm summary table explained that 3 variables were 100% of the relative influence: nperps, weapsubtype1_txt and city.
## var rel.inf
## nperps nperps 47.774687
## weapsubtype1_txt weapsubtype1_txt 42.885061
## city city 9.340252
I analysed these variables further to see why they were so influential.
The majority of terrorist and in particular, suicide attacks were perpetrated by one attacker. This does not mean they were acting alone, but were the only person who carried out the attack in the vast majority of cases (Figure 5).
Figure 5 Suicide attacks by number of attackers (nperps)
Figure 6 Weapon (sub) type (weapsubtype1_txt)
Vehicles were used in the majority of suicide attacks (but not attacks overall) (Figure 6). Given the finding that most suicide attacks are bomb/explosive attacks (Figure 2), this finding makes sense.
Lastly, the third most influential variable, the city is illustrated in Figure 7. Bagdad has withstood the greatest amount of terrorist attacks over the ten year period, including suicide attacks. Bagdad has suffered many devastating car bomb suicide attacks in this time, killing hundreds of people.
Figure 7 Cities that withstood terrorist attacks (city)
Figure 8 Groups perpetrating terrorist attacks (gname)
The Islamic State (ISIL) have been the perpetrators of the majority of suicide attacks. Boko Haram, an active terrorist group that also perpetrates suicide attacks is not included as they operate in the sub-Saharan African region.
I can conclude that the 3 most important variables from my model stand up to scrutiny of further data analysis.
Recommendations for enhancing the model
I tried to get glmnet with ridge and lasso working, to deal with the sparseness of my response variable, however the model would run overnight and then fail. Getting this working would definitely improve the model.
Building elasticnet regularization model and conducting upsampling would also improve it.