Exploration of Tree Based Gradient Boosting Models to classify terrorism events as Suicide Attacks

Using Tree Based Gradient Boosting Models to classify terrorism events as Suicide Attacks

Tracy Keys

13 June 2017

Background

My team Gonzo at UTS used the Global Terrorism Database (GTD) to explore whether distinct features of terrorism events could predict the ABC’s online reaction to them. We did this through web scraping the ABC’s Twitter feed, and Google Search results, and then built generalized linear models, and ElasticNet regularization models.

Our research illustrated the dramatic increase in terrorism events in recent years, and as shown below (Figure 1), the absolute number and proportion of suicide attacks is also on the rise. Most of these attacks were through bombings or explosions (Figure 2). I wanted to explore these suicide attacks further, and identify what were the most important characteristics in the GTD or most influential factors in determining the classification of suicide attack. This paper represents my exploration and is definitely not perfect!

Figure 1 Terrorist Attacks during 2005-2015

Figure 2 Terrorist Attack Types during 2005-2015

My aims in the work discussed in this blog are, firstly, to deepen our teams understanding of how we can use the database itself, and secondly, to use a new statistical method, decision tree classification, (a gradient boosting model), to answer my new research question: how can Gradient Boosting Models classify terrorism events as Suicide Attacks?

Data Preparation

I had to change the data import by converting all the logical variables to factors for the gbm package, and make sure there were no NAs.

The package also has limitations in the number of levels a factor can have, so the research focused on the Middle East and North Africa, and South Asia regions in the GTD.

In addition, further filtering occurred on city to only those cities that had experienced a suicide attack, and this way I could keep my city and group name levels below 1024.

Gbm also takes binary outcome variables, so I translated my target “suicide” into “outcome_binary”.

After initial data exploration, and reading up on the GTD codebook, and getting extreme correlation in my outcome and variables, 3 variables were removed from the data: Weapsubtype1_txt=” Suicide…”, Nkillterr  and terrorist_killed

The gbm model

My data sets were split 70/30 into training and testing sets. My best cross validated gbm model is shown below:

gbm_fit = gbm(outcome_binary ~ ., distribution = “bernoulli”, data = training,cv.folds=10,
verbose = “CV”, n.trees = 100, interaction.depth = 3)

Model Evaluation

As this is a classification model with a binary outcome, I evaluated the model by calculating the Confusion matrix shown below.

ReferenceReference
Prediction Suicide= No Suicide = Yes
Suicide= No82526
Suicide = Yes1059620

Table 1 Confusion Matrix

Due to the high number of false positives (1059 of 1679), precision of the model is 37%, but accuracy is high (89%) due to the negative prediction values. Due to the sparsity of the response variable, this is a common result.   This can be shown graphically in Figure 3 ROC Chart.

The Area under the Curve (AUC) was 98.33%. This is really high (100% is perfect).  This is illustrated by the very small gap in the training and testing Gains Chart Receiver Operating Characteristic (ROC) curve in Figure 3.

I did have some faith in the result however as I had already removed the 3 variables that were high correlated with the suicide variable.

Figure 3 ROC Chart

Sensitivity score is 99%, and specificity 88.6%. I used our lecturer Stephan’s model evaluation code, but I have to say, something looks odd with the charts (Figure 4)

Figure 4 Sensitivity Specificity Chart

 

Model Findings

The model calculated the probability threshold for classification as suicide attack was 6.35%.  The gbm summary table explained that 3 variables were 100% of the relative influence: nperps, weapsubtype1_txt and city.

##                                     var   rel.inf
## nperps                           nperps 47.774687
## weapsubtype1_txt       weapsubtype1_txt 42.885061
## city                               city  9.340252

I analysed these variables further to see why they were so influential.

The majority of terrorist and in particular, suicide attacks were perpetrated by one attacker. This does not mean they were acting alone, but were the only person who carried out the attack in the vast majority of cases (Figure 5).

Figure 5 Suicide attacks by number of attackers (nperps)

Figure 6  Weapon (sub) type (weapsubtype1_txt)

Vehicles were used in the majority of suicide attacks (but not attacks overall) (Figure 6). Given the finding that most suicide attacks are bomb/explosive attacks (Figure 2), this finding makes sense.

Lastly, the third most influential variable, the city is illustrated in Figure 7. Bagdad has withstood the greatest amount of terrorist attacks over the ten year period, including suicide attacks. Bagdad has suffered many devastating car bomb suicide attacks in this time, killing hundreds of people.

Figure 7 Cities that withstood terrorist attacks (city)

Figure 8 Groups perpetrating terrorist attacks (gname)

The Islamic State (ISIL) have been the perpetrators of the majority of suicide attacks. Boko Haram, an active terrorist group that also perpetrates suicide attacks is not included as they operate in the sub-Saharan African region.

I can conclude that the 3 most important variables from my model stand up to scrutiny of further data analysis.

Recommendations for enhancing the model

I tried to get glmnet with ridge and lasso working, to deal with the sparseness of my response variable, however the model would run overnight and then fail. Getting this working would definitely improve the model.

Building elasticnet regularization model and conducting upsampling would also improve it.

Git it on…. maybe not!

I’ve been mildly anxious about my hard work being on my c drive, managing version control, tracking my thinking and learning, that kind of thing, and then last night, I got this error trying to open my work…

load(“~/R/lineartimeseries/DAM assignment 2 Pt A Q3 v1.R”)
Error: bad restore file magic number (file may be corrupted) — no data loaded
In addition: Warning message:
file ‘DAM assignment 2 Pt A Q3 v1.R’ has magic number ‘#####’
Use of save versions prior to 2 is deprecated

My anxiety escalated! But then I rebooted and everything was fine.

Following this brief moment of panic, Durand helpfully pointed out that this is where something like GitHub would be useful.

Ah, this course is teaching me so much 🙂

So I thought, I am going to follow this tutorial: http://product.hubspot.com/blog/git-and-github-tutorial-for-beginners

But then of course, in the usual style, we get sent off to another tutorial, http://mac.appstorm.net/how-to/utilities-how-to/how-to-use-terminal-the-basics/ which is then just for Mac.

Seriously.

So stay tuned for Part II but right now I am going back to writing my DAM assignment.

According to Mark Zuckerberg, Facebook is not a media company

According to Mark Zuckerberg, CEO , Facebook, the world’s largest social media platform[i] is not a media company[ii].

Zuckerberg explained in August 2016: “No, we are a tech company, not a media company…..We build the tools, we do not produce any content..”[iii]

One of those tools is the Facebook News Feed, which provides every one of the almost 2bn[iv] monthly active users a hyper- personalised news stream: “…an algorithmically generated and constantly refreshing summary of updates …”  [v] from friends and any other page a user follows, plus targeted ads and Page suggestions from Facebook.  There is also the Trending module on the right hand side of the Facebook user home page, which surfaces news stories and is entirely created by an algorithm[vi].

How Facebook News Feed works

The Facebook algorithm is complex but it essentially works by identifying key features of a post i.e. is it a video, who posted it, how often was it shared and by whom, and also uses natural language processing to identify the text, topics and sentiments within the post.

Then, in order to present relevant content to the specific user, Facebook analyses the past behaviour of the user and other users across hundreds of factors, then predicts the likelihood that the user will engage with this piece of content because they or people like them previously engaged with this content type and topic. This likelihood, combined with the age of the content and how popular it is across the network is its News Feed Rank score. Content is then selected and sorted so that the highest ranked content is first in the news feed, and then presented in descending order.

The Facebook algorithm is constantly being tweaked by Facebook through unsupervised machine learning, supplemented by the analysis of their team of data scientists, and qualitative feedback from dedicated user focus groups.[vii][viii]

Benefits of Facebook News Feed

Using unsupervised text analysis and machine learning algorithms to find and serve up content to the specific user has a lot of benefits, as such hyper-personalisation can be performed economically at scale, giving huge international reach for content creators, publishers, and interest groups.

Users are served up content that has a high probability of being from like-minded people, brands and groups, without having to search for it themselves (although that too is possible, utilising text analysis and search tools).

Brands and groups can quickly gain followers or reach a large audience if they know how to use the system, which is a great platform for brand awareness or for non-mainstream/minority causes to publish and broadcast their views.

In this regard, the Facebook News Feed provides the promise of freedom of speech and capitalist marketplace for its users, as does the internet as a whole:

“What is driving the Net is the promise of political efficacy, of the enhancement of democracy through citizens’ access and use of new communications technologies.”[ix]

Facebook as a technology company build the tools, and then content creators and publishers use the platform and the News Feed algorithm to find an audience for their content.  Facebook is the neutral, laissez faire “marketplace”, with community guidelines to prevent hate and crimes from being encouraged[x].

Downsides of Facebook News Feed

However, recent events have highlighted some of the flaws in the News Feed algorithm and the processes for dealing with errors in it. In the recent US election, it was uncovered that fake news sites were being promoted in peoples feeds to gain advertising revenue[xi]. The algorithm currently cannot identify legitimate news sites and satirical and/or fake sites. Facebook also have not developed their automated monitoring systems, or escalation workflows at the same rate as their automated products, and just this week a horrific video of a man murdering another man in cold blood remained on the site for 3 hours after it was initially reported[xii].

It is becoming increasingly difficult for Facebook to argue that it is not a media company, or that it does not have a responsibility to its users and the community for how its tools are used.

Facebook and its newsfeed algorithm are under pressure to assure the community that they are not proliferating fake news, manipulating their users emotions [xiii], promoting hate, discouraging respect or dialogue by seeing both sides of a debate[xiv], or broadcasting violent and terrible video and taking too long to remove it[xv].  Even more so, they are under pressure from their advertisers to ensure their brands are not placed next to such content. Some advertisers have recently pulled advertising from Google and Youtube and Facebook are very aware they could be next[xvi].

In addition, for Facebook’s users, the algorithm is not transparent and not able to be re-set or customised or trained by the user. Users can find it frustrating and feel like they are stuck in an echo chamber, where they are open to manipulation by Facebook, lobby groups or unscrupulous advertisers who know how to game the algorithm.

“What if people “like” posts that they don’t really like, or click on stories that turn out to be unsatisfying? The result could be a news feed that optimizes for virality, rather than quality—one that feeds users a steady diet of candy, leaving them dizzy and a little nauseated, liking things left and right but gradually growing to hate the whole silly game.” [xvii]

The Verdict

On balance, I think the benefits of the Facebook News Feed algorithm and natural language processing outweigh these costs. Facebook is still very much listening to their users and aware that there is intense competition for their attention, and therefore are constantly working to improve the algorithm and their products.

For example, in January 2017 Facebook made changes to the Trending module to only show trusted news sources[xviii], in April 2017 implemented a button to report possible fake news stories, and have established a user group to provide real human feedback on the algorithm.

Facebook recently announced a project with esteemed journalist Jeff Jarvis and CUNY to build  relationships and support credible journalism. [xix]

Even Mark Zuckerberg CEO of Facebook is changing his tune.  In December 2016 he said,

“Facebook is a new kind of platform. It’s not a traditional technology company…It’s not a traditional media company. You know, we build technology and we feel responsible for how it’s used.”[xx]

Which is just as well, because whilst he might not want to admit he is a media company, 2bn users a month use Facebook for their news, and if Facebook doesn’t act responsibly, legislators will eventually catch on that Facebook and social media is very much key to the worlds global media ecosystem.

End notes

[i] Wikipedia.com, Facebook. [ONLINE] Available at:  https://en.wikipedia.org/wiki/Facebook [Accessed 17 April 2017].

[ii] Reuters.com, Giulia Segreti. 2016. Facebook CEO says group will not become a media company. [ONLINE] Available at: http://www.reuters.com/article/us-facebook-zuckerberg-idUSKCN1141WN. [Accessed 17 April 2017].

[iii] Reuters.com, Giulia Segreti. 2016. Facebook CEO says group will not become a media company. [ONLINE] Available at: http://www.reuters.com/article/us-facebook-zuckerberg-idUSKCN1141WN. [Accessed 17 April 2017].

[iv] Wikipedia.com, Facebook. [ONLINE] Available at:  https://en.wikipedia.org/wiki/Facebook [Accessed 17 April 2017].

[v] Wikipedia.com, Timeline of Facebook. [ONLINE] Available at: https://en.wikipedia.org/wiki/Timeline_of_Facebook [Accessed 17 April 2017].

[vi] TheGuardian.com, Facebook fires trending topics team [ONLINE] available at: “https://www.theguardian.com/technology/2016/aug/29/facebook-fires-trending-topics-team-algorithm [Accessed 17 April 2017].

[vii] Slate.com, How Facebook’s news feed algorithm works [ONLINE] Available at http://www.slate.com/articles/technology/cover_story/2016/01/how_facebook_s_news_feed_algorithm_works.html [Accessed 17 April 2017].

[viii] Techcrunch.com, Ultimate guide to the Facebook News Feed [ONLINE] Available at https://techcrunch.com/2016/09/06/ultimate-guide-to-the-news-feed/ [Accessed 17 April 2017].

[ix] Dean, Jodi (2005), “Communicative Capitalism: Circulation and the Foreclosure of Politics,” Cultural Politics 1(1): 62.

[x] Facebook, Controversial, Harmful and hateful speech on Facebook [ONLINE] Available at https://www.facebook.com/notes/facebook-safety/controversial-harmful-and-hateful-speech-on-facebook/574430655911054/ [Accessed 17 April 2017].

[xi] Forbes.com How Facebook helped Donald Trump become president [ONLINE] Available at https://www.forbes.com/sites/parmyolson/2016/11/09/how-facebook-helped-donald-trump-become-president/#3a548ab759c5[Accessed 17 April 2017].

[xii] Theaustralian.com.au, 2017. Murder video forecasts scrutiny at Facebook [ONLINE] Available at http://www.theaustralian.com.au/business/wall-street-journal/murder-video-forces-scrutiny-at-facebook/news-story/79aa1b6e6acf9dce738062f226c422a6 [Accessed 20 April 2017]

[xiii] Theguardian.com Facebook reveals news feed experiment to contol emotions [ONLINE] Available at https://www.theguardian.com/technology/2014/jun/29/facebook-users-emotions-news-feeds [Accessed 17 April 2017].

[xiv] Financial Times, Facebook and the manufacture of consent [ONLINE] Available at  https://ftalphaville.ft.com/2016/11/16/2179807/facebook-and-the-manufacture-of-consent/

[Accessed 17 April 2017].

[xv] Theaustralian.com.au, 2017. Murder video forecasts scrutiny at Facebook [ONLINE] Available at http://www.theaustralian.com.au/business/wall-street-journal/murder-video-forces-scrutiny-at-facebook/news-story/79aa1b6e6acf9dce738062f226c422a6 [Accessed 20 April 2017]

[xvi] TheGuardian.com Google pledges more control for brands over ad placement [ONLINE] Available at https://www.theguardian.com/media/2017/mar/17/google-pledges-more-control-for-brands-over-ad-placement [Accessed 17 April 2017].

[xvii] Slate.com How Facebook’s news feed algorithm works [ONLINE] Available at http://www.slate.com/articles/technology/cover_story/2016/01/how_facebook_s_news_feed_algorithm_works.html [Accessed 17 April 2017].

[xviii] RT.com Facebook fake news trending algorithm [ONLINE] Available at https://www.rt.com/viral/375121-facebook-fake-news-trending-algorithm/ [Accessed 17 April 2017].

[xix] UsaToday.com  Facebook Friends media journalism project [ONLINE] Available at https://www.usatoday.com/story/tech/news/2017/01/11/facebook-friends-media-journalism-project/96428460/ [Accessed 17 April 2017].

[xx] Techcrunch.com, Josh Constine, Zuckerberg implies Facebook is a media company, just not a traditional media company [ONLINE] Available at https://techcrunch.com/2016/12/21/fbonc/ [Accessed 17 April 2017].

Using topicmodels package for analysis of topics in texts

My vignette is about text mining and analysis, utilising the tm and topicmodels packages in R and Latent Dirichlet Allocation, to work out what the documents are written about without having to read them all!

The vignette shows you how to create a Document-Term Matrix, then uses LDA to work out what key themes are present in a body of documents (called a corpus) and assigns each document to the topics, with varying probabilities for each topic.

This tool can help a user find a relevant document without having to search for it by name, or even knowing what it was written about!

Anyway, here is the link to my vignette:

http://rpubs.com/benjibex/266565

I hope you find it useful.

Tracy

Presenting by colours (the MDSI way)

I was very honoured to be able to represent the 36100ers by presenting at the UTS Academic Board on Thursday, along with Pedro and Kelly and of course Theresa.

with-vice-chancellor-and-cic-director

To get my story across with authenticity, the trick that works for me is to not write much, then really know my key messages and the flow, and then let the magic happen!

The only trouble with this is when the presentation is not recorded, and then Theresa asks you to do it again and record it so she can screencast it 🙂

img_1013
img_1012

Is there a sexist data crisis? Hardly a crisis, but still important to resolve

In our session on Tuesday Simon K, as an aside, suggested we google “is there a sexist data crisis. “

I did, (here is a BBC article with that exact title http://www.bbc.com/news/magazine-36314061) but it got me thinking, this is hardly a crisis and hardly new. Women are underrepresented in many important things.

For example, did you know women (and other “minority groups” like non Caucasians) are underrepresented in clinical trials? The article does mention this too. The FDA in the US has a program to try to increase the participation of women in these trials: http://www.fda.gov/ScienceResearch/SpecialTopics/WomensHealthResearch/ucm131731.htm

Systematic bias? Deliberate? Could be both.

Anyway, I will be sure to think more about it.

Missing data codification OR how to capture that slap across the face

Last week we read about missing data and how to plan for it. I found it super useful and applicable to our Quantified Self work- if we had codified our missing data I would have had less chasing up to do, and we would have had some insights into the boundaries of what we were willing to share with the group.

I found this youtube video pretty informative in explaining the types of missingness, ie when something was missing that was either not related to the variable in question, or it was, or it was related to some other variable or combination of variables. And how to capture exactly WHY it was missing in your surveys as this in itself was very useful information. ie to convey the slap across the face “How very dare you ask me that?!” in a survey monkey form..

Speaking of missing, I have missed a lot of opportunities to blog, including the Unearthed Hackathon experience. I hope to get to that soon!

TK

The journey to Ithaca starts with a single step…

I felt like I attended the University of Technology Sydney’s Masters of Data Science and Innovation (MDSI) information session on a whim. But with hindsight the dots join together rather nicely, and the decision to enrol feels so right!

I am going to try out every facet of this experience and jump in with both feet. I can feel old disused parts of my brain shaking the dust off even as I type. Now all I need to do is set up my beachside office…

Once I get my new laptop, I am going to join my friend Nat in her office at Wylies Baths :)
Once I get my new laptop, I am going to join my friend Nat in her office at Wylies Baths 🙂