Advanced Twitter data download and visualization

Solar eclipse Twitter

Hey folks!

In my last tutorial I’ve been introducing the twitteR package for R that allows users to download subsets of Twitter data via the Twitter Data Stream API. I showed you how to establish a connection and how to download information about tweets by defining a keyword (i.e. a hashtag), a location using lat. long. coordinates and a date with the searchTwitter() function. Unfortunately this function only allows you to query one specific location at a time within a radius specified by the user. Sometimes however it is necessary to have a global coverage of data. This tutorial therefore shows you how download Twitter Data for the whole earth at once. Enjoy!

Create a grid

The first step is to create a dataframe with a custom grid that covers the whole earth. Every grid point represents one Twitter query that will be executed later. In this case we will create a grid with a cellsize of 15°. At the Aquator this corresponds approximately to a distance of 1500km:

#create a sequence for the latitude and longitude values
latseq = seq(-90,90,15)
lonseq = seq(-180,180,15)
  
#transform to a dataframe
latlon.df = data.frame()
  
  for (i in 1:length(lonseq)){
    for (ii in 1:length(latseq)) {
      latlon.df[ii,i] <- paste0(latseq[ii],",",lonseq[i])
    } 
  }

This is how the grid looks like:Twitter Data Download Grid

Automatic query function for searchTwitter()

After we created a grid, we will now write a function that loops over every grid point and downloads the requestes twitter data for every point with a radius of 750km:

downloadTweets = function(term, radius, unit) {
  tweetsTotal.df = data.frame()
  
  for (i in 1:ncol(latlon.df)) {
    for (ii in 1:nrow(latlon.df)) {
      
      #output for console that shows what coordinate is being queried at the moment
      flush.console()
      print(paste(ii,",", i))
      print(latlon.df[ii,i])
      
      #query statement
      tweets<- searchTwitter(term, n=1000, 
                             since="2015-01-01",
                             geocode=paste0(latlon.df[ii,i],",",radius,unit),
                             retryOnRateLimit=200)
      
      #create df out of search query
      tweets.df = do.call("rbind",lapply(tweets,as.data.frame))

      #merge all dataframes
      tweetsTotal.df = rbind(tweetsTotal.df, tweets.df)

      #pause 5sec. between every query to not exceed Twitter download limit
      Sys.sleep(5)
    }
  }
  tweetsTotal.df 
}

Where the variable term is the term you want to search for, radius is the query radius, and unit is the unit of the radius (either mi, or km). Please note that I used  the function Sys.sleep(5) that pauses the loop after every iteration for 5 secons to make sure I am not downloading too much data at once: If you stream too much data in a short period of time you can get blocked from the Twitter API. For more information have a look at the Developer Agreement & Policy.

Specify your parameters and run the function:

#Set specifications
term = 'politics'
radius = 1500
unit = "km"

#run function
tweets = downloadTweets(term, radius, unit)

The execution of this function might take 10 minutes or more. Just get a cup of coffee and lay back while it does the work for you 🙂

Examples

Now I would also like to show you some example queries and visualizations I made that demonstrate how powerful this package is.

Tweets about Ukraine in the year 2015

Here is a map that shows a subset of tweets that were published this year and contained the word “ukraine“.

Tweets about ukraine 2015The term “Ukraine” was quite popular in Central Europe and on the East Coast of the United States. Unexpected however is, that there was no Tweet from Ukraine itself and quite few in Russia and Asia in general. Please note however that I only queried the english country name “Ukraine” and not the equivalent expressions in other languagues. Here is the code I used to create the map:

library(ggplot2)
library(maptools)

#create new dataframe with coodrinates only, and do type conversion (necessarry for ggplot)
coords = data.frame(longit=tweetsTotal.df$longitude, latit=tweetsTotal.df$latitude)
coords$latit = as.double(as.character(coords$latit))
coords$longit = as.double(as.character(coords$longit))

#get background map
world =  wrld_simpl
world = fortify(world)

#plot with ggplot2
p1 <- ggplot()+ggtitle(expression(atop("Worldwide Tweets", atop(italic("Keyword "Ukraine" | since 2015"), ""))))+geom_map(data=world,map=world,aes(x=long,y=lat,map_id=id))+geom_point(colour="steelblue3", alpha = 0.30, data=coords, mapping=aes(x=longit, y=latit), size=4)
p1

Solar eclipse

Last weekend there was a partial solar eclipse over Europe. The earth, the moon and the sun aligned in a way that the sun was covered by the moon by approximately 60%.  This made me curious on how many Tweets where sent  this year (2015) containing the word “solar eclipse” and when the time they were sent. The result is quite intresting:

Solar eclipse TwitterFirst I was confused because I couldn’t really see a pattern in the dots. I was expecting far more dots in Europe (since the eclipse was only visible in Europe) and not so many dots in Southern America. This made me even more curious… I decided to look at the date and day time the tweets were published using CartoDB which is a great web-mapping tool that allows you to make animated web maps:

If you start the animation you can see that there were some single tweets starting from 13/03/15  onwards in Asia, Southern America and the USA, Europe however stays relatively clean. The closer we get to the event of the solar eclipse which happened on 20/03/15 we can see that the activity in  Europe and in the whole world increases and skyrockets on 20/03/15 where all Europe is suddenly covered under a orange-red heat map. I am planning on writing a tutorial on CartoDB soon, so pay us a visit later if you got curious. 🙂

I hope you enjoyed the data download function and the visualizations. If you have any questions on the provided functions that enable you to download twitter data for the whole earth, feel free to leave me a comment. Please also take a look at the previous Twitter tutorial where I explained how to connect to the Twitter API via R.

Cheers

Martin

 

About This Author

Martin was born in Czech Republic and studied at the University of Natural Resources and Life Sciences, Vienna. He is currently working at GeoVille - an Earth Observation Company based in Austria, specialised in Land Monitoring. His main interests are: Open-source applications like R, (geospatial) statistics and data-management, web-mapping and visualization. He loves travelling, geocaching, photography and sports.

Post A Reply

*