Hey folks!
In my last tutorial I’ve been introducing the twitteR package for R that allows users to download subsets of Twitter data via the Twitter Data Stream API. I showed you how to establish a connection and how to download information about tweets by defining a keyword (i.e. a hashtag), a location using lat. long. coordinates and a date with the searchTwitter() function. Unfortunately this function only allows you to query one specific location at a time within a radius specified by the user. Sometimes however it is necessary to have a global coverage of data. This tutorial therefore shows you how download Twitter Data for the whole earth at once. Enjoy!
Create a grid
The first step is to create a dataframe with a custom grid that covers the whole earth. Every grid point represents one Twitter query that will be executed later. In this case we will create a grid with a cellsize of 15°. At the Aquator this corresponds approximately to a distance of 1500km:
#create a sequence for the latitude and longitude values latseq = seq(-90,90,15) lonseq = seq(-180,180,15) #transform to a dataframe latlon.df = data.frame() for (i in 1:length(lonseq)){ for (ii in 1:length(latseq)) { latlon.df[ii,i] <- paste0(latseq[ii],",",lonseq[i]) } }
This is how the grid looks like:
Automatic query function for searchTwitter()
After we created a grid, we will now write a function that loops over every grid point and downloads the requestes twitter data for every point with a radius of 750km:
downloadTweets = function(term, radius, unit) { tweetsTotal.df = data.frame() for (i in 1:ncol(latlon.df)) { for (ii in 1:nrow(latlon.df)) { #output for console that shows what coordinate is being queried at the moment flush.console() print(paste(ii,",", i)) print(latlon.df[ii,i]) #query statement tweets<- searchTwitter(term, n=1000, since="2015-01-01", geocode=paste0(latlon.df[ii,i],",",radius,unit), retryOnRateLimit=200) #create df out of search query tweets.df = do.call("rbind",lapply(tweets,as.data.frame)) #merge all dataframes tweetsTotal.df = rbind(tweetsTotal.df, tweets.df) #pause 5sec. between every query to not exceed Twitter download limit Sys.sleep(5) } } tweetsTotal.df }
Where the variable term is the term you want to search for, radius is the query radius, and unit is the unit of the radius (either mi, or km). Please note that I used the function Sys.sleep(5) that pauses the loop after every iteration for 5 secons to make sure I am not downloading too much data at once: If you stream too much data in a short period of time you can get blocked from the Twitter API. For more information have a look at the Developer Agreement & Policy.
Specify your parameters and run the function:
#Set specifications term = 'politics' radius = 1500 unit = "km" #run function tweets = downloadTweets(term, radius, unit)
The execution of this function might take 10 minutes or more. Just get a cup of coffee and lay back while it does the work for you 🙂
Examples
Now I would also like to show you some example queries and visualizations I made that demonstrate how powerful this package is.
Tweets about Ukraine in the year 2015
Here is a map that shows a subset of tweets that were published this year and contained the word “ukraine“.
The term “Ukraine” was quite popular in Central Europe and on the East Coast of the United States. Unexpected however is, that there was no Tweet from Ukraine itself and quite few in Russia and Asia in general. Please note however that I only queried the english country name “Ukraine” and not the equivalent expressions in other languagues. Here is the code I used to create the map:
library(ggplot2) library(maptools) #create new dataframe with coodrinates only, and do type conversion (necessarry for ggplot) coords = data.frame(longit=tweetsTotal.df$longitude, latit=tweetsTotal.df$latitude) coords$latit = as.double(as.character(coords$latit)) coords$longit = as.double(as.character(coords$longit)) #get background map world = wrld_simpl world = fortify(world) #plot with ggplot2 p1 <- ggplot()+ggtitle(expression(atop("Worldwide Tweets", atop(italic("Keyword "Ukraine" | since 2015"), ""))))+geom_map(data=world,map=world,aes(x=long,y=lat,map_id=id))+geom_point(colour="steelblue3", alpha = 0.30, data=coords, mapping=aes(x=longit, y=latit), size=4) p1
Solar eclipse
Last weekend there was a partial solar eclipse over Europe. The earth, the moon and the sun aligned in a way that the sun was covered by the moon by approximately 60%. This made me curious on how many Tweets where sent this year (2015) containing the word “solar eclipse” and when the time they were sent. The result is quite intresting:
First I was confused because I couldn’t really see a pattern in the dots. I was expecting far more dots in Europe (since the eclipse was only visible in Europe) and not so many dots in Southern America. This made me even more curious… I decided to look at the date and day time the tweets were published using CartoDB which is a great web-mapping tool that allows you to make animated web maps:
If you start the animation you can see that there were some single tweets starting from 13/03/15 onwards in Asia, Southern America and the USA, Europe however stays relatively clean. The closer we get to the event of the solar eclipse which happened on 20/03/15 we can see that the activity in Europe and in the whole world increases and skyrockets on 20/03/15 where all Europe is suddenly covered under a orange-red heat map. I am planning on writing a tutorial on CartoDB soon, so pay us a visit later if you got curious. 🙂
I hope you enjoyed the data download function and the visualizations. If you have any questions on the provided functions that enable you to download twitter data for the whole earth, feel free to leave me a comment. Please also take a look at the previous Twitter tutorial where I explained how to connect to the Twitter API via R.
Cheers
Martin
Post A Reply