class: center, middle, inverse, title-slide # rtweet tutorial ##
with R-Ladies Coventry ### 🌈 Zane Dax (She/They) 🌈 ###
@StarTrek_Lt
### 2022-01-26 --- class: inverse,center background-image: url("slides_files/RLadies\ bg.png") # rtweet ![rladies](slides_files/rtweet_R_logo 2.png) --- class: inverse, center, middle # Get Started --- # rtweet library Install the **rtweet** package from [Github](https://github.com/ropensci/rtweet) or from CRAN: ```r # install.packages("rtweet") library(rtweet) ``` -- You are required to have a Twitter API access key in order to use the library. - A user *must* authenticate their interaction with Twitter's API. If you already have an access token key, then a browser popup will appear where you need to click on permissions and verification. -- - Once you have this set up you are ready to go -- .footnote[ [1] my documentation is based on using a MacOS version 12.1 [2] See Twitter API documentation for full details and explanations ] --- class: inverse, middle, center # Data Mine the hashtags --- # Needed libraries You will require a few more libraries to follow along with the slides: ```r library(rtweet) # needed for Twitter API library(tidyverse) # for data manipulation library(ggplot2) # for plotting library(tidytext) # for text analysis library(ggraph) # graphing words library(showtext) # for font styling library(widyr) # dataframe manipulation library(tidyr) library(DT) ``` If you are not interested in graphing the word associations and frequencies, then you just need: - rtweet - tidyverse - ggplot --- # Step 1 - Find a #hashtag You can see a trending hashtag on Twitter or use a word / phrase to get tweets on that specific hashtag/ phrase. For this example we will use #DragRace. The rate limit is **max** 18,000 tweets per API call. We are not interested in retweets for this hashtag and set the function to false. Language settings are set to return English tweets only. **Note**: You can search tweets <= 9 days ago for free, otherwise you need to pay Twitter to access older tweets. **search Twitter for all tweets with #DragRace** ``` # give your search_tweets() a variable DragRace_tweets = search_tweets2( "#DragRace", # hashtag we are interested in n= 18000, # max amount of tweets in 1 API request lang = "en", # English include_rts = FALSE # no retweets ) ``` It is **strongly advised** you make use of variable names for your Twitter data as to avoid `rate limit` (getting timed-out for 15 minutes or Twitter Developer account suspended which is the account that has the API key). If you rate_limit too often, there will be the message in the console and you will lose privileges with the Twitter Developer account which makes your API key void. --- --- # Step 2 - Save your search Tweets This is very important to do as to avoid executing the `search_tweets2()` function again within 15 minutes and seeing a `rate_limit` warning message in the console. Seeing a `rate_limit` warning is scary to see but really means you have to wait at least 15 minutes to try to run the code again. To get `rate limit` info for specific token (function call) - `token <- get_tokens()` - `rate_limit(token)` - `rate_limit(token, "search_tweets")` ``` # query limit remaining reset reset_at timestamp # 1 search/tweets 180 180 15.0119… 2021-09-02 09:42:08 2021-09-02 09:27:08 ``` **SAVE YOUR TWITTER SEARCHES** ``` DragRace_tweets = write_as_csv( DragRace_tweets, # variable name used file_name = "DragRace_tweets", # new variable name for CSV file prepend_ids = TRUE, # set to true na = "", fileEncoding = "UTF-8" # save file in UTF-8 ) ``` --- # Step 3 - read in saved data DragRace_tweets = read_twitter_csv(file = "../TwitterData/DragRace_tweets.csv" , unflatten = FALSE) `DragRace_tweets` has 13,000 rows (obs.) and 91 columns (variables) --- class: inverse, middle, center # Text Analysis of Tweets --- # #DragRace **The Twitter API returns lots of meta-data information.** Only 6 rows of the Tweets dataframe text is shown. This is raw Twitter text data that is messy with `url`s, emojis and `\n` new lines. this is `head(DragRace_tweets$text)` [1] "So to watch #DragRace #DragRaceS14 in the UK I’ve got to pay for yet another streaming service?! \n\nNope. Pity though, was looking forward to it. 😒" [2] "if ru says willow’s name like that the whole season i think my entire skin will crawl off #DragRace" [3] "June and Orion are going to lip-sync foooooor theeeeeir liiiiiiiiiiives #DragRace" [4] "Watching season 14 premiere of #DragRace https://t.co/1iaHPzPj7L" [5] "Bosco... Thanks for that ending! #DragRace" [6] "There's more depth to Willow Pill than her looks suggest, I like it #DragRace" --- # just the text - 1 The Twitter data we have is messy and has url links, need to remove them. ```r # ======== remove URLs DragRace_tweets$stripped_text = gsub("http.*","", DragRace_tweets$text) DragRace_tweets$stripped_text = gsub("https*","", DragRace_tweets$stripped_text) # head(DragRace_tweets$stripped_text) ``` - Note: the dataframe has `stripped_text` column (variable) which will be used again later for bigrams. [1] "So to watch #DragRace #DragRaceS14 in the UK I’ve got to pay for yet another streaming service?! \n\nNope. Pity though, was looking forward to it. 😒" [2] "if ru says willow’s name like that the whole season i think my entire skin will crawl off #DragRace" [3] "June and Orion are going to lip-sync foooooor theeeeeir liiiiiiiiiiives #DragRace" [4] "Watching season 14 premiere of #DragRace " [5] "Bosco... Thanks for that ending! #DragRace" [6] "There's more depth to Willow Pill than her looks suggest, I like it #DragRace" --- # just the text - 2 This code **tokenizes** the words to allow for easy word counts and further text analysis. ```r # ===== tidytext::unnest_tokens() clean_DragRace_tweets = DragRace_tweets %>% select(stripped_text) %>% unnest_tokens(word, stripped_text) # tokens head(clean_DragRace_tweets) ``` ``` ## # A tibble: 6 × 1 ## word ## <chr> ## 1 so ## 2 to ## 3 watch ## 4 dragrace ## 5 dragraces14 ## 6 in ``` `clean_DragRace_tweets` has 164,460 obs. and 1 variable --- # Tweet Word Counts - 1 Top 10 words, counted and sorted ```r # == plot, word counts of clean text (stopwords included) clean_DragRace_tweets %>% count(word, sort= T) %>% top_n(10) %>% mutate(word= reorder(word, n)) ``` ``` ## Selecting by n ``` ``` ## # A tibble: 10 × 2 ## word n ## <fct> <int> ## 1 dragrace 12798 ## 2 the 5391 ## 3 i 3544 ## 4 is 2994 ## 5 to 2589 ## 6 a 2573 ## 7 and 2397 ## 8 of 2222 ## 9 this 1869 ## 10 for 1737 ``` --- --- # Tweet Word Counts - 1.1 `stopwords` included <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="504" /> --- # Tweet Word Counts - 1.2 `stopwords` are removed from the data. `clean_DragRace_words` has 85,994 obs. ```r # ========== tidytext stop_words + anti_join clean_DragRace_words = clean_DragRace_tweets %>% anti_join(stop_words) clean_DragRace_words %>% count(word, sort= T) %>% top_n(10) %>% filter(word > 10) ``` ``` ## # A tibble: 10 × 2 ## word n ## <chr> <int> ## 1 dragrace 12798 ## 2 season 1497 ## 3 willow 1386 ## 4 kornbread 1327 ## 5 rupaulsdragrace 931 ## 6 kerri 834 ## 7 love 798 ## 8 drag 753 ## 9 i’m 741 ## 10 14 595 ``` --- # Tweet Word Counts - 1.3 <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="720" /> --- # bigram and n-grams ```r # ============= NETWORK OF WORDS ! library(widyr) # bigrams. n-grams DragRace_tweets_paired_words = DragRace_tweets %>% select(stripped_text) %>% unnest_tokens(paired_words, stripped_text, token= "ngrams", n = 2) # 2 for bigram pairing DragRace_tweets_paired_words %>% count(paired_words, sort = T) %>% top_n(5) ``` ``` ## # A tibble: 6 × 2 ## paired_words n ## <chr> <int> ## 1 season 14 541 ## 2 willow pill 514 ## 3 i love 411 ## 4 in the 408 ## 5 drag race 393 ## 6 of the 393 ``` --- # word pair splitting we now split the bigrams into n-gram ```r # ======== word pair splitting library(tidyr) DragRace_tweets_word_splits = DragRace_tweets_paired_words %>% separate(paired_words, c('word1','word2'), sep = " ") DragRace_tweets_filtered <- DragRace_tweets_word_splits %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) # new bigram counts: DragRace_tweets_words_counts <- DragRace_tweets_filtered %>% count(word1, word2, sort = TRUE) head(DragRace_tweets_words_counts) ``` ``` ## # A tibble: 6 × 3 ## word1 word2 n ## <chr> <chr> <int> ## 1 season 14 541 ## 2 willow pill 514 ## 3 drag race 393 ## 4 dragrace rupaulsdragrace 276 ## 5 kerri colby 270 ## 6 rupaulsdragrace dragrace 209 ``` --- # Graph the words Now that we have the n-grams and counted, we can graph them to see what words are associated together. graph of word network ``` DragRace_tweets_words_counts %>% filter(n >= 10) %>% ggraph(layout = "fr") ``` --- <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="864" /> <!-- ====== --> --- class: inverse, middle, center # #rstats Tweets --- # 1 & 2 - #rstats & save it We can search Twitter for the #rstats hashtag. The `#` is optional. Spaces in the query is treated as AND, the OR *must* be capitalized. ``` rstats_searched_tweets = search_tweets( q = "rstats OR RStats", n = 4000, type = "mixed", include_rts = FALSE, parse = TRUE, verbose = TRUE, retryonratelimit = FALSE, lang="en" ) rstats_searched_tweets = write_as_csv( rstats_searched_tweets, # variable name used file_name = "rstatsTweets", # new variable name for CSV file prepend_ids = TRUE, # set to true na = "", fileEncoding = "UTF-8" # save file in UTF-8 ) ``` --- # 3 - read in the rstats tweets ``` rstats_tweets = read_twitter_csv(file ="../TwitterData/DragRace_tweets.csv" , unflatten = FALSE) ``` # 4 - ts_plot ``` # using the time series plot function for Twitter data ts_plot(rstats_tweets, # dataframe of searched tweets by= "mins", # can plot by seconds (default), mins, days, etc tz= "America/Edmonton", # your timezone col= 'white' # default is black ) ``` --- # 5 - time series plot <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="720" /> --- class: inverse, middle, center # The End --- # Further Information - I have a detailed rtweet guide (learnr file) that tells you pretty much all of the documentation with understandable examples. - Twitter CSV datasets are available on GitHub for you to practice with for those without a API key. - **Thank you** RLadies Coventry Heather Turner & Sophie Hardy for this opportunity 🍁