rtweet tutorial

class: center, middle, inverse, title-slide

# rtweet tutorial
## <br/>with R-Ladies Coventry
### 🌈 Zane Dax (She/They) 🌈
### <span class="citation">@StarTrek_Lt</span>
### 2022-01-26

---

class: inverse,center
background-image: url("slides_files/RLadies\ bg.png")
# rtweet

![rladies](slides_files/rtweet_R_logo 2.png)

---
class: inverse, center, middle

# Get Started

---

# rtweet library

Install the **rtweet** package from [Github](https://github.com/ropensci/rtweet) or from CRAN:

```r
# install.packages("rtweet")
library(rtweet)
```

You are required to have a Twitter API access key in order to 
use the library.

- A user *must* authenticate their interaction with Twitter's API. If 
you already have an access token key, then a browser popup will 
appear where you need to click on permissions and verification.

- Once you have this set up you are ready to go

.footnote[
[1] my documentation is based on using a MacOS version 12.1

[2] See Twitter API documentation for full details and explanations
]

---
class: inverse, middle, center

# Data Mine the hashtags

---

# Needed libraries

You will require a few more libraries to follow along with the slides:

```r
library(rtweet)     # needed for Twitter API
library(tidyverse)  # for data manipulation
library(ggplot2)    # for plotting 
library(tidytext)   # for text analysis
library(ggraph)     # graphing words
library(showtext)   # for font styling
library(widyr)      # dataframe manipulation
library(tidyr)
library(DT)
```

If you are not interested in graphing the word associations and frequencies, then you just need:
- rtweet
- tidyverse
- ggplot

---

# Step 1 - Find a #hashtag

You can see a trending hashtag on Twitter or use a word / phrase to get tweets on that specific hashtag/ phrase. For this example we will use #DragRace. The rate limit is 
**max** 18,000 tweets per API call. We are not interested in retweets for this hashtag and set the function to false. Language settings are set to return English tweets only.
**Note**: You can search tweets <= 9 days ago for free, otherwise you need to pay Twitter to access older tweets.

**search Twitter for all tweets with #DragRace**

```
# give your search_tweets() a variable

DragRace_tweets = search_tweets2(
  "#DragRace",         # hashtag we are interested in
  n= 18000,            # max amount of tweets in 1 API request
  lang = "en",         # English
  include_rts = FALSE  # no retweets
)
```

It is **strongly advised** you make use of variable names for your Twitter data as to avoid `rate limit` (getting timed-out for 15 minutes or Twitter Developer account suspended which is the account that has the API key). If you rate_limit too often, there will be the message in the console and you will lose privileges with the Twitter Developer account which makes your API key void.
---

--- 
# Step 2 - Save your search Tweets

This is very important to do as to avoid executing the `search_tweets2()` function again within 15 minutes and seeing a `rate_limit` warning message in the console. Seeing a `rate_limit` warning is scary to see but really means you have to wait at least 15 minutes to try to run the code again.

To get `rate limit` info for specific token (function call)
- `token <- get_tokens()`
- `rate_limit(token)`
- `rate_limit(token, "search_tweets")`

```
#   query         limit remaining reset    reset_at            timestamp           
# 1 search/tweets   180       180 15.0119… 2021-09-02 09:42:08 2021-09-02 09:27:08 
```

**SAVE YOUR TWITTER SEARCHES**
```
DragRace_tweets = write_as_csv(
  DragRace_tweets,               # variable name used
  file_name = "DragRace_tweets", # new variable name for CSV file
  prepend_ids = TRUE,            # set to true 
  na = "",
  fileEncoding = "UTF-8"         # save file in UTF-8 
)
```

---
# Step 3 - read in saved data

DragRace_tweets = read_twitter_csv(file = "../TwitterData/DragRace_tweets.csv" , 
  unflatten = FALSE)

`DragRace_tweets` has 13,000 rows (obs.) and 91 columns (variables)

---
class: inverse, middle, center

# Text Analysis of Tweets

---
# #DragRace

**The Twitter API returns lots of meta-data information.** 
Only 6 rows of the Tweets dataframe text is shown. This is raw Twitter text data that is messy with `url`s, emojis and `\n` new lines.

this is `head(DragRace_tweets$text)`

[1] "So to watch #DragRace #DragRaceS14 in the UK I’ve got to pay for yet another streaming service?!     \n\nNope. Pity though, was looking forward to it. 😒"
[2] "if ru says willow’s name like that the whole season i think my entire skin will crawl off #DragRace"                                              
[3] "June and Orion are going to lip-sync foooooor theeeeeir liiiiiiiiiiives #DragRace"                                                                 
[4] "Watching season 14 premiere of #DragRace https://t.co/1iaHPzPj7L"     
[5] "Bosco... Thanks for that ending! #DragRace"                          
[6] "There's more depth to Willow Pill than her looks suggest, I like it #DragRace"

---
# just the text - 1

The Twitter data we have is messy and has url links, need to remove them.

```r
# ======== remove URLs
DragRace_tweets$stripped_text = gsub("http.*","", DragRace_tweets$text)
DragRace_tweets$stripped_text = gsub("https*","", DragRace_tweets$stripped_text)

# head(DragRace_tweets$stripped_text)
```

- Note: the dataframe has `stripped_text` column (variable) which will be used again later for bigrams.

[1] "So to watch #DragRace #DragRaceS14 in the UK I’ve got to pay for yet another streaming service?!     \n\nNope. Pity though, was looking forward to it. 😒"
[2] "if ru says willow’s name like that the whole season i think my entire skin will crawl off #DragRace"                                                      
[3] "June and Orion are going to lip-sync foooooor theeeeeir liiiiiiiiiiives #DragRace"                                                                        
[4] "Watching season 14 premiere of #DragRace "                                                                                                                
[5] "Bosco... Thanks for that ending! #DragRace"                                                                                                               
[6] "There's more depth to Willow Pill than her looks suggest, I like it #DragRace"

---
# just the text - 2

This code **tokenizes** the words to allow for easy word counts and further text analysis.

```r
# ===== tidytext::unnest_tokens()
clean_DragRace_tweets = DragRace_tweets %>% 
  select(stripped_text) %>% 
  unnest_tokens(word, stripped_text) # tokens

head(clean_DragRace_tweets)
```

```
## # A tibble: 6 × 1
##   word       
##   <chr>      
## 1 so         
## 2 to         
## 3 watch      
## 4 dragrace   
## 5 dragraces14
## 6 in
```

`clean_DragRace_tweets` has 164,460 obs. and 1 variable

---
# Tweet Word Counts - 1

Top 10 words, counted and sorted

```r
# == plot, word counts of clean text  (stopwords included)
clean_DragRace_tweets %>% 
  count(word, sort= T) %>% 
  top_n(10) %>% 
  mutate(word= reorder(word, n))
```

```
## Selecting by n
```

```
## # A tibble: 10 × 2
##    word         n
##    <fct>    <int>
##  1 dragrace 12798
##  2 the       5391
##  3 i         3544
##  4 is        2994
##  5 to        2589
##  6 a         2573
##  7 and       2397
##  8 of        2222
##  9 this      1869
## 10 for       1737
```

---

--- 
# Tweet Word Counts - 1.1 
`stopwords` included
<img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="504" />

---
# Tweet Word Counts - 1.2

`stopwords` are removed from the data. `clean_DragRace_words` has 85,994 obs.

```r
# ========== tidytext stop_words + anti_join
clean_DragRace_words = clean_DragRace_tweets %>% 
  anti_join(stop_words)

clean_DragRace_words %>% 
  count(word, sort= T) %>% 
  top_n(10) %>%
  filter(word > 10)
```

```
## # A tibble: 10 × 2
##    word                n
##    <chr>           <int>
##  1 dragrace        12798
##  2 season           1497
##  3 willow           1386
##  4 kornbread        1327
##  5 rupaulsdragrace   931
##  6 kerri             834
##  7 love              798
##  8 drag              753
##  9 i’m               741
## 10 14                595
```

---
# Tweet Word Counts - 1.3
<img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="720" />

---

# bigram and n-grams

```r
# ============= NETWORK OF WORDS !
library(widyr)

# bigrams. n-grams
DragRace_tweets_paired_words = DragRace_tweets %>% 
  select(stripped_text) %>% 
  unnest_tokens(paired_words, 
                stripped_text, 
                token= "ngrams",
                n = 2) # 2 for bigram pairing

DragRace_tweets_paired_words %>% 
  count(paired_words, sort = T) %>% 
  top_n(5)
```

```
## # A tibble: 6 × 2
##   paired_words     n
##   <chr>        <int>
## 1 season 14      541
## 2 willow pill    514
## 3 i love         411
## 4 in the         408
## 5 drag race      393
## 6 of the         393
```

---
# word pair splitting

we now split the bigrams into n-gram

```r
# ======== word pair splitting
library(tidyr)

DragRace_tweets_word_splits = DragRace_tweets_paired_words %>% 
  separate(paired_words, c('word1','word2'), sep = " ")

DragRace_tweets_filtered <- DragRace_tweets_word_splits %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
DragRace_tweets_words_counts <- DragRace_tweets_filtered %>%
  count(word1, word2, sort = TRUE)

head(DragRace_tweets_words_counts)
```

```
## # A tibble: 6 × 3
##   word1           word2               n
##   <chr>           <chr>           <int>
## 1 season          14                541
## 2 willow          pill              514
## 3 drag            race              393
## 4 dragrace        rupaulsdragrace   276
## 5 kerri           colby             270
## 6 rupaulsdragrace dragrace          209
```

---
# Graph the words

Now that we have the n-grams and counted, we can graph them to see what words are associated together.

graph of word network
```
DragRace_tweets_words_counts %>% 
  filter(n >= 10) %>% 
  ggraph(layout = "fr")
```

---
<img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="864" />

---
class: inverse, middle, center

# #rstats Tweets

---
# 1 & 2 - #rstats & save it

We can search Twitter for the #rstats hashtag. The `#` is optional. Spaces in the query is treated as AND, the OR *must* be capitalized.

```
rstats_searched_tweets = search_tweets(
  q = "rstats OR RStats",
  n = 4000,
  type = "mixed",
  include_rts = FALSE,
  parse = TRUE,
  verbose = TRUE,
  retryonratelimit = FALSE,
  lang="en"
)

rstats_searched_tweets = write_as_csv(
  rstats_searched_tweets,        # variable name used
  file_name = "rstatsTweets",    # new variable name for CSV file
  prepend_ids = TRUE,            # set to true
  na = "",
  fileEncoding = "UTF-8"         # save file in UTF-8
)
```

---
# 3 - read in the rstats tweets

```
rstats_tweets = read_twitter_csv(file ="../TwitterData/DragRace_tweets.csv" , 
  unflatten = FALSE)
```

# 4 - ts_plot
```
# using the time series plot function for Twitter data

ts_plot(rstats_tweets,          # dataframe of searched tweets
        by= "mins",             # can plot by seconds (default), mins, days, etc
        tz= "America/Edmonton", # your timezone 
        col= 'white'             # default is black
        )               
```

---
# 5 - time series plot

---
class: inverse, middle, center

# The End

---
# Further Information

- I have a detailed rtweet guide (learnr file) that tells you pretty much all of the documentation with understandable examples.

- Twitter CSV datasets are available on GitHub for you to practice with for those without a API key.

- **Thank you** RLadies Coventry Heather Turner & Sophie Hardy for this opportunity 🍁