Farming communities
in Dutch media

Web Scraping & Topic Modeling using R

Claudiu Forgaci
Shuyu Zhang

2024-03-20

Workshop outline

  1. Tutorial: Scraping data from social media (Shuyu) - 30 min
  2. Tutorial: Finding topics in Dutch news (Claudiu) - 1 hour
  3. Exercises: Visualising scraped data & topic modeling with your own data - 1 hour

Setup

  1. Log in to RStudio Server at edu.nl/ph3a6 with
  • the username from your handout
  • the password Rban1smGuest
  1. Open edu.nl/qh34g in a new tab

  2. When you are ready put a green sticky note on your laptop and follow the instructions on the presenter’s screen

Part 1: Scraping data from social media

The question

What are the main farming-related topics discussed in social media?

The data

Social media data
from Facebook or X

Why social media data?

In the digital media era, social media platforms and websites serve as valuable sources for collecting user-generated content, which in turn represents the voices and opinions of citizens. Twitter (X)’s real-time nature and vast array of tweets cover diverse topics, providing insights into current events, public reactions, and emerging trends. Similarly, Facebook’s extensive user base and features like status updates and comments offer rich data on user experiences and interactions.

The approach

Web scraping using the Chrome extension
Web Data Research Assistant

What is web scraping?

Web scraping is the process of extracting data from websites. It involves automated techniques to collect information from web pages or social media website, typically in formats such as HTML, XML, or JSON. Web scraping allows you to retrieve specific data elements, such as text, images, or tables, from web pages and store them for analysis or other purposes.

Application in RStudio

Open script_socialmedia.qmd
and follow the instructions there

Part 2: Finding topics in the Dutch news

The question

What are the main farming-related topics discussed in Dutch news since 2022?

The data

51 newspaper articles
in Dutch

The approach

Quantitative analysis:
Topic modeling

Quantitative analysis

Why?

  • Reproducible - re-run the analysis with the same results
  • Automated - run the analysis on other data
  • Scalable - run the analysis on (much) more data

Good to know:

  • Requires (some) knowledge of statistics
  • Results depend on the amount and quality of data available

The method

Topic modeling using Latent Dirichlet Allocation (LDA) is a method used to reveal latent topics in unstructured text data.

In an LDA model:

  • documents are a mixture of topics
  • topics are a mixture of words

Application in RStudio

Open script.qmd
and follow the instructions there

Found this workshop useful?

Reach out to us with questions:

  • Claudiu Forgaci: C.Forgaci@tudelft.nl
  • Shuyu Zhang: S-Zhang-19@tudelft.nl

Follow Rbanism:

Citation:

  • Zhang, S., & Forgaci, C. (2024). What do Facebook or X users say about farming in the Netherlands? - Word cloud analysis using R TU Delft.

  • Forgaci, C. (2024). What do the Dutch news say about farming communities? - A Topic modeling approach using R TU Delft.