Faculty Research Request: Data Scraping and Analysis
Introduction
This project is rooted with a FIMS faculty member who is interested in scraping and analyzing data from two social media sources. Specifically, they would like to collect data from the Reddit subreddit r/librarians and Twitter posts that use the hashtag #librarytwitter. The goal is to engage digital scholarship and analyze the data with the intent of gaining some insight into trends on library-related social media channels.
Through this collaboration, we aim to create a strategy for data collection, organization and analysis using a series of open-source software and programming. The library resources and staff allocated to this project can provide expertise and resources that enable research goals, support access to software and technological know-how, and provide service that aims at preserving data and making it accessible for future access and research.
In this proposal, we will discuss the project objectives, necessary technology and infrastructure, resources, project sustainability, and parameters for project completion. Additionally, after a conversation with colleagues, we will address the concerns about ethical data collection and use.
For digital scholarship to be successful we need to define what it is and how it can positively affect our academic communities' research. A variety of characteristics need to be considered,
Interdisciplinary nature: This necessitates sharing resources and expertise. The library's role includes facilitating research, training, and teaching, especially in digital methods and tools.
Collaboration: We can accomplish more by working together. Through community building, we can actively contribute to the institution's scholarship.
Critical: The library has the access and training to ensure that information and related research methods are equitable and diverse. This allows barriers to be critically identified within research.
Inventive: The library has the knowledge and tools to be imaginative when disseminating information. This allows for greater engagement from scholars, students and the public.
(Sherman Centre for Digital Scholarship)
Objectives
The library’s role and objectives in this project are to support and advance research through a partnership between the library and faculty. We can do that by filling gaps in the faculty member and their team’s technical knowledge of digital scholarship and by empowering them through training in the use of digital tools.
We aim to:
Grow partnerships between the library and faculty
Identify how the library’s/librarians expertise is best suited for fulfilling project goals. What is best left to others?
Understand and define project goals and find creative solutions to make them attainable
Use accessible digital tools
Train research team on the use of digital tools
Data collection
Train research team methods of formatting datasets
Train research team to analyze data using software
Propose ways to maintain data and access
Infrastructure
The technical components of this project can be split into three sections; data scraping, data analysis, and preservation. Each will requires unique tools, methods, and software.
*Examples are available in the Appendix.
● Data scraping
To scrape Reddit and Twitter of specific data we will use Python. Python is a highly flexible programming language, that emphasizes readability, and contains a comprehensive library of programming tools. We will access Python, via Google Colab, which is a virtual place to write and execute Python in your browser. This makes it easy to share, collaborate, customize and save projects for reuse.
● Reddit- data scraping
To harvest data from Reddit we will import the following packages from the Python library,
praw : “Python Reddit API Wrapper” this is the module that will allow us to connect to Reddit.
panda : Panda is used mostly in data science and analysis. It’s used for data cleaning, transforming and analyzing.
datetime : a module that is built into Python and allows the user to manipulate the dates and times.
Basically, we will create a Reddit account for the project, connect to it via the “praw” package and extract specifically defined data using “panda”. Using “praw” we can define a subreddit to scrape, change the number of posts that are scraped (for example 50 or 500), and the hierarchy by which they are organized. Such as “top”, “controversial”, “new”, “hot”, or “gilded”.
● Twitter- data scraping
Twitter data will be extracted similarly, by using a package that connects to Twitter and then formatting the data using Pandas and CSV.
tweepy: an open-source Python package that provides access to Twitter API (application programming interface). To gain access to this interface a developer account is needed, which is also free to sign up for.
csv: this package allows the user to create, read, and write .csv files, which will allow us to easily save the data and export it from Colab.
● Voyant- analysis
Voyant is a web-based tool for performing text analysis. It can be used to study text that the user uploads and then performs tasks with, such as identifying phrases, occurrences, and terms, and visualizing data. This will open new avenues for interpretation, and facilitate analysis and data organization more efficiently than conventional research methods. Due to their technical knowledge and training, librarians are ever becoming more engaged with scholars in the education of digital tools, access, and preparation of datasets (Wallace & Feeney, 2018, pp. 24). Using tools such as Voyant meets the library's objectives of inventiveness, interdisciplinary exploration, and collaboration.
For a list of tools and guides visit Voyant Tools Help page.
● Institutional repository
The university’s institutional repository is available to support the preservation of any data created via this project. All data should meet accepted ethical benchmarks of the scholarly community. The university maintains all costs associated with this service. More details about the sustainability and use of this tool can be found in subsequent sections.
Resourcing
Resources for web scraping, analysis and sustainability of this project will either be open-source or supported through institutional operational costs.
Web Scraping & Analysis: Python, Google Colab, and Voyant are all free to use and web-based. This will maximize interoperability between each user's computer systems and make collaboration easier should the faculty’s team employ research assistants or engage other professionals in their field.
Institutional Repository: This is freely accessible to staff for preservation and providing access to their data and research. It is encouraged that faculty engage in this resource.
Hardware: Due to the nature of the digital tools we are using, the faculty’s existing devices are likely sufficient to perform this work. If not, or if any research assistants need access to computers, the library can provide the tools and space.
Staff Time: The librarian partnering in this project will be committed to the education of faculty regarding digital tools, harvesting data from Twitter and Reddit, and in general liaising between the faculty research team and library services of this project.
Maintenance and Sustainability
The maintenance and sustainability of this project are minimal. Something to consider is the ability to replicate scrapes using Python. Unlike software that may become incompatible or unsupported over time, theoretically, the Python code we use for scraping data should be replicable in the future. This provides the opportunity to produce scheduled (weekly, monthly) scrapes or follow-up data sets further into the future.
The other thing to consider is how the data for this project is conserved. Since the faculty member will be encouraged to upload their project data to the institutional repository, there is no need for this individual project to be maintained as it will be secured, preserved and made accessible as a larger project within the university library.
Ethical/Legal Concerns
Due to the public, online nature of the data utilized in this project there will invariably be some ethical concerns raised during its collection, analysis, or presentation. While the precedent has been set, with many research projects revolving around social media posts and comments, it should be our goal to identify any possible ethical pitfalls in this project. Some concerns should include, providing anonymity to the users' data, considering the ethics of sharing the project's collected data, and the blurred lines of consent.
Below are three articles that discuss the ethical challenges of projects that use social media data. It is suggested that we review this literature to ensure the project is, at minimum, in line with current research trends involving the ethics of using social media data.
Norman Adams, N. (2022). “Scraping” Reddit posts for academic research? Addressing some blurred lines of consent in growing internet-based research trend during the time of Covid-19. International Journal of Social Research Methodology, 1–16. https://doi.org/10.1080/13645579.2022.2111816
Proferes, N., Jones, N., Gilbert, S., Fiesler, C., & Zimmer, M. (2021). Studying Reddit: A Systematic Overview of Disciplines, Approaches, Methods, and Ethics. Social Media + Society, 7(2), 1-14. https://doi.org/10.1177/20563051211019004
Reagle, J. (2022). Disguising Reddit sources and the efficacy of ethical research. Ethics and Information Technology, 24(3), 41–41. https://doi.org/10.1007/s10676-022-09663-w
Sun-setting and Project Completion
A key way to ensure that the project is mutually sun-setted is for each party to have defined roles and boundaries. This will help the project run more efficiently and enable each party to meet their goals and the overall project’s goals.
Librarian
Perform data scraping from Twitter and Reddit using predetermined criteria (ie. hashtags, keywords, subreddits, dates, etc)
Suggest and educate faculty member and their team regarding the advantages and uses of digital research tools. (ie. through a workshop, providing additional documentation)
Advise research team regarding methods for organizing datasets
Communicate the limits of each digital tool being used so that project expectations remain manageable
Assist with general research
Liaise between faculty and other library staff (if additional expertise is needed)
Faculty & team
Clearly define project goals and communicate scraping parameters that best suit their research (ie. hashtags, keywords, subreddits, dates, etc)
Attend digital tool workshops
Prepare scraped data for digital analysis tools (outline provided by librarian)
Perform analysis of scraped data
Upload data to institutional repository (instruction provided by librarian if needed)
Finally, once the project is completed it is vital to hold a post-mortem so that all parties involved can look at the project from start to finish to identify what went right and what can be improved. Based on thoughtful reflection and discussion this is an important and often overlooked stage that will benefit future projects.
Appendix
● Python Example 1: Reddit See the code here for an example scrape. This instance scrapes data from the top 500 posts on r/librarians.
● Python Example 2: Twitter See the code here for an example scrape. This instance scrapes data using the hashtag #librarytwitter between 1 December 2022 - 7 December 2022.
● Voyant Example: This screencap shows some of the versatility of Voyant. With multiple views, the user can explore ideas and keywords through visuals and text at the same time. In this example, the focus is on Reddit users + the keyword “advice”. Voyant highlights key terms in the corpus, allows the user to focus on the context of a term(s), and visualizes language use.
References
Sherman Centre for Digital Scholarship. (n.d.). What is Digital Scholarship? https://scds.ca/what-is-ds/
Wallace, N., & Feeney, M. (2018). An Introduction to Text Mining: How Libraries Can Support Digital Scholars. Qualitative & Quantitative Methods in Libraries, 7(1), 23–30.
Project Presentation