Social Media Data Analysis Tool

On social media, users generate content in various forms, including video, images, text, and geospatial data, which is often freely available and provides a huge amount of data. This data can be used for many purposes: by corporations to improve business processes, by policymakers to identify trends in public opinion, by public health officials to monitor infectious disease outbreaks, and by first responders to coordinate rescue efforts after a natural disaster.


Initial Questions

There is a plethora of freely available toolkits for performing natural language processing (NLP) and machine learning (ML) on large datasets, including those from social media sources. …but you have to know how to write code to use them!

The problem

There is lots of interest in leveraging social media data for research in social science, urban planning, epidemiology, and other fields, but most researchers lack the programming skills necessary to use existing NLP toolkits. “I’ll get around to learning how to use these tools some day.” We decided to make SOMEDAEX happen.


What this application can do?

1. Sentiment analysis and other text classification tasks
2. Named entity recognition
3. Geocoding of recognized named locations
4. Topic modeling
5. Network analysis
6. Interactive visualization with linked data views

How will it work?

Existing Python libraries for data analysis and natural language processing Transformers, SpaCy, Stanza, Flair, GeoPy, Top2Vec, NetworkX, NLTK, etc.
Pretrained models from Huggingface, SpaCy, and StanfordNLP
Existing JavaScript libraries for data visualization D3, Vega, Vega Lite, Deck.gl, Chart.js, etc.


Creating Proof of Conecpt

Goals for Proof of Conecpt

1. Allow users to incrementally construct a data processing pipeline by using a simple question-based interface.
2. Promote user understanding of the data analysis tasks by showing the results of each task alongside its configuration.
3. Enable rapid prototyping and exploration by using the Streamlit framework.

Lessons learned

1. Interface concepts seemed to work well, but no rigorous user testing was performed.
2. Streamlit was not designed for the level of interactivity we envisioned.
3. The app became painfully slow both as the pipeline grew in complexity and as the size of the dataset increased.


Tasks

NLP Architecture

Snow

Top most Hastags from the Tweets

Snow


Scatter Plot of Tweets Sentiment

Snow


Scatter Plot of Tweets Emotion over Time

Snow


Geo-Ploting of Tweets

Snow


Topic Modelling

Snow
Snow
Snow
Snow

Building the App

1. The NLP toolkits we want to use are written in Python.
2. The data visualization libraries are written in JavaScript.
3. Our solution is to build an Electron app that runs a Python background process for data processing.

What is Electron

Electron is an open-source, cross-platform software framework that allows you to build desktop applications using web technologies (HTML, CSS, JavaScript). Unlike a browser-based web application, an Electron app has access to the local file system and the ability to spawn child processes.

Snow