Social Media Data Analysis Tool
On social media, users generate content in various forms, including video, images, text, and geospatial data, which is often freely available and provides a huge amount of data. This data can be used for many purposes: by corporations to improve business processes, by policymakers to identify trends in public opinion, by public health officials to monitor infectious disease outbreaks, and by first responders to coordinate rescue efforts after a natural disaster.
Initial Questions
There is a plethora of freely available toolkits for performing natural language processing (NLP) and machine learning (ML) on large datasets, including those from social media sources. …but you have to know how to write code to use them!
The problem
There is lots of interest in leveraging social media data for research in social science, urban planning, epidemiology, and other fields, but most researchers lack the programming skills necessary to use existing NLP toolkits. “I’ll get around to learning how to use these tools some day.” We decided to make SOMEDAEX happen.
What this application can do?
1. Sentiment analysis and other text classification tasks
2. Named entity recognition
3. Geocoding of recognized named locations
4. Topic modeling
5. Network analysis
6. Interactive visualization with linked data views
How will it work?
Existing Python libraries for data analysis and natural language processing
Transformers, SpaCy, Stanza, Flair, GeoPy, Top2Vec, NetworkX, NLTK, etc.
Pretrained models from Huggingface, SpaCy, and StanfordNLP
Existing JavaScript libraries for data visualization
D3, Vega, Vega Lite, Deck.gl, Chart.js, etc.
Creating Proof of Conecpt
Goals for Proof of Conecpt
1. Allow users to incrementally construct a data processing pipeline by using a simple question-based interface.
2. Promote user understanding of the data analysis tasks by showing the results of each task alongside its configuration.
3. Enable rapid prototyping and exploration by using the Streamlit framework.
Lessons learned
1. Interface concepts seemed to work well, but no rigorous user testing was performed.
2. Streamlit was not designed for the level of interactivity we envisioned.
3. The app became painfully slow both as the pipeline grew in complexity and as the size of the dataset increased.
Tasks
NLP Architecture
Top most Hastags from the Tweets
Scatter Plot of Tweets Sentiment
Scatter Plot of Tweets Emotion over Time
Geo-Ploting of Tweets
Topic Modelling
Building the App
1. The NLP toolkits we want to use are written in Python.
2. The data visualization libraries are written in JavaScript.
3. Our solution is to build an Electron app that runs a Python background process for data processing.
What is Electron
Electron is an open-source, cross-platform software framework that allows you to build desktop applications using web technologies (HTML, CSS, JavaScript). Unlike a browser-based web application, an Electron app has access to the local file system and the ability to spawn child processes.