Web scraping
Last updated
Last updated
Stocks on stocktwits may regularly receive thousands of tweets per day depending on the time. The goal is to efficiently parse and store these tweets for easy access. To extract tweets, each page must be scraped on a regular interval to keep up with the stream of tweets coming from the live site. Stocktwits provides a public API to fetch tweets and other information from their database, but the restrictions and limits set make it impossible to fetch all the data necessary.
There are 2 basic categories of web-scraping in the current system: user scraping and stock scraping.
Webpage scraping is done using Selenium and BeautifulSoup. Selenium first uses the Chromium web driver to open a page and begin parsing. Chrome options are assigned to the driver for efficiency. Some options include running in headless mode and disabling images on page load.
On page load, the web driver scrolls to the bottom and keeps scrolling until a specified number of scrolls later. The number of scrolls are determined by the last time that stock was parsed.
Then using BeautifulSoup, a soup object is extracted from the webpage to use for parsing. Each user's tweet are saved using this example general data structure.
For users, the scraping logic is the same but are done to catch tweets that aren't found from stock scraping. Additional user features are extracted along with the tweets from each user's feed using this example data structure.
To parse a user, the minimum ideas count is 200. This is so that users who don't have many tweets are not unnecessarily parsed. Currently, there have been 200,000 users seen through stock parsing and of those, only 65000 users are stored in the database with enough ideas. This threshold lowers the user parsing compute time by about 1/3.
User parsing is done with one of the following 3 strategies
Update user: Re-parse users that have been previously parsed up till that last time.
New user: Parse a new user for the first time.
Error user: Re-parse a user that had an error occur (Ex. API down, Chrome error)