Web scraping

Stocks on stocktwits may regularly receive thousands of tweets per day depending on the time. The goal is to efficiently parse and store these tweets for easy access. To extract tweets, each page must be scraped on a regular interval to keep up with the stream of tweets coming from the live site. Stocktwits provides a public API to fetch tweets and other information from their database, but the restrictions and limits set make it impossible to fetch all the data necessary.

There are 2 basic categories of web-scraping in the current system: user scraping and stock scraping.

Stock scraping

Webpage scraping is done using Selenium and BeautifulSoup. Selenium first uses the Chromium web driver to open a page and begin parsing. Chrome options are assigned to the driver for efficiency. Some options include running in headless mode and disabling images on page load.

driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

On page load, the web driver scrolls to the bottom and keeps scrolling until a specified number of scrolls later. The number of scrolls are determined by the last time that stock was parsed.

Then using BeautifulSoup, a soup object is extracted from the webpage to use for parsing. Each user's tweet are saved using this example general data structure.

{
    'symbol': 'AAPL',
    'user': 'Jstones',
    'time': '2020-07-11 14:34:00',
    'isBull': True,
    'likeCount': 1,
    'commentCount': 0,
    'messageText': '$AAPL is expected to announce four new iPhones.'
}

User scraping

For users, the scraping logic is the same but are done to catch tweets that aren't found from stock scraping. Additional user features are extracted along with the tweets from each user's feed using this example data structure.

{
    'join_date': '2019-05-03',
    'followers': 243,
    'following': 53,
    'ideas': 803, # number of tweets
    'tier': 0 # 0 default user
}

To parse a user, the minimum ideas count is 200. This is so that users who don't have many tweets are not unnecessarily parsed. Currently, there have been 200,000 users seen through stock parsing and of those, only 65000 users are stored in the database with enough ideas. This threshold lowers the user parsing compute time by about 1/3.

User parsing is done with one of the following 3 strategies

  1. Update user: Re-parse users that have been previously parsed up till that last time.

  2. New user: Parse a new user for the first time.

  3. Error user: Re-parse a user that had an error occur (Ex. API down, Chrome error)

Last updated