Only at this stage of the project was I able to bring in the thing that it was all for: predicting stock values.
It was unexpectedly difficult to find a good place to pull historical stock data because a the yahoo database was depreciated for python. However, the pandas datareader allowed me to pull data from the google database, which is know to be unstable but it worked good enough for me.
The stock data was formatted as a difference between the open and close of the market on that day, and other values were used as multipliers. The formula for value or stock value change, was developed from a couple sci-kit regression analyses.
The formula:
value = change*(followers/totalfollowers)*polarity*confidence/(currentprice)*10000
The value for the tweet is an estimate of the change caused by the tweet given by people reached (followers) as a percentage of total followers, this then multiplied by the confidence and polarity and the percentage change in the stock on that day. The 10000 multiplier is used to create a reasonable percentage change using the Kylie Jenner tweet as a baseline.
The full code is below:
sentiment.py
import pandas as pd
from pandas_datareader import data as web
# Package and modules for importing data; this code may change depending on pandas version
import datetime
import sys
# Let's get Apple stock data; Apple's ticker symbol is AAPL
# First argument is the series we want, second is the source ("yahoo" for Yahoo! Finance), third is the start date, fourth is the end date
df = pd.read_csv("clusterdata/"+sys.argv[1]+"cluster.csv",index_col=0,encoding='latin-1')
def getstockvalue(change, followers, totalfollowers, polarity, confidence, currentprice):
# determines effect of tweet on stock price change
'''
date = date.split()[0]
date = date.split("/")
start = datetime.datetime(int(date[2]),int(date[0]),int(date[1])-1)
print(start)
apple = web.DataReader("AAPL", "google", start)
type(apple)
print(apple)
'''
value = change*(followers/totalfollowers)*polarity*confidence/(currentprice)*10000
print(value)
return value
start = datetime.datetime(2018,2,24)
# Let's get Apple stock data; Apple's ticker symbol is AAPL
# First argument is the series we want, second is the source ("google" for google Finance), third is the start date, fourth is the end date
apple = web.DataReader(sys.argv[2], "google", start)
print(apple)
currentprice = apple['Open'][0]
dif = abs(apple['Open'][0] - apple['Close'][0])
print(dif)
totalfollowers = df['followers'].sum()
for index,row in df.iterrows():
followers = df['followers'][index]
polarity = df['polarity'][index]
confidence = df['sentiment_confidence'][index]
df.at[index,'difference'] = getstockvalue(dif, followers, totalfollowers, polarity, confidence, currentprice)
df.to_csv('finaldata/'+sys.argv[2]+'.csv')
print(df.head())