Introduction

This presentation aims to prepare club members with some basic tools and knowledge to succeed in the upcoming Quant Quest challenge.

Machine Learning Pipeline

  1. Obtain data
    • Either from scraping, downloading, or other means.
  2. Preprocess data
    • Remove unwanted data.
    • Filter out noise.
    • Patitioning data into training set, validation set, test set
    • Scale, shift, and normalize.
  3. Find a good representation
    • The purpose of this step is to find a more representative representation of the data.
    • In NLP, a good representation can be word count, or tf-idf.
    • Dimensionality reduction.
  4. Training the classifier/regressor
  5. Testing
    • Accuracy, false-positive, false-negative, f-1 score, etc.

Some Tools

This section details some Python libraries that might be helpful

  1. Numerical analysis
    • numpy - Linear algebra, matrix and vector manipulation
    • pandas - Data anaysis, data manipulation
  2. Machine learning
    • scikit-learn - General machine learning. Supports basic/advance level algorithms, but only run on CPU.
    • theano - Deep learning framework.
    • tensorflow - Another deep learning framework.
  3. Natural language processing
  4. Utilities

Download

You can get most of these libraries from the Anaconda distribution or from the links above.

Obtain data

This section will introduce basic tools to download text corpus from wikipedia articles. We will download the content of all 500 articles of 500 companies in the S&P 500.

In [ ]:
import urllib2
import string
import time
import os
from bs4 import BeautifulSoup, NavigableString
import wikipedia as wk
In [ ]:
def initOpener():
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    return opener

The function below output a dictionary whose keys are stock tickers and values are article URLs. These URLs are then used for scraping.

In [ ]:
def getSP500Dictionary():
    stockTickerUrl = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

    usableStockTickerURL = initOpener().open(stockTickerUrl).read()

    stockTickerSoup = BeautifulSoup(usableStockTickerURL, 'html.parser')

    stockTickerTable = stockTickerSoup.find('table')

    stockTickerRows = stockTickerTable.find_all('tr')

    SP500companies = {}

    stockBaseURL = 'https://en.wikipedia.org'

    for stockTickerRow in stockTickerRows:
        stockTickerColumns = stockTickerRow.find_all('td')
        counter = 1
        for element in stockTickerColumns:
            # Stock Ticker
            if (counter % 8) == 1:
                stockTicker = element.get_text().strip().encode('utf-8', 'ignore')
                counter = counter + 1
            # Corresponding link to wiki page
            elif (counter % 8 == 2):
                SP500companies[stockTicker] = element.find('a', {'href': True}).get('href')
                counter = counter + 1

    return SP500companies

The cell bellow uses wikipedia package to load the summary paragraph of the wikipedia article of each company.

In [ ]:
import codecs
import wikipedia as wk
import sys
import json

SP500dict = getSP500Dictionary()
err = []
data = []
comp_name = []
for k, v in SP500dict.iteritems():
    # k: ticker, v: company name
    v_str = str(v)
    pageId = v_str.split('/')[-1]
    pageId = pageId.replace('_',' ')
    try:
        data.append(wk.summary(pageId).encode('utf-8'))
        comp_name.append(pageId.encode('utf-8'))
    except:
        err.append((k,v))
# Dump the data into json file for later use
with open('data.json', 'w') as outfile:
    json.dump((data, comp_name), outfile)
In [1]:
import json

with open('data.json') as json_data:
    data_ = json.load(json_data)
data = data_[0]
comp_name = data_[1]
# print 2 companies
print data[10]
print '-----'
print data[11]
BorgWarner Inc. is an American worldwide automotive industry components and parts supplier. It is primarily known for its powertrain products, which include manual and automatic transmissions and transmission components, such as electro-hydraulic control components, transmission control units, friction materials, and one-way clutches, turbochargers, engine valve timing system components, along with four-wheel drive system components.
The company has 60 manufacturing facilities across 18 countries, including the U.S., Canada, Europe, and Asia. It provides drivetrain components to all three U.S. automakers, as well as a variety of European and Asian original equipment manufacturer (OEM) customers. BorgWarner has diversified into several automotive-related markets (1999), including ignition interlock technology (ACS Corporation est.1976) for preventing impaired operation of vehicles.
Historically, BorgWarner was also known for its ownership of the Norge appliance company (washers and dryers).
-----
United Continental Holdings, Inc. (formerly UAL Corporation) is a publicly traded airline holding company headquartered in the Willis Tower in Chicago. UCH owns and operates United Airlines, Inc. The company is the successor of UAL Corporation, which agreed to change its name to United Continental Holdings in May 2010, when a merger agreement was reached between United and Continental Airlines. Its stock trades under the UAL symbol. To effect the merger, Continental shareholders received 1.05 shares of UAL stock for each Continental share, effectively meaning Continental was acquired by UAL Corporation; at the time of closing, it was estimated that United shareholders owned 55% of the merged entity and Continental shareholders owned 45%. The company or its subsidiary airlines also have several other subsidiaries. Once completely combined, United became the world's largest airline, as measured by revenue passenger miles. United is a founding member of the Star Alliance.
UCH has major operations at Chicago–O'Hare, Denver, Guam, Houston–Intercontinental, Los Angeles, Newark (New Jersey), San Francisco, Tokyo–Narita and Washington–Dulles. UCH's United Air Lines, Inc. controls several key air rights, including being one of only two American carriers authorized to serve Asia from Tokyo-Narita (the other being Delta Air Lines). Additionally, UCH's United is the largest U.S. carrier to the People’s Republic of China and maintains a large operation throughout Asia.
UCH uses Continental's operating certificate and United's repair station certificate, having been approved by the FAA on November 30, 2011.

Preprocessing and Feature Representation

Vectorize documents to matrix of occurence. While counting, filter out stopwords.

In [2]:
# Import the method
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the vectorizer with the option of stopword, which will eliminate 
# common words like 'the', 'a', etc.
count_vect = CountVectorizer(stop_words='english')
# fit_transform method applies the vectorizer on the data set
X_train_counts = count_vect.fit_transform(data)
# The resulting matrix is 496 by 7942. Each row is a document (a wikipedia article)
# each column is the occurence of each word.
print X_train_counts.shape
(497, 7901)

$tf(t,d)$ is the frequency that term $t$ appears in document $d$.

$df(d,t)$ is the number of documents that contain term $t$.

$idf(t)=\log \frac{1+n_d}{1+df(d,t)} + 1$,

  • $n_d$ is number of documents

$tfidf(t,d)=tf(t,d)\times idf(t)$

In sklearn implementation, the final tf-idf vector is normalized by the L2 norm.

Tfidf gives a nice numerical representation of the document. From this representation, we can perform numerical analysis technique on the data.

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer()
X_train_tf = tf_transformer.fit_transform(X_train_counts)
print X_train_tf.shape
print 
(497, 7901)

Clustering

K-means cluster your dataset into K centroids.

For a set of observation $(x_0, x_1, \dots, x_n)\in \mathbb{R}^d$ (in our case, $n=498$ and $d=7940$), k-means clusters these $n$ observations into $k$ groups $S={S_1, S_2, \dots, S_k}$ such that:

$$argmin_S \sum_{i=1}^{k} \sum_{x\in S_i} ||x-\mu_i||^2$$

Intuitively, we want to minimize the total distance of each point in a cluster to the center $\mu$ of that cluster.

We start with placing centroids on the data set (there are many schemes to initialize centroids, but we go with random). Then for each data point, we determine which group it belongs to by looking at the Euclidian distance.

Next, we iteratively update the center to minimize the sum of distance of all points in that group to the group center. At each iteration, the new centroid is the arithmetic mean of all points in that cluster.

In [4]:
from sklearn.cluster import KMeans
# Note that n_clusters is number of cluster. This is important for accuracy. Play around with it
classifier = KMeans(n_clusters = 90, n_jobs=-1)
classifier.fit(X_train_tf)
Out[4]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=90, n_init=10,
    n_jobs=-1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)
In [5]:
print (classifier.labels_)
[57 14 35 15  7 86 31 56  4 43  9 42 28 87 11 57  0 76 20 35 27 73 43 29 44
 76 50 37 88 65 37 21 12 33 11 35  7 60 23 35 44 12 46 34  6 16 71 21  4 81
 88 58 27 88 18 66 63 55  3 15 16 14 15  6 14 80 11 57 64 66 23 58  6 14 88
 85  2 74 36 48 19 18 57 31 34 69 45 16 35 11 53 70 37 48 50 37 83 23 59 80
 34 72 15 52 42 29 33 19 55 71 86 20 20 23 81 37 81 16 16 60 78 45 17 52 74
 52 10 15  3 16 81 15  2 50 78  2 23  3  7 19  6 16  5 65 86 48 52 85 30 16
  0 78 31 54 53 29 60  2 48 29  0 23 24 23 29 48 14 14 42 71 85 75 16 41 82
 13 32 11 24 29 45 52 39 23  2 27 19 12 72  9  5 56 52 44 18 43 16 42 14 19
 53 55 71  7 42 68 32 45  4 36 56 25 11 19  4 71 56  5 71 13 11 52 16 36  0
 79 54 37 48 28 30 23 27 88 87  5  2 22 34 37  9 15 85 16 12 83 72 36 89 81
 29 16 15  5 10  6 35 49 16 46  1  3 11 83 12  4 20 87 25  8 11 11 52 31  7
  2 84  9 37 23 23  2 16 11 14  9 56 65 13 18 59 60 62 89 44 54 11 60  4  4
 15 34 20 15 27 19 23 37 18  6 14 21 26 63 69 64 62 16 38 21 39 18 20 39 76
  8 40 34 13 54 40 54 67 54  2 26 80 14 16 32 19 50 61 15 85 22 40 42 52  2
 52 69  7 45 41 56 37 29 48 11 82 76 33  2 52 19 11 81 77 38 52 58 35  1 25
 43 54 15 65 67 24 73 26  4  3 29 35 24 73 55 50 51 66  3 46 77 18 32 73 29
 81  6 71 15  4 16 79 70 48 47 47 54 55 52 52 30 54  2  2 38 37  9 35 61 27
 53 21  6  4 68 30 15 81 38 38 30 59  2 27 11 21 43 44 52  2 79 28 11 53 71
 23 19 27 14 26 14 54 13 62 49 49 74  4 15 56 13 58 71 89 36 52 16  9 18 82
 53 68 42 38 69 29 14 71 11 68  1 37  6 88  6 15 30 81 41  0 30 30]
In [6]:
import numpy as np
print [comp_name[x] for x in np.where(classifier.labels_==30)[0]]
print "____"
print [comp_name[x] for x in np.where(classifier.labels_==35)[0]]
[u'Prudential Financial', u'Lincoln National', u'Northern Trust Corp.', u'Affiliated Managers Group Inc', u'Ameriprise Financial', u'T. Rowe Price Group', u'Principal Financial Group', u'S%26P Global, Inc.']
____
[u'Campbell Soup', u'Mead Johnson', u'Johnson %26 Johnson', u'Franklin Resources', u'Boston Scientific', u'Arthur J. Gallagher %26 Co.', u'PPG Industries', u'Johnson Controls', u'Lilly (Eli) %26 Co.']
In [7]:
print comp_name.index('Goldman Sachs Group')
print [comp_name[x] for x in np.where(classifier.labels_==classifier.labels_[118])[0]]
118
[u'Comerica Inc.', u'BlackRock', u'Charles Schwab Corporation', u'Bank of America Corp', u'Goldman Sachs Group', u'SunTrust Banks', u'PNC Financial Services', u'Regions Financial Corp.', u'Wells Fargo', u'State Street Corp.', u'BB%26T Corporation', u'Huntington Bancshares', u'Intuit Inc.', u'JPMorgan Chase %26 Co.', u'Fifth Third Bancorp', u'Citigroup Inc.', u'Capital One Financial', u'Navient', u'U.S. Bancorp']
In [9]:
# print comp_name