Data Acquisition 11 | A Review of the Main Topics

Series: Data Acquisition

Data Acquisition 11 | A Review of the Main Topics

  1. Basic Computer Network Concepts

2. HTTP Methods and Responses Cheatsheet by John Deringer

3. Common Ports Cheatsheet by Jeremy Stretch

4. Command Lines of Computer Network

// check the DNS solution
$ nslookup www.dgate.org
// get the laptop's host name
$ hostname
// connect via telnet and then send a request
$ telnet www.dgate.org 80
GET / HTTP/1.1
Host: www.dgate.org

// listen to a port 11223
$ nc -l 11223
// get and save an webpage
$ curl https://www.cnn.com > cnn.html
$ wget -O cnn.html https://www.cnn.com // note that wget not support socket5
// recursively wget a server
$ wget -r -np -R "index.html*" {{URL}}
// find IP address assigned by NAT (internal IP)
$ ipconfig getifaddr en0
// find external IP address
$ curl ifconfig.me // not work in China
$ curl mip.chinaz.com | grep IP地址: // work in China
// SSH connection (especially for AWS)
$ ssh -i foo.pem ubuntu@xyz.com
// SSH copy (especially for AWS)
$ scp /path/to/file username@a:/path/to/destination

5. Amazon Web Service

6. Get API results

Firstly, let’s put our APIs in a python script named ,

# youtube
youtube_key =
# twitter
consumer_key =
consumer_secret =
access_token =
access_token_secret =
# Quandl
quandl_key =

I put this file in the parent directory so I have to add the parent directory to the path in order to import this script

import os, sys
lib_path = os.path.abspath(os.path.join('..'))
sys.path.append(lib_path)
import apikeys
import requests

We are going to use the API from twitter, Youtube, and Quandl. Before we start, we have to install some of the packages,

! pip install bs4==0.0.1
! pip install --ignore-installed --upgrade google-api-python-client
! pip install tweepy==3.9.0
  • Build Youtube API query
DEVELOPER_KEY = youtube_key
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, developerKey=DEVELOPER_KEY)

Return a search result,

QUERY = 'cats'          # what we want to search (replace)
search_response = youtube.search().list(
q=QUERY, # search terms
part="id,snippet", # what we want back
maxResults=20, # how many results we want back
type="video" # only tell me about videos
).execute()

Return a comments result,

videoid = 'gU_gYzwTbYQ'         # the video we want to see (replace)
comments = youtube.commentThreads().list(
part="snippet", # comments are in the "snippet"
videoId=videoid, # the video we want to see
textFormat="plainText", # only return plain text
maxResults=5 # how many results we want back
).execute()
  • Build twitter API query
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Return home timeline of our current account,

api.home_timeline()

Return user info,

user_name = 'JoeBiden'         # username we want to get
api.get_user(user_name)

Return user time line

user_name = 'JoeBiden'         # username we want to get
api.user_timeline(user_name)

Return tweets for more than 1 page (i.e. 100 records)

user_name = 'JoeBiden'         # username we want to get
n = 100 # number of records we want to get
tweepy.Cursor(api.user_timeline, id=user_name).items(n)
  • Build Quandl API query and grab the CSV data
ticker = 'AAPL'                # stock we want to search
APIKEY = quandl_key # our API key
url = HistoryURL % (ticker,APIKEY)
r = requests.get(url)
csvdata = r.text

Save this file to a CSV file,

with open(f'{ticker}.csv', 'w') as f:
f.write(csvdata)

To conduct operations on this CSV file through command line, we can use the csvcut command and the csvgrep command. To use this, we have to install it first,

! pip install csvkit

For example, if we want to select column 12 for the first five lines (including CSV headers),

! csvcut -c 12 AAPL.csv | head -6

For example, if we want to select column 12 for the first five lines (without CSV headers),

! csvcut -c 12 AAPL.csv | tail +2 | head -5

if we want to select column 12 and 9 column for the first five lines,

! csvcut -c 12,9 AAPL.csv | head -5

if we want to select column 11–13 column for the first five lines,

! csvcut -c 11-13 AAPL.csv | head -5

7. Flask

  • Import packages
from flask import Flask, render_template, redirect, url_for
  • Build an app
app = Flask(__name__)
  • A common root page
@app.route("/")
def root():
return redirect(url_for('index'))
  • A common index page
@app.route("/index")
def index():
return render_template('index.html')

Note that all the jinja templates should be put in the following folder in the current directory,

templates/
  • Launch the instance
app.run('0.0.0.0')

8. Socket Communication

  • Starter code for creating a server on port 8000.
import socket
import netifaces as ni

# Create a serve socket
serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ip = ni.ifaddresses('en0')[ni.AF_INET][0]['addr'] # might be en1
# (on linux it might be `eth0` not `en0`)
serversocket.bind((ip, 8000)) # wait at port 8000
# Start listening for connections from client
serversocket.listen(5) # 5 is number of clients that can queue up before failure
# Wait for connection
(clientsocket, address) = serversocket.accept()
clientsocket.send(f"Successfully connected to {ip}\n".encode())
# Write something here
# Close the socket
clientsocket.close()

See an example of silly chatter from,

9. Selenium

Launch a Selenium chrome driver,

from selenium import webdriver
driver = webdriver.Chrome('/usr/local/bin/chromedriver')

Open a webpage (i.e. google.com),

driver.get('http://www.google.com')

Get an element (‘q’ means something),

driver.find_element_by_name('q')           # via name
driver.find_element_by_id('q') # via id
driver.find_element_by_class_name('q') # via class
# via css format: i.e. <input type="password" />
driver.find_element_by_css_selector('input[type='password']')
# via xpath: i.e. <input name="username" type="text" />
driver.find_elements_by_xpath('//input[@name='username']')

Write in a text box and submit,

query = 'something'           # things we want to post
box.send_keys(query)
box.submit()

Click a button,

btn.click()

Scrolling down the page,

driver.execute_script("window.scrollTo(0, 10000);")

Quit the driver,

driver.quit()