Data Acquisition 11 | A Review of the Main Topics
Data Acquisition 11 | A Review of the Main Topics

- Basic Computer Network Concepts

2. HTTP Methods and Responses Cheatsheet by John Deringer

3. Common Ports Cheatsheet by Jeremy Stretch

4. Command Lines of Computer Network
// check the DNS solution
$ nslookup www.dgate.org
// get the laptop's host name
$ hostname
// connect via telnet and then send a request
$ telnet www.dgate.org 80
GET / HTTP/1.1
Host: www.dgate.org
// listen to a port 11223
$ nc -l 11223
// get and save an webpage
$ curl https://www.cnn.com > cnn.html
$ wget -O cnn.html https://www.cnn.com // note that wget not support socket5
// recursively wget a server$ wget -r -np -R "index.html*" {{URL}}
// find IP address assigned by NAT (internal IP)
$ ipconfig getifaddr en0
// find external IP address
$ curl ifconfig.me // not work in China
$ curl mip.chinaz.com | grep IP地址: // work in China
// SSH connection (especially for AWS)
$ ssh -i foo.pem ubuntu@xyz.com
// SSH copy (especially for AWS)
$scp /path/to/file username@a:/path/to/destination
5. Amazon Web Service

6. Get API results
Firstly, let’s put our APIs in a python script named ,
# youtube
youtube_key =
consumer_key =
consumer_secret =
access_token =
access_token_secret =
# Quandl
quandl_key =
I put this file in the parent directory so I have to add the parent directory to the path in order to import this script
import os, sys
lib_path = os.path.abspath(os.path.join('..'))
sys.path.append(lib_path)
import apikeys
import requests
We are going to use the API from twitter, Youtube, and Quandl. Before we start, we have to install some of the packages,
! pip install bs4==0.0.1
! pip install --ignore-installed --upgrade google-api-python-client
! pip install tweepy==3.9.0
- Build Youtube API query
DEVELOPER_KEY = youtube_key
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, developerKey=DEVELOPER_KEY)
Return a search result,
QUERY = 'cats' # what we want to search (replace)
search_response = youtube.search().list(
q=QUERY, # search terms
part="id,snippet", # what we want back
maxResults=20, # how many results we want back
type="video" # only tell me about videos
).execute()
Return a comments result,
videoid = 'gU_gYzwTbYQ' # the video we want to see (replace)
comments = youtube.commentThreads().list(
part="snippet", # comments are in the "snippet"
videoId=videoid, # the video we want to see
textFormat="plainText", # only return plain text
maxResults=5 # how many results we want back
).execute()
- Build twitter API query
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
Return home timeline of our current account,
api.home_timeline()
Return user info,
user_name = 'JoeBiden' # username we want to get
api.get_user(user_name)
Return user time line
user_name = 'JoeBiden' # username we want to get
api.user_timeline(user_name)
Return tweets for more than 1 page (i.e. 100 records)
user_name = 'JoeBiden' # username we want to get
n = 100 # number of records we want to get
tweepy.Cursor(api.user_timeline, id=user_name).items(n)
- Build Quandl API query and grab the CSV data
ticker = 'AAPL' # stock we want to search
APIKEY = quandl_key # our API key
url = HistoryURL % (ticker,APIKEY)
r = requests.get(url)
csvdata = r.text
Save this file to a CSV file,
with open(f'{ticker}.csv', 'w') as f:
f.write(csvdata)
To conduct operations on this CSV file through command line, we can use the csvcut command and the csvgrep command. To use this, we have to install it first,
! pip install csvkit
For example, if we want to select column 12 for the first five lines (including CSV headers),
! csvcut -c 12 AAPL.csv | head -6
For example, if we want to select column 12 for the first five lines (without CSV headers),
! csvcut -c 12 AAPL.csv | tail +2 | head -5
if we want to select column 12 and 9 column for the first five lines,
! csvcut -c 12,9 AAPL.csv | head -5
if we want to select column 11–13 column for the first five lines,
! csvcut -c 11-13 AAPL.csv | head -5
7. Flask
- Import packages
from flask import Flask, render_template, redirect, url_for
- Build an app
app = Flask(__name__)
- A common root page
@app.route("/")
def root():
return redirect(url_for('index'))
- A common index page
@app.route("/index")
def index():
return render_template('index.html')
Note that all the jinja templates should be put in the following folder in the current directory,
templates/
- Launch the instance
app.run('0.0.0.0')
8. Socket Communication
- Starter code for creating a server on port 8000.
import socket
import netifaces as ni
# Create a serve socket
serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ip = ni.ifaddresses('en0')[ni.AF_INET][0]['addr'] # might be en1
# (on linux it might be `eth0` not `en0`)
serversocket.bind((ip, 8000)) # wait at port 8000
# Start listening for connections from client
serversocket.listen(5) # 5 is number of clients that can queue up before failure
# Wait for connection
(clientsocket, address) = serversocket.accept()
clientsocket.send(f"Successfully connected to {ip}\n".encode())
# Write something here
# Close the socket
clientsocket.close()
See an example of silly chatter from,
9. Selenium
Launch a Selenium chrome driver,
from selenium import webdriver
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
Open a webpage (i.e. google.com),
driver.get('http://www.google.com')
Get an element (‘q’ means something),
driver.find_element_by_name('q') # via name
driver.find_element_by_id('q') # via id
driver.find_element_by_class_name('q') # via class
# via css format: i.e. <input type="password" />
driver.find_element_by_css_selector('input[type='password']')
# via xpath: i.e. <input name="username" type="text" />
driver.find_elements_by_xpath('//input[@name='username']')
Write in a text box and submit,
query = 'something' # things we want to post
box.send_keys(query)
box.submit()
Click a button,
btn.click()
Scrolling down the page,
driver.execute_script("window.scrollTo(0, 10000);")
Quit the driver,
driver.quit()