Data Acquisition 5 | A Review on the Previous Topics

Series: Data Acquisition

Data Acquisition 5 | A Review on the Previous Topics

https://dribbble.com/shots/4117066-Crypto-Cart
  1. Data Encoding

(1) What is the output of the following code?

The output is,

True

(2) What is the output of the following code?

The output is,

49

Explain:

The strings in Python are a certain type of object, they are not just a collection of enough characters to hold the string. For macOS and python 3.8, the object of an empty string takes 49 bytes.

(3) What is the output of the following code?

The output is,

33

Explain:

Similarly, an empty byte object takes 33 bytes.

(4) What is the output of the following code?

The output is,

No Error

(5) What is the output of the following code?

The output is,

A

(6) What is the output of the following code?

The output is,

Error

Explain:

We can not encode Unicode characters to ASCII.

(7) What is the output of the following code?

The output is,

57

Explain:

In this case, a = 49 because we are encoding a byte type (33 + 16). Then b = 8, because there are already two 32-bit Unicode characters in the string and the others, will be interpreted as 32-bit (4 bytes). So the result of b is 2 times 4 and that equals 8. Sum a and b up to get 49 + 8 = 57.

Because we have 16-bit Unicode (2 bytes, i.e. ✎) and 32-bit Unicode (4 bytes, i.e. the emoji). When we have a 32-bit Unicode character in the string, then all the chars will take Keep in mind the following relationship,

print(getsizeof(''))
print(getsizeof(''.encode('utf-8')))
print('empty byte type ->', getsizeof(b''))                       
# 33
print('empty uft-8 type ->', getsizeof(b''.decode('utf-8')))      
# 49, 16 bytes added for the decode
print('a ->', getsizeof('a'))                                     
# 49 + 1, ASCII takes 1 byte each
print('ab ->', getsizeof('ab'))                                   
# 49 + 1 * 2
print('✎ ->', getsizeof('✎'))                                    
# 76
print('✎a ->', getsizeof('✎a'))                                  
# 76 + 2
print('✎✏ ->',getsizeof('✎✏'))                                  
# 76 + 2, 2b for 2-byte unicode
print('🥳 ->',getsizeof('🥳'))                                     
# 80
print('🥳✎ ->',getsizeof('🥳✎'))                                  
# 80 + 4, 4b for 4-byte unicode
print('🥳✎a ->',getsizeof('🥳✎a'))                                
# 80 + 4 * 2

(8) What is the output of the following code?

The output is,

True

(9) What is the output of the following code?

The output is,

True

2. Git and Bash

(1) What does the following code mean?

$ grep import csv2json.py

Answer:

It means to get all the lines with the word import. It actually gives us all the import statements in the csv2json.py file.

(2) What does the following code mean?

$ python csv2json.py my.csv 2> /dev/null

Answer:

The code above means to discard all the errors after running the csv2json.py file on the file my.csv.

(3) What does the following code mean?

$ python csv2json.py my.csv 2>&1 /tmp/log

Answer:

The code above means to record all the errors after running the csv2json.py file on the file my.csv to the log file in the path /tmp.

(4) What does the following code mean?

$ cat | python json2csv.py my.json | grep tomato

Answer:

The code above uses a pipeline to firstly transfer my.json to a standard csv output and then get all the lines with tomato in this output. Finally, the code print the result of all the lines with the word tomato of the output by running json2csv.py on my.json.

(5) What does the following code mean?

$ python csv2json.py data/super.csv > /tmp/t.json

Answer: save a json file and direct it towards the file location ‘/tmp/t.json’

(6) What does the following code mean?

$ tr -s '\n' ' ' < /tmp/tsla.txt | fold -s | head -10

Answer: This will feed the translate command our tesla.txt file, and then it will replace all new line characters with a space string. It will then pass that to fold -s.

(7) Do you need a github account in order to use git?

Answer: No!

(8) Is Github a homework submission system?

Answer: No! It is actually a version control system and it is the easiest way to back up the files on our computer.

(9) Fill in the blanks.

For the wc command in the command line, fill in the following blanks with it options.

_____ : Gives the byte count
_____ : Gives the char count
_____ : Gives the line count
_____ : Gives the word count

Answer:

-c
-m
-l
-w

(10) What does the following code mean?

$ cat my.csv | tail +2 | head -10

Answer: get the first 10 items in my.csv, but without printing the header.

3. Data Pipeline

(1) What are some examples of files that are text-based?

Answer: CSV, XML, HTML, Natural language treat such as email message or tweet, Python, JS, Java, C++ or any other programming scripts, JSON

(2) What are some examples of files that are not text-based?

Answer: mp3, png, jpg, mpg.

(3) What does the following code mean?

Answer: the code above can be used to load an excel file.

(4) What does the following code mean?

Answer: the code above reads the txt file into a list of words.

(5) What does the following code mean?

[i for i in dir(34) if not i.startswith('__')]

Answer: to show all the non-private methods of the integer object 34.

(6) Fill in the blank.

Suppose we have an origin file and we would like to convert it to the csv, html, XML, and json, then the size of each file after compression is,

____ < ____ < _____ < _____

Answer:

CSV < HTML < JSON < XML

Explain:

the ratio of original to compressed for CSV: 4
the ratio of original to compressed for XML: 9.5
the ratio of original to compressed for JSON: 7.9

(7) Explain the meaning of the following stuffs in a class definition.

__init__
__str__
__add__

Answer:

__init__: constructor
__str__: called when conversion to string needed like print
__add__: + Operator overloading

4. Hash Table

(1) Answer the following question.

Assume a hash table with three buckets and hash function for strings: ord(key[0]) - ord('a'). To which bucket index would apple go?

Answer:

0

Explain:

Because key[0]==‘a’ and so ord(‘a’) — ord(‘a’) == 0. The number of buckets here doesn’t matter.

(2) What is the search complexity in direct addressing with a hash table?

Answer:

O(1)

(3) Fill in the blank.

Based on the definition of a binary tree,

A ________ is a node that has no children.
The ________ is the non-leave node.
The ________ is a simple kind of tree that has at most two children.
A binary tree with ________ nodes has n-1 edges.

Answer:

leaf
internal node
binary tree
n

5. TFIDF

(1) What is a good way to visualize term frequency?

Answer: Using a word cloud.

(2) Calculate the term frequency of ‘times’ in d1

d1 = "in the new york times in"
d2 = "the new york post"
d3 = "the los angeles times"

Answer: (1/6)

0.16666666666666666

(3) Calculate the document frequency of ‘times’ without smoothing.

d1 = "in the new york times in"
d2 = "the new york post"
d3 = "the los angeles times"

Answer: (2/3)

0.6666666666666666

(4) Calculate the document frequency of ‘times’ with smoothing.

d1 = "in the new york times in"
d2 = "the new york post"
d3 = "the los angeles times"

Answer: (3/4)

0.75

(5) Calculate the logged inverse document frequency of ‘times’ with smoothing and a basement of 10.

d1 = "in the new york times in"
d2 = "the new york post"
d3 = "the los angeles times"

Answer: log10(4/3)

0.12493873660829992

(6) Calculate the tfidf index for token ‘times’ in d1.

d1 = "in the new york times in"
d2 = "the new york post"
d3 = "the los angeles times"

Answer: (1/6)*log10(4/3)

0.020823122768049984

(7) What does the following code do?

Answer: the program above is used to create a bag of words model.

(8) What is the output of the following code?

Answer:

Hello world
Hello,world!