Data Acquisition 1 | Data Pipeline

Series: Data Acquisition

Data Acquisition 1 | Characters Encoding, Stdin and Stdout, and Data Pipeline

https://dribbble.com/shots/13575928-Data-Pipeline
  1. Characters Encoding

(1) ASCII

ASCII codes are 7 bits for each character, so it has a maximum of 2⁷ = 128 chars for expression, however, this is not enough for all though languages and the information we have today.

(2) Unicode

To represent Unicode we have to use 16-bit (2 bytes) not 8-bit (1 byte) characters. However, Python 3 does seem to do some optimization, keeping strings as 1-byte-per-char as long as possible, until we introduce a non-ASCII character.

from sys import getsizeof
print(getsizeof('')) # 49 for ASCII overhead
print(getsizeof('a'))
print(getsizeof('ab'))
print(getsizeof('Ω')) # 74 for non-ASCII overhead
print(getsizeof('ΩΩ'))

Actually, Unicode is started with \u, for example

'\u00ab'

(3) Hexadecimal

Because a byte can be described in 2 hexadecimal digits, which is why we tend to use hexadecimal. For example,

'\xFF'

(4) Show Encoding Values

We can show the decimal value of the encoding value of a character by ord() function.

ord('Ω')     # 937

(5) Decoding Decimal Values

Also, we can decode a decimal value of a character by the chr() function.

chr(937)     # 'Ω'

(6) Encoding ASCII with Command Line

Let’s, firstly create a file named ascii.txt by,

$ echo ID 345\n > ascii.txt

The od command is a good way to encode a file. The -c tells it to print out the bytes as characters and -t dC tells it to print out the decimal values of those characters; -t xC tells it to print those character values in hexadecimal.

$ od -c -t dC ascii.txt

then this will give us the following output with every character being encoded to a decimal value.

Also, we can do the same thing to encode characters into hexadecimal values,

$ od -c -t xC ascii.txt

then,

(7) Encoding Unicode with Command Line

Let’s, firstly create a file named utf8.txt by,

$ echo 'Pencil: ✏, Euro: €\n' > utf8.txt

Then, do

$ od -c -t xC utf8.txt

we will have the output that,

The ** mean "Hi, I'm a byte that is part of the preceding character shown".

The word count command ** means to output newline, word, and byte counts of a file. For example,

wc utf8.txt

we will then have,

      1       4      25 utf8.txt

2. Data Pipeline

Before we start this part, make sure you understand sort of the command lines. If you are not sure how to use the command line, here is a quick reference that you may find out somehow helpful.

(1) Recall: Stdin and Stdout

We have known that all the inputs in Linux are standard inputs and all the outputs are standard outputs. The most common command to create a standard output is to use an echo command. For example,

$ echo Hello

This will give back a hello in our command line. This output hello is called the standard output.

We can redirect this output to a .txt file named Hello.txt.

$ echo Hello > Hello.txt

we can then use cat command to test this,

$ cat Hello.txt

Also, we would like to check the standard input, the read command actually gives us a feature to read a value and then give that value to a file, for example,

$ read Something

then, we can type in,

World

To show the variable Something, we can use,

$ echo $Something

The process we type the word World is called the standard input.

Then we are trying to mix the standard input and the standard output in a single line together. Firstly, we create a new file called readline.sh with the following code,

#!/bin/bash
read Something
echo $Something

then, we type in,

$ source readline.sh < ascii.txt

This code will give ascii.txt as our standard input and returns its content,

ID 345n

We can also redirect the standard output to another file,

$ source readline.sh < ascii.txt > ascii2.txt

(2) Project of the Data Pipeline

Here’s a quick reference for this project of the data pipeline.