Data Acquisition 1 | Data Pipeline
Data Acquisition 1 | Characters Encoding, Stdin and Stdout, and Data Pipeline

- Characters Encoding
(1) ASCII
ASCII codes are 7 bits for each character, so it has a maximum of 2⁷ = 128 chars for expression, however, this is not enough for all though languages and the information we have today.
(2) Unicode
To represent Unicode we have to use 16-bit (2 bytes) not 8-bit (1 byte) characters. However, Python 3 does seem to do some optimization, keeping strings as 1-byte-per-char as long as possible, until we introduce a non-ASCII character.
from sys import getsizeof
print(getsizeof('')) # 49 for ASCII overhead
print(getsizeof('a'))
print(getsizeof('ab'))
print(getsizeof('Ω')) # 74 for non-ASCII overhead
print(getsizeof('ΩΩ'))
Actually, Unicode is started with \u, for example
'\u00ab'
(3) Hexadecimal
Because a byte can be described in 2 hexadecimal digits, which is why we tend to use hexadecimal. For example,
'\xFF'
(4) Show Encoding Values
We can show the decimal value of the encoding value of a character by ord() function.
ord('Ω') # 937
(5) Decoding Decimal Values
Also, we can decode a decimal value of a character by the chr() function.
chr(937) # 'Ω'
(6) Encoding ASCII with Command Line
Let’s, firstly create a file named ascii.txt by,
$ echo ID 345\n > ascii.txt
The od
command is a good way to encode a file. The -c
tells it to print out the bytes as characters and -t dC
tells it to print out the decimal values of those characters; -t xC
tells it to print those character values in hexadecimal.
$ od -c -t dC ascii.txt
then this will give us the following output with every character being encoded to a decimal value.

Also, we can do the same thing to encode characters into hexadecimal values,
$ od -c -t xC ascii.txt
then,

(7) Encoding Unicode with Command Line
Let’s, firstly create a file named utf8.txt by,
$ echo 'Pencil: ✏, Euro: €\n' > utf8.txt
Then, do
$ od -c -t xC utf8.txt
we will have the output that,

The **
mean "Hi, I'm a byte that is part of the preceding character shown".
The word count command **
means to output newline, word, and byte counts of a file. For example,
wc utf8.txt
we will then have,
1 4 25 utf8.txt
2. Data Pipeline
Before we start this part, make sure you understand sort of the command lines. If you are not sure how to use the command line, here is a quick reference that you may find out somehow helpful.
(1) Recall: Stdin and Stdout
We have known that all the inputs in Linux are standard inputs and all the outputs are standard outputs. The most common command to create a standard output is to use an echo
command. For example,
$ echo Hello
This will give back a hello in our command line. This output hello is called the standard output.
We can redirect this output to a .txt file named Hello.txt.
$ echo Hello > Hello.txt
we can then use cat command to test this,
$ cat Hello.txt
Also, we would like to check the standard input, the read
command actually gives us a feature to read a value and then give that value to a file, for example,
$ read Something
then, we can type in,
World
To show the variable Something, we can use,
$ echo $Something
The process we type the word World is called the standard input.
Then we are trying to mix the standard input and the standard output in a single line together. Firstly, we create a new file called readline.sh with the following code,
#!/bin/bash
read Something
echo $Something
then, we type in,
$ source readline.sh < ascii.txt
This code will give ascii.txt as our standard input and returns its content,
ID 345n
We can also redirect the standard output to another file,
$ source readline.sh < ascii.txt > ascii2.txt
(2) Project of the Data Pipeline
Here’s a quick reference for this project of the data pipeline.