Data Acquisition 1 | Characters Encoding, Stdin and Stdout, and Data Pipeline

- Characters Encoding
ASCII codes are 7 bits for each character, so it has a maximum of 2⁷ = 128 chars for expression, however, this is not enough for all though languages and the information we have today.
(2) Unicode
To represent Unicode we have to use 16-bit (2 bytes) not 8-bit (1 byte) characters. However, Python 3 does seem to do some optimization, keeping strings as 1-byte-per-char as long as possible, until we introduce a non-ASCII character.
from sys import getsizeof
print(getsizeof('')) # 49 for ASCII overhead
print(getsizeof('Ω')) # 74 for non-ASCII overhead
Actually, Unicode is started with \u, for example
(3) Hexadecimal
Because a byte can be described in 2 hexadecimal digits, which is why we tend to use hexadecimal. For example,
(4) Show Encoding Values
We can show the decimal value of the encoding value of a character by ord() function.
ord('Ω') # 937
(5) Decoding Decimal Values
Also, we can decode a decimal value of a character by the chr() function.
chr(937) # 'Ω'
(6) Encoding ASCII with Command Line
Let’s, firstly create a file named ascii.txt by,
$ echo ID 345\n > ascii.txt
The od
command is a good way to encode a file. The -c
tells it to print out the bytes as characters and -t dC
tells it to print out the decimal values of those characters; -t xC
tells it to print those character values in hexadecimal.
$ od -c -t dC ascii.txt
then this will give us the following output with every character being encoded to a decimal value.

Also, we can do the same thing to encode characters into hexadecimal values,
$ od -c -t xC ascii.txt

(7) Encoding Unicode with Command Line
Let’s, firstly create a file named utf8.txt by,
$ echo 'Pencil: ✏, Euro: €\n' > utf8.txt
Then, do
$ od -c -t xC utf8.txt
we will have the output that,

The **
mean "Hi, I'm a byte that is part of the preceding character shown".
The word count command **
means to output newline, word, and byte counts of a file. For example,
wc utf8.txt
we will then have,
1 4 25 utf8.txt
2. Data Pipeline
(1) Recall: Stdin and Stdout
We have known that all the inputs in Linux are standard inputs and all the outputs are standard outputs. The most common command to create a standard output is to use an echo
command. For example,
$ echo Hello
This will give back a hello in our command line. This output hello is called the standard output.
We can redirect this output to a .txt file named Hello.txt.
$ echo Hello > Hello.txt
we can then use cat command to test this,
$ cat Hello.txt
Also, we would like to check the standard input, the read
command actually gives us a feature to read a value and then give that value to a file, for example,
$ read Something
then, we can type in,
To show the variable Something, we can use,
$ echo $Something
The process we type the word World is called the standard input.
Then we are trying to mix the standard input and the standard output in a single line together. Firstly, we create a new file called with the following code,
read Something
echo $Something
then, we type in,
$ source < ascii.txt
This code will give ascii.txt as our standard input and returns its content,
ID 345n
We can also redirect the standard output to another file,
$ source < ascii.txt > ascii2.txt
(2) Project of the Data Pipeline
