資料處理概念與方法-文字壓縮


文字壓縮

提問:為何文字要壓縮?

說明

參考網站

http://www.csie.ntnu.edu.tw/~u91029/Compression.html

活動設計1-文字壓縮1

 

參考資料:http://www.zutopedia.com/compression.html

練習:

Once upon a time, long, long ago, three little pigs set out to make their fortunes. The first little pig wasn't very clever, and decided to build his house out of straw, because it was cheap. The second little pig wasn't very clever either, and decided to build his house out of sticks, for the "natural" look that was so very much in fashion, even in those days. The third little pig was much smarter than his two brothers, and bought a load of bricks in a nearby town, with which to construct a sturdy but comfortable country home.

Not long after hist housewarming party, the first little pig was curled up in a chair reading a book, when there came a knock at the door. It was the big bad wolf, naturally.

"Little pig, little pig, let me come in!" cried the wolf.

"Not by the hair on my chinny-chin-chin!" squealed the first little pig.

"Then I'll huff, and I'll puff, and I'll blow your house down!" roared the wolf, and he did huff, and he did puff, and the house soon collapsed. The first little pig ran as fast as he could to the house of sticks, and was soon safe inside. But it wasn't long before the wolf came calling again.

"Little pig, little pig, let me come in!" cried the wolf.

"Not by the hair on my chinny-chin-chin!" squealed the second little pig.

"Then I'll huff, and I'll puff, and I'll blow your house down!" roared the wolf, and he did huff, and he did puff, and the house was soon so much firewood. The two terrified little pigs ran all the way to their brother's brick house, but the wolf was hot on their heels, and soon he was on the doorstep.

"Little pig, little pig, let me come in!" cried the wolf.

"Not by the hair on my chinny-chin-chin!" squealed the third little pig.

"Then I'll huff, and I'll puff, and I'll blow your house down!" roared the wolf, and he huffed, and he puffed, and he huffed som more, but of course, the house was built of brick, and the wolf was soon out of breath. Then he had an idea. The chimney! He clambered up a handy oak tree onto the roof, only to find that there was no chimney, because the third little pig, being conscious of the environment, had installed electric heating. In his frestration, the wolf slipped and fell off the roof, breaking his left leg, and severely injuring his pride. As he limped away, the pigs laughed, and remarked how much more sensible it was to live in the city, where the only wolves were in the zoo. And so that is what they did, and of course they all lived happily ever after.

說明

RLE - Run Length Encoding (遊程長度編碼)

它是一種無損失演算法,該演算法在特定類型的數據資料壓縮上有著驚人的壓縮比。

範例

假設使用 RLE 壓縮下列字串資料(17 個位元組):

ABBBBBBBBBCDEEEEF

利用 RLE 壓縮後,被壓縮的檔案只佔用 10 位元組,如下所示:

A *8B C D *4E F

正如你所見的,重複的字串的數據被一個控制字元 (*) 加重複次數和重複字元本身所取代,控制字符是非固定不變的,不同的壓縮會有不同的選用字元。


活動設計2-文字壓縮2

操作步驟說明:

1. 找到重覆出現的單字,在其中選取一個單字(或是一整個句子),稱為保留標的。

2. 再選取其他重覆出現的單字(或是一整個句子),稱為壓縮標的。

3. 選取完成之後,按下「開始壓縮」按鈕,將所有重覆的單字(或句子)清除到只剩一個,被壓縮的標的則用數字標示其所在的位置。

4. 一直進行到無法再找到保留或壓縮標的為止,看看總共用了幾個壓縮的步驟、成功壓縮了幾個字元。
問題一、為什麼選取標的時,要被限制在至少選取兩個字元以上?

問題二、尋找壓縮標的時,如果漏掉了其中幾個,對於結果有什麼影響?

問題三、要優先標記一整個句子,還是以較短的單字為優先?

 

說明

LZW(Lempel-Ziv-Welch)

是跟隨著開發此壓縮演算法的科學家 Abraham Lempel, Jakob Ziv 與 Terry Welch 命名,這是一個『基於字典』的無失真壓縮演算法。基於字典演算法掃描文件中出現一次以上的序列資料,這些序列資料儲存於壓縮文件內的字典內,每次出現重複性資料時只要替換為參照字典的索引即可。

範例

假設你要壓縮以下字符串的文字:"the quick brown fox jumps over the lazy dog.";其中 'the' 這個詞發生兩次

所以這個資料可以壓縮成:"the quick brown fox jumps over << lazy dog.",其中 "<<" 是一個指向最前面 4 個字符的指標。

LZ78

範例

以上述的例子來說,重複的 'the' 將被編入索引假設該索引是*,則壓縮後的字串變成:"* quick brown fox jumps over * lazy dog.",這與實用上還很遙遠,但是它透過片語取代舉例說明壓縮方法。

實用上取代的不一定是一個單字,也可以是幾個字組成的片語,字典基礎的壓縮會以標記(token)來取代片語(phrase);片語的選取法則將影響字典的大小與取代的重複次數,各種變異型大都是在此問題上做文章;如果標記的位元數量是少於片語所需的位元數目,那麼壓縮就如此產生。

在大部份的應用,LZW 壓縮比當時已有且廣為人之的方法提供一個比較好的壓縮率。它變成電腦上第一個被廣泛使用在一般目的資料壓縮的方法。在大的英文文本中,一般它可以壓縮到大約原來大小的一半。其他的資料種類在很多情況下也相當的有用

參考網站

https://sites.google.com/ntjh.ntct.edu.tw/cstt/(南投縣謝宗翔老師)