Semantic Compression using Text Clustering

Introduction

By combining semantic clustering with Huffman coding , a compression algorithm that probably reduces the text doucuments storage requirements can be generated.
In this method each word of the document is given a unique ID which would be of the form : XY where X=binarybits which represent ClusterID to which it belongs,Y=binarybits representing the word with in cluster.

Concept

One fact about documents that describes a topic is that "All the words in the document are closely related semantically".

That is a document containing topic related to sports contain words related to sports spread through out the document with some common words filling the gap between them.

Now, the concept is:

Identify the different global clusters of the document and give each one some ID like { sports=000, verbs=001, comparision=010, etc..}
Now for all the words of the document, identify which words fit in which cluster and give each word a unique ID with in cluster like

{ for cluster sports: 000,     golf=000, football=001, ...}
{ for cluster verbs:001,       is=000, was=001 ...}
{ for cluster comparision:010, more=000,  ...}

Now , the actual ID of the word of the document is given by : {ClusterID to which it belongs , Concatenated by Word's Unique ID with in the Cluster } , which would be unique through out the document.

Thus finally the word "Golf" that happens to appear in the document gets it unique code like : 000 000 { sports, golf}

Thus the word GOLF which requires 4*8=32 bits can be represented by 6 bits. The word IS which requires 2*8=16 bits can be represented by 6 bits.

Example

Consider the scentence: I love golf because many people love it.
One can identify the clusters as :
Emotion   = {Love, ...}
Sports    = {Golf     }
Relative  = {Many     }
Personals = {I, People, It}
Connectors= {Because      }
Actually the Name of the cluster is not important , as long as the clustering can be done effectively. And now, classify the words of above phrase and give each a unique id with in the cluster.
Thus the Words get the IDs with respect to Clusters as:

Cluster Word
Emotion :=000 Love=000
Sports:=001 Golf=000
Relative:=010 Many=000
Personals:=011 I=000, People=001, It=010
Connectors:=100 Because=000
Now , the Unique IDs of each of the words are
Love=000000, Golf=001000, Many=010000, I=011000, People=011001, It=011010, Because=100000

The scentence is coded as :
I Love Golf Becauese Many People Love It. = Total 32 characters= 256 bits
011000 000000 001000 1000000 010000 011001 000000 011010 = Total 48 bits.
Thus above string of 256 bits can be compressed to 48 bits.
Moreover, Again the 48 bits can be compressed using the Runlength Encoding Algorithm because it uses just 1's and 0's, which may probably reduce the size to still lesser No of bits.

Remarks

Howmany bits should be used to represent Cluster ID and Howmany for Id of word with in cluster?
In above example both ClusterID and Word id Used 3 bits each. That is 8 different clusters and 8 different words with in each cluster can be represented with them.That is a total 64 different words can be represented with 6 bits of representation. But thats too less.

Any how going for too many bits for representation is also not advisable. Because if we observe in above example, the Character "I" used 6 bits, where as "7 letter word Because" also used 6 bits.

Optimum would be to use 16 bits for final id( Both ClusterID,WordID put together)
And one should maintain a tradeoff between ClusterID, WordID ratio with in the 16 bits.
If ClusterID=3 bits , then WordID takes remaining 13 bits resulting in 8 Clusters , each having capacity to hold 2power 13 words.
Otherwise if ClusterID=8 bits then 256Clusters each with capacity of 256 words will result.

Increasing the no.of bits of ClusterID reduces its capacity to hold many words but increases the flexibility to divide into more no.of clusters.
Reducing the no.of bits of ClusterID results in more space with in one room and reduces no.of rooms thus might results in non-effective Classification.
Classification of words: Actually we don't need to classify the words of documents Semantically as long as they were classified effeciently, but , probably for words of text document, there exitst no effecient classification other thatn semantic classification. Other wise one can go for ones own classification as long as one is consistant with the notation.
1. One important issue is by standardizing the classification like, say All the words ,,{is, was,...} should go into ...{ClusterX with ClusterID=XXX}, the algorithm can be used as standard compressor. This standardization helps in storing the web documents also. So, the compression/Decompression tool makers can look up the standards and can compress/Decompress the documents accordingly. Otherwise, the One manufacturer has to have one lookup table in his Compression program and should use the same while decompression.This may result in Inconsinstancy.
2. Another issue worth noting is that the documents compressed in this manner in one language can be decompressed into another language, if standard classfication tables were Maintained for all languages.This can be done with the help of some thing like UNL etc...Because the compressed text contain nothing but language independent IDs which can be defined appropriately for each language this almost is possible.

Advantages

It compresses to the max. extent reducing the size max. possible.
Can be used as Standard over the web.
Translation becomes trivial.

Limitations:

It probably need lot of semantic support.{might be from UNL, Semantic Chains.......}
It is limited to text documents.

P.GopalaKrishna