By combining semantic clustering with Huffman coding , a compression algorithm that probably reduces the text doucuments storage requirements can be generated.In this method each word of the document is given a unique ID which would be of the form : XY where X=binarybits which represent ClusterID to which it belongs,Y=binarybits representing the word with in cluster.
That is a document containing topic related to sports contain words related to sports spread through out the document with some common words filling the gap between them.
Now, the concept is:
{ for cluster sports: 000, golf=000, football=001, ...} { for cluster verbs:001, is=000, was=001 ...} { for cluster comparision:010, more=000, ...}Now , the actual ID of the word of the document is given by : {ClusterID to which it belongs , Concatenated by Word's Unique ID with in the Cluster } , which would be unique through out the document.
Thus finally the word "Golf" that happens to appear in the document gets it unique code like : 000 000 { sports, golf}
Thus the word GOLF which requires 4*8=32 bits can be represented by 6 bits. The word IS which requires 2*8=16 bits can be represented by 6 bits.
Consider the scentence: I love golf because many people love it.One can identify the clusters as :
Emotion = {Love, ...} Sports = {Golf } Relative = {Many } Personals = {I, People, It} Connectors= {Because }Actually the Name of the cluster is not important , as long as the clustering can be done effectively. And now, classify the words of above phrase and give each a unique id with in the cluster.Thus the Words get the IDs with respect to Clusters as:
Cluster WordNow , the Unique IDs of each of the words areEmotion :=000 Love=000
Sports:=001 Golf=000
Relative:=010 Many=000
Personals:=011 I=000, People=001, It=010
Connectors:=100 Because=000Love=000000, Golf=001000, Many=010000, I=011000, People=011001, It=011010, Because=100000The scentence is coded as :
I Love Golf Becauese Many People Love It. = Total 32 characters= 256 bits011000 000000 001000 1000000 010000 011001 000000 011010 = Total 48 bits.Thus above string of 256 bits can be compressed to 48 bits.
Moreover, Again the 48 bits can be compressed using the Runlength Encoding Algorithm because it uses just 1's and 0's, which may probably reduce the size to still lesser No of bits.
In above example both ClusterID and Word id Used 3 bits each. That is 8 different clusters and 8 different words with in each cluster can be represented with them.That is a total 64 different words can be represented with 6 bits of representation. But thats too less.
Any how going for too many bits for representation is also not advisable. Because if we observe in above example, the Character "I" used 6 bits, where as "7 letter word Because" also used 6 bits.
Optimum would be to use 16 bits for final id( Both ClusterID,WordID put together)
And one should maintain a tradeoff between ClusterID, WordID ratio with in the 16 bits.
If ClusterID=3 bits , then WordID takes remaining 13 bits resulting
in 8 Clusters , each having capacity to hold 2power
13 words.
Otherwise if ClusterID=8 bits then 256Clusters
each with capacity of 256 words will result.
Increasing the no.of bits of ClusterID
reduces its capacity to hold many words but increases the flexibility to
divide into more no.of clusters.
Reducing the no.of bits of ClusterID
results in more space with in one room and reduces no.of rooms thus might
results in non-effective Classification.
By