View Full Version : Compression scripts
krycek
11-16-2002, 09:47 AM
OK, does anyone know where I can find the source for some good compression algorithms?
I have some scripts, but they are in C, and I do not really want to go through and translate them. I need the scripts in JavaScript or PHP really, but if you know a good script I will translate it :)
The scripts I currently have only use Huffman-based encoding, which is fair enough but there are better methods out there. One variant I have is LBE which looks useful but, hopefully a better one will come along. Oh and I have LZ based too, but as I want a string compressor for 'general purpose' use, LZ is no good to me at all.
Also note that a file-based script is no good, well I can take the core of the compression routine out but I would prefer a simple script which just processes strings.
When finished, the scripts will become OpensSource, so the code will be accessible to all.
I cannot use the built-in PHP compression as this has to work on the server and the client, hence both PHP and JavaScript.
Cheers for any suggestions, links, scripts, or other help :D
::] krycek [::
beetle
11-16-2002, 04:26 PM
PHP and javascript are both C-based...why not just port over something you've got?
krycek
11-16-2002, 06:18 PM
Well, that's what I am doing right now :)
I soon found that I have to bite the bullet and port my other-language scripts, because there is nothing available in JS or PHP to do this at the moment...
I will post the source somewhere when I finish :)
::] krycek [::
beetle
11-16-2002, 06:20 PM
Cool, I'd like to see those
krycek
11-16-2002, 09:53 PM
Ok, sure :)
There are two methods below - one is a purely linear method that looks for repeated characters, and the second is more complex, it works on a user-defined index (such as a list of html/javascript keywords) and replaces them with a token. It also builds an index so it can put itself back together :)
One slight problem... the second method can only deal with chunks that are 2 ^ bits in length... as the max seems to be 16 bit, the limit is essentially 65Kb for each index.
I am currently experimenting with a way to split the string into chunks and then recombine, each with it's own index, but it is a little tricky. I think I am nearly there, though, and I will post the enhanced code shortly.
Notice that when used together, these methods can acheive quite a respectable level of compression, however they are very use-specific. The first method especially is not all that valuable if there is not much repeating data, because it can actually increase the size instead of compress.
I have five algorithms on my list, each one I have to code essentially from scratch but thankfully I have some old C code for the more complex transformations that I will be working on later.
I hope to use the five in a general compression routine:
1. ITE (Indexed Token Encode) the string (if required) in order to remove common keywords
2. RLE (Run Length Encode) the string (removes repeating data ready for step 3)
3. Apply BWT (Burrows-Wheeler Transformation) to the string
4. MTF (Move To Front) transformation, which readies the string for another pass of...
5. RLE again, because there will be more repeating data
6. ARI (order-0 ARIthmetic encoding) complex method applied to the by now almost random data
My experiments so far suggest that a PHP/HTML/JavaScript file could be compressed by up to 60%. I think that makes it all worthwile :)
If anyone wants to help me out, please say so!
Also, everyone is of course free to reuse the code, however it is my property under GNU licensing and is to form part of my API (public launch very soon ;)) If you use my code, please just drop my an email with some feedback etc. :)
Enjoy!
//----- RLE (Run Length Encode) String (Linear Compression Algorithm) --------------------------------------------------
String.prototype.rle = function(bits) {
if (!bits) bits = 8
var maxSequence = Math.pow(2, bits) - 1
var thisCharCode = 0
var lastCharCode = 0
var output = ""
for (i = 0; i < this.length; i++) {
thisCharCode = this.charCodeAt(i)
output += String.fromCharCode(thisCharCode)
if (thisCharCode == lastCharCode) {
var count = 1
while (count < maxSequence && i < this.length) {
i++
thisCharCode = this.charCodeAt(i)
if (thisCharCode == lastCharCode) {
count++
} else {
break
}
}
output += String.fromCharCode(count)
if (count != maxSequence && thisCharCode >= 0) {
output += String.fromCharCode(thisCharCode)
}
}
lastCharCode = thisCharCode
}
return output
}
String.prototype.unrle = function() {
var thisCharCode = 0
var lastCharCode = 0
var output = ""
for (i = 0; i < this.length; i++) {
thisCharCode = this.charCodeAt(i)
output += String.fromCharCode(thisCharCode)
if (thisCharCode == lastCharCode) {
i++
var count = this.charCodeAt(i)
while (count-- > 1) {
output += String.fromCharCode(thisCharCode)
}
}
lastCharCode = thisCharCode
}
return output
}
//----- ITE (Indexed Token Encode) String (Indexed Compression Algorithm) ----------------------------------------------
String.prototype.ite = function(tokenIndex, bits) {
if (!bits) bits = 8
var maxSequence = Math.pow(2, bits) - 1
// Pretend array
tokenIndex = new Array("function", "alert")
// -------------
dataIndex = new Array() // Stores the token code and position
var parseString = this.substr(0) // Copies the string
// Loop through the tokens given by the user, one at a time
for (i = 0; i < tokenIndex.length; i++) {
var tokenNotFound = 0
// Search for the current token within the string
while (!tokenNotFound) {
// Repeat until the token is not found anymore
tokenPos = parseString.indexOf(tokenIndex[i])
if (tokenPos >= 0) {
// The token was found!
// Insert the token code and the position into the data index
dataIndex[dataIndex.length] = String.fromCharCode(i + 1, tokenPos)
// Remove the token that was found, from the string
parseString = parseString.substring(0, tokenPos)
+ parseString.substr(tokenPos + tokenIndex[i].length)
} else {
// The token was not found!
tokenNotFound = 1
}
}
}
// All the tokens have been searched for. The next step is to build the index into the string
// Add the number of tokens found
var finalString = String.fromCharCode(dataIndex.length + 1)
for (i = 0; i < dataIndex.length; i++) {
finalString += dataIndex[i]
}
finalString += parseString
return finalString
}
String.prototype.unite = function(tokenIndex, bits) {
if (!bits) bits = 8
var maxSequence = Math.pow(2, bits) - 1
// Pretend array
tokenIndex = new Array("function", "alert")
// -------------
var countIndexes = 0 // The total number of indexes
dataIndex = new Array() // Stores the token code and position
var parseString = this.substr(0) // Copies the string
// Look at the first character in the string to find out how many items are in the data index
// Loop through the string pulling out each data index in turn
for (i = 1; i < parseString.charCodeAt(0); i++) {
dataIndex.push(parseString.substr((i * 2) - 1, 2))
}
// Remove the data index from the string
parseString = parseString.substr((dataIndex.length * 2) + 1)
// Loop through the data index in reverse, re-inserting each token into the string at the designated place
for (i = dataIndex.length - 1; i >= 0; i--) {
// Look at the current token code and position stored in the data index
// Insert the token that was removed
parseString = parseString.substring(0, dataIndex[i].charCodeAt(1))
+ tokenIndex[dataIndex[i].charCodeAt(0) - 1]
+ parseString.substr(dataIndex[i].charCodeAt(1))
}
return parseString
}
By the way, apologies that the code is not perfect, because I only started these routines today and I am still developing them :)
::] krycek [::
krycek
11-16-2002, 11:37 PM
OK, finished :)
The latest version of the ITE script can now deal with any string size, nevermind what bit length is specified.
It does this by splitting a long string into chunks of the maximum allowed size, compressing them, then recombining them. At the other end, the decompressed does the same in reverse.
So... I know it's a long piece of code, but it has changed quite a bit so I will re-post it in it's entirity. It is ready to use.
By the way, some parts of the code may look odd, but I have endeavoured to avoid using char code 0 (null) as this is a bad code to use! It is not an actual character therefore would mess everything up.
I have created a 260k html/javascript file to test this with, and I compared the output with the input and they were identical, however that does not mean that this script is 100% bug-free, so try not to use it for anything mission-critical! I could not break the script no matter what I tried, also I acheived a good level of compression with just a few keywords in the dictionary.
Obviously the dictionary (by that I mean the token index) can be user-defined, and would also have to be downloaded - but only once. So, not a big problem :)
...feedback, anyone...? Beetle? Joh6nn? etc. etc.
//----- ITE (Indexed Token Encode) String (Indexed Compression Algorithm) ----------------------------------------------
String.prototype.ite = function(tokenIndex, bits) {
if (!bits) bits = 8
var maxSequence = Math.pow(2, bits) - 1
segments = new Array() // Stores chunks of data
var parseString = this.substr(0) // Copies the string
var finalString = String.fromCharCode(bits) // Stores the bit length
// Check the length of the string. If it is more than the allowable maximum, then split it up.
while (parseString) {
if (parseString.length > maxSequence) {
segments.push(parseString.substr(0, maxSequence - 1))
parseString = parseString.substr(maxSequence - 1)
} else {
segments.push(parseString.substr(0))
parseString = ""
}
}
// Loop through the segments and process each in turn
for (s = 0; s < segments.length; s++) {
dataIndex = new Array() // Stores the token code and position
// Loop through the tokens given by the user, one at a time
for (i = 0; i < tokenIndex.length; i++) {
var tokenNotFound = 0
// Search for the current token within the string
while (!tokenNotFound) {
// Repeat until the token is not found anymore
tokenPos = segments[s].indexOf(tokenIndex[i])
if (tokenPos >= 0) {
// The token was found!
// Insert the token code and the position into the data index
dataIndex[dataIndex.length] = String.fromCharCode(i + 1, tokenPos)
// Remove the token that was found, from the string
segments[s] = segments[s].substring(0, tokenPos)
+ segments[s].substr(tokenPos + tokenIndex[i].length)
} else {
// The token was not found!
tokenNotFound = 1
}
}
}
// All the tokens have been searched for. The next step is to build the index into the string
// Add the number of tokens found
parseString = String.fromCharCode(dataIndex.length + 1)
for (i = 0; i < dataIndex.length; i++) {
parseString += dataIndex[i]
}
parseString += segments[s]
// Add the length of the compressed segment, and then the segment
finalString += String.fromCharCode(parseString.length + 1)
+ parseString
}
return finalString
}
String.prototype.unite = function(tokenIndex) {
bits = this.charCodeAt(0)
segments = new Array() // Stores chunks of data
var parseString = this.substr(1) // Copies the string
var finalString = ""
// Check through the string and split it into the appropriate segments.
while (parseString) {
segments.push(parseString.substr(1, parseString.charCodeAt(0) - 1))
parseString = parseString.substr(parseString.charCodeAt(0))
}
// Loop through the segments and process each in turn
for (s = 0; s < segments.length; s++) {
dataIndex = new Array() // Stores the token code and position
// Look at the first character in the string to find out how many items are in the data index
// Loop through the string pulling out each data index in turn
for (i = 1; i < segments[s].charCodeAt(0); i++) {
dataIndex.push(segments[s].substr((i * 2) - 1, 2))
}
// Remove the data index from the string
segments[s] = segments[s].substr((dataIndex.length * 2) + 1)
// Loop through the data index in reverse, re-inserting each token into the string at the designated place
for (i = dataIndex.length - 1; i >= 0; i--) {
// Look at the current token code and position stored in the data index
// Insert the token that was removed
segments[s] = segments[s].substring(0, dataIndex[i].charCodeAt(1))
+ tokenIndex[dataIndex[i].charCodeAt(0) - 1]
+ segments[s].substr(dataIndex[i].charCodeAt(1))
}
finalString += segments[s]
}
return finalString
}
Please let me know if you use this script!
::] krycek [::
EDIT: I realised that there is no need to specify the bit length to the unite function, so I took that bit out. Instead, I stored the bit length as the first character of the file, for sake of information. The uncompression technique doesn't really need to know it.
krycek
11-17-2002, 01:16 AM
OK, I have done a couple of tests on the simple compression functions I posted.
I made a quick dictionary of 25 keywords, and bunged together a 160Kb file made of html, javascript, css, etc. I stuck it into a text area and compressed it, displayed the length, then uncompressed it again to check validity. (Assigning the innerHTML of the text area to a string.)
Using 8-bit indexing, the compression took less than 2 seconds on my machine, and reduced the string size from 159712 characters to 145452 (91% of original). 12-bit indexing took around 10 seconds, and reduced to 144058 (90%). 16-bit reduced only slightly more, to 143961 (90%) but took around 20 seconds.
Anything above 16-bit failed, so Joh6nn was right :)
Decompression in each case was very quick, around 20% of the compression time.
Seeing as I made a very quick dictionary, and the code I used was heavily commented, I was pleased to get 10% reduction. I expect that with optimised files (comments stripped etc.) and a better dictionary (say, 256 entries) the routine could possibly acheive as much as 20%.
Seeing as this is only intended to be a pre-parser before the string is hit with the more powerful algorithms, I am quite pleased with the results :)
Whatever I did, I could not break the functions. Also I left the PC running for a while on a 16Mb string (simply the 160Kb string copied and pasted 100 times) and then used a file comparison tool to compare the output - identical. It did take around two minutes to process, though!
Well if anyone can break it, please do...
::] krycek [::
krycek
11-17-2002, 07:54 PM
Has anyone had the time to test these functions at all?
::] krycek [::
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.