VOGONS


OT and Obsolete: Better text compression?

Topic actions

Reply 40 of 61, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

All you have to do is maintain progress of 3% better every day, and we'll be able to get the text version of wikipedia on a floppy disk by the end of next year.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 41 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I found the bug, and now, I'm getting 113 bytes and 54.5%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 42 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I'm starting to seriously debug now. Some recent bug fixes cost me, and yesterday evening, I was at 142 bytes and 42.8% and just gained 5 bytes and 2%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 43 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I got it to work so far, but I broke the compression ratio. 🙁 The problem is that I'm always using a bit to determine that a space follows. I just need the program to remember the last character of a token and print a space if the first character is also a letter.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 44 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

It works better than the latest version of PrintTok2 that honors all or nearly all possible literals but worse than my modification of Z-Machine. BTW, assuming the space only saved 2 bytes.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 45 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I was wrong again: now, it's doing using Toldo's ideas 154 bytes and 38.0%, but I'm not finished with it.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 46 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Not compressing words that occur less than 3 times or are only one character in length helps, and the latter just gave me 7 bytes and 2.8%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 47 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

BTW, counting an apostrophe as a word character helped. Other characters either broke even or cost me.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 48 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I am unhappy to announce that there was an error in my calculations: I was referring to the wrong tokens when tallying the results. 🙁 Now, I'm doing 156 bytes and 37.1%.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 49 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I got it to work now. It's printing both tokens and strings properly, but some bug fixes cost me. I'm now at 166 and 33.1%. 🙁

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 50 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

My version of Z-Machine is doing 161 bytes and 35.4%, and another approach 177 bytes and 29.0%.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 51 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Instead of assuming a space and writing a space otherwise, a bit after each word or punctuation mark specifying that a space follows seems to work.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 52 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

My rendition of Z-Machine is working so far now, but I want to buy some more points there by the end of tomorrow. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 53 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I was totally wrong, and I'm ecstatic! It's working now, and the numbers I'm getting now are 93 bytes compressed and 62.5% compressibility. 😀 I want 65% or better by the end of tomorrow.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 54 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I'm at 84 bytes and 66.2% right now. My modifications include shortening tokens to just as many bits as are needed, a bit after each token to determine if a space follows, compression of tokens and treating certain punctuations as letters/numbers.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 55 of 61, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

I was wondering if you could do anything with binary chain codes, https://www.tinaja.com/text/chain01.html which are sequences where a string of a given length in them is a unique value at any point, such that bit patterns are not repeated. So you've effectively got the 7bit ascii values of one char in 7bits, 2 in 8 bits, 3 in 9 bits, 4 in in 10 bits etc depending on offset. Then maybe do something like a ETOAINSHRDLU... letter frequency distribution from the center outwards, so displacement from center, or jump to next letter is what you store.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 56 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Umm...I was wrong about my text compression techniques: some bug-fixes cost me big time, and I don't know what I'm doing wrong. 🙁 I'm guessing I'm not handling tokenization well. The following are some code snippets responsible for gathering and sorting tokens, written in C:

struct tok2buf {
char *token;
//struct string_buffer* firstoccur;
unsigned len;
unsigned occur;
int saved;
struct string_buffer* cu;
} tok2buf[1024];

struct tokxlat {
char* text;
unsigned len;
} tokens2[1024];

unsigned tok2bufsize=0;

unsigned parsetokens1main (void)
{
unsigned i, j, k, l, m;
unsigned curstrlen, curtoklen;
unsigned newtok;
int bestcurtok, bestcurtoklen;
char *besttokptr;
curstr=strings;
while (curstr) {
curstrlen=curstr->inlen;
for (i=0; curstrlen>=1 && i<curstrlen;) {
if (0 && !tok2bufsize) {
} else {
bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
for (j=k=0; j<tok2bufsize; j++) {
if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
if (l>=bestcurtoklen && l==tok2buf[j].len) {
newtok=0;
bestcurtok=j;
bestcurtoklen=l;
} else if (l>bestcurtoklen) {
newtok=1; m=1;
bestcurtok=j;
bestcurtoklen=l;
}
}
} if (newtok) {
addtotok1 (&curstr->in[i], bestcurtoklen);
} else if (bestcurtok>=0) {
tok2buf[bestcurtok].occur++;
}
} //i+=bestcurtoklen;
if (bestcurtoklen>=4) i+=bestcurtoklen;
else i++;
}
curstr=curstr->next;
}
sorttokens1();
compresstoks();
compresstoks2();
curstring=strings;
while (curstring) {
complit_5a();
Show last 117 lines
		curstring=curstring->next;
}
writetokstofile();
writestrs();
return 0;
}

void sorttokens1 (void)
{
unsigned i, j, k, l;
struct tok2buf tmpswaptok;
struct tokxlat t;
char c[64];
char* c2;
for (i=0; i<tok2bufsize; i++) {
tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur))-(tok2buf[i].len);
if (tok2buf[i].len<5 || tok2buf[i].occur<3) tok2buf[i].saved=0;
}
for (i=0; i<tok2bufsize-1; i++) {
k=i;
for (j=i; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
if (k!=i) {
memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
}
} for (i=0; i<tok2bufsize; i++) {
if (tok2buf[i].saved<11) {tok2bufsize=i; break;}
}
collecttokens();
curstr=strings;
while (curstr) {
curstrlen=curstr->inlen;
for (i=0; curstrlen>=1 && i<curstrlen;) {
if (0 && !tok2bufsize) {
//addtotok1(curstr->in, 3);
} else {
bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
for (j=k=0; j<tok2bufsize; j++) {
if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
//bestcurtok=j; bestcurtoklen=tok2buf[j].len;
l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
if (tok2buf[j].cu==curstr) {
// if (tok2buf[j].len+l>i) //continue;
// l=tok2buf[j].len-i;
//if (tok2buf[j].token+l>=&curstr->in[i]) l=&curstr->in[i]-tok2buf[j].token;
}
//if ((int)(l=getlenmatch(tok2buf[j].token, &curstr->in[i]))>bestcurtoklen) {
if (l>=bestcurtoklen && l==tok2buf[j].len) {
newtok=0;
bestcurtok=j;
bestcurtoklen=l;
} else if (l>bestcurtoklen) {
newtok=1; m=1;
bestcurtok=j;
bestcurtoklen=l;
}
}
} if (newtok) {
addtotok1 (&curstr->in[i], bestcurtoklen);
} else if (bestcurtok>=0) {
tok2buf[bestcurtok].occur++;
}
} //i+=bestcurtoklen;
if (bestcurtoklen>=4) i+=bestcurtoklen;
else i++;
}
curstr=curstr->next;
}
sorttokens1();
compresstoks();
compresstoks2();
curstring=strings;
while (curstring) {
complit_5a();
curstring=curstring->next;
}
writetokstofile();
writestrs();
return 0;
}

void sorttokens1 (void)
{
unsigned i, j, k, l;
struct tok2buf tmpswaptok;
struct tokxlat t;
char c[64];
char* c2;
for (i=0; i<tok2bufsize; i++) {
tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur))-(tok2buf[i].len);
if (tok2buf[i].len<5 || tok2buf[i].occur<3) tok2buf[i].saved=0;
}
// for (i=0; i<tok2bufsize; i++) {
// tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur)+1);
// if (tok2buf[i].len<4 || tok2buf[i].occur<3) tok2buf[i].saved=0;
// //if (tok2buf[i].len<3) tok2buf[i].saved=0;
// }
for (i=0; i<tok2bufsize-1; i++) {
k=i;
for (j=i; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
if (k!=i) {
memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
// memcpy (&t, &tokens2[i], sizeof(tmpswaptok));
// memcpy (&tokens2[i], &tokens2[k], sizeof(tmpswaptok));
// memcpy (&tokens2[k], &t, sizeof(tmpswaptok));
// c2=tokens[i]; tokens[i]=tokens[k]; tokens[k]=c2;
}
} for (i=0; i<tok2bufsize; i++) {
if (tok2buf[i].saved<11) {tok2bufsize=i; break;}
}
collecttokens();
if (tok2bufsize>=129) tok2bufsize=128;
}

The problem seems to be too few tokens, but, when I decrease the condition to "saved," I get more tokens, but the compression ratio suffers. I don't know what I'm doing wrong. 🙁

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 57 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Good news! Over the past two hours, I gained about 8.1% compressibility with my variation of Toldo's technique on my text adventure's rooms description, but I need to debug it, but first, I want to buy some more points. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 58 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I'm sorry, but at the time, I was wrong. However, I've been debugging and optimizing. I found that one cause of the numbers I was receiving was skipping every other character, because I was advancing twice instead of once. Then, I was doing very poorly. The main reason was that I was trying to compress the EOS. Right now, the numbers are exceptional. 😀 And it works! 😀 I need to decompress compressed tokens and actually writing the compressed tokens: right now, the figures are calculated with tokens compressed, but the tokens are actually not compressed. If I can get the tokens to decompress and efficiently, I plan to let people try it out and benchmark it and tell me how it stacks up. It's currently for cc65, though, but I plan to target other compilers and text adventure creation systems and actual text files.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 59 of 61, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Maybe the numbers I was getting were due to a syntax error in the example test file causing much of one string from compiling correctly, as after the correction, I was doing horrible. 🙁 I've gotten it to work several times, but the compression ratio was horrible. 🙁 While it's doing its job, it's doing it very poorly. 🙁 I believe my problem is with tokenization, as I'm getting way too few tokens. My modifications to Toldo's design is the best, but even it is doing poorly. Following is the code I'm using to collect the tokens in a version not based on Toldo's design:

unsigned parsetokens1main (void)
{
unsigned i, j, k, l, m;
unsigned curstrlen, curtoklen;
unsigned newtok;
int bestcurtok, bestcurtoklen;
//struct tok2buf * besttok[16];
char *besttokptr;
curstr=strings;
while (curstr) {
curstrlen=curstr->inlen;
//curstr->seg=curseg;
for (i=0; curstrlen>=1 && i<curstrlen;) {
if (0 && !tok2bufsize) {
//addtotok1(curstr->in, 3);
} else {
bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
for (j=k=0; j<tok2bufsize; j++) {
if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
//bestcurtok=j; bestcurtoklen=tok2buf[j].len;
l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
if (tok2buf[j].cu==curstr) {
// if (tok2buf[j].len+l>i) //continue;
// l=tok2buf[j].len-i;
//if (tok2buf[j].token+l>&curstr->in[i]) l=&curstr->in[i]-tok2buf[j].token;
}
//if ((int)(l=getlenmatch(tok2buf[j].token, &curstr->in[i]))>bestcurtoklen) {
if (l>=bestcurtoklen && l==tok2buf[j].len) {
newtok=0;
bestcurtok=j;
bestcurtoklen=l;
} else if (l>bestcurtoklen) {
putchar('.');
newtok=1; m=1;
bestcurtok=j;
bestcurtoklen=l;
}
}
} if (newtok) {
addtotok1 (&curstr->in[i], bestcurtoklen);
} else if (bestcurtok>=0 && bestcurtoklen>=4) {
tok2buf[bestcurtok].occur++;
}
} //i+=bestcurtoklen;
if (bestcurtoklen>=4) i+=bestcurtoklen;
else i++;
printf ("<%d>\n", bestcurtoklen);
}
curstr=curstr->next;
}
printf ("# tokens before sort: %d\n", tok2bufsize);
sorttokens1();
compresstoks();
//compresstoks2();
//compresstoksbpe();
curstring=strings;
while (curstring) {
complit_5a();
curstring=curstring->next;
}
Show last 5 lines
	writetokstofile();
writestrs();
return 0;
}

and sorting the tokens:

void sorttokens1 (void)
{
unsigned i, j, k, l;
struct tok2buf tmpswaptok;
unsigned char c[64];
for (i=0; i<tok2bufsize; i++) {
tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur+1));
if (tok2buf[i].occur<6 || tok2buf[i].len<3) tok2buf[i].saved=0;
}
for (i=0; i<tok2bufsize-1; i++) {
k=i;
for (j=i+1; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
if (k!=i) {
memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
}
}
for (i=0; i<tok2bufsize; i++) {
//tok2buf[i].saved=((tok2buf[i].len-1)*(tok2buf[i].occur))-(tok2buf[i].len+1);
memcpy (c, tok2buf[i].token, tok2buf[i].len);
c[tok2buf[i].len]=0;
printf (" Token# %d: \"%s\", Occur %d\n", i, c, tok2buf[i].occur);
}
getchar();
for (i=0; i<tok2bufsize; i++) {
if (tok2buf[i].saved<1) {tok2bufsize=i; break;}
} if (tok2bufsize>128) tok2bufsize=128;
printf ("# tokens after sort: %d\n", tok2bufsize);
collecttokens();
puts ("b");
}

I'm using ANSI-compliant C.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community