OT and Obsolete: Better text compression?

Reply 60 of 67, by BloodyCactus

Posted on 2024-12-29, 19:34

BloodyCactus Offline

Rank Oldbie

Rank: Oldbie
Posts: 1529
Joined: 2016-02-03, 13:34
Location: Lexington VA

write a test file where you manually know what the outputs should be. like "aaaabbbbcccc" and you know what to expect and keep building new more complex tests where you know what the output is expected to be and when its not, you can work out why.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 61 of 67, by Harry Potter

Posted on 2024-12-29, 19:47

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Thank you. The program right now is displaying a whole lot of diagnostic information upon execution and is only displaying two tokens after sorting. This shouldn't be, as the text compiles to 1370 bytes uncompressed. Earlier in the process, the raw tokens found were more realistic, though.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 62 of 67, by Harry Potter

Posted on 2025-02-27, 00:24

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I took a better look at PrintTok2's output earlier today and confirmed that the tokens gathered in Pass 1 are realistic, but, after sort, I'm getting very few tokens. 🙁 Every tweak I applied to the sort function, even when they gave more tokens, cost me in overall compressibility. I really don't know what's wrong with the function.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 63 of 67, by rmay635703

Posted on 2025-02-27, 04:04

rmay635703 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1801
Joined: 2019-01-19, 19:32

Harry Potter wrote on 2025-02-27, 00:24:

I took a better look at PrintTok2's output earlier today and confirmed that the tokens gathered in Pass 1 are realistic, but, after sort, I'm getting very few tokens. 🙁 Every tweak I applied to the sort function, even when they gave more tokens, cost me in overall compressibility. I really don't know what's wrong with the function.

If you had already designed your text adventure along with all the words that you used you would be able to experiment and generate a table like this that would cover how often various characters or words were used

The attachment IMG_6363.jpeg is no longer available

A method of using less space under specific circumstances would be to limit your character table to 4 bits with the 16th character being an escape to others. This approach only saves space if you use automatic capitalization and punctuation along with a limited character set where certain characters are heavily favored and symbols and numerics aren’t used.

Natural language is very favorable to this but game language may have a different distribution.

The only way to get the absolute smallest file size is to have a completed game and then to apply a “bespoke” elegant solution to your specific dataset.

Generic solutions are more of a modern solution to large datasets.
Vintage games that fit on a 5 1/4 floppy needed to be creative if they had an immense wall of text.

It is hard to beat uncompressed 5 bit text packed three to a 16 bit word with a flag if your text is varied and non-repetitive.

The more immense your game is the more complex compression methods can become while still saving space. The complex methods don’t work on a short simple repetitive game.

Reply 64 of 67, by Harry Potter

Posted on 2025-02-27, 11:14

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I have several versions of PrintTok2, including one that uses a modified version of the 5 bits method you specified. I also have a couple that favor more-often-occurring letters. I have the text adventure's text mostly ready for compression but had to manually decompress it, as I was using PrintTok1, and it required manual compression. The problem I'm having is that I'm not getting enough tokens, and the problem seems to be in the sort function. I can post the code if you want.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 65 of 67, by Harry Potter

Posted on 2025-03-01, 18:57

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Good news: over the course of today, I gained a little more than 40 bytes compressibility in one of my versions of PrintTok2: it is the one based on Toldo's design. However, it's still not doing all that well: on 1,370 bytes of text compiled, I'm getting 654 bytes compressed, while Deflate (zip files) gives me 646 bytes. 🙁 The main gain is from the use of a static form of BPE. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 66 of 67, by Harry Potter

Posted on 2025-03-01, 21:59

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I will post some of my code now, in the hopes that somebody here can tell me what I'm doing wrong. Following are the code for my modified version of Toldo's technique that parses the words:

1unsigned parsetokens1main (void)
2{
3	unsigned i, j, k, l, m, n;
4	unsigned curstrlen, curtoklen, assuspc;
5	unsigned newtok;
6	int bestcurtok, bestcurtoklen;
7	char *besttokptr;
8	curstr=strings;
9	tok2bufsizea=0;
10	while (curstr) {
11		curstrlen=curstr->inlen;
12		for (i=0; i<curstrlen;) {
13			if (0 && !tok2bufsize) {
14			} else {
15				bestcurtok=-1; newtok=1; bestcurtoklen=0;m=0; assuspc=0;
16				if (isgood(curstr->in[i])) {
17					for (n=1; isgood(curstr->in[i+n]); n++); assuspc=(curstr->in[i+n+1]==32);
18				} else {
19					n=1; assuspc=(curstr->in[i+n]==32);
20				} newtok=1;
21				for (j=k=0; j<tok2bufsize; j++) {
22					if (tok2buf[j].len==n && !memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
23						newtok=0; break;
24					}
25				}
26				if (newtok) {
27					addtotok1 (&curstr->in[i], n);
28				} else {
29					tok2buf[j].occur++;
30				}
31			} i+=n;
32		}
33		curstr=curstr->next;
34	}
35	sorttokens1();
36	compresstoks();
37	compresstoks2();
38	writetokstofile();
39	writestrs();
40	writedeclstrs();
41	return 0;
42}

the same for a different version:

1unsigned parsetokens1main (void)
2{
3	unsigned i, j, k, l, m;
4	unsigned curstrlen, curtoklen;
5	unsigned newtok;
6	int bestcurtok, bestcurtoklen;
7	char *besttokptr;
8	curstr=strings;
9	while (curstr) {
10		curstrlen=curstr->inlen;
11		for (i=0; curstrlen>=1 && i<curstrlen;) {
12			if (0 && !tok2bufsize) {
13			} else {
14				bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
15				for (j=k=0; j<tok2bufsize; j++) {
16					if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
17						l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
18						if (tok2buf[j].cu==curstr) {
19						}
20						if (l>=bestcurtoklen && l==tok2buf[j].len) {
21							newtok=0;
22							bestcurtok=j;
23							bestcurtoklen=l;
24						} else if (l>bestcurtoklen) {
25							newtok=1; m=1;
26							bestcurtok=j;
27							bestcurtoklen=l;
28						}
29					}
30				} if (newtok) {
31					addtotok1 (&curstr->in[i], bestcurtoklen);
32				} else if (bestcurtok>=0 && bestcurtoklen>=4) {
33					tok2buf[bestcurtok].occur++;
34				}
35			} //i+=bestcurtoklen;
36			if (bestcurtoklen>=4) i+=bestcurtoklen;
37			else i++;
38		}
39		curstr=curstr->next;
40	}
41	sorttokens1();
42	compresstoks();
43	curstring=strings;
44	while (curstring) {
45		complit_5a();
46		curstring=curstring->next;
47	}		
48	writetokstofile();
49	writestrs();
50	return 0;
51}

and the tokens sorter for the latter:

1void sorttokens1 (void)
2{
3	unsigned i, j, k, l;
4	struct tok2buf tmpswaptok;
5unsigned char c[64];
6	for (i=0; i<tok2bufsize; i++) {
7		tok2buf[i].saved=1;
8		if (tok2buf[i].occur<6) tok2buf[i].saved=0;
9	}
10	for (i=0; i<tok2bufsize-1; i++) {
11		k=i;
12		for (j=i+1; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
13		if (k!=i) {
14			memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
15			memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
16			memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
17		}
18	} 
19	for (i=0; i<tok2bufsize; i++) {
20		//tok2buf[i].saved=((tok2buf[i].len-1)*(tok2buf[i].occur))-(tok2buf[i].len+1);
21		memcpy (c, tok2buf[i].token, tok2buf[i].len);
22		c[tok2buf[i].len]=0;
23		printf (" Token# %d: \"%s\", Occur %d\n", i, c, tok2buf[i].occur);
24	}
25getchar();
26	for (i=0; i<tok2bufsize; i++) {
27		if (tok2buf[i].saved<1) {tok2bufsize=i; break;}
28	} if (tok2bufsize>128) tok2bufsize=128;
29printf ("# tokens after sort: %d\n", tok2bufsize);
30	collecttokens();
31puts ("b");
32}

The problem is that, with every version of PrintTok2 other than the modification of Toldo's design, I'm getting way too few tokens, and every attempt I make to increase the number of tokens resulted in poorer compressibility. Also, even my modified Toldo does a little too poorly. I don't know why this is.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 67 of 67, by Harry Potter

Posted on 2025-03-09, 09:57

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Good news: I finally gained some ground with text compression! 😁 It turns out that I was using the wrong technique to get the tokens. I'm currently only collecting the results after tokenization and before my other techniques. There, I gained 2.1% yesterday evening. I still have a lot of work to do.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Main menu