OT and Obsolete: Better text compression?

Reply 40 of 67, by BitWrangler

Posted on 2024-11-06, 14:08

BitWrangler Offline

Rank l33t++

Rank: l33t++
Posts: 8225
Joined: 2017-10-11, 00:55
Location: Ontario

All you have to do is maintain progress of 3% better every day, and we'll be able to get the text version of wikipedia on a floppy disk by the end of next year.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 41 of 67, by Harry Potter

Posted on 2024-11-06, 14:59

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I found the bug, and now, I'm getting 113 bytes and 54.5%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 42 of 67, by Harry Potter

Posted on 2024-11-07, 13:32

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I'm starting to seriously debug now. Some recent bug fixes cost me, and yesterday evening, I was at 142 bytes and 42.8% and just gained 5 bytes and 2%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 43 of 67, by Harry Potter

Posted on 2024-11-07, 23:46

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I got it to work so far, but I broke the compression ratio. 🙁 The problem is that I'm always using a bit to determine that a space follows. I just need the program to remember the last character of a token and print a space if the first character is also a letter.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 44 of 67, by Harry Potter

Posted on 2024-11-08, 00:22

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

It works better than the latest version of PrintTok2 that honors all or nearly all possible literals but worse than my modification of Z-Machine. BTW, assuming the space only saved 2 bytes.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 45 of 67, by Harry Potter

Posted on 2024-11-08, 23:58

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I was wrong again: now, it's doing using Toldo's ideas 154 bytes and 38.0%, but I'm not finished with it.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 46 of 67, by Harry Potter

Posted on 2024-11-09, 00:04

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Not compressing words that occur less than 3 times or are only one character in length helps, and the latter just gave me 7 bytes and 2.8%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 47 of 67, by Harry Potter

Posted on 2024-11-09, 00:16

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

BTW, counting an apostrophe as a word character helped. Other characters either broke even or cost me.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 48 of 67, by Harry Potter

Posted on 2024-11-09, 13:25

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I am unhappy to announce that there was an error in my calculations: I was referring to the wrong tokens when tallying the results. 🙁 Now, I'm doing 156 bytes and 37.1%.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 49 of 67, by Harry Potter

Posted on 2024-11-09, 18:00

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I got it to work now. It's printing both tokens and strings properly, but some bug fixes cost me. I'm now at 166 and 33.1%. 🙁

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 50 of 67, by Harry Potter

Posted on 2024-11-09, 18:17

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

My version of Z-Machine is doing 161 bytes and 35.4%, and another approach 177 bytes and 29.0%.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 51 of 67, by Harry Potter

Posted on 2024-11-09, 18:33

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Instead of assuming a space and writing a space otherwise, a bit after each word or punctuation mark specifying that a space follows seems to work.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 52 of 67, by Harry Potter

Posted on 2024-11-09, 23:44

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

My rendition of Z-Machine is working so far now, but I want to buy some more points there by the end of tomorrow. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 53 of 67, by Harry Potter

Posted on 2024-11-11, 00:39

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I was totally wrong, and I'm ecstatic! It's working now, and the numbers I'm getting now are 93 bytes compressed and 62.5% compressibility. 😀 I want 65% or better by the end of tomorrow.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 54 of 67, by Harry Potter

Posted on 2024-11-11, 11:55

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I'm at 84 bytes and 66.2% right now. My modifications include shortening tokens to just as many bits as are needed, a bit after each token to determine if a space follows, compression of tokens and treating certain punctuations as letters/numbers.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 55 of 67, by BitWrangler

Posted on 2024-11-11, 14:35

BitWrangler Offline

Rank l33t++

Rank: l33t++
Posts: 8225
Joined: 2017-10-11, 00:55
Location: Ontario

I was wondering if you could do anything with binary chain codes, https://www.tinaja.com/text/chain01.html which are sequences where a string of a given length in them is a unique value at any point, such that bit patterns are not repeated. So you've effectively got the 7bit ascii values of one char in 7bits, 2 in 8 bits, 3 in 9 bits, 4 in in 10 bits etc depending on offset. Then maybe do something like a ETOAINSHRDLU... letter frequency distribution from the center outwards, so displacement from center, or jump to next letter is what you store.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 56 of 67, by Harry Potter

Posted on 2024-11-14, 19:43

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Umm...I was wrong about my text compression techniques: some bug-fixes cost me big time, and I don't know what I'm doing wrong. 🙁 I'm guessing I'm not handling tokenization well. The following are some code snippets responsible for gathering and sorting tokens, written in C:

1struct tok2buf {
2	char *token;
3	//struct string_buffer* firstoccur;
4	unsigned len;
5	unsigned occur;
6	int saved;
7	struct string_buffer* cu;
8} tok2buf[1024];
9
10struct tokxlat {
11	char* text;
12	unsigned len;
13} tokens2[1024];
14
15unsigned tok2bufsize=0;
16
17unsigned parsetokens1main (void)
18{
19	unsigned i, j, k, l, m;
20	unsigned curstrlen, curtoklen;
21	unsigned newtok;
22	int bestcurtok, bestcurtoklen;
23	char *besttokptr;
24	curstr=strings;
25	while (curstr) {
26		curstrlen=curstr->inlen;
27		for (i=0; curstrlen>=1 && i<curstrlen;) {
28			if (0 && !tok2bufsize) {
29			} else {
30				bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
31				for (j=k=0; j<tok2bufsize; j++) {
32					if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
33						l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
34						if (l>=bestcurtoklen && l==tok2buf[j].len) {
35							newtok=0;
36							bestcurtok=j;
37							bestcurtoklen=l;
38						} else if (l>bestcurtoklen) {
39							newtok=1; m=1;
40							bestcurtok=j;
41							bestcurtoklen=l;
42						}
43					}
44				} if (newtok) {
45					addtotok1 (&curstr->in[i], bestcurtoklen);
46				} else if (bestcurtok>=0) {
47					tok2buf[bestcurtok].occur++;
48				}
49			} //i+=bestcurtoklen;
50			if (bestcurtoklen>=4) i+=bestcurtoklen;
51			else i++;
52		}
53		curstr=curstr->next;
54	}
55	sorttokens1();
56	compresstoks();
57	compresstoks2();
58	curstring=strings;
59	while (curstring) {
60		complit_5a();

…Show last 117 lines

61		curstring=curstring->next;
62	}		
63	writetokstofile();
64	writestrs();
65	return 0;
66}
67
68void sorttokens1 (void)
69{
70	unsigned i, j, k, l;
71	struct tok2buf tmpswaptok;
72	struct tokxlat t;
73	char c[64];
74	char* c2;
75	for (i=0; i<tok2bufsize; i++) {
76		tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur))-(tok2buf[i].len);
77		if (tok2buf[i].len<5 || tok2buf[i].occur<3) tok2buf[i].saved=0;
78	}
79	for (i=0; i<tok2bufsize-1; i++) {
80		k=i;
81		for (j=i; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
82		if (k!=i) {
83			memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
84			memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
85			memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
86		}
87	} for (i=0; i<tok2bufsize; i++) {
88		if (tok2buf[i].saved<11) {tok2bufsize=i; break;}
89	}
90	collecttokens();
91	curstr=strings;
92	while (curstr) {
93		curstrlen=curstr->inlen;
94		for (i=0; curstrlen>=1 && i<curstrlen;) {
95			if (0 && !tok2bufsize) {
96				//addtotok1(curstr->in, 3);
97			} else {
98				bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
99				for (j=k=0; j<tok2bufsize; j++) {
100					if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
101						//bestcurtok=j; bestcurtoklen=tok2buf[j].len;
102						l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
103						if (tok2buf[j].cu==curstr) {
104							// if (tok2buf[j].len+l>i) //continue;
105							// l=tok2buf[j].len-i;
106							//if (tok2buf[j].token+l>=&curstr->in[i]) l=&curstr->in[i]-tok2buf[j].token;
107						}
108						//if ((int)(l=getlenmatch(tok2buf[j].token, &curstr->in[i]))>bestcurtoklen) {
109						if (l>=bestcurtoklen && l==tok2buf[j].len) {
110							newtok=0;
111							bestcurtok=j;
112							bestcurtoklen=l;
113						} else if (l>bestcurtoklen) {
114							newtok=1; m=1;
115							bestcurtok=j;
116							bestcurtoklen=l;
117						}
118					}
119				} if (newtok) {
120					addtotok1 (&curstr->in[i], bestcurtoklen);
121				} else if (bestcurtok>=0) {
122					tok2buf[bestcurtok].occur++;
123				}
124			} //i+=bestcurtoklen;
125			if (bestcurtoklen>=4) i+=bestcurtoklen;
126			else i++;
127		}
128		curstr=curstr->next;
129	}
130	sorttokens1();
131	compresstoks();
132	compresstoks2();
133	curstring=strings;
134	while (curstring) {
135		complit_5a();
136		curstring=curstring->next;
137	}		
138	writetokstofile();
139	writestrs();
140	return 0;
141}
142
143void sorttokens1 (void)
144{
145	unsigned i, j, k, l;
146	struct tok2buf tmpswaptok;
147	struct tokxlat t;
148	char c[64];
149	char* c2;
150	for (i=0; i<tok2bufsize; i++) {
151		tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur))-(tok2buf[i].len);
152		if (tok2buf[i].len<5 || tok2buf[i].occur<3) tok2buf[i].saved=0;
153	}
154	// for (i=0; i<tok2bufsize; i++) {
155		// tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur)+1);
156		// if (tok2buf[i].len<4 || tok2buf[i].occur<3) tok2buf[i].saved=0;
157		// //if (tok2buf[i].len<3) tok2buf[i].saved=0;
158	// }
159	for (i=0; i<tok2bufsize-1; i++) {
160		k=i;
161		for (j=i; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
162		if (k!=i) {
163			memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
164			memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
165			memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
166			// memcpy (&t, &tokens2[i], sizeof(tmpswaptok));
167			// memcpy (&tokens2[i], &tokens2[k], sizeof(tmpswaptok));
168			// memcpy (&tokens2[k], &t, sizeof(tmpswaptok));
169			// c2=tokens[i]; tokens[i]=tokens[k]; tokens[k]=c2;
170		}
171	} for (i=0; i<tok2bufsize; i++) {
172		if (tok2buf[i].saved<11) {tok2bufsize=i; break;}
173	}
174	collecttokens();
175if (tok2bufsize>=129) tok2bufsize=128;
176}

The problem seems to be too few tokens, but, when I decrease the condition to "saved," I get more tokens, but the compression ratio suffers. I don't know what I'm doing wrong. 🙁

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 57 of 67, by Harry Potter

Posted on 2024-11-14, 20:59

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Good news! Over the past two hours, I gained about 8.1% compressibility with my variation of Toldo's technique on my text adventure's rooms description, but I need to debug it, but first, I want to buy some more points. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 58 of 67, by Harry Potter

Posted on 2024-11-29, 14:24

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

I'm sorry, but at the time, I was wrong. However, I've been debugging and optimizing. I found that one cause of the numbers I was receiving was skipping every other character, because I was advancing twice instead of once. Then, I was doing very poorly. The main reason was that I was trying to compress the EOS. Right now, the numbers are exceptional. 😀 And it works! 😀 I need to decompress compressed tokens and actually writing the compressed tokens: right now, the figures are calculated with tokens compressed, but the tokens are actually not compressed. If I can get the tokens to decompress and efficiently, I plan to let people try it out and benchmark it and tell me how it stacks up. It's currently for cc65, though, but I plan to target other compilers and text adventure creation systems and actual text files.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 59 of 67, by Harry Potter

Posted on 2024-12-29, 18:22

Harry Potter Offline

Rank Oldbie

Rank: Oldbie
Posts: 986
Joined: 2011-06-05, 14:42
Location: New York, U.S.

Maybe the numbers I was getting were due to a syntax error in the example test file causing much of one string from compiling correctly, as after the correction, I was doing horrible. 🙁 I've gotten it to work several times, but the compression ratio was horrible. 🙁 While it's doing its job, it's doing it very poorly. 🙁 I believe my problem is with tokenization, as I'm getting way too few tokens. My modifications to Toldo's design is the best, but even it is doing poorly. Following is the code I'm using to collect the tokens in a version not based on Toldo's design:

1unsigned parsetokens1main (void)
2{
3	unsigned i, j, k, l, m;
4	unsigned curstrlen, curtoklen;
5	unsigned newtok;
6	int bestcurtok, bestcurtoklen;
7	//struct tok2buf * besttok[16];
8	char *besttokptr;
9	curstr=strings;
10	while (curstr) {
11		curstrlen=curstr->inlen;
12		//curstr->seg=curseg;
13		for (i=0; curstrlen>=1 && i<curstrlen;) {
14			if (0 && !tok2bufsize) {
15				//addtotok1(curstr->in, 3);
16			} else {
17				bestcurtok=-1; newtok=1; bestcurtoklen=3;m=0;
18				for (j=k=0; j<tok2bufsize; j++) {
19					if (!memcmp(&curstr->in[i], tok2buf[j].token, tok2buf[j].len)) {
20						//bestcurtok=j; bestcurtoklen=tok2buf[j].len;
21						l=getlenmatch(tok2buf[j].token, &curstr->in[i]);
22						if (tok2buf[j].cu==curstr) {
23							// if (tok2buf[j].len+l>i) //continue;
24							// l=tok2buf[j].len-i;
25							//if (tok2buf[j].token+l>&curstr->in[i]) l=&curstr->in[i]-tok2buf[j].token;
26						}
27						//if ((int)(l=getlenmatch(tok2buf[j].token, &curstr->in[i]))>bestcurtoklen) {
28						if (l>=bestcurtoklen && l==tok2buf[j].len) {
29							newtok=0;
30							bestcurtok=j;
31							bestcurtoklen=l;
32						} else if (l>bestcurtoklen) {
33putchar('.');
34							newtok=1; m=1;
35							bestcurtok=j;
36							bestcurtoklen=l;
37						}
38					}
39				} if (newtok) {
40					addtotok1 (&curstr->in[i], bestcurtoklen);
41				} else if (bestcurtok>=0 && bestcurtoklen>=4) {
42					tok2buf[bestcurtok].occur++;
43				}
44			} //i+=bestcurtoklen;
45			if (bestcurtoklen>=4) i+=bestcurtoklen;
46			else i++;
47printf ("<%d>\n", bestcurtoklen);
48		}
49		curstr=curstr->next;
50	}
51printf ("# tokens before sort: %d\n", tok2bufsize);
52	sorttokens1();
53	compresstoks();
54	//compresstoks2();
55	//compresstoksbpe();
56	curstring=strings;
57	while (curstring) {
58		complit_5a();
59		curstring=curstring->next;
60	}

…Show last 5 lines

61	writetokstofile();
62	writestrs();
63	return 0;
64}

and sorting the tokens:

1void sorttokens1 (void)
2{
3	unsigned i, j, k, l;
4	struct tok2buf tmpswaptok;
5unsigned char c[64];
6	for (i=0; i<tok2bufsize; i++) {
7		tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur+1));
8		if (tok2buf[i].occur<6 || tok2buf[i].len<3) tok2buf[i].saved=0;
9	}
10	for (i=0; i<tok2bufsize-1; i++) {
11		k=i;
12		for (j=i+1; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
13		if (k!=i) {
14			memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
15			memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
16			memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
17		}
18	} 
19	for (i=0; i<tok2bufsize; i++) {
20		//tok2buf[i].saved=((tok2buf[i].len-1)*(tok2buf[i].occur))-(tok2buf[i].len+1);
21		memcpy (c, tok2buf[i].token, tok2buf[i].len);
22		c[tok2buf[i].len]=0;
23		printf (" Token# %d: \"%s\", Occur %d\n", i, c, tok2buf[i].occur);
24	}
25getchar();
26	for (i=0; i<tok2bufsize; i++) {
27		if (tok2buf[i].saved<1) {tok2bufsize=i; break;}
28	} if (tok2bufsize>128) tok2bufsize=128;
29printf ("# tokens after sort: %d\n", tok2bufsize);
30	collecttokens();
31puts ("b");
32}

I'm using ANSI-compliant C.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Main menu