write a test file where you manually know what the outputs should be. like "aaaabbbbcccc" and you know what to expect and keep building new more complex tests where you know what the output is expected to be and when its not, you can work out why.
Thank you. The program right now is displaying a whole lot of diagnostic information upon execution and is only displaying two tokens after sorting. This shouldn't be, as the text compiles to 1370 bytes uncompressed. Earlier in the process, the raw tokens found were more realistic, though.
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community
I took a better look at PrintTok2's output earlier today and confirmed that the tokens gathered in Pass 1 are realistic, but, after sort, I'm getting very few tokens. 🙁 Every tweak I applied to the sort function, even when they gave more tokens, cost me in overall compressibility. I really don't know what's wrong with the function.
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community
I took a better look at PrintTok2's output earlier today and confirmed that the tokens gathered in Pass 1 are realistic, but, after sort, I'm getting very few tokens. 🙁 Every tweak I applied to the sort function, even when they gave more tokens, cost me in overall compressibility. I really don't know what's wrong with the function.
If you had already designed your text adventure along with all the words that you used you would be able to experiment and generate a table like this that would cover how often various characters or words were used
The attachment IMG_6363.jpeg is no longer available
A method of using less space under specific circumstances would be to limit your character table to 4 bits with the 16th character being an escape to others. This approach only saves space if you use automatic capitalization and punctuation along with a limited character set where certain characters are heavily favored and symbols and numerics aren’t used.
Natural language is very favorable to this but game language may have a different distribution.
The only way to get the absolute smallest file size is to have a completed game and then to apply a “bespoke” elegant solution to your specific dataset.
Generic solutions are more of a modern solution to large datasets.
Vintage games that fit on a 5 1/4 floppy needed to be creative if they had an immense wall of text.
It is hard to beat uncompressed 5 bit text packed three to a 16 bit word with a flag if your text is varied and non-repetitive.
The more immense your game is the more complex compression methods can become while still saving space. The complex methods don’t work on a short simple repetitive game.
I have several versions of PrintTok2, including one that uses a modified version of the 5 bits method you specified. I also have a couple that favor more-often-occurring letters. I have the text adventure's text mostly ready for compression but had to manually decompress it, as I was using PrintTok1, and it required manual compression. The problem I'm having is that I'm not getting enough tokens, and the problem seems to be in the sort function. I can post the code if you want.
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community
Good news: over the course of today, I gained a little more than 40 bytes compressibility in one of my versions of PrintTok2: it is the one based on Toldo's design. However, it's still not doing all that well: on 1,370 bytes of text compiled, I'm getting 654 bytes compressed, while Deflate (zip files) gives me 646 bytes. 🙁 The main gain is from the use of a static form of BPE. 😀
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community
I will post some of my code now, in the hopes that somebody here can tell me what I'm doing wrong. Following are the code for my modified version of Toldo's technique that parses the words:
1void sorttokens1 (void) 2{ 3 unsigned i, j, k, l; 4 struct tok2buf tmpswaptok; 5unsigned char c[64]; 6 for (i=0; i<tok2bufsize; i++) { 7 tok2buf[i].saved=1; 8 if (tok2buf[i].occur<6) tok2buf[i].saved=0; 9 } 10 for (i=0; i<tok2bufsize-1; i++) { 11 k=i; 12 for (j=i+1; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j; 13 if (k!=i) { 14 memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok)); 15 memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok)); 16 memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok)); 17 } 18 } 19 for (i=0; i<tok2bufsize; i++) { 20 //tok2buf[i].saved=((tok2buf[i].len-1)*(tok2buf[i].occur))-(tok2buf[i].len+1); 21 memcpy (c, tok2buf[i].token, tok2buf[i].len); 22 c[tok2buf[i].len]=0; 23 printf (" Token# %d: \"%s\", Occur %d\n", i, c, tok2buf[i].occur); 24 } 25getchar(); 26 for (i=0; i<tok2bufsize; i++) { 27 if (tok2buf[i].saved<1) {tok2bufsize=i; break;} 28 } if (tok2bufsize>128) tok2bufsize=128; 29printf ("# tokens after sort: %d\n", tok2bufsize); 30 collecttokens(); 31puts ("b"); 32}
The problem is that, with every version of PrintTok2 other than the modification of Toldo's design, I'm getting way too few tokens, and every attempt I make to increase the number of tokens resulted in poorer compressibility. Also, even my modified Toldo does a little too poorly. I don't know why this is.
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community
Good news: I finally gained some ground with text compression! 😁 It turns out that I was using the wrong technique to get the tokens. I'm currently only collecting the results after tokenization and before my other techniques. There, I gained 2.1% yesterday evening. I still have a lot of work to do.
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community