Page 2 of 3
Posted: Tue Sep 12, 2006 7:43 pm
by Memblers
tepples wrote:There are also a number of commonly-used letter triples and quadruples, etc. They're called "words." Has anyone investigated separating the text into words, assigning a code number to each word, and then on decompression looking up the code number in a table to get characters?
I thought that's basically what Huffman compression does. I'm sure that'd work well, especially when the word is much longer than the code number that represents it. And give the most repeated ones the smallest code numbers.
Posted: Tue Sep 12, 2006 8:32 pm
by tokumaru
Memblers wrote:I'm sure that'd work well, especially when the word is much longer than the code number that represents it. And give the most repeated ones the smallest code numbers.
If you used 16-bit codes you could index 65536 words. That's more than enough for most languages if you ask me. And 2 bytes per word is very good, as most words are larger than 2 characters. However, defining the dictionary is what would be more space consuming, killing the efficiency of the compression.
Posted: Tue Sep 12, 2006 8:52 pm
by Memblers
Since it's just text though, you've at least got bit 7 free to do whatever with. So if you took 127 of the most repeated words (including leading and trailing spaces, punctuation) and had bit 7 represent that the byte is a code number, you'd have a decent amount of the script represented in single bytes. That would save memory even from the word "a" (remembering there's spacing). And common stuff like "the, of, is, it, to".
DTE is pretty good, I don't know if this is better. Maybe if there are some big, commonly used words. Or if you reduced your charset to 64 (6-bit) and combined a dictionary with DTE. Could use DTE on the dictionary too, heheh.

Posted: Tue Sep 12, 2006 9:07 pm
by tepples
Memblers wrote:Could use DTE on the dictionary too, heheh.

Building the romhackers' nightmare, one compression layer at a time.

Posted: Wed Sep 13, 2006 4:13 am
by mbrenaman
Well, it was either DTE or Huffman. In my situation, I wanted the tiny dictionary needed for DTE (SNROM RPG game). My dictionary of 64 entries (128 bytes) was conveniently put on the fixed bank. It was very easy to implement and I saw reduction in different text areas of about 20%-40%.
If I had a whole bank to spare, I would have dedicated it to words instead a small character pairs. Most of all banks have been dedicated to other things from the original design. If I had planned for text compression in the begining, I probably would have went with word compression.
I'm satisfied though, and see our game's goals being more releastic.
Thanks again for everyone's input.
Posted: Wed Sep 13, 2006 8:26 am
by Bregalad
DTE can be more or less efficient in function of the effecivity of the pairs of letters made. I think is is very easy to implement, and quite flexible.
Using 100% a dictionary wouldn't be so good, because long code would come often for very used words, and consume a lot of space for rare words. Also, it would be a pain to compile the text.
You're not suck with ASCII (at least I think so). ASCII in NES is a terrible waste of tiles. You can just keep your own letter mapping, and do something like that (just build that as example) :
$00-$10 : Special commands in text (space, return, tab, etc...)
$10-$30 : Use for special commands such as location words, people or simply common long words
$30-$7f : Simply input one letter
$80-$ff : 128 different dual letter combinations.
Posted: Wed Sep 13, 2006 2:41 pm
by commodorejohn
That's not a bad idea. I always keep the ASCII text characters at their default location to save a conversion step (I just fill the unused characters such as $00-1F with commonly-used tiles,) but I like the idea of using unused text bytes for special purposes.
Posted: Fri Sep 29, 2006 1:25 pm
by abonetochew
Would implementing something like bzip2 be practical on the NES?
I've never implemented the
Burrows-Wheeler transform or the
move-to-front transform, but they don't look like they require kilobytes of dictionaries to work properly.
Posted: Fri Sep 29, 2006 2:15 pm
by tepples
BWT handles an entire block (several kilobytes) in RAM at a time. MTF takes a RAM segment of up to 256 bytes, one for each byte.
Posted: Sat Sep 30, 2006 7:33 am
by tokumaru
I was just researching those 2! I haven't tried any implementations yet, but I think that traditional BWT will use too much memory to be implemented on the NES. I wonder if a version that works on smaller blocks would provide efficient compression...
Posted: Mon Oct 02, 2006 10:03 am
by CaH4e3
"Dick Tracy" and "Magician" games using some nice variations of huffman compression... First one encodes symbols as indexes in symbol array arranged by usage frequence for each symbol. Second one using some codes like huffman tree, that compared with code-table and then index in that table using as index of symbol in array similar to above...
Bot need pretty little memory (just reading bit-stream and decoding)... Compression routines pretty simple to write on PC, decoding routines you may cut from games.

Posted: Mon Oct 02, 2006 2:29 pm
by Celius
The way I'd go is I'd have $Ex (I'm not sure how many x is) combinations of letters that are very common. It'd be $Ex, and not $FF because there are letters that can be defined as single letters ($40-$5A). It's obvious that you can't have DTE for every combination, because eventually you'd be using 2 bytes to define dual tiles, and we can all just see that's pretty pointless. I was thinking TTE (Triple Tile Encoding) MAY be more effective. I wouldn't have a table though, I'd have a routine that decodes every tile manually. It'd be a huge table, and I don't even know if it would make sense. It'd use 2 bytes for 3 letters. It'd be hard to work in cubes though, but I've always found working in cubes on a system such as this was interesting. It's just an idea that came to mind.
Now that I think about it, DTE is possible for 24 bits for 2 letters, as apposed to 32 bits. I'm sure that's what many of you were thinking. It'd still only cut the size down by 1/4. Which in the grand scheme of things can really help. Instead of using 8 banks for text, you can use 6. You know? It really does help to have at least SOME sort of compression. I'm going to think a little more about TTE, or maybe a different method of compression.
Posted: Mon Oct 02, 2006 3:01 pm
by tokumaru
You know, if 64 characters is enough for all symbols, you could even try a mix of DTE and TTE. The top 2 bits could be used to define what the other 6 are: an index to a single character, and index to a pair of characters or an index to a group of 3 characters. Of course, you'd also have only 64 entries in each table, so I really don't know if that'd perform well. And there is still one code left, but I don't know what that could be used for, since I don't think that groups of 4 characters would be very common.
Posted: Mon Oct 02, 2006 3:51 pm
by commodorejohn
Here's a thought I had today on combining word-lookup with DTE and regular ASCII.
Suppose we have each word in our text be represented by a two-byte pointer to a zero-terminated string in the ROM. We'll assume that nobody's going to store a text lookup in zero-page, so if the high byte of the pointer is 0, that signals that a DTEed and zero-terminated string follows, and the pointer itself is discarded. From there, a DTE algorithm takes over and returns control to the word-lookup algorithm when it hits the terminator character (0, not Ahnuld.) You could even have DTE in the word-table entries.
Granted, it's a little over-wrought, but I'll bet it provides pretty decent compression.
Posted: Tue Oct 03, 2006 9:16 am
by Bregalad
Tokumaru, your idea looks quite interesting. I think the fourth combination could be for variable lenght things, such as locations words and people's name, among with sprical caracters such as tabing, returning and message-end-ing.