Twitter: https://twitter.com/Wareya

Cursed

Date: 2019-03-15

I just added something very cursed to my mecab clone.

So mecab has to store a 2d matrix of connection costs between types of tokens (because connection costs and token costs are how it finds the lowest-cost path and token types are how it handles connection costs) but the connection matrix for the only truly good mecab dictionary is 800MB now so loading it all into memory was taking up a lot of memory and that's not gonna fly on a VPS with 1GB of RAM.

So at first I just made a hashmap of matrix edge pairs to connection costs that falls back to reading directly from disk if it's not in the hashmap and that worked fine and it took up like 5~10% as much memory since it only uses 1~2% of all the total available possible connection types (there's a lot of memory overhead for hashmaps in rust apparently) and it was sorta fast but it was still noticably slower.

So I manually took lists of the most common 2000 edge locations on both axises of the matrix and load the corresponding 2000x2000 matrix into memory first with lookup tables for which edges exist in it and where and oh my god it's so horrible but it's ALMOST as fast as just using a 800MB vector in memory and it uses the least memory out of anything but slowly accessing the disk every time anything reads a connection weight (which is literally tens of billions of times).

Incredibly cursed.