Mouse over or tap the graph to highlight specific games.
The pink dot is the median.
The x axis is an estimate of lexical complexity, where 1000 is easy and 20000 is hard.
The y axis is an estimate of structural complexity, where 100 is easy and 50 is hard.
Games in the bottom half of the graph are more likely to have simple/short sentences.
Games in the left half of the graph are more likely to stick to common words.
The x axis is the number of words, from a frequency list based on VNs, you need to know to have 92.5% coverage of that VN. (grammatical words like particles are completely ignored, as are the most common 20 uncovered words in the VN)
The y axis can be set to one of three different variables.
The "Hayashi" metric is, I think, used to guess at how hard textbooks and assigned reading are, meant for school related stuff. See this paper's relevant section on this metric.
"custom metric a" is based on:
- the proportion of the text made up by kanji, hiragana, and katakana
- the proportion of runs in the text that are kanji, hiragana, or katakana
- the average length of runs of kanji, hiragana, and katakana, adjusted for how much of the text they take up
- the number of kanji, hiragana, and katakana per sentence
- the number of runs of a single writing system per per sentence
- the average length of each sentence
"custom metric b" is based on:
- the proportion of the text made up by kanji, hiragana, or katakana
- the proportion of runs in the text that are kanji, hiragana, or katakana
- the average length of runs of kanji, hiragana, and katakana
The custom metrics are fitted to match the x axis using multiple regression.
"custom metric a" is shown by default.
The x axis is not an objective measure of lexical complexity. Its quality depends entirely on the frequency list being used, but a frequency list based on visual novels makes the most sense for these stats.
The y axis is not an objective measure of structural complexity. It's a measure of the relative structural complexity of games with similar lexical complexities.
The general error range is something like +/- 25% of the size of the chart for "custom metric a", +/- 30% for "custom metric b", and way too much for Hayashi.
I picked 92.5% because 90% was low enough to be slightly unstable and 95% starts to show the weakness of "coverage target" as a metric.
Tsuushinbo is a loli game.
donate scripts pls (they need to be well formatted or be raw data)
ambiguous abbreviations explained here
Compare with this old version to get a feeling for how much the stats change when issues caused by analysis are resolved.