Solutions to hitting

Post by **Dave** » December 9th, 2004, 2:49 am

mikekchar wrote: Unfortunately it's more complicated than that. True, if I can show a particular defect in the distribution, it shows the distribution is off. However, there are an infinite number of ways the distribution can be off. For instance, perhaps every second generation is low.

Sure...but it's still alot easier than doing a statistically rigorous proof imho. The only disadvantage is that failure to prove the RNG is bad won't actually be proof that it's good.

mikekchar wrote: I agree with you here. However, if there are differences, it might explain why some people experience one thing and others experience something else. My only point was that the behaviour of the RNG in Wesnoth is already unportable, although the code itself will compile on multiple platforms. If there turns out to be a problem on a specific platform, there's no real reason it couldn't be changed for that particular platform.

The code is portable: it uses the standard C++ library function, 'rand()'. Sure, the precise behavior can differ from platform to platform, but it's still portable code.

Having to add #ifdefs on a per-platform basis is something I'd avoid unless absolutely necessary. If rand() turns out to be bad on a platform, we'd just import our own random number algorithm.

mikekchar wrote: I took a look at my save file from the aformentioned 13 miss game. The data does not corroborate my observation. So either I missed some hits or some things aren't being saved. My guess is the former

I'm not going to say I'm sure that code doesn't have bugs...after all it's barely been tested at all. If you find any, let us know

mikekchar wrote: One question... I'm looking in the replay file and I notice that the [random] tags are nested. Is there some documentation that explains what this means?

Just think of it as a linked list of random numbers. Each child is just the successor to the previous random number.

David

Post by **ott** » December 13th, 2004, 7:37 pm

Here is a quick perl script to extract a summary of the statistics section described by Dave. The v1.3 output now looks like

Code: Select all

70      1   2
70     01   2
70     10   3
70     11   9
70    011   2
70    100   1
70    101   2
70    111   8
70   0010   1
70   0011   1
70   0101   1
70   0111   1
70   1010   3
70   1110   1
70 expected, 74.2574257425743 observed

80     00   1
[...]
Summary:
 % #a #h   #
70  1  1   2
70  2  1   5
70  2  2   9
70  3  1   1
70  3  2   4
70  3  3   8
70  4  1   1
70  4  2   5
70  4  3   2
 =  101   75
[...]
== 2610 1311

with the first column being the percentage to hit, the second column containing the sequence of hits, and the third how many times that %-sequence combo occurs in the savefile. The summary at the end aggregates sequences of hits, so that 0011 and 0101 are both classified as 4 swings, 2 hits, with the count of those events in the last column.

Post by **ott** » December 15th, 2004, 11:31 am

OK. I ran st-1.6 (attached) against my savefiles. Assuming that the code to generate attack/defend statistics is working correctly, I'm seeing an almost exact correspondence between expected values and observed values, as far as number of hits is concerned. Examples:

Code: Select all

total 6646 swings, 3576.7 expected vs. 3561 observed hits
total 3041 swings, 1696 expected vs. 1699 observed hits
total 3369 swings, 1747.8 expected vs. 1705 observed hits

In the archive I've attached a few st summaries of savegames late in some of the campaigns, as well as output against two directories of savegames. If I'm interpreting the data correctly, then the random number generation scheme is actually very well behaved for this particular metric on my platform.

Since I'm seeing large deviations from expected damage in-game, there may be something weird with the display or calculation of expected damage values, as I suggested earlier in this thread. My current hypothesis is that the EV calculations in-game assume all swings will be made, while a skirmish actually terminates if one of the parties dies. This would always show an EV higher than the observed value. I'll continue investigating.

mikekchar · Post by **mikekchar** » December 16th, 2004, 3:14 am

Thanks, ott for the code. I have similar code as well and have found similar results so far. I *did* however get a game with 9 60%+ misses in a row and I have the replay file this time. I died (due to really stupid strategy, not an unfair RNG

), so I don't have a save file. I haven't gotten around to writing the extraction of the stats from the replay file yet, so I don't know if anything else was off.

I've added the game as an attachment. It's my impression that there are way too many misses in this game (on both sides), but as I said, I don't have the stats yet.

One thing that might be an issue: in this game, it was reporting 60% to hit from my archer (level 1) and 70% to hit from the level 3 sharpshooter even though they were in the water. I was surprised by this. I can't find the event in the replay file but you can clearly see it if you watch the replay. One possibility is that the dialog is reporting the wrong odds.

Anyway, 4 misses at 60% followed by 5 misses at 70% has a likelihood of .006%. It's not impossible, but it seems rather improbable.

Thanks everyone!

Post by **ott** » December 16th, 2004, 1:34 pm

Looking at your savefile, total 1863 swings, 1020.4 expected vs. 1010 observed hits is what I see, quite reasonable.

Unfortunately there seems no easy way to extract sequences other than the attacks/defends mechanism, which seems to have the granularity of one side of a skirmish at a time.

Looking at sequences of 4 hits,

Code: Select all

for a total 133 events, using code like

Code: Select all

perl -n -e '/^50. 4  \d.(..\d)/ and $c+=$1; END{print "$c\n"}'

on the st output. We expect .0081 of the 70 4 * events to be 70 4 0 sequences, or 0.2106 sequences; we observe 1 of these. We expect .0256 of the 60 4 * sequences to be 60 4 0, or 1.3824 sequences; we actually observe 2 of these. For 50 4 *, we expect 1 zero-sequence and observe 1. For 40 4 *, we expect 3.888 and observe 7. For 30 4 *, we expect 1.6807 and observe 2. Not really enough to go on here.

Doing this analysis on one of my replays, with 525 sequences of 4 swings, I see

Code: Select all

30 4 *:  38
40 4 *: 101
50 4 *:  92
60 4 *: 143
70 4 *: 150
80 4 *:   1

of 4-swing events. This yields 9.1238/16, 13.0896/12, 5.75/12, 3.6608/7, 1.215/5, .0016/0 pairs of expected/observed pairs for each P. Four of these look fishy (far too low, we should have seen more of these events) and may merit further investigation.

This could all indicate a problem but the other way, too few runs of 4 misses. Maybe there is something about rand() and length 4 runs of small values (too few), and length 9 runs (too many)?

mikekchar · Post by **mikekchar** » December 18th, 2004, 5:44 pm

Yes, it's all very mysterious. Unfortunately, as you said, there are far few trials to determine if there is any statistical issue here. However, in my game there are 13 4-miss occurrances with an expected number of 8.17. In your game there are 52 observed versus 32.84 expected.

In both cases this is a ratio of 1.6:1 (almost exactly). Unless I'm misunderstanding your numbers, this appears to be fairly strong evidence that there are more long misses than there should be.

I'm not sure why you've come to the opposite conclusion. Perhaps a mistake on one or both of our parts?

Over Christmas I intend to put together the data from a whole campaign (or at least as far as I can get with my lack of skill) and make a report. I can't quite decide what the distribution of 4 strike-misses would be. If it's normal, we already have significant indication of a problem (much better than 95% confidence). However, I suspect that it's not, so I'd like to have at least 300 observations just to make sure.

mikekchar · Post by **mikekchar** » December 18th, 2004, 8:56 pm

OK... I decided to look into the data of the campaign I currently have running. I don't have all my tools automated yet, so some of the calculations were done by hand. I only had to do the numbers for 30% and 40% hitting to determine that there is definitely something wrong (but it's been a long time since I took a stats course, so I might be mistaken). First, here's the data:

Code: Select all

30% Attacking
=============
574 Total attempts
                                      Observed   Expected  Ratio
        Swung 1 times with 0 hits -->       24     44.8    0.54
        Swung 1 times with 1 hits -->       40     19.2    2.08
Total                                       64     64.0

        Swung 2 times with 0 hits -->      127    135.73   0.94
        Swung 2 times with 1 hits -->      136    116.34   1.17
        Swung 2 times with 2 hits -->       14     24.93   0.56
Total                                      277    277.00

        Swung 3 times with 0 hits -->       66     63.798  1.03
        Swung 3 times with 1 hits -->       78     82.026  0.95
        Swung 3 times with 2 hits -->       41     35.154  1.17
        Swung 3 times with 3 hits -->        1      5.022  0.20
Total                                      186    186.000

        Swung 4 times with 0 hits -->       12     11.284  1.06
        Swung 4 times with 1 hits -->       24     19.345  1.24
        Swung 4 times with 2 hits -->       11     12.436  0.88
        Swung 4 times with 3 hits -->        0      3.553  0.00*
        Swing 4 times with 4 hits -->        0      0.381  0.00*
Total                                       47     47.000

Grand Total                                574    574

40% Attacking
=============
508 Total attempts
                                      Observed   Expected  Ratio
        Swung 1 times with 0 hits -->       13     21      0.62
        Swung 1 times with 1 hits -->       22     14      1.57
Total                                       35     35

        Swung 2 times with 0 hits -->       74     74.52   0.99
        Swung 2 times with 1 hits -->      113     99.36   1.14
        Swung 2 times with 2 hits -->       20     33.12   0.60
Total                                      207    207.00

        Swung 3 times with 0 hits -->       29     27.648  1.05
        Swung 3 times with 1 hits -->       54     55.296  0.98
        Swung 3 times with 2 hits -->       29     36.864  0.79
        Swung 3 times with 3 hits -->       16      8.192  1.95
Total                                      128    128.000

        Swung 4 times with 0 hits -->       24     17.8848 1.34
        Swung 4 times with 1 hits -->       57     47.6928 1.20
        Swung 4 times with 2 hits -->       48     47.6928 1.01
        Swung 4 times with 3 hits -->        9     21.1968 0.42
        Swung 4 times with 4 hits -->        0      3.5328 0.00*
Total                                      138    138.0000

Grand Total                                508    508

Now for my assumptions. The random number generator is supposedly uniformly distributed. That means that the results within a category (percentage and # of swings) is also uniformly distributed. The observed and expected values of those results are normally distributed (Mean Value Theorem).

This means that as long as there are 30 or more observations in a category, any ratio of observed:expected outside the range of 0.6667 and 1.5 (2 standard deviations) is statistically significant with a confidence interval of 95%.

Anybody who actually has access to a stats book should check this out to make sure my memory isn't playing tricks on me.

Now, you will notice that apart from one category, all of them have many more than 30 observations. Not only that but *every* category has a statistically significant problem in it!!!!!

This should conclusively show that with a 95% confidence, the RNG is *not* uniformly distributed and is biased. How it is biased I couldn't say other than guessing (I think it favours middle numbers).

I will continue to collect and process stats, but at this point I think a bug report should be in order. Of course I am open to others pointing out the flaws in my argument

Elvish_Pillager · Post by **Elvish_Pillager** » December 18th, 2004, 9:22 pm

I think if there's a problem, it's an inverse-law-of-averages type of thing: It tends to get the same results repeatedly.

That explanation would make sense from my experience. I've played lots of games where I was consistently lucky or unlucky in the same ways.

However, given the usage, this is just an arbitrary guess.

Post by **ott** » December 19th, 2004, 10:30 pm

mikekchar wrote:In your game there are 52 observed versus 32.84 expected.

In both cases this is a ratio of 1.6:1 (almost exactly). Unless I'm misunderstanding your numbers, this appears to be fairly strong evidence that there are more long misses than there should be.

I'm not sure why you've come to the opposite conclusion. Perhaps a mistake on one or both of our parts?

I think you are right. Re-reading what I wrote, the expected values are lower than the observed values. My conclusion was written as if things were the other way round. The data therefore supports a hypothesis that there are more 4-miss sequences than there should be, at least for certain values of P.

Post by **ott** » December 19th, 2004, 10:58 pm

mikekchar wrote:some of the calculations were done by hand

BTW, many of these calculations are already done using the st script, in the Summary section. Be sure to use the -v command line switch to get the verbose stats. I'll add the ratios in the next version.

mikekchar wrote:I only had to do the numbers for 30% and 40% hitting to determine that there is definitely something wrong

Looks to me like the following are unequivocal evidence of a problem:

Code: Select all

30% Attacking
        Swung 1 times with 0 hits -->       24     44.8    0.54
        Swung 1 times with 1 hits -->       40     19.2    2.08

In my analysis, the 4-miss cases for 30%, 50%, 60% and 70% were all also problematic.

mikekchar wrote:Now, you will notice that apart from one category, all of them have many more than 30 observations. Not only that but *every* category has a statistically significant problem in it!!!!!

Not sure why you conclude this? Looks to me as if most categories are quite close to 1 (except the 30% 1-swing, 0-hit and 1-swing, 1-hit cases) and many of the others have counts less than 30.

mikekchar wrote:This should conclusively show that with a 95% confidence, the RNG is *not* uniformly distributed and is biased. How it is biased I couldn't say other than guessing (I think it favours middle numbers).

Maybe EP's hypothesis of not enough "mixing" fits the facts better? There is some evidence that there are more 4-miss sequences than there should be for each of 30%, 50%, 60% and 70% hit probabilities. This means that we are seeing more "low" random numbers in a row than expected. This seems contrary to the RNG favouring middle numbers.

Post by **ott** » December 22nd, 2004, 11:59 am

Here is version 1.10 of st.

This one has a different output format, and does simple analysis of the distribution of %-swings-hits events as suggested by mikekchar. On an old 0.8 replay for Return to Wesnoth, this gives

Code: Select all

30  2  2    3    8.5  0.35  <--
30  3  3    1    2.9  0.35  <--
30  4  0   16    9.1  1.75  <--
40  1  0   18   33.6  0.54  <--
40  1  1   38   22.4  1.70  <--
50  1  0    5   20.5  0.24  <--
50  1  1   36   20.5  1.76  <--
50  4  0   12    5.8  2.09  <--
60  1  0   20   49.6  0.40  <--
60  4  0    7    3.7  1.91  <--
70  1  0    2   21.0  0.10  <--
70  2  0    6    9.8  0.61  <--
70  3  0    8    4.8  1.68  <--
70  4  0    5    1.2  4.12  <--
80  2  0    1    1.6  0.64  <--

as the statistically significant differences from expected, using mikekchar's criteria of 30 observations, ratio outside the interval [2/3, 3/2]. Some of these look mildly problematic, eg. 1-hit-in-1-swing with 40% chance-to-hit was observed 38 times but we only expected 22.4 of these; 4-misses-in-4-swings with 30% chance to hit was observed 16 times but was expected only 9.1 times. In general, I think this may provide some evidence for the RNG creating too many long sequences of misses.

For a replay of Underground Passage, the significant deviations are

Code: Select all

60  1  0    5   37.2  0.13  <--
60  1  1   88   55.8  1.58  <--
60  4  0    1    2.3  0.44  <--
60  4  1   23   13.7  1.68  <--

but there isn't really enough data to go on here (many events were excluded by the 30 observation criterion).

Could someone run st against another savefile from a scenario that occurs late in a long campaign?

It would also be great if the exact meaning of the [attacks] and [defends] WML tags could be clarified. (EDIT: see the subsequent clarification.) The Statistical Scenario WML wiki page doesn't describe these. (EDIT: added descriptions.) Almost all the statistics being gathered occurs in [attacks] tags, and only a few observations in [defends] tags. (EDIT: this was a game bug, fixed post 0.8.8.)