Solutions to hitting
Moderator: Forum Moderators
Sure...but it's still alot easier than doing a statistically rigorous proof imho. The only disadvantage is that failure to prove the RNG is bad won't actually be proof that it's good.mikekchar wrote: Unfortunately it's more complicated than that. True, if I can show a particular defect in the distribution, it shows the distribution is off. However, there are an infinite number of ways the distribution can be off. For instance, perhaps every second generation is low.
The code is portable: it uses the standard C++ library function, 'rand()'. Sure, the precise behavior can differ from platform to platform, but it's still portable code.mikekchar wrote: I agree with you here. However, if there are differences, it might explain why some people experience one thing and others experience something else. My only point was that the behaviour of the RNG in Wesnoth is already unportable, although the code itself will compile on multiple platforms. If there turns out to be a problem on a specific platform, there's no real reason it couldn't be changed for that particular platform.
Having to add #ifdefs on a per-platform basis is something I'd avoid unless absolutely necessary. If rand() turns out to be bad on a platform, we'd just import our own random number algorithm.
I'm not going to say I'm sure that code doesn't have bugs...after all it's barely been tested at all. If you find any, let us knowmikekchar wrote: I took a look at my save file from the aformentioned 13 miss game. The data does not corroborate my observation. So either I missed some hits or some things aren't being saved. My guess is the former
Just think of it as a linked list of random numbers. Each child is just the successor to the previous random number.mikekchar wrote: One question... I'm looking in the replay file and I notice that the [random] tags are nested. Is there some documentation that explains what this means?
David
“At Gambling, the deadly sin is to mistake bad play for bad luck.” -- Ian Fleming
Here is a quick perl script to extract a summary of the statistics section described by Dave. The v1.3 output now looks like
with the first column being the percentage to hit, the second column containing the sequence of hits, and the third how many times that %-sequence combo occurs in the savefile. The summary at the end aggregates sequences of hits, so that 0011 and 0101 are both classified as 4 swings, 2 hits, with the count of those events in the last column.
Code: Select all
70 1 2
70 01 2
70 10 3
70 11 9
70 011 2
70 100 1
70 101 2
70 111 8
70 0010 1
70 0011 1
70 0101 1
70 0111 1
70 1010 3
70 1110 1
70 expected, 74.2574257425743 observed
80 00 1
[...]
Summary:
% #a #h #
70 1 1 2
70 2 1 5
70 2 2 9
70 3 1 1
70 3 2 4
70 3 3 8
70 4 1 1
70 4 2 5
70 4 3 2
= 101 75
[...]
== 2610 1311
- Attachments
-
- st-1.3.zip
- (963 Bytes) Downloaded 579 times
OK. I ran st-1.6 (attached) against my savefiles. Assuming that the code to generate attack/defend statistics is working correctly, I'm seeing an almost exact correspondence between expected values and observed values, as far as number of hits is concerned. Examples:
In the archive I've attached a few st summaries of savegames late in some of the campaigns, as well as output against two directories of savegames. If I'm interpreting the data correctly, then the random number generation scheme is actually very well behaved for this particular metric on my platform.
Since I'm seeing large deviations from expected damage in-game, there may be something weird with the display or calculation of expected damage values, as I suggested earlier in this thread. My current hypothesis is that the EV calculations in-game assume all swings will be made, while a skirmish actually terminates if one of the parties dies. This would always show an EV higher than the observed value. I'll continue investigating.
Code: Select all
total 6646 swings, 3576.7 expected vs. 3561 observed hits
total 3041 swings, 1696 expected vs. 1699 observed hits
total 3369 swings, 1747.8 expected vs. 1705 observed hits
Since I'm seeing large deviations from expected damage in-game, there may be something weird with the display or calculation of expected damage values, as I suggested earlier in this thread. My current hypothesis is that the EV calculations in-game assume all swings will be made, while a skirmish actually terminates if one of the parties dies. This would always show an EV higher than the observed value. I'll continue investigating.
- Attachments
-
- st-1.6.zip
- (7.88 KiB) Downloaded 569 times
Last edited by ott on December 16th, 2004, 12:30 pm, edited 1 time in total.
Thanks, ott for the code. I have similar code as well and have found similar results so far. I *did* however get a game with 9 60%+ misses in a row and I have the replay file this time. I died (due to really stupid strategy, not an unfair RNG ), so I don't have a save file. I haven't gotten around to writing the extraction of the stats from the replay file yet, so I don't know if anything else was off.
I've added the game as an attachment. It's my impression that there are way too many misses in this game (on both sides), but as I said, I don't have the stats yet.
One thing that might be an issue: in this game, it was reporting 60% to hit from my archer (level 1) and 70% to hit from the level 3 sharpshooter even though they were in the water. I was surprised by this. I can't find the event in the replay file but you can clearly see it if you watch the replay. One possibility is that the dialog is reporting the wrong odds.
Anyway, 4 misses at 60% followed by 5 misses at 70% has a likelihood of .006%. It's not impossible, but it seems rather improbable.
Thanks everyone!
I've added the game as an attachment. It's my impression that there are way too many misses in this game (on both sides), but as I said, I don't have the stats yet.
One thing that might be an issue: in this game, it was reporting 60% to hit from my archer (level 1) and 70% to hit from the level 3 sharpshooter even though they were in the water. I was surprised by this. I can't find the event in the replay file but you can clearly see it if you watch the replay. One possibility is that the dialog is reporting the wrong odds.
Anyway, 4 misses at 60% followed by 5 misses at 70% has a likelihood of .006%. It's not impossible, but it seems rather improbable.
Thanks everyone!
- Attachments
-
- replay.tar.gz
- Replay of a battle where I got 9 60%+ misses in a row. Unfortunately I also died, so no game save file.
- (14.91 KiB) Downloaded 624 times
Looking at your savefile, total 1863 swings, 1020.4 expected vs. 1010 observed hits is what I see, quite reasonable.
Unfortunately there seems no easy way to extract sequences other than the attacks/defends mechanism, which seems to have the granularity of one side of a skirmish at a time.
Looking at sequences of 4 hits, for a total 133 events, using code like on the st output. We expect .0081 of the 70 4 * events to be 70 4 0 sequences, or 0.2106 sequences; we observe 1 of these. We expect .0256 of the 60 4 * sequences to be 60 4 0, or 1.3824 sequences; we actually observe 2 of these. For 50 4 *, we expect 1 zero-sequence and observe 1. For 40 4 *, we expect 3.888 and observe 7. For 30 4 *, we expect 1.6807 and observe 2. Not really enough to go on here.
Doing this analysis on one of my replays, with 525 sequences of 4 swings, I see of 4-swing events. This yields 9.1238/16, 13.0896/12, 5.75/12, 3.6608/7, 1.215/5, .0016/0 pairs of expected/observed pairs for each P. Four of these look fishy (far too low, we should have seen more of these events) and may merit further investigation.
This could all indicate a problem but the other way, too few runs of 4 misses. Maybe there is something about rand() and length 4 runs of small values (too few), and length 9 runs (too many)?
Unfortunately there seems no easy way to extract sequences other than the attacks/defends mechanism, which seems to have the granularity of one side of a skirmish at a time.
Looking at sequences of 4 hits,
Code: Select all
80 4 *: 0
70 4 *: 26
60 4 *: 54
50 4 *: 16
40 4 *: 30
30 4 *: 7
Code: Select all
perl -n -e '/^50. 4 \d.(..\d)/ and $c+=$1; END{print "$c\n"}'
Doing this analysis on one of my replays, with 525 sequences of 4 swings, I see
Code: Select all
30 4 *: 38
40 4 *: 101
50 4 *: 92
60 4 *: 143
70 4 *: 150
80 4 *: 1
This could all indicate a problem but the other way, too few runs of 4 misses. Maybe there is something about rand() and length 4 runs of small values (too few), and length 9 runs (too many)?
Yes, it's all very mysterious. Unfortunately, as you said, there are far few trials to determine if there is any statistical issue here. However, in my game there are 13 4-miss occurrances with an expected number of 8.17. In your game there are 52 observed versus 32.84 expected.
In both cases this is a ratio of 1.6:1 (almost exactly). Unless I'm misunderstanding your numbers, this appears to be fairly strong evidence that there are more long misses than there should be.
I'm not sure why you've come to the opposite conclusion. Perhaps a mistake on one or both of our parts?
Over Christmas I intend to put together the data from a whole campaign (or at least as far as I can get with my lack of skill) and make a report. I can't quite decide what the distribution of 4 strike-misses would be. If it's normal, we already have significant indication of a problem (much better than 95% confidence). However, I suspect that it's not, so I'd like to have at least 300 observations just to make sure.
In both cases this is a ratio of 1.6:1 (almost exactly). Unless I'm misunderstanding your numbers, this appears to be fairly strong evidence that there are more long misses than there should be.
I'm not sure why you've come to the opposite conclusion. Perhaps a mistake on one or both of our parts?
Over Christmas I intend to put together the data from a whole campaign (or at least as far as I can get with my lack of skill) and make a report. I can't quite decide what the distribution of 4 strike-misses would be. If it's normal, we already have significant indication of a problem (much better than 95% confidence). However, I suspect that it's not, so I'd like to have at least 300 observations just to make sure.
OK... I decided to look into the data of the campaign I currently have running. I don't have all my tools automated yet, so some of the calculations were done by hand. I only had to do the numbers for 30% and 40% hitting to determine that there is definitely something wrong (but it's been a long time since I took a stats course, so I might be mistaken). First, here's the data:
Now for my assumptions. The random number generator is supposedly uniformly distributed. That means that the results within a category (percentage and # of swings) is also uniformly distributed. The observed and expected values of those results are normally distributed (Mean Value Theorem).
This means that as long as there are 30 or more observations in a category, any ratio of observed:expected outside the range of 0.6667 and 1.5 (2 standard deviations) is statistically significant with a confidence interval of 95%.
Anybody who actually has access to a stats book should check this out to make sure my memory isn't playing tricks on me.
Now, you will notice that apart from one category, all of them have many more than 30 observations. Not only that but *every* category has a statistically significant problem in it!!!!!
This should conclusively show that with a 95% confidence, the RNG is *not* uniformly distributed and is biased. How it is biased I couldn't say other than guessing (I think it favours middle numbers).
I will continue to collect and process stats, but at this point I think a bug report should be in order. Of course I am open to others pointing out the flaws in my argument
Code: Select all
30% Attacking
=============
574 Total attempts
Observed Expected Ratio
Swung 1 times with 0 hits --> 24 44.8 0.54
Swung 1 times with 1 hits --> 40 19.2 2.08
Total 64 64.0
Swung 2 times with 0 hits --> 127 135.73 0.94
Swung 2 times with 1 hits --> 136 116.34 1.17
Swung 2 times with 2 hits --> 14 24.93 0.56
Total 277 277.00
Swung 3 times with 0 hits --> 66 63.798 1.03
Swung 3 times with 1 hits --> 78 82.026 0.95
Swung 3 times with 2 hits --> 41 35.154 1.17
Swung 3 times with 3 hits --> 1 5.022 0.20
Total 186 186.000
Swung 4 times with 0 hits --> 12 11.284 1.06
Swung 4 times with 1 hits --> 24 19.345 1.24
Swung 4 times with 2 hits --> 11 12.436 0.88
Swung 4 times with 3 hits --> 0 3.553 0.00*
Swing 4 times with 4 hits --> 0 0.381 0.00*
Total 47 47.000
Grand Total 574 574
40% Attacking
=============
508 Total attempts
Observed Expected Ratio
Swung 1 times with 0 hits --> 13 21 0.62
Swung 1 times with 1 hits --> 22 14 1.57
Total 35 35
Swung 2 times with 0 hits --> 74 74.52 0.99
Swung 2 times with 1 hits --> 113 99.36 1.14
Swung 2 times with 2 hits --> 20 33.12 0.60
Total 207 207.00
Swung 3 times with 0 hits --> 29 27.648 1.05
Swung 3 times with 1 hits --> 54 55.296 0.98
Swung 3 times with 2 hits --> 29 36.864 0.79
Swung 3 times with 3 hits --> 16 8.192 1.95
Total 128 128.000
Swung 4 times with 0 hits --> 24 17.8848 1.34
Swung 4 times with 1 hits --> 57 47.6928 1.20
Swung 4 times with 2 hits --> 48 47.6928 1.01
Swung 4 times with 3 hits --> 9 21.1968 0.42
Swung 4 times with 4 hits --> 0 3.5328 0.00*
Total 138 138.0000
Grand Total 508 508
This means that as long as there are 30 or more observations in a category, any ratio of observed:expected outside the range of 0.6667 and 1.5 (2 standard deviations) is statistically significant with a confidence interval of 95%.
Anybody who actually has access to a stats book should check this out to make sure my memory isn't playing tricks on me.
Now, you will notice that apart from one category, all of them have many more than 30 observations. Not only that but *every* category has a statistically significant problem in it!!!!!
This should conclusively show that with a 95% confidence, the RNG is *not* uniformly distributed and is biased. How it is biased I couldn't say other than guessing (I think it favours middle numbers).
I will continue to collect and process stats, but at this point I think a bug report should be in order. Of course I am open to others pointing out the flaws in my argument
- Elvish_Pillager
- Posts: 8137
- Joined: May 28th, 2004, 10:21 am
- Location: Everywhere you think, nowhere you can possibly imagine.
- Contact:
I think if there's a problem, it's an inverse-law-of-averages type of thing: It tends to get the same results repeatedly.
That explanation would make sense from my experience. I've played lots of games where I was consistently lucky or unlucky in the same ways.
However, given the usage, this is just an arbitrary guess.
That explanation would make sense from my experience. I've played lots of games where I was consistently lucky or unlucky in the same ways.
However, given the usage, this is just an arbitrary guess.
It's all fun and games until someone loses a lawsuit. Oh, and by the way, sending me private messages won't work. :/ If you must contact me, there's an e-mail address listed on the website in my profile.
I think you are right. Re-reading what I wrote, the expected values are lower than the observed values. My conclusion was written as if things were the other way round. The data therefore supports a hypothesis that there are more 4-miss sequences than there should be, at least for certain values of P.mikekchar wrote:In your game there are 52 observed versus 32.84 expected.
In both cases this is a ratio of 1.6:1 (almost exactly). Unless I'm misunderstanding your numbers, this appears to be fairly strong evidence that there are more long misses than there should be.
I'm not sure why you've come to the opposite conclusion. Perhaps a mistake on one or both of our parts?
BTW, many of these calculations are already done using the st script, in the Summary section. Be sure to use the -v command line switch to get the verbose stats. I'll add the ratios in the next version.mikekchar wrote:some of the calculations were done by hand
Looks to me like the following are unequivocal evidence of a problem:mikekchar wrote:I only had to do the numbers for 30% and 40% hitting to determine that there is definitely something wrong
Code: Select all
30% Attacking
Swung 1 times with 0 hits --> 24 44.8 0.54
Swung 1 times with 1 hits --> 40 19.2 2.08
Not sure why you conclude this? Looks to me as if most categories are quite close to 1 (except the 30% 1-swing, 0-hit and 1-swing, 1-hit cases) and many of the others have counts less than 30.mikekchar wrote:Now, you will notice that apart from one category, all of them have many more than 30 observations. Not only that but *every* category has a statistically significant problem in it!!!!!
Maybe EP's hypothesis of not enough "mixing" fits the facts better? There is some evidence that there are more 4-miss sequences than there should be for each of 30%, 50%, 60% and 70% hit probabilities. This means that we are seeing more "low" random numbers in a row than expected. This seems contrary to the RNG favouring middle numbers.mikekchar wrote:This should conclusively show that with a 95% confidence, the RNG is *not* uniformly distributed and is biased. How it is biased I couldn't say other than guessing (I think it favours middle numbers).
Here is version 1.10 of st.
This one has a different output format, and does simple analysis of the distribution of %-swings-hits events as suggested by mikekchar. On an old 0.8 replay for Return to Wesnoth, this gives as the statistically significant differences from expected, using mikekchar's criteria of 30 observations, ratio outside the interval [2/3, 3/2]. Some of these look mildly problematic, eg. 1-hit-in-1-swing with 40% chance-to-hit was observed 38 times but we only expected 22.4 of these; 4-misses-in-4-swings with 30% chance to hit was observed 16 times but was expected only 9.1 times. In general, I think this may provide some evidence for the RNG creating too many long sequences of misses.
For a replay of Underground Passage, the significant deviations are but there isn't really enough data to go on here (many events were excluded by the 30 observation criterion).
Could someone run st against another savefile from a scenario that occurs late in a long campaign?
It would also be great if the exact meaning of the [attacks] and [defends] WML tags could be clarified. (EDIT: see the subsequent clarification.) The Statistical Scenario WML wiki page doesn't describe these. (EDIT: added descriptions.) Almost all the statistics being gathered occurs in [attacks] tags, and only a few observations in [defends] tags. (EDIT: this was a game bug, fixed post 0.8.8.)
This one has a different output format, and does simple analysis of the distribution of %-swings-hits events as suggested by mikekchar. On an old 0.8 replay for Return to Wesnoth, this gives
Code: Select all
30 2 2 3 8.5 0.35 <--
30 3 3 1 2.9 0.35 <--
30 4 0 16 9.1 1.75 <--
40 1 0 18 33.6 0.54 <--
40 1 1 38 22.4 1.70 <--
50 1 0 5 20.5 0.24 <--
50 1 1 36 20.5 1.76 <--
50 4 0 12 5.8 2.09 <--
60 1 0 20 49.6 0.40 <--
60 4 0 7 3.7 1.91 <--
70 1 0 2 21.0 0.10 <--
70 2 0 6 9.8 0.61 <--
70 3 0 8 4.8 1.68 <--
70 4 0 5 1.2 4.12 <--
80 2 0 1 1.6 0.64 <--
For a replay of Underground Passage, the significant deviations are
Code: Select all
60 1 0 5 37.2 0.13 <--
60 1 1 88 55.8 1.58 <--
60 4 0 1 2.3 0.44 <--
60 4 1 23 13.7 1.68 <--
Could someone run st against another savefile from a scenario that occurs late in a long campaign?
It would also be great if the exact meaning of the [attacks] and [defends] WML tags could be clarified. (EDIT: see the subsequent clarification.) The Statistical Scenario WML wiki page doesn't describe these. (EDIT: added descriptions.) Almost all the statistics being gathered occurs in [attacks] tags, and only a few observations in [defends] tags. (EDIT: this was a game bug, fixed post 0.8.8.)
- Attachments
-
- st-1.10.zip
- (11.64 KiB) Downloaded 871 times