"In particular, we want to reconsider whether the success measure was itself tuned. Note that there are at least three steps in the process:McKay proceeds to explain in detail:The standard history is that, apart from addition of the permutation test at the second step, all these measures are the same. However, we now know that they are all different."
- WRR used some success measure M1 for their first list.
- WRR used some success measure M2 for their second list.
- WRR distributed a program which implements a success measure M3.
(1) "We know that M1 and M2 are different because the preprints which describe them have differing mathematical descriptions." (Emphasis mine)For more details he refers the reader to an appendix titled "Early changes in WRR's success measure." At the beginning of this appendix he announces:
(2) "Here are two examples where the success measure presented by WRR with their first list of rabbis differs mathematically from the success measure presented by WRR with their second list of rabbis"Our response:
" there is reason to believe it was not actually used to compute the distances that are presented in the preprint"This is an absolute understatement. McKay knows very well that in (what he calls) M1, M2 and M3, there was only one method to create the set of perturbations. This can be proved easily, because the perturbation production methods leave prominent "fingerprints."
"In the second preprint of WRR, where the second list of rabbis was first presented, this part of the algorithm is described using English prose that can plausibly be read either way."Was he hoping that no one would notice this sentence?
| In Conclusion: McKay's claims are absolutely baseless. His opening assertions (denoted by (2) above) and the examples themselves are totally contradictory. One can hardly escape the conclusion that McKay hoped the reader would swallow his " headlines" without checking the details. |
"We know that M2 and M3 are different because the program distributed by WRR does not give the same distances between word pairs as are listed in WRR's preprints. Witztum has admitted that there was an earlier program that gave different values but is unable to give it to us. Some of the changes might have been strictly error corrections, but since WRR's later programs still contain errors we don't know whether error correction was performed in a blind fashion (i.e., without regard to whether the result improved)."Our response:
This reminds me of a similar amusing theory developed by Gil Kalai, who was McKay's co-author in several articles (including the one publicized in statistical Science). His theory tried to explain WRR's success by claiming "perhaps". This theory said that perhaps there were typing errors when WRR's data was fed into the computer, and perhaps, psychologically, WRR only caught those mistakes which led to "bad" results and not those which led to "good" results. Perhaps this was the source of their overall good result.Now McKay comes along with a new "perhaps" theory, taking advantage of the fact that the PROG1 program became lost. Here too the facts can be checked. The procedure described in the first preprint (excluding the perturbation method, which as we proved in Sec. A, example 1, was exactly the same as in the final article) can be given to a non-biased programmer to produce a PROG'1 and see how it performs in the permutation test. I expect that the results would be much the same as those of ELS1.
He entertained this theory for quite a while. I heard about it from people with whom he had discussed it. One of them asked me what I had to say about it. I asked him if he knew how many "good" results there had been in my experiment. He had no idea. When I told him, about sixty, he laughed. Suddenly he realized that by merely typing about sixty pairs of expressions (which had the "good" results) correctly and working out the results with our freely available program, it could easily be verified whether it was typing mistakes which had caused our "good" results or not.
"Nevertheless, the degree to which this explanation is significant is impossible to determine."
"A basic problem with experiments like MBBK's 'study of variations' is the interdependence of the variations: This interdependence may be between the functions chosen for this purpose, or between the chosen sampling values for a certain parameter. In fact, most of the variations chosen by MBBK have this flaw. As a direct consequence of these interdependencies, MBBK admit that their results are unquantifiable.
But even though they cannot quantify their results they still use them to create a psychological impact. See paragraph 10.
Since the name of the game becomes 'psychology', MBBK's presentation of the data plays a central role. Under these circumstances, any misleading presentation of the data has a great impact on the reader. We will give explicit examples of this in chapters II and III." (Chap. I, Sec. 6)
"The 'study of variations' lacks quantitative assessment.In particular, McKay et al often claim that the "vast majority" of (or even "almost all") the results indicate that the results worsen through variation (and this supports their hypothesis), or they stress how few results indicate any improvement. In other words they make extensive use of "raw counts."
MBBK write:
'For these reasons we are not going to attempt a quantitative assessment of our evidence. We merely state our case that the evidence is strong and leave it for the reader to judge.' (Pg. 159)
But how could a study lacking quantitative assessment be published in a statistics journal!?" (ibid, Sec. 10)
"What we found was that, in the great majority of cases, changing a parameter of WRR's experimental method made their result weaker."This claim is based on "raw counts."
"The refutation of Witztum's first claim is that he did not manage to identify a single variation which tells a contrary story and should have been presented but was not".There is not a grain of truth in this reply. Our article [1] is replete with variations which indicate the opposite of McKay's thesis. Therefore he is forced to invent excuses why he didn't include them. In addition he:
| P1 | P2 | P3 | P4 | Min(P1-P4) | Min(P1-P2) | |
| L1 | 1.0 | 0.6 | 1.0 | 0.8 | 0.7 | 0.7 |
| L2 | 0.1 | 0.6 | 0.1 | 0.6 | 0.6 | 0.6 |
| r1 | r2 | r3 | r4 | Min(r1-r4) | Min(r1-r2) | |
| L1 | 1.1 | 1.0 | 0.9 | 1.0 | 1.0 | 1.0 |
| L2 | 0.3 | 0.9 | 0.3 | 0.9 | 0.9 | 0.9 |
"Here are two examples where the success measure presented by WRR with their first list of rabbis differs mathematically from the success measure presented by WRR with their second list of rabbis. The two scans below are from the 1986 preprint in which WRR presented their first list of rabbis."But, since we proved (Chap. I, Sec. A) that this claim is empty, McKay replaced it with the following statement:
"Here are two examples where the success measure presented by WRR in their earliest preprints differ mathematically from the success measure published by them in Statistical Science. The 1986 preprint presented the first list of rabbis, and 1987 preprint presented the second list. Neither preprint mentioned the permutation test that appeared in Statistical Science."1. Note that the original claim was supposed to create the (mistaken) impression that there is a difference between what McKay calls M1 and what he calls M2. But his new version doesn't mention this at all, and only speaks of the differences in (what he calls) M3!
"Neither preprint mentioned the permutation test that appeared in Statistical Science."This statement is correct: It is well known that the permutation test was proposed a long time after the preprints were publicized, therefore, it could not have been mentioned in them. Since the permutation test is irrelevant here, and even McKay did not mention it in connection with the "two examples," we must conclude that McKay added this superfluous statement only to create the false impression that something was "not right" with the preprints.
| Note that McKay started with his "Study of Variations " [6] where he pretended to prove that WRR "cooked" their data. Following our refutation [1] of his "proofs", he was forced to make a "reconsideration of his hypothesis" and was left with speculations [2]. After we disproved even these speculations in this article, he chose to make pointless "inferences" about one preprint or another, even though he admits that this makes no difference to how the experiment was executed. |
| The text in the previous frame is appropriate also to be stressed here. |