The Superfinals of the 17th season of the Top Engine Chess Championship (TCEC) has just concluded and Leela Chess Zero emerged as the champion against the mighty Stockfish with a final score of 52.5-47.5. Leela won 17 games (16 as white and 1 as black), drew 71 games, and lost 12 games (11 as black and 1 as white), to become the TCEC champion for the second time, after failing to qualify in season 16 (although it was undefeated in the Premier Division).
The breakdown of the results is shown in the table below.
It shall be noted that the cutechess implementation by TCEC was not updated to properly convert Leela’s evaluation into centipawns. It was actually the code for centipawn conversion that had to be updated.1 The result was the very low centipawn evaluation scores shown by Lc0 even at more than 90% losing evaluation. This may account for the 7 losses from mates seen during the SuFi.
Below is the plot of the results of games for each engine playing as white. Note that because of contempt2, SF evaluates the opening positions very conservatively as black compared to when it is playing as white. Be that as it may, the difference between Leela’s evaluation as white and SF’s evaluation as black is remarkable. Some chatters say that this could have affected SF’s performance. It has been claimed in chat that a contempt of 0 performs better against Leela. It is notable that many of the wins of Leela as white came from when its opening book evaluation was around 1 or when SF’s opening book evaluation was less than 0.5. Most notable is game 94, (Queen’s pawn game, Chigorin variation), which SF evaluated as 0.03 out of the opening book; Leela gave an evaluation of 1.18. In the reverse game that SF won as white, SF gave an evaluation of 1.55; on the other hand, Leela gave an evaluation of 1.1, which was not far from its evaluation when playing as white. While both of Leela and SF’s evaluations agree when SF was playing as white, there were some notable exceptions, specially game 7, which Leela evaluated 0.6 and Stockfish evaluated as 1.43.
There were only two games that were won as black–one by Stockfish in game 16, where Leela, playing as white, gave an opening evaluation of 0.78 and Stockfish gave an opening evaluation of 0.16; and the other one by Leela in game 95, where Stockfish, playing as white, gave an opening evaluation of 0.90 and Leela gave an opening evaluation of 0.93.
In game 16, Leela was optimistic about its position, giving evaluations \(>1\) up to move 115, slowly declining afterwards. But typical of Leela, its overzealousness to push for the win could sometimes backfire, specially during the endgame, when Leela doesn’t have enough time to analyze its position more deeply. In this case, Leela blundered the draw by moving
In game 95, a French opening, Leela showed its mastery of the French, overturning a great opening advantage for white by closing the position and converting the game into the start of the only reverse wins in the entire Superfinals.
One of the more memorable moments in the Superfinals for me was game 66, when Leela’s evaluation jumped from +1.3 to +1.69 after the pawn sacrifice
25. c5!!. Leela also attempted to sacrifice another pawn on c4 afterwards on move 28 (
28 ... Qxc4 does not work because
28 ... Qxc4 29. Bb3 Qb4 30. Bxf7+ Kxf7 31. Rxd6 Bg4 32.f5 gxf5 33. Bd2 Qc4 34. Qg3 Kg8 35. e5 Red8
36. Bc3 h5 37. Qe3 f4 38. Qd2 Rxd6 39. exd6 is totally winning for white) and successfully sacrificed a pawn on
29. h5 (en route to a thorn pawn?). After
29... gxh5, Stockfish’s pawn structure looked so bad.
|A||9||Startposition 1.d4; Dutch Leningrad; Budapest gambit; English 1… Nc6; Czech Benoni; Dutch; Snake Benoni; Trompovsky; Dutch|
|B||15||Sicilian Keres Attack; Modern Defence; Sicilian 4… Qb6; Owen’s Defence; Sicilian Dragon; Scandinavian; Caro Kann Advance; Sicilian Kan; The Black Lion; Sicilian Taimanov; Nimzowitsch Defence; Pirc Defence; Sicilian 4… Qb6; Modern Defence; Sicilian Najdorf 6.Be3|
|C||10||Frankenstein-Dracula gambit; French Winawer; Ruy Lopez Schliemann; Startposition 1.e4; French Classical; Fried Liver attack; French 2.d3; Ruy Lopez Zaitsev; Traxler gambit; French Advance|
|D||5||Slav Bronstein 5… Bg4; Benko gambit; Slav Geller gambit; QGD Chigorin; Queen’s Pawn|
|E||11||King’s Indian Mar del Plata; Benoni 7.Nd2; King’s Indian Sämisch; Queen’s Indian Petrosian; King’s Indian Fianchetto; King’s Indian Karpov; King’s Indian; Benoni 7.f4; King’s Indian Sämisch; Nimzo Indian; King’s Indian Mar del Plata|
The table below shows the game numbers, the openings, variations, and ECO codes after transposition, the win rate (by Leela), the elo difference after each game (
elodiff), the standard error of the elo differences, the likelihoods of superiority, the opening evaluations by Leela (
Lc0), the opening evaluations by Stockfish (
SF), and the result as white. Note that each opening is played as white by both engines in turns. SF plays each opening as white first.
We see that elo difference after 100 games is around 17 but with large error bars (SE=70.15). I wonder how the elo difference will play out with larger sample size.
We can now see the estimated ELO differences at the last of game of each ECO group of openings.
We see that Leela racked up the lead through the A and C openings in this season.
Looking at the opening evaluations by ECO family of codes, we can see that the opening evaluations do not differ when Stockfish played white and Leela played black. But the opening evaluations differed a lot when SF played black and Leela played white. Notice though that D openings were evaluated almost similarly by Stockfish and Leela.
Quite interesting too is the number of moves in each game (mean =101.92, sd =46.79). Games were considerably quite shorter if Stockfish was playing white (mean =87.2, sd =40.13), specially when it was winning (mean = 68.09, sd = 19.44). Games took a while to finish when Leela was playing white (mean = 116.64, sd = 48.69), specially when it was winning (mean = 124.12, sd = 41.55). However, SF lost as white (game 95, 93 moves) in much shorter time than it did winning as black (game 16, 196 moves, coming via a long series of high level shuffling from a fortress-y position, after Leela pressed for activity as discussed above).
We also see that for this SuFi, the rooks and the king moved the most, perhaps due to many pawn and rook endings.
Finally, Leela’s evaluations seemed to agree with SF’s evaluation up to a certain centipawn value only, around \((-3,3)\).
The reason was that TCEC had yet to update the system to reflect correct centipawn evaluation. (Again, it was the Leela centipawn code that had to be updated.) This resulted in a lot of mates during the competition. Here is an attempt to model Leela’s centipawn evaluation based on SF’s evaluation. I have stored the matched evaluations throughout the games here. This CSV file contains all of the evaluations in all games for which Leela’s evaluation is in \((-4,4)\) and Stockfish’s evaluation is in \((-20,20)\). The reason for the choice of limits is nothing special–one engine evaluation seem to be well-behaved with respect to the other engine evaluation. I have also removed the last move of the engine with the greater number of moves so that the number of evaluations will match.
I fitted a logistic curve to the evaluations. To do this, I added 20 to SF’s evals and 4 to Leela’s evals and proceeded to fit the model in R.
evals_comp <- read.csv("evals_tcec.csv")
x <- evals_comp$SF + 20 y <- evals_comp$Lc0 + 4 data.df <- data.frame(x=x, y=y) max(y)
##  11.91
logit <- qlogis model.0 <- lm(logit(y/12) ~ x, data=data.df) summary(model.0)
## ## Call: ## lm(formula = logit(y/12) ~ x, data = data.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.271 -0.154 -0.028 0.173 3.747 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.64397 0.02670 -136 <2e-16 *** ## x 0.15413 0.00125 124 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.334 on 8976 degrees of freedom ## Multiple R-squared: 0.63, Adjusted R-squared: 0.63 ## F-statistic: 1.53e+04 on 1 and 8976 DF, p-value: <2e-16
phi1 <- 12 phi2 <- coef(model.0) phi3 <- coef(model.0) model<-nls(y~phi1/(1+exp(-(phi2+phi3*x))), start=list(phi1=phi1,phi2=phi2,phi3=phi3),data=data.df,trace=TRUE)
## 6218.7 : 12.00000 -3.64397 0.15413 ## 5969.8 : 10.50614 -4.03318 0.18298 ## 5779.1 : 9.16663 -4.92558 0.23813 ## 5325.2 : 8.8094 -6.4689 0.3196 ## 5287.8 : 8.97024 -6.74805 0.33167 ## 5287.8 : 8.95998 -6.77621 0.33319 ## 5287.8 : 8.95897 -6.77960 0.33337 ## 5287.8 : 8.95885 -6.77999 0.33339
## ## Formula: y ~ phi1/(1 + exp(-(phi2 + phi3 * x))) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## phi1 8.95885 0.04817 186.0 <2e-16 *** ## phi2 -6.77999 0.10502 -64.6 <2e-16 *** ## phi3 0.33339 0.00552 60.4 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.768 on 8975 degrees of freedom ## ## Number of iterations to convergence: 7 ## Achieved convergence tolerance: 4.68e-06
#set parameters phi1<-coef(model) phi2<-coef(model) phi3<-coef(model) x<-c(min(data.df$x):max(data.df$x)) #construct a range of x values bounded by the data y<-phi1/(1+exp(-(phi2+phi3*(x)))) #predicted SF's evals predict<-data.frame(x=x-20,y=y-4) #create the prediction data frame
# create a plot of actual values and the predictions from fitted model ggplot(data=evals_comp,aes(x=SF,y=Lc0))+ geom_point(size=1)+theme_bw()+ labs(x='SF',y='Lc0')+ geom_line(data=predict,aes(x=x,y=y), size=1, color = "blue")
Update: 24 April 2020
All of the SuFi games went on for a total duration of 12.88115 days (except for short intervals between games). The average time for a SuFi game is 03:05 with a standard deviation of 00:29.
The following table shows some summary statistics by ECO code for the whole SuFi.
The following table shows some summary statistics by ECO code with Leela playing white.