PDA

View Full Version : Chess ratings contest



Kevin Bonham
07-08-2010, 09:17 PM
http://kaggle.com/chess

This is a contest set up by Jeff Sonas in which people can design ratings systems that work better than ELO. Entries close mid-November.

ER
07-08-2010, 10:45 PM
http://kaggle.com/chess

This is a contest set up by Jeff Sonas in which people can design ratings systems that work better than ELO. Entries close mid-November.


1. Fritz DVD autographed by world champions Viswanathan Anand, Garry Kasparov, Anatoly Karpov and Viktor Korchnoi (see image)
Korchnoi was never a World Champion! :P

kaggle
Can I still claim copyrights from a misspelled word? (Cagles?) :P OK Part of it belongs to Eclectic who formulated the concept!
Can we just produce a Glicko Variable and win the lot? ;)

Kevin Bonham
07-08-2010, 10:49 PM
Korchnoi was never a World Champion

That is sloppy editing to let such a blooper through.


Can we just produce a Glicko Variable and win the lot?

I'm surprised Glicko hasn't been included, among other improvements. Maybe it is difficult to implement using the software they have tried. Maybe we should enter Glicko-2.

Notably most of the entries so far are worse than ELO and some are worse than just assuming white scores 54% every game no matter what the past results!

Garvinator
07-08-2010, 10:54 PM
Maybe we should enter Glicko-2. I would like to see this. In part because I remember how much some people 'cracked the s***s' when FIDE proposed increasing the k factor. So imagine how many complaints there would be if a players rating volatility was included ;)

Saragossa
10-08-2010, 10:45 PM
Korchnoi has been the world senior chess champion and thus on technicality can be approved - reminds me of a similar situation involving a not-quite state rapid champion. Ahh the dumbings of public media.

ER
11-08-2010, 12:37 AM
Korchnoi has been the world senior chess champion and thus on technicality can be approved - reminds me of a similar situation involving a not-quite state rapid champion. Ahh the dumbings of public media.

Then why not Bobby Cheng??? he is a world champion and he has many decades of productive chess in front of him, plus he is a Victorian! :) :owned:
How are you, Lawrence, hows everyone? :)

Saragossa
11-08-2010, 11:21 AM
I would totally include Bobby in my DVD signing. I'm well, a bit bogged down with study because now is the season of the dread independent project. Marcus is in a similar situation, except he has the commitment to continue playing chess, something which I cannot quite juggle. Kole is a full on talker now after a slow start and Dad started a new job and is a vegetarian convert, under my guidance, and is loving it. How are you?

ER
11-08-2010, 09:39 PM
I would totally include Bobby in my DVD signing. I'm well, a bit bogged down with study because now is the season of the dread independent project. Marcus is in a similar situation, except he has the commitment to continue playing chess, something which I cannot quite juggle. Kole is a full on talker now after a slow start and Dad started a new job and is a vegetarian convert, under my guidance, and is loving it. How are you?

Nice to know the whole family is doing well. Well studying is very important in this stage in life and I am sure you will do well. Are you playing in the weekender??? I am doing fine here just taking it easy, say hi to all Lawrence :)

Kevin Bonham
23-09-2010, 11:43 AM
Update on progress:

http://chessbase.com/newsdetail.asp?newsid=6687

Garvinator
23-09-2010, 12:35 PM
I think Glicko and Glicko 2 will struggle to be accepted, even if they are shown to be the best predictor of future results.

I recall how much jumping up and down when fide proposed increasing the K factor. So imagine the reaction if Glicko was introduced with its volatility factors.

Kevin Bonham
23-09-2010, 12:42 PM
It's interesting that Sonas is claiming his own system is outperforming both Glickos. However he does write:


As the organizer of the contest, I have "benchmarked" several prominent rating systems, starting with Chessmetrics, Elo, PCA, and Glicko/Glicko-2. Other systems (including TrueSkill) will also be benchmarked in the near future. A "benchmark" consists of implementing those systems, optimizing any parameters for predictive power, submitting predictions based on their ratings, and publicly describing the details of the methodology in the discussion forum. These benchmark entries help other competitors to gauge the success of their own entries and to get some ideas of what other people have tried in the past. If you are interested in learning more about any of the benchmarked systems, you can find detailed descriptions in the discussion forum on the contest website.

It's the "optimizing any parameters for predictive power" bit I'm curious about here. Might have a look at this sometime.

Kevin Bonham
23-09-2010, 03:03 PM
Yup. Thought so. The Glicko/Glicko2 versions being used aren't really "optimised" properly.

Garvinator
23-09-2010, 03:42 PM
Yup. Thought so. The Glicko/Glicko2 versions being used aren't really "optimised" properly.What do you mean? I am suspecting that you are meaning that the !!, !, ?? factors have been removed, which then corrupts the system.
Without the reliability factors, is it really Glicko at all?

Rincewind
23-09-2010, 03:50 PM
I suspect he is just talking about optimising parameters of the system for his particular subset of data compared to what might be best for, say, a national or international ratings system catering for players will a different make up.

Also, as far as I can tell, for their competition you can incorporate a pair-wise component to the prediction of your system. Say if the test data includes a game between two players who have already played each other in the training data then you can use a head-to-head comparison for those specific players.

I've looked at the training data and implemented a simple Elo system and obtained an RMSE of around the benchmark number. Adding some unsophisticated head-to-head component I was able to significantly lower the RMSE.

Of course head-to-head is completely useless for a national rating system.

Kevin Bonham
23-09-2010, 04:04 PM
What do you mean? I am suspecting that you are meaning that the !!, !, ?? factors have been removed, which then corrupts the system.
Without the reliability factors, is it really Glicko at all?

What I'm talking about is that you can make Glicko(/-2) more or less responsive by altering particular constants in the calculations, depending on how often you do ratings runs, and those he uses seem to make it particularly unresponsive compared to something like the ACF system.

SHump
04-10-2010, 02:21 PM
A couple of things.
1) The 'world champion' part of the signatures on fritz 11 has been taken off the blurb, so Viktor need not worry any longer.
2) Is the ACF going to make an entry at all for this competition? Surely that would be best, regardless of the result, at least we would then not have anyone say "But they did not use Glicko-2 the way ACF uses it". I mean, no-one has to act on the result, but it would be good to know where things stand...

Garvinator
06-10-2010, 12:31 AM
A couple of things.
1) The 'world champion' part of the signatures on fritz 11 has been taken off the blurb, so Viktor need not worry any longer.
2) Is the ACF going to make an entry at all for this competition? Surely that would be best, regardless of the result, at least we would then not have anyone say "But they did not use Glicko-2 the way ACF uses it". I mean, no-one has to act on the result, but it would be good to know where things stand...
I would like to see this from an analysis point of view, but from the practical side I do not see the point. Fide showed their hand big time when they caved in to protests when it was proposed to increase the K factor by a few points.

So even if Glicko 2 was shown to be far and away the best predictor of future results, it would never be accepted by Fide because by their standards it is too volatile.

Kevin Bonham
20-02-2011, 03:23 PM
Update posted and a further competition is being held:

http://www.chessbase.com/newsdetail.asp?newsid=7020

The new comp includes two divisions; one for identifying the best possible rating system, and one for identifying "the most promising approach, out of the ten most accurate entries that meet a restrictive definition of a "practical chess rating system"."

The main prize is sponsored by Deloitte Australia. "Deloitte is a preeminent provider of analytics globally and helps companies capture, manage and analyze their data as part of their overall business strategy."

The second part is sponsored by FIDE:


This prize will be awarded by FIDE representatives to what they consider to be the most promising approach, out of the ten most accurate entries that meet a restrictive definition of a "practical chess rating system". This restrictive definition is specified within the rules of the contest. For the selected winner, FIDE will provide air fare for a round trip flight to Athens, Greece, and full board for three nights in Athens, and payment toward other expenses, for one person to present and discuss their system during a special FIDE meeting of chess rating experts in Athens.

ELO-worshippers take heed: even FIDE realises ELO is bunk and is seriously looking at refining or replacing it.

This sponsorship is a commendable initiative from FIDE. :clap: :clap:

The following are the rules set by FIDE for a practical system:



(1) A rating vector V(X,M) of one or more numbers (maximum of ten) is maintained and updated on a monthly basis for each player X, representing the components of player X's rating at the start of month M.

(2) The initial rating vectors V(X,1), representing the components of player X's rating at the start of month 1, can only be a function of one or more of the following:
(a) Player X's rating on the initial FIDE rating list provided as part of the contest
(b) Player X's K-factor on the initial FIDE rating list provided as part of the contest
(c) Player X's career # of games on the initial FIDE rating list provided as part of the contest
(d) System constant parameters

(3) The predicted score E(X,Y,M) for player X in a single game against player Y during a particular month M in the test period (months 133-135), can only be a function of one or more of the following:
(a) The rating vector V(X,133) for player X, representing the components of player X's rating at the start of month 133
(b) The rating vector V(Y,133) for player Y, representing the components of player Y's rating at the start of month 133
(c) System constant parameters
(d) The details of whether player X has the white pieces, or player Y has the white pieces, in the game
(e) The value of M (either 133, 134, or 135)

(4) When updating the rating vector for player X, from V(X,M) to the new values V(X,M+1), based on games G1, G2, ..., GN played by player X against opponents Y1, Y2, ..., YN during a particular month M, where N>0, those updates can only be a function of one or more of the following:
(a) The rating vector V(X,M) representing the components of player X's rating at the start of month M
(b) System constant parameters
(c) For each game Gi and opponent Yi out of games G1, G2, ..., GN played by player X against opponents Y1, Y2, ..., YN during month M:
(i) The rating vector V(Yi,M) representing the components of player Yi's rating at the start of month M
(ii) The game outcome (win/draw/loss) for player X in game Gi
(iii) The details of whether player X had the white pieces, or player Yi had the white pieces, or the color of pieces was unknown.
(iv) The details of whether game Gi came from the primary training dataset, secondary training dataset, or tertiary training dataset

(5) Rating vector updates cannot involve any iterative computation that is carried out to convergence in order to solve an optimality criterion (it can require, at most, two iterations of such an algorithm).

(6) Any player X who is not included in the initial FIDE rating list provided as part of the contest, and therefore "enters" the system due to their first month M where they played any games, can either receive an initial rating vector V(X,1) that is maintained unchanged until month M+1, or they can receive their first rating vector V(X,M+1) as part of the rating updating algorithm described above in (4); either approach is acceptable.

(7) System constant parameters may be optimized by the contestant in any fashion they desire, but such values are to remain unchanged and independent of month M when applying the rating updating algorithm described above.

I believe that Glicko-2 as implemented in Australia would not meet some of these criteria for a "practical rating system" - although it clearly is a practical rating system given that we implement it. Nonetheless the criteria above give designers a great degree of room to move in devising a more accurate system.

Rincewind
20-02-2011, 06:26 PM
This is a bit too restrictive but a good practical measures to prevent unrealistic entries winning. When looking at the Kaggle competition I found the easiest way to improve the rating system was to build in a head-to-head factor. That is for a game between A vs B is A had played B before and won then up the expected score in favour of A. If colours were the same as before up the expected even more. Using this approach I was able to reduce the RMS value by a few percentage points with relatively little tuning. Even a system with head to head factors could be incorporated into a practical rating system but it would be a little cumbersome and good to avoid if you are running a world-wide rating system which is FIDE's goal.

Garvinator
21-02-2011, 11:20 AM
I found it interesting reading this chessbase article that it was DeloitteAustralia that provided the prizefund for this international competition.

http://www.chessbase.com/newsdetail.asp?newsid=7020 :owned:I am still trying to find out if Glicko or Glicko 2 is part of the competition, with no alterations.

Kevin Bonham
21-02-2011, 12:44 PM
I am still trying to find out if Glicko or Glicko 2 is part of the competition, with no alterations.

Strawman versions of both were included in the original benchmarks as noted further up this thread.

Kevin Bonham
26-02-2011, 06:33 PM
http://www.kaggle.com/blog/wp-content/uploads/2011/02/kaggle_win.pdf

^^^
Article by the winner of the first comp outlining his method. While he calls his system Elo++ it includes several Glicko-like assumptions about old data being less important than new data, and about the importance of how often the player has played. It also uses some unusual features involving the "neighbourhood" of a player (which is basically an indicator of the strength of opposition the player typically plays).

Santa
07-03-2011, 02:56 AM
http://www.kaggle.com/blog/wp-content/uploads/2011/02/kaggle_win.pdf

^^^
Article by the winner of the first comp outlining his method. While he calls his system Elo++ it includes several Glicko-like assumptions about old data being less important than new data, and about the importance of how often the player has played. It also uses some unusual features involving the "neighbourhood" of a player (which is basically an indicator of the strength of opposition the player typically plays).

Elo accepted old measurements are less useful than new ones, and his formula ages them right out. The rate of ageing depends on the value of K, a property of a player. A higher value of K eliminates the effect of old results more rapidly than lower values.

A problem with FIDE's implementation, when I studied it, was that K for a player was only changed once, when they achieved a rating of (I think) 2400. K was never increased.

Properly used, K can and should be revised from time to time. If a player is inactive for a period, K should be increased. If a player's K is clearly wrong as evidenced by a performance rating over a significant number (let us say 9 simply because after 9 games we publish a player's rating) of games, then K should be increased.

K should be a "reliability indicator." A player who plays at the around same standard over an extended period should have a stable rating. The rating of a player who, by some misadventure, suddenly plays less well than before should have K increased to as to adjust more quickly.

Rincewind
07-03-2011, 09:56 AM
Elo accepted old measurements are less useful than new ones, and his formula ages them right out. The rate of ageing depends on the value of K, a property of a player. A higher value of K eliminates the effect of old results more rapidly than lower values.

A problem with FIDE's implementation, when I studied it, was that K for a player was only changed once, when they achieved a rating of (I think) 2400. K was never increased.

Properly used, K can and should be revised from time to time. If a player is inactive for a period, K should be increased. If a player's K is clearly wrong as evidenced by a performance rating over a significant number (let us say 9 simply because after 9 games we publish a player's rating) of games, then K should be increased.

K should be a "reliability indicator." A player who plays at the around same standard over an extended period should have a stable rating. The rating of a player who, by some misadventure, suddenly plays less well than before should have K increased to as to adjust more quickly.

The difference is that if K is a reliability indicator then that should feed into the rating adjustment calculation for the opponents of that player. For example if I beat a 2000 rated player with a high K then that is not as significant as beating a 2000 rated player with a low K since the first player might actually be much weaker due to rust or other reasons.

So while K factors do effectively age data they do not do the whole deal. Also we would need some way to adjust K factors both up and down in a way that is meaningful (leads to a overall more predictive system). Currently the process is arbitrary based on the fact that the FIDE systems 'wants' greater stability for rated players in the 2400+ range but as far as I know this hasn't been benchmarked with any other scheme and probably dates back to when the FIDE rating list had a much higher floor than it now has.

Garvinator
30-03-2011, 01:03 PM
Jeff Sonas has released some new results about the fide rating system: http://www.chessbase.com/newsdetail.asp?newsid=7114

I wonder what an ACF rating expectancy table would look like, similar to the one produced by Jeff.

Vlad
30-03-2011, 03:44 PM
Very nice illustration to what I always suspected.
1) With the current FIDE rating system one has strong incentives to play only in tournaments where average ratings are higher than his/her rating.
2) At the top of any country's rating list, players are underrated. It would be interesting to estimate by how much Zhao is underrated.:)

Probably average rating of his opponents is about 300 points below his rating. That means he is underrated by roughly 300/6=50 points. So he would be closer to 2630 if he lived in Europe.

Correction: it is actually 300/5=60, which means he could be 2640 if he were in Europe.

Garvinator
30-03-2011, 11:48 PM
I think out of that article the main point is the 400 point rule is garbage and should be junked immediately

Oepty
31-03-2011, 09:39 PM
It seems to be a great article and hoping they do not take too long to put out the second section. Like Garvin said it is obvious the 400 point cutoff is rubbish.
The question I have though is how did the ratings of players get stretched too far away from the average? It does not seem to say in the article.
Scott

Garvinator
01-04-2011, 01:36 PM
It seems to be a great article and hoping they do not take too long to put out the second section. Like Garvin said it is obvious the 400 point cutoff is rubbish.Of course it is rubbish. The main issue with the 400 point rule is why should a game between players with a 399 point differential be treated differently than a 401 points differential?

The question I have though is how did the ratings of players get stretched too far away from the average? It does not seem to say in the article.Can you expand on what you mean by this please? I do not understand your question.

ChessGuru
01-04-2011, 08:55 PM
Is the ACF / Gletsos going to enter the ACF method in the event?

It would be an interesting and impartial indicator of the success of the ACF Glicko-Gletsos method over the FIDE/ELO method.

I'm also interested to see that the FIDE prize requires a certain level of simplicity - acknowleding the fact that predictive power must be tempered with general public (players) understanding of the methodology.

Kevin Bonham
01-04-2011, 09:10 PM
I'm also interested to see that the FIDE prize requires a certain level of simplicity - acknowleding the fact that predictive power must be tempered with general public (players) understanding of the methodology.

I'm not sure that's really what they are acknowledging since the rules do leave a lot of room for a system to be complex beyond ready public understanding. For instance you can have a rating state made up of ten different numbers, which you can then manipulate using pretty much any kind of function you like, provided you obey the rules on allowable function inputs.

They may be more concerned with computing simplicity issues. Or they may be intending to pick the simplest of the systems that perform well.

Discussion of the FIDE prize on the thread at http://www.kaggle.com/forums/default.aspx?g=posts&t=339 shows that there is a lot of interest in Glicko as a possible basis for a successful entry.

Oepty
01-04-2011, 11:12 PM
Of course it is rubbish. The main issue with the 400 point rule is why should a game between players with a 399 point differential be treated differently than a 401 points differential?
Can you expand on what you mean by this please? I do not understand your question.

Garvin, when Sonas discusses the area of the graph in the red box he says at the start of the seond paragraph concerning the section in the red bo,x "Another way to think of this is that all players' ratings have been stretched a bit too far away from the average." My question is, why has this stretching occurred? I could not see answer in the article, perhaps he will answer it in the second part.
Scott

Vlad
02-04-2011, 09:50 AM
Stretching implies that whenever you play against a player with smaller rating (say 100 points difference), the system treats this game as if you played against somebody with 100 points difference, but his actual strength is only 83 points difference. That means in expected terms you are losing some ratings points. Specifically for a win when K=10 you should get 3.9 points, but you only get 3.6 points. So you miss out on 0.3 points from a single game. Now if you are somewhere in the middle of the field and you roughly have equal number of games that you are higher rated than your opponent and vice versa then this effect cancels out. However, if you are usually at the top of the field (like Zhao or Smerdon) then you mostly play against players with lower ratings. This 0.3 per game accumulates over time and in the case of Zhao (if we assume that he mostly plays in Australia) it could become something like 60 points.

Oepty
02-04-2011, 12:06 PM
Stretching implies that whenever you play against a player with smaller rating (say 100 points difference), the system treats this game as if you played against somebody with 100 points difference, but his actual strength is only 83 points difference. That means in expected terms you are losing some ratings points. Specifically for a win when K=10 you should get 3.9 points, but you only get 3.6 points. So you miss out on 0.3 points from a single game. Now if you are somewhere in the middle of the field and you roughly have equal number of games that you are higher rated than your opponent and vice versa then this effect cancels out. However, if you are usually at the top of the field (like Zhao or Smerdon) then you mostly play against players with lower ratings. This 0.3 per game accumulates over time and in the case of Zhao (if we assume that he mostly plays in Australia) it could become something like 60 points.

That does not answer my question.
Scott

Santa
03-04-2011, 05:38 AM
The difference is that if K is a reliability indicator then that should feed into the rating adjustment calculation for the opponents of that player. For example if I beat a 2000 rated player with a high K then that is not as significant as beating a 2000 rated player with a low K since the first player might actually be much weaker due to rust or other reasons.

So while K factors do effectively age data they do not do the whole deal. Also we would need some way to adjust K factors both up and down in a way that is meaningful (leads to a overall more predictive system). Currently the process is arbitrary based on the fact that the FIDE systems 'wants' greater stability for rated players in the 2400+ range but as far as I know this hasn't been benchmarked with any other scheme and probably dates back to when the FIDE rating list had a much higher floor than it now has.

How should one determine the effect on your rating calculation of my less reliable rating should we play? I don't see an argument for increasing or decreasing your rating more than otherwise, though it might increase the uncertainty of the reliability of your rating.

I agree in Elo systems K should be revised periodically, based on such things as frequency of playing and observed reliability of recent results.

Kevin Bonham
03-04-2011, 02:13 PM
I don't see an argument for increasing or decreasing your rating more than otherwise, though it might increase the uncertainty of the reliability of your rating.

Actually in Glicko and related systems a player who plays an unreliably rated opponent will have their rating increase or decrease less than for the same result against a more reliably rated opponent.

The reason for this is that if a player gets unusually good/bad results against players of a specific rating level, the probability that this is because they are a stronger/weaker player than their current rating indicates is higher if the opponents are reliably rated. If the opponents are unreliably rated it's more likely the player is still at their previous strength. (It's also possible the player is even stronger/worse than these results against unreliably rated players ndicate, but that is much less likely.)

Kaitlin
03-04-2011, 04:51 PM
What about a rating system where you only get points if you play someone within 500 points of your points. I your oppers is the same points as you, you get 1 point for winning. If they are 1 point different you get 499/500's of a point and if they are 499 points different you get 1/500's of a point. If they are 499 above you and you win you still only get 1/500's of a point.

That seems fair :)

You could make it 1000 but 500 each side is 1000 so that seems a good number.

Santa
04-04-2011, 12:32 AM
Actually in Glicko and related systems a player who plays an unreliably rated opponent will have their rating increase or decrease less than for the same result against a more reliably rated opponent.

The reason for this is that if a player gets unusually good/bad results against players of a specific rating level, the probability that this is because they are a stronger/weaker player than their current rating indicates is higher if the opponents are reliably rated. If the opponents are unreliably rated it's more likely the player is still at their previous strength. (It's also possible the player is even stronger/worse than these results against unreliably rated players ndicate, but that is much less likely.)

That has confused me no end!!

If a player's rating is unreliable, who's to say whether he's better than the rating suggests, or worse?

If he's better, his opponents should increase more for a win than otherwise. Conversely, if he's worse, then his opponents should increase less for a win.

Overall, I don't expect unreliably-rated players to have much impact.

fwiw I ras reading an old OTM the other day, published after Jamo took over from me. At that time, Glicko would, I expect, have asserted I had a reliable rating - I'd played a lot of games in recent times, and my rating hadn't changed a lot, it's trend was up, but nothing spectacular.

Until I played in the Dandenong Champs at the time in question. I came equal second and, according to rhe report, earned a 99-point rating increase.

At the time, I was playing another tournement (Waverley) where my results were less stellar and I earned a ratings decrease of 30 or so. But then again, there was another Waverley tounament in the same period where I earned a counter-balancing 30 or so increase.

Logically, Glicko2 should have determined that my rating was less reliable and so rewarded those who beat me by a lesser amount. Or is there something I've overlooked?

Kevin Bonham
04-04-2011, 01:00 AM
If a player's rating is unreliable, who's to say whether he's better than the rating suggests, or worse?

He's more likely to be worse than better because the set of unreliably rated players includes many who are inactive and these typically play below rating strength on return.

Actually, if you beat an unreliable player rated well above you, it's more likely because he's worse than his rating, but in the less common case that you lose to one rated well below you, it's possibly because he's better than it (eg new player with relatively few games). So in both cases there is good reason for the system to treat the result with caution and scale it down.


Logically, Glicko2 should have determined that my rating was less reliable and so rewarded those who beat me by a lesser amount. Or is there something I've overlooked?

Reliability works the same way in Glicko 1 and 2. It is a function of the amount and age of data and not the performance.

Glicko 2 might have determined your rating was more volatile. But (see http://www.glicko.net/glicko/glicko2.doc/example.html) "The opponents' volatilities are not relevant in the calculations. " so the opponents will not be rewarded less if you are recorded as more volatile.

Rincewind
05-05-2011, 03:19 PM
The following is an except from the Kaggle email newsletter (May 5) relating to the chess rating competition.


The Deloitte/FIDE Chess Ratings Competition attracted one of the strongest fields ever seen in a Kaggle Competition. The competition attracted 189 teams, ranging from chess ratings experts to Netflix Prize winners. As Jeff Sonas wrote on the Kaggle blog last week, the competition has far exceeded his expectations. A big congratulations the provisional winner, Tim Salimans, an econometrician at Erasmus University in Rotterdam. We look forward to reading about the approaches used by top performers on the Kaggle blog. We also look forward to the results of the FIDE prize, which could see the introduction of a new chess ratings system.

http://blog.kaggle.com/2011/04/24/the-deloittefide-chess-competition-play-by-play
http://www.kaggle.com/c/ChessRatings2/Leaderboard

- Anthony Goldbloom

Kevin Bonham
10-06-2011, 11:18 PM
Further reporting on systems used by the winners.

http://chessbase.com/newsdetail.asp?newsid=7277

It is interesting that many of the winners came up with a legal method of "cheating" in the predictions contest. Because the accuracy test for predictions involved predicting player results from a bunch of tournaments, and because some of those tournaments were Swisses, several entrants realised that if a player played against a relatively weak field in a Swiss, then they were probably performing poorly, and this could be used to predict their results in the games from that tournament. Sonas said that when this sneakiness was removed the same systems still won.

No reporting on the FIDE prize results in this one.

Rhubarb
04-11-2011, 05:50 PM
More about Kaggle in today's SMH. (http://www.smh.com.au/it-pro/business-it/from-bondi-to-the-big-bucks-the-28yearold-whos-making-data-science-a-sport-20111104-1myq1.html)


"To date Kaggle has crunched data on dark matter, predicting which used cars are likely to be bad buys, improve the World Chess Federation's official chess rating system, and predicting the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information," Kaggle claims.

Denis_Jessop
04-11-2011, 08:12 PM
More about Kaggle in today's SMH. (http://www.smh.com.au/it-pro/business-it/from-bondi-to-the-big-bucks-the-28yearold-whos-making-data-science-a-sport-20111104-1myq1.html)


"To date Kaggle has crunched data on dark matter, predicting which used cars are likely to be bad buys, improve the World Chess Federation's official chess rating system, and predicting the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information," Kaggle claims.

WOW! Apparently it (whatever it is) also predicted the winner of the Eurovision Song Contest. :cool: :clap: :clap:

DJ

Rincewind
04-11-2011, 10:50 PM
I think the current interest is due to Kaggle getting financial backing and so being able to actually employ people.

Redmond Barry
08-11-2011, 08:09 PM
I think the current interest is due to Kaggle getting financial backing and so being able to actually employ people.

its good to see theres finally additional geese for the kaggle ............ :eh:

Kevin Bonham
20-03-2012, 11:54 PM
Dr Alec Stephenson, Department of Statistics, Swinburne University, Australia has been announced as the winner of the FIDE section of the Kaggle competition. (In the unknown likelihood of him reading this, my congratulations to him!)

His solution is explicitly a modified Glicko system. The modifications are relatively simple. One of them is a per-game bonus reflecting the idea that more active players improve more whether they win a particular game or not. I have seen his description paper but am not sure yet if it is available online.

Interview: http://blog.kaggle.com/2012/03/20/could-world-chess-ratings-be-decided-by-the-stephenson-system/

Rincewind
21-03-2012, 12:10 AM
Dr Alec Stephenson, Department of Statistics, Swinburne University, Australia

Well done! Also his second win on Kaggle so he has some form.

I note though that Swinburne doesn't have a Dept of Statistics, per se, and it is amalgamated with the Psychology group to form the Psychological Sciences and Statistics academic unit. Also judging from his kaggle profile (https://www.kaggle.com/users/2702/alec-stephenson) at least Alec seems to have recently moved to work at CSIRO as a research scientist. All the best with the new job too!

ER
21-03-2012, 12:37 AM
Any idea if and to what extent Dr Stephenson's application of Glicko differs to ACF's?

What now? Is FIDE going to think seriously of using Dr Stephenson's application to decide their ratings?

Will ACF's rating system adjust to Dr Stephenson's application?

FM_Bill
05-04-2012, 11:45 AM
One of the biggest weakness of the ELO system is the expected results at certain ratings differences have no resemblance to reality.

ELO difference Expected score
0 0.50
20 0.53
40 0.58
60 0.62
80 0.66
100 0.69
120 0.73
140 0.76
160 0.79
180 0.82
200 0.84
300 0.93
400 0.97

It it would interesting to know real stats for the above categories are
say for a ratings period for the whole of Australia, both averages and for individuals.
... ...

Kevin Bonham
05-04-2012, 12:11 PM
Any idea if and to what extent Dr Stephenson's application of Glicko differs to ACF's?

I would say it is parallel but different in that both systems take Glicko as a starting point and then add various refinements that are useful, but the choice of refinements is different.


What now? Is FIDE going to think seriously of using Dr Stephenson's application to decide their ratings?

Dr Stephenson modestly states that in his view the method might be too complicated to be used in practice but it is no more complicated than ours. Whether it would be palatable to enough FIDE member nations is another question. Like our version of Glicko there is the barrier to overcome that if a system is too complex for a player to calculate their own rating, some players do not like that. It's especially an issue with FIDE because of title norms. A player likes to know if they have crossed the line for a norm or not in a particular game, rather than having to wait to see how their subsequent results, and their opponents' subsequent results, affect their rating.


Will ACF's rating system adjust to Dr Stephenson's application?

Our rating pool is different from the FIDE rating pool and as such the ideas would need to be tested to see if they improved predictiveness or not. Also we have been running a colour-blind system for a long time so incorporating colour advantage would require more effort to ensure tournaments are always submitted with correct colours (some club round robins for instance are often not.)

Kevin Bonham
05-04-2012, 12:12 PM
It it would interesting to know real stats for the above categories are
say for a ratings period for the whole of Australia, both averages and for individuals.
... ...

I have seen some data on this for Australia but it was a fair while ago and would probably bear no relevance to the current system. My (not necessarily reliable) memory is that Ian Rout might have compiled it, perhaps back in the Australian Chess Forum (magazine) days.

ER
05-04-2012, 02:07 PM
I would say it is parallel but different in that both systems take Glicko as a starting point and then add various refinements that are useful, but the choice of refinements is different.



Dr Stephenson modestly states that in his view the method might be too complicated to be used in practice but it is no more complicated than ours. Whether it would be palatable to enough FIDE member nations is another question. Like our version of Glicko there is the barrier to overcome that if a system is too complex for a player to calculate their own rating, some players do not like that. It's especially an issue with FIDE because of title norms. A player likes to know if they have crossed the line for a norm or not in a particular game, rather than having to wait to see how their subsequent results, and their opponents' subsequent results, affect their rating.



Our rating pool is different from the FIDE rating pool and as such the ideas would need to be tested to see if they improved predictiveness or not. Also we have been running a colour-blind system for a long time so incorporating colour advantage would require more effort to ensure tournaments are always submitted with correct colours (some club round robins for instance are often not.)

Thanks for providing all the above very important information!

Kevin Bonham
19-04-2012, 10:47 PM
Article by IA Reuben here:

http://www.englishchess.org.uk/?p=18494


It was agreed to recommend to the General Assembly in Istabul in September that the FIDE and Sticko systems be run alongside each other as soon as the FIDE Office is able to cope. The FIDE System will continue to count for official ratings, title norms and qualification. It may take as long as four years before a final decision is made. That may seem a long time. But, in context, we agreed on my recommendation in 1999 that the FIDE Rating List go down to 1000 and monthly lists. That process will only reach fruition on 1 July 2012.

(Sticko = Stephenson's variant of Glicko)

So subject to approval from the GA the idea is that a Glicko variant will be running in parallel as a test and eventually a decision might be made to switch to another system depending on how the test goes.

Rincewind
12-11-2014, 10:59 PM
I see there is a new Chess Rating related competition on Kaggle.

https://www.kaggle.com/c/finding-elo


Predict a chess player's FIDE Elo rating from one game

This competition challenges Kagglers to determine players' FIDE Elo ratings at the time a game is played, based solely on the moves in one game. Do a player's moves reflect their absolute skill? Does the opponent matter? How closely does one game reflect intrinsic ability? How well can an algorithm do? Does computational horsepower increase accuracy?

Sounds like a fun competition with 4 months to go so still plenty of time.