Learning Tic-Tac-Toe

PSU CS 441/541 Homework 3

This assignment is an optional extra credit assignment, worth the same amount as a normal homework assignment.


Continuing our exploration of the game of Tic-Tac-Toe (TTT), in this assignment you will construct an algorithm to learn the value of TTT positions.

As in the previous assignment, the input to the program should be a TTT position on standard input. The output should be the score of each move from the initial position, as described below. The program will work as follows: play 500 games of TTT starting from the indicated position according to the algorithm I will describe, and emit the estimated value of the moves ``learned'' by the program by play from this position.

The learning algorithm is the reinforcement learning algorithm used by Donald Michie in 1961 for his groundbreaking matchbox tic-tac-toe player MENACE. This player literally used matchboxes to represent positions, and added and subtracted glass beads representing moves from matchboxes to represent the value of the moves in a position. Your program will train against itself.

Your program will play a series of training games, using a move selection function described below. When you encounter a new position (for example, the initial position) during a game, initialize each square which is a legal move in the position so that there are 32 counters for the move. Then select a move according to the algorithm of the next paragraph, and evaluate the resulting new position in the same fashion. When the game is over, for each move that led to

A win for the side on move: Add three counters to the move.
A win for the opponent: Subtract one counter from the move (but always leave at least 1 counter).
A draw: Do nothing.

To select a move in a position, select randomly from the available moves, with each move weighted according to the number of counters it has.

The output of the program should be the number of counters for each move from the initial position after some number of learning games. Does your program converge on something sensible from the all-blank position in a reasonable number of games? Try the following initial position:

O X
 X 
 O 
Is the convergence better here?

Programming thoughts:


Your program may be written in the language of your choice (let me know if you're planning on something other than C, C++, or Java) but should compile and run on the department UNIX machines. If you are developing on the department machines, please follow the departmental computing Safety Guidelines.

Homework should be submitted by e-mail to <cs541@cs.pdx.edu>. The words "CS441/541 HW3" should appear somewhere in the subject line. The homework submission should be a computer program giving the requested input-to-output mapping, and a brief document indicating experiments run, their outcomes, and the resulting implications.

The program and writeup. will be graded on readability and approach. Remember, if I can't understand your submission, I can't give you credit for it.