WIP: 2048 AI - John Kim

2048 was no doubt the hottest game of 2014. There was a certain joy in being one of the first people to hit 2048, and talking with friends about what the optimal strategy could be. I always thought that 2048 would be a really good example of applying a genetic algorithm, i.e. figuring out a strategy, applying it, mutating the strategy, seeing if it scores higher, picking the strongest performers, and so on.

The quarantine has given me time to actually try this. The post is a work in progress - I have a working harness and a basic genetic algorithm running, but the results are not yet impressive enough for me to declare victory. Consider this a writeup of where I am, not where I’ve arrived.

The Harness

Before any AI, you need to be able to simulate the game in code. The rules of 2048 are simple: a 4x4 grid, tiles slide in one of four directions, matching tiles merge and double in value, and a new tile (a 2 or rarely a 4) spawns after each move. The game ends when no move is possible.

import random
import copy

class Board:
    def __init__(self):
        self.grid = [[0] * 4 for _ in range(4)]
        self.score = 0
        self.spawn_tile()
        self.spawn_tile()

    def spawn_tile(self):
        empty = [(r, c) for r in range(4) for c in range(4) if self.grid[r][c] == 0]
        if empty:
            r, c = random.choice(empty)
            self.grid[r][c] = 4 if random.random() < 0.1 else 2

    def slide_row_left(self, row):
        tiles = [x for x in row if x != 0]
        merged, skip = [], False
        for i, t in enumerate(tiles):
            if skip:
                skip = False
                continue
            if i + 1 < len(tiles) and tiles[i] == tiles[i + 1]:
                merged.append(t * 2)
                self.score += t * 2
                skip = True
            else:
                merged.append(t)
        return merged + [0] * (4 - len(merged))

    def rotate_cw(self):
        self.grid = [list(row) for row in zip(*self.grid[::-1])]

    def move(self, direction):
        # 0=left, 1=up, 2=right, 3=down
        for _ in range(direction):
            self.rotate_cw()
        new_rows = [self.slide_row_left(row) for row in self.grid]
        changed = new_rows != self.grid
        self.grid = new_rows
        for _ in range((4 - direction) % 4):
            self.rotate_cw()
        if changed:
            self.spawn_tile()
        return changed

    def game_over(self):
        for d in range(4):
            trial = copy.deepcopy(self)
            if trial.move(d):
                return False
        return True

    def max_tile(self):
        return max(c for row in self.grid for c in row)

The rotate_cw trick is nice: rather than implementing all four slide directions, you just implement left and rotate the board into position. One direction of code, four directions of behavior.

Heuristics

The “AI” needs a way to evaluate how good a board position is without playing the game to completion. These quick evaluation functions are heuristics. None of them are perfect, but together they can capture something meaningful about board quality.

I’m using four:

Empty tiles - more empty squares means more room to maneuver. A crowded board is usually close to game over.

Monotonicity - tiles arranged in a consistent increasing or decreasing order (say, largest in the corner, decreasing outward) are easier to merge without breaking the structure.

Smoothness - adjacent tiles with similar values are more likely to merge. A board where a 512 sits next to a 2 is hard to work with.

Max tile in corner - a strong meta-strategy for 2048 is keeping the largest tile anchored to a corner so it doesn’t get in the way.

import math

def count_empty(grid):
    return sum(1 for row in grid for c in row if c == 0)

def monotonicity(grid):
    score = 0
    for row in grid:
        if all(row[i] >= row[i+1] for i in range(3)):
            score += 1
        elif all(row[i] <= row[i+1] for i in range(3)):
            score += 1
    for c in range(4):
        col = [grid[r][c] for r in range(4)]
        if all(col[i] >= col[i+1] for i in range(3)):
            score += 1
        elif all(col[i] <= col[i+1] for i in range(3)):
            score += 1
    return score

def smoothness(grid):
    score = 0
    for r in range(4):
        for c in range(4):
            if grid[r][c] == 0:
                continue
            if c + 1 < 4 and grid[r][c+1] != 0:
                score -= abs(grid[r][c] - grid[r][c+1])
            if r + 1 < 4 and grid[r+1][c] != 0:
                score -= abs(grid[r][c] - grid[r+1][c])
    return score

def corner_bonus(grid):
    best = max(c for row in grid for c in row)
    corners = [grid[0][0], grid[0][3], grid[3][0], grid[3][3]]
    return 1 if best in corners else 0

def evaluate(grid, weights):
    w_empty, w_mono, w_smooth, w_corner = weights
    return (w_empty  * count_empty(grid) +
            w_mono   * monotonicity(grid) +
            w_smooth * smoothness(grid) +
            w_corner * corner_bonus(grid))

The weights vector is what the genetic algorithm will be searching for - the right balance of these four signals to produce good play.

The Greedy AI

With an evaluation function in hand, the simplest possible AI is purely greedy: at each turn, look at all legal moves, simulate each one, evaluate the resulting board, and take the best one.

def greedy_move(board, weights):
    best_move, best_score = None, float('-inf')
    for d in range(4):
        trial = copy.deepcopy(board)
        if trial.move(d):
            s = evaluate(trial.grid, weights)
            if s > best_score:
                best_score, best_move = s, d
    return best_move

def play_game(weights):
    board = Board()
    while not board.game_over():
        move = greedy_move(board, weights)
        if move is None:
            break
        board.move(move)
    return board.score

This is one-step lookahead only - it doesn’t think further ahead than the immediate next move. Better approaches (expectimax, monte carlo rollouts) look several moves ahead, but greedy is a useful baseline and fast enough to run thousands of games for training.

The Genetic Algorithm

Now for the evolutionary part. The idea: start with a population of random weight vectors, evaluate each one by playing several games and averaging the score, keep the best performers, breed and mutate them to produce the next generation. Repeat.

def fitness(weights, num_games=10):
    return sum(play_game(weights) for _ in range(num_games)) / num_games

def mutate(weights, scale=0.3):
    return [w + random.gauss(0, scale) for w in weights]

def crossover(a, b):
    return [a[i] if random.random() < 0.5 else b[i] for i in range(len(a))]

def evolve(pop_size=30, generations=40):
    population = [[random.uniform(0, 2) for _ in range(4)]
                  for _ in range(pop_size)]

    for gen in range(generations):
        scored = sorted([(fitness(w), w) for w in population],
                        reverse=True, key=lambda x: x[0])

        best_score, best_weights = scored[0]
        print(f"Gen {gen:3d} | Best avg score: {best_score:.0f} | "
              f"Weights: {[round(w, 2) for w in best_weights]}")

        survivors = [w for _, w in scored[:pop_size // 2]]

        next_gen = survivors[:]
        while len(next_gen) < pop_size:
            a, b = random.sample(survivors, 2)
            next_gen.append(mutate(crossover(a, b)))

        population = next_gen

    return scored[0][1]

The selection pressure here is simple: the bottom half of the population dies each generation and is replaced by mutated children of the top half. Not sophisticated, but it works for exploration.

Where Things Stand

After 40 generations with a population of 30, the algorithm consistently discovers that w_empty should be by far the dominant weight. Keeping tiles free matters more than any of the structural heuristics, at least with one-step lookahead. The weights for smoothness and corner tend to vary a lot run-to-run, which suggests they’re not contributing much signal here.

Average scores hover around 3,000–5,000 after training, which is… okay. Nowhere near reliably hitting the 2048 tile. The ceiling of one-step greedy search is low no matter how good the evaluation function gets.

What’s Next

The obvious next step is extending the lookahead. Even a two-step lookahead would mean evaluating the opponent’s response (the random tile spawn), which starts to look like a minimax tree with chance nodes - this is called expectimax search. Implementing that is on the list.

The other thing I want to try is using the genetic algorithm to search over move sequences rather than heuristic weights. Instead of learning a policy, learn a fixed prefix of moves that tends to produce good board configurations early, and see if that bootstraps better play. Probably not the right framing but it sounds fun to try.

More updates when I have something worth reporting!