I have made the source code for Sudoku Grader open and available out on GitHub. Feel free to fork it and add your new strategies or modify the grading scheme.
I have made the source code for Sudoku Grader open and available out on GitHub. Feel free to fork it and add your new strategies or modify the grading scheme.
Imagine a world where every last screw has to be custombuilt, and everybody drives on different sides of the road. This is a world without standards, and a realm imbued with chaos. This is Sudoku. Every publication seems to follow its own grading scale, and the rifts between different scales can be treacherous. Individuals are needlessly discouraged when their egos, inflated by relatively easy level “5 of 5” puzzles in one, are crushed to sand by “Moderate” puzzles in another. It is a crime to for such a rift to cause mass disillusionment with an otherwise enjoyable form of brainteasing entertainment. Making it even worse, creating the standards themselves does not require an earthshattering amount of work.
Creating an open standard for grading Sudoku puzzles is accomplished easily enough. The confusion and disenfranchisement can end. Just as every computer keyboard doesn’t need its own custom layout, every publication does not need its own Sudoku grading scale. With grading systems like the one proposed, the Sudokusolving masses can ascend from the bottomless, sinister, and putrid bowels of proprietary systems into the light of standardization.
With the standardized grading scale finished, it is time to begin analysis of some of the major news publications’ and Sudoku providers’ grading systems. Four possible publications to consider are USA Weekend, The Columbus Dispatch, NYTimes.com, and The Giant Gift Book of Sudoku by Will Shortz. These are two print news publications, a website, and a Sudoku book. Sampling puzzles from each source and running them through Sudoku Grader, a Sudoku grading program written by myself as prescribed in this essay, some interesting patterns emerge in the data. The data, available in Appendix A has been summarized in Table 1.
Table 1 Relative Difficulties 

Sudoku Grader Grade 
USA Weekend 
Columbus Dispatch 
NYTimes.com 
Shortz 
10 




9 




8 




7 



Beware 
6 



Beware 
5 



Beware 
4 


Difficult 
Demanding 
3 


Difficult 
Demanding 
2 

5 

Moderate 
1 
Easy, Medium, Hard 
4 
Medium 
Light, Moderate 
0 
Easy 
1,2,3,4 
Easy 
Light 
First of all, it is clear that different publications’ grading scales do not line up. Also, it is immediately apparent that puzzles written for print news publications tend to be graded on a scale far easier than those available through other means. A puzzle rated “5 of 5” in The Columbus Dispatch is significantly less difficult than a “Difficult” puzzle from NYTimes.com or a “Demanding” puzzle from Will Shortz, as confirmed by a twosample ttest (Appendix B). While other statistical conclusions are difficult to make due to limited sample sizes and unequal variances, it is worth noting some other trends in the data. First of all, every single USA Weekend or Columbus Dispatch puzzle rated had at most a strategic difficult score of one. The vast majority of these puzzles had a strategic difficulty score of 0, requiring only Slicing and Slotting and Simple Singles. This means that such print news publications do not include puzzles requiring strategies more advanced than Hidden Singles. Likely an effort to avoid alienating casual Sudoku solvers, only publishing easier puzzles contributes to confusion between the grading scales.
Ideally, the proposed grading scale would be better able to differentiate between the news publications’ difficulty levels. Were the system more precise, perhaps USA Weekend’s “Easy”, “Medium”, and “Hard” ratings would not grade to 1. But these subtle differences between difficulty levels are not an issue of the grader’s accuracy. Instead the problem is in lack of precision. If the grading scale used the set of real numbers 010 rather than the set of integers 010, finer distinctions could be made. That said, having ten levels of difficulty should offer enough distinction between difficulties. Anything more might be superfluous.
With the grading scale implemented and tested, one final question remains: why hasn’t a standardized grading system been adopted yet? As evidenced by this paper, creating the standardized and transparent grading system need not be overly complex. The issue is that different individuals perceive difficulty in different ways, and different publications do not want to alienate their target audiences. Readers of The Columbus Dispatch might be disappointed to learn that their puzzles are relatively easy compared to those in other sources by seeing a low standardized score, despite the fact that the target audience for a newspaper’s Sudoku puzzles is less sophisticated than those that search for difficult puzzles in published Sudoku collections.
A .05 alphalevel is used for all tests.
Test 1
H_{0 }: Sudoku Grader grade for “5 of 5” Dispatch puzzles = Sudoku Grader grade for
“Difficult” NYTimes.com puzzles
H_{A }: Sudoku Grader grade for “5 of 5” Dispatch puzzles < Sudoku Grader grade for
“Difficult” NYTimes.com puzzles
Conclusion
pvalue<.05, so evidence suggests that the Sudoku Grader grade for “5 of 5” Dispatch puzzles < Sudoku Grader grade for “Difficult” NYTimes.com puzzles
Test 2
H_{0 }: Sudoku Grader grade for “5 of 5” Dispatch puzzles = Sudoku Grader grade for
“Demanding” Shortz puzzles
H_{A }: Sudoku Grader grade for “5 of 5” Dispatch puzzles < Sudoku Grader grade for
“Demanding” Shortz puzzles
Conclusion
pvalue<.05, so evidence suggests that the Sudoku Grader grade for “5 of 5” Dispatch puzzles < Sudoku Grader grade for “Demanding” Shortz puzzles
The release version of Sudoku Grader is now available on the downloads page. A few tweeks have been made. Will you be able to notice them?
Check it out by clicking on the downloads tab.
In fact, analyzing and grading puzzles is a task perfectly suited to numbercrunching, always impartial, infallible geniuses—computers. All that is required is an algorithm to follow, and the computer will do all of the rest of the work. So the first step in creating a Sudoku grading scale is deciding on an algorithm. The question that must be asked is this: what makes a Sudoku puzzle difficult? Difficulty can be decomposed into two basic categories: strategic and procedural. Strategic difficulty refers to the difficulty of the techniques used in solving the puzzle, while procedural difficulty refers to the difficulty of the progression through those techniques. For instance, a puzzle might require one of the most difficult strategies, but only use it once, thus having a low procedural difficulty despite a high strategic difficulty. These two categories form the basis for the final difficulty score.
With these criteria in mind, one can begin to create the skeleton of a grading system. The computer will take a puzzle, and attack it with the solving techniques one at a time. First it will attempt to apply the easiest technique, but if it cannot do so jump up to a more difficult strategy, and continue this until a strategy is applied. The program will then start over with the easiest technique in case the application of the previous technique resulted in making an easy strategy usable. The program will continue in this manner until either the puzzle is solved, or it runs out of solving techniques to apply. After the solving is complete, by examining which strategies are used a strategic difficulty score will be calculated, and then a procedural difficulty will be calculated by analyzing the steps taken. It appears easy enough, but there are some complications.
Using this grading algorithm, one immediately runs into a significant problem. Who is to say how difficult strategies are in comparison with one another, and how can one be certain that the grading system takes into account every possible Sudoku solving strategy? These are thorny issues that cannot be rectified, so some concessions must be made. First of all, there are tens of strategies for solving Sudoku puzzles. As long as all strategies below a certain plateau of difficulty are covered, it is not a problem that not all strategies are implemented. If a necessary strategy is not used by the computer, resulting in an unsolvable puzzle, one can simply give the puzzle the highest difficulty score available. On the second front, one needs to break down the common solving strategies into comparable categories.
There are two categories of strategies: strategies that result in a cell in the Sudoku grid being assigned a value and those that result in possible candidates for a cell’s value being removed. Peter Gordon, author of The Mensa Guide to Solving Sudoku, suggests a certain progression through these categories when solving Sudoku puzzles. The first category tends to be the easiest. It includes strategies, in order from easiest to hardest, like Simple Singles, Slicing and Slotting, Singles, and Hidden Singles. The second category is more difficult, including strategies, again in order from easiest to hardest, like Locked Candidates, Naked Subsets, Hidden Subsets, and advanced strategies like XWing and Swordfish. With the strategies identified, all one has to do is understand the strategies, and then decipher how to program the computer to follow them.
A good standard grading scale would run from 0 to 10. Because mankind operates in a base 10 world, this scale just plain makes sense. The largest factor in determining whether or not an individual can solve a puzzle is whether the individual knows enough strategies to solve it. If one does not know the strategy, one cannot apply it, and the Sudoku cannot be solved. For this reason, allocate 7 of the 10 points in the grading scale to strategic difficulty. A 7 will be given when the program does not have enough strategies in its arsenal to solve the puzzle. Otherwise, the puzzle will assume the strategic difficulty score of the most difficult strategy it uses. A 6 will be granted if XWing or Swordfish are used. A 5 will be granted if Hidden Quads are used, as they can be extremely difficult to spot. A four will be given if a Naked Quad, Hidden Triple, or Hidden Pair is used. A three will be given to Naked Pairs and Naked Triples. Naked subsets are quicker to spot than hidden subsets, thus a Naked Subset the same size as a Hidden Subset will have a lower score. Finally, a two will be given if Locked Candidates are used, a one if Hidden Singles or Singles are used, and a zero if only Slicing and Slotting and Simple Singles are used. These last two strategies are quick to spot and do not require keeping candidate lists, and thus carry the lowest difficulty rank. With strategic difficulty assigned, now the program can assign procedural difficulty scores.
A procedural difficulty score is more difficult to assign because individuals solve puzzles in different ways and tend to take different paths when approaching the same Sudoku puzzle. That said, there are a couple of universal factors that will make a puzzle procedurally difficult. First is the number of candidate elimination strategies used. If the bulk of strategies used to solve a puzzle do not result in any values being placed, an individual solving the Sudoku will have a much harder time. So if more than four candidate elimination techniques are used, one can add one to the procedural difficulty score. If eight are used, the puzzle is especially difficult, and another procedural difficulty point can be tacked on. The second problem that will make a puzzle procedurally difficult is its sheer length. This can be measured by counting the number of empty squares in the puzzle. If there are more than 55 squares to solve, a final procedural difficulty point may be added. This results in a total possible procedural difficulty score of 3, which adding to the strategic difficulty score, will give a score out of 10. The grading scale is complete
If the Sudoku grading program runs through all algorithms without being able to apply any of them when the grid is not yet solved, there is a substantial issue at hand. It is important to know if the puzzle actually can be solved. Because the grading algorithm does not concern itself with strategies more difficult than Swordfish, it is paramount to know at least that the puzzle has a solution in order to differentiate between giving the highest strategic difficulty score and alerting the user that the puzzle is in fact impossible. To solve this requires a new solving technique: brute force.
Programming the brute force solving technique can be done in many ways, some of which will run quicker than others. A decent brute force technique that, while not the fastest available, is recursive backtracking. The idea behind this approach is to take the grid, guess a value for the first spot, and if the grid is still valid, guess a value for the next empty spot until the grid is filled (Jensen 1). If the grid is not valid, then another guess for the value will be inserted into the empty space. If no guesses result in a valid grid, then the guessing continues at the previous empty cell in the same manner. This process repeats itself in a recursive fashion, backtracking whenever it creates an invalid grid. If the grid is never filled, then the puzzle cannot possibly have any solutions. Alternatively, editing the logic slightly, one can also determine if a puzzle is impossible because it has multiple valid solutions. With a means of finding the puzzle’s strategic difficulty, it is time to begin constructing the grader’s scale.
The final common advanced strategy is named Swordfish. Swordfish works like XWing, except that instead of looking for a candidate occurring only twice in two rows or columns, Swordfish is looking for a candidate that occurs two or three times in three columns, in the same three rows. Again, the idea is that wherever this candidate actually occurs in each of the three columns, it will set off a chain reaction that will force the candidate into a spot in each of the shared rows (Penny Publications 6). The strategy works not only for columns against rows, but also rows against columns. Figure 10 demonstrates this perfectly. First of all, note that in the second, sixth, and fifth rows the candidate 6 occurs twice. These occurrences have been highlighted. Because between the three rows, all of the candidate occurrences are limited to exactly three columns, Swordfish can be applied. Swordfish works because it is known that one of two cells in the rows in question will contain a 6, and the cell that gets the 6 will force a cell in another row but the same column not to be a 6, dictating a cell in that other row to have the value 6, in turn setting the location of the final 6 in the last final row. Because the chain of events, no matter which cell is initially determined to be 6, will lead to a 6 being a value in each column in one of the highlighted cells, other cells in those columns but not in the three rows containing 6 as a candidate can have that candidate removed, as indicated with red xmarks in the example. This is not a strategy for the faint of heart.
Fortunately, Swordfish is simpler to implement and utilize than it is to fully understand. Looping through all possible combinations of three rows or three columns and then all candidates 19, the algorithm will check to see that all three rows’ cells containing the given candidate number exactly two or three cells and are members of the same three columns. At this point, the candidate can be removed from each column in question in cells outside of the three rows. That is all it takes to program this final Sudoku strategy. But there is still a slight problem.