Letter Recognition using kNN

We will use kNN to build a basic OCR application.

Sources:

Contents

English Alphabets Dataset

This example is similar to the digits dataset seen in another demo, but for the English alphabets, there is a slight change in data and feature set. Here, instead of images, OpenCV comes with a data file, letter-recognition.data. If you open it, you will see 20000 lines. In each row, first column is an alphabet which is our label. Next 16 numbers following it are its different features. These features are obtained from UCI Machine Learning Repository. You can find the details of these features here.

There are 20000 samples available, so we take first 10000 data as training samples and remaining 10000 as test samples. We also should change the alphabets to ASCII codes as integer class labels.

Load dataset

fname = fullfile(mexopencv.root(), 'test', 'letter-recognition.data');
if exist(fname, 'file') ~= 2
    disp('Downloading dataset...')
    url = 'https://cdn.rawgit.com/opencv/opencv/3.2.0/samples/data/letter-recognition.data';
    urlwrite(url, fname);
end
if ~mexopencv.isOctave()
    fid = fopen(fname, 'rt');
    D = textscan(fid, ['%c' repmat(' %d',1,16)], 20000, ...
        'Delimiter',',', 'CollectOutput',true);
    fclose(fid);
else
    %HACK: textscan failed to read the whole file in Octave!
    D = cell(1,1+16);
    [D{:}] = textread(fname, ['%s' repmat(' %d',1,16)], 20000, 'Delimiter',',');
    D = {char(D{1}), cat(2, D{2:end})};
end

Prepare data for classification

labels = int32(D{1}) - int32('A');
data = single(D{2});
whos data labels
if ~mexopencv.isOctave() && mexopencv.require('stats')
    %HACK: tabulate in Octave behaves differently
    tabulate(sort(D{1}))
end
N = 10000;  % fix(numel(labels)/2)
  Name            Size              Bytes  Class     Attributes

  data        20000x16            1280000  single              
  labels      20000x1               80000  int32               

  Value    Count   Percent
      A      789      3.94%
      B      766      3.83%
      C      736      3.68%
      D      805      4.03%
      E      768      3.84%
      F      775      3.88%
      G      773      3.87%
      H      734      3.67%
      I      755      3.77%
      J      747      3.73%
      K      739      3.69%
      L      761      3.81%
      M      792      3.96%
      N      783      3.92%
      O      753      3.77%
      P      803      4.01%
      Q      783      3.92%
      R      758      3.79%
      S      748      3.74%
      T      796      3.98%
      U      813      4.07%
      V      764      3.82%
      W      752      3.76%
      X      787      3.94%
      Y      786      3.93%
      Z      734      3.67%

Train

K = 5;
knn = cv.KNearest();
knn.DefaultK = K;
fprintf('Training... '); tic
knn.train(data(1:N,:), labels(1:N));
toc
Training... Elapsed time is 0.005358 seconds.

Test

fprintf('Testing... '); tic
yhat = knn.findNearest(data(N+1:end,:), K);
toc
%TODO: https://savannah.gnu.org/bugs/?50716
Testing... Elapsed time is 0.359780 seconds.

Performance on test set

confmat = accumarray([labels(N+1:end), yhat]+1, 1);
fprintf('Accuracy = %d/%d\n', trace(confmat), sum(confmat(:)));
if mexopencv.isOctave()
    display(confmat)
else
    % Display confusion matrix as a nicley formatted table
    letters = cellstr(unique(D{1}));  % num2cell('A':'Z')
    t = array2table(confmat, 'RowNames',letters, 'VariableNames',letters);
    display(t)

    % We can see how most classification errors involve similar looking letters
    [r,c,v] = find(triu(confmat,1) + tril(confmat,-1).');
    t = table(letters(r), letters(c), v, ...
        'VariableNames',{'letter1', 'letter2', 'err'});
    t = sortrows(t, [-3 1 2]);
    display(t(1:min(end,10),:))
end
Accuracy = 9306/10000
t =
  26×26 table
          A      B      C      D      E      F      G      H      I      J      K      L      M      N      O      P      Q      R      S      T      U      V      W      X      Y      Z 
         ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___    ___
    A    388      1      0      0      0      0      0      0      0      0      0      0      2      0      0      0      0      0      0      0      1      0      0      0      4      0
    B      0    351      0      1      3      1      0      0      0      0      0      0      2      0      0      0      0     10      1      0      0      2      0      0      0      1
    C      0      0    336      0      4      0      4      0      0      0      0      0      1      0      8      0      2      0      0      0      2      0      1      0      0      0
    D      0      1      0    398      1      0      0     14      0      0      0      0      0      1      1      0      0      1      1      0      0      0      0      0      0      0
    E      0      4      8      0    336      3      7      1      0      0      3      0      0      0      0      0      0      0      0      0      0      0      0      0      0      8
    F      0      1      0      1      2    356      0      3      2      0      0      0      0      2      0     16      0      0      1     10      0      2      0      0      0      0
    G      1      5      3      6     14      1    367      1      0      0      1      0      1      0      2      0      1      0      0      0      0      2      1      0      0      0
    H      0      8      1      9      4      0      5    276      0      0     10      1      1      0      2      0      1      6      0      0      0      0      0      1      1      1
    I      1      1      0      1      0      7      0      0    354     24      0      0      0      0      0      1      1      0      0      0      0      0      0      1      0      0
    J      0      0      0      1      2      0      0      1     13    336      0      0      0      0      1      0      0      0      0      0      1      0      0      0      0      0
    K      0      4      0      3      4      0      0     29      0      0    308      0      0      0      1      0      0     12      0      0      0      0      0      8      0      0
    L      0      0      1      0      2      0      2      3      0      1      2    372      0      0      0      0      1      1      0      0      0      0      0      1      0      0
    M      0      2      0      0      0      0      1      0      0      0      0      0    368      5      2      0      0      0      0      0      0      0      4      0      0      0
    N      0      7      0      5      0      0      0      8      0      0      0      1      3    370      6      0      0      5      0      0      0      4      0      0      0      0
    O      0      0      1     12      0      0      1      1      0      0      0      0      0      2    350      0      3      0      0      0      1      0      2      0      0      0
    P      0      3      0      2      1     13      0      4      0      0      0      1      0      0      1    367      1      1      0      0      0      0      0      0      0      0
    Q      1      1      1      1      2      0      4      0      0      0      0      0      0      0     15      0    386      1      0      0      0      0      0      0      1      0
    R      0     10      0      4      0      3      0      7      0      0      7      0      0      1      0      0      1    358      1      0      0      2      0      0      0      0
    S      4      6      0      4      9      1      0      0      0      1      0      1      0      0      0      0      0      1    365      1      0      0      0      0      0      0
    T      0      2      0      3      0      2      0      0      1      0      1      0      0      0      0      0      1      0      0    351      0      1      0      2      5      0
    U      1      1      0      0      0      0      0      8      0      0      0      0      1      0      1      0      0      0      0      1    394      0      0      0      0      0
    V      0      7      1      0      0      1      0      0      0      0      0      0      1      0      1      0      0      0      0      1      1    365      4      0      0      0
    W      0      0      0      0      0      0      0      1      0      0      0      0      2      0      4      0      0      0      0      0      1      5    385      0      0      0
    X      2      3      0      1     11      0      0      0      1      1     12      0      0      0      0      0      0      1      0      1      1      0      0    353      0      1
    Y      0      1      0      1      0      0      0      1      0      0      0      0      0      0      0      2      0      0      0     11      3      1      1      1    356      0
    Z      0      0      0      1      1      1      0      0      0      3      0      1      0      0      0      0      5      0      2      2      0      0      0      0      0    360
  10×3 table
    letter1    letter2    err
    _______    _______    ___
    'H'        'K'        39 
    'I'        'J'        37 
    'F'        'P'        29 
    'D'        'H'        23 
    'E'        'G'        21 
    'B'        'R'        20 
    'K'        'X'        20 
    'K'        'R'        19 
    'O'        'Q'        18 
    'T'        'Y'        16 

Citations