Tag Archives: Data Analysis

One vs All logit to recognize handwritten digits

The blog has been neglected for long. I graduated from my Master’s program, landed a job and worked on some really fun assignments at work, I wish I could talk about them more here but I shouldn’t.

I have always been an enthusiast about data and solving problems using simple algorithms in data science. So I signed up for Andrew Ng’s machine learning course on Coursera recently. However the course was in Octave and as a die hard R fan who doesn’t really care about certificates any i (I learn more when I’m relaxed and I’m doing it just for pleasure) I decided to have a go at all the assignments in R.

I’m not presenting anything path breaking here, just some of my thoughts as I go through these assignments as an amateur enthusiast.

First off, Andrew Ng is a wonderful teacher who is extremely knowledgeable and yet has a simple clear way of presenting complex concepts.I recommend his course to anybody  who wants to have fun with data.

In the third assignment, we were given the extremely interesting task of recognizing handwritten digits from 0 to 9 using a machine learning algorithm, specifically one vs all logistic regression. The input consisted of pixel intensities of the image of the digits. Each digit being a row in the matrix.

Displaying a random number of rows of the input dataset as an image gave me this –


Logistic Regression is a rather common statistical modelling technique/machine learning algorithm used as a binary classifier. It is basically a generalized linear model that uses the sigmoid function as the link function. The task here was to extend the binary classifier to build a multiclass classifier which could learn and recognize each of these 10 different digits. The idea to extend it is pretty simple. Select each class(digit) in this case as ‘one’ and group together all other classes(digits) as the other ‘all’. So the logit now gives you the probability that the image was of the selected digit and not any of the other digits. Building such a model for each digit would give you 10 different models each giving the probability that the image is of the digit specified by the model. You could now select the maximum of these probabilities and assign the image to that category. If the maximum probability was for the digit 6, then the image is most likely a 6.

At the outset the above method seems simple and elegant. I was wondering why we progressed to multinomial logistic regression in our econometrics classes as soon as we had multiple classification problems, why we never paused to consider multiple logistic regressions. I also felt very stupid for not having thought of this. However as I started thinking about it more, I realized that there is a key difference between modelling for machine learning and modelling to study a social science like economics, though the models are very similar. In machine learning as far as I have seen the focus is on the prediction, the parameters of the model being mere tools to get us to an accurate prediction. In a social science( and in business) there is equal if not more emphasis on the parameters themselves as decision tools which tell us about the importance of each of the independent variables in the model.

In a multinomial logistic regression, probabilities of belonging to the various classes are evaluated jointly instead of in a stratified model(This also means you can do with just 1 model with n-1 sets of parameters for n classes instead of n models for n classes like in the one vs all example above). This means we get a clean set of probabilities for the mutually exclusive classes which add up to one. This makes it easy for us to calculate, in one go important marginal effects of the variables like the effect on the odds ratio for different groups vs the base group with a change in one of the dependent variables, in a coherent way for the entire system. I’m not sure how I would interpret marginal effects coming out of a onevsall model in a sensible way. Maybe someone can clear that up for me.

Another realization that hit me was that gradient descent is not a great way to learn the parameters of a logistic regression, that choosing the learning rate is a huge headache.In may scenarios, it is near impossible to choose a ‘goldilocks’ learning rate. In my moments of misplaced idealism I had decided not to use the packaged optimization functions Prof Ng asked us to use, trying my own hack of a gradient descent function instead. Either I would end up having a rather large learning rate leading to an algorithm that wouldn’t converge, and I think as mentioned in the class, would dance around a local minimum without actually hitting it. Looking something like this (lol)

local min

Or I would have a very small learning rate, which would take millions of iterations to converge. Anyway, finally I used a packaged optimizing function, though I could not use the matlab function coded and included in the exercise resources specifically for this exercise. (I was too lazy to translate this into R). I got an in-sample accuracy of 91% which is less than what was expected according to the assignment handout,I attribute this to not using the specified optimizing function.

Overall it was a fun exercise, I hope to learn a lot more and I also wish I had a test sample set aside to see how my onevsall classifier would work on a new sample of data. However given the time it takes for R to run the optimization function on my laptop, I’m totally chickening out of that for now.

PS: I’ve attached the R code for onevsall I used, in case anyone wants to read it and lambaste its inefficiencies.

Leave a comment

Posted by on March 12, 2014 in Data Analysis


Tags: ,

Comparing Classifiers – Revisiting Bayes

I have been quite interested in data and its analysis lately. One of the major tasks involved in dealing with data is classifying it. Is a customer credit worthy or not? Would a customer be interested in buying the latest flavor of ice cream? Or better still, which flavor/brand is she likely to choose?  While these questions require predicting the future, more specifically they require you to classify people/objects into different bins based on what has been observed historically.

To address this issue many types of classifiers have been developed by mathematicians, statisticians and computer scientists. Most of these make some kind of assumptions about underlying data and are varied in their complexity as well as accuracy. As a rule of thumb, the more complex classifiers make less stringent assumptions about the underlying data and thereby give more accurate results for data which isn’t as well behaved as a statistician would ideally like it to be.

Since this piqued my interest I decided to test out two well known classifiers very varied in their level of complexity on the famous iris data. The classifiers I tried out are the Naïve Bayes Classifier and the Multinomial Logistic Regression Model.

I think I’ll talk a little about the data first, the Iris dataset is pretty famous and is available as one of the pre-loaded data sets in R(Open source Statistical Software).

The dataset is a table of observations collected by the botanist Edgar Anderson. It has measurements of the Petal Width, Petal Height, Sepal Width and Sepal Height of three species of the Iris flower namely Iris Setosa, Iris Virginica and Iris Versicolor.

Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          6.2         2.9          4.3        1.3 versicolor
          4.8         3.4          1.9         0.2     setosa
          6.4         3.2          4.5         1.5 versicolor
          6.6         2.9          4.6         1.3 versicolor
         7.9         3.8          6.4         2.0  virginica
          6.0         2.9          4.5         1.5 versicolor

Our objective would be to create this classifier that could learn from this data and classify future observations about iris flowers into one of these three species based on these parameters.

The Naïve Bayes classifer is an extremely simple classifier based on Bayes Theorem. It makes the strict assumption that each of the attributes (Petal Width , Petal Height, Sepal Width) are conditionally independent that is the probability of a flower having a larger petal width wouldn’t depend on the fact that it has a large petal length, once you know which type of flower species it is. If this is true, we would expect that there wouldn’t be any correlation between a set of attributes within a flower species.

A quick look at the scatter plot below (click and expand the gallery) would tell us that this isn’t exactly true, looking at the third box on the last row, there is an evident correlation between Petal width and Petal Length given that the species is Versicolor (Red). There are plenty of observations which look not too correlated as well. I am going to go ahead and use the Naïve Bayes Classifier and see what it does anyway. Naïve Bayes is a classifier whose algorithm breaks down into simple counting and so its very easy to understand and computationally simple to implement. Professore Andrew Moore’s website is an excellent source for understanding these and other algorithms in data analysis.

The Mulinomial Logistic Regression Model uses a more complex estimation process. It is an extension to multiple groups of the logistic regression model which provides estimators for data which can be classified into binary groups. Here we use multinomial logit rather than the basic logit model as the data has 3 groups namely Versicolor, Setosa and Virginica.  Mainly a regression analysis is done for two of these classes while one of them is regarded as the base class. Therefore 5(number of parameters = 4, namely Sepal Length, Sepal Width, Petal Length and Petal Width + 1 (intercept)) parameters are estimated for each regression, bringing it to a total of 10 parameters. The estimation involves a Bayesian concept called Maximum a Posteriori estimation which is far more complex than the simple counting of the Naïve Bayes Classifier.

So why go into all this trouble? The answer is that multinomial logistic regression models make far less stricter assumptions about the nature of the underlying data. The assumption made in the case of this model is one called Independence of Irrelevant Alternatives. That is adding another category to the existing three categories of species should not change the relative odds between any two of the species already listed. This condition only applies to the choice variable (Species) and says nothing about the attributes unlike the conditional independence assumption in the Naïve Bayes classifier.

So I used both the classifiers. The iris dataset has 150 rows of data , that is 150 flowers were observed and recorded in terms of the attributes mentioned and their species. In order to test these classifiers, I used only 75% of the data to train them and the other 25% to test the predictions made by them with the true value of categories that are available.

A simple function to calculate their respective error rates/accuracy was written in  R.

The result ? Mlogit was more accurate but only marginally so. Over 10 runs, on average the Naïve Bayes classifier gave 95% accurate results and the Mlogit gave 97% accurate results. Small price to pay for getting rid of complex computation ? Maybe not if you have powerful processors and efficient algorithms but every mistake could cost you a lot, Maybe so if you just want to make quick classifications and a little loss in accuracy wont cost you a lot compared to the gains in speed. Points to ponder..?

You can download the code to do these comparisons here — It would work with any data frame as long as the last column is the choice variable.(Or so I believe) .Classifiers

Photo Credits

Wiki Commons –

Setosa – Radomil

Versicolor – Daniel Langlois

Virginica – Frank Mayfield

1 Comment

Posted by on March 4, 2012 in Data Analysis, Uncategorized


Tags: , , , ,

%d bloggers like this: