This one is for the love of my life. I thank Chris Hadfield for making your eyes shine.

## One vs All logit to recognize handwritten digits

The blog has been neglected for long. I graduated from my Master’s program, landed a job and worked on some really fun assignments at work, I wish I could talk about them more here but I shouldn’t.

I have always been an enthusiast about data and solving problems using simple algorithms in data science. So I signed up for Andrew Ng’s machine learning course on Coursera recently. However the course was in Octave and as a die hard R fan who doesn’t really care about certificates any i (I learn more when I’m relaxed and I’m doing it just for pleasure) I decided to have a go at all the assignments in R.

I’m not presenting anything path breaking here, just some of my thoughts as I go through these assignments as an amateur enthusiast.

First off, Andrew Ng is a wonderful teacher who is extremely knowledgeable and yet has a simple clear way of presenting complex concepts.I recommend his course to anybody who wants to have fun with data.

In the third assignment, we were given the extremely interesting task of recognizing handwritten digits from 0 to 9 using a machine learning algorithm, specifically one vs all logistic regression. The input consisted of pixel intensities of the image of the digits. Each digit being a row in the matrix.

Displaying a random number of rows of the input dataset as an image gave me this –

Logistic Regression is a rather common statistical modelling technique/machine learning algorithm used as a binary classifier. It is basically a generalized linear model that uses the sigmoid function as the link function. The task here was to extend the binary classifier to build a multiclass classifier which could learn and recognize each of these 10 different digits. The idea to extend it is pretty simple. Select each class(digit) in this case as ‘one’ and group together all other classes(digits) as the other ‘all’. So the logit now gives you the probability that the image was of the selected digit and not any of the other digits. Building such a model for each digit would give you 10 different models each giving the probability that the image is of the digit specified by the model. You could now select the maximum of these probabilities and assign the image to that category. If the maximum probability was for the digit 6, then the image is most likely a 6.

At the outset the above method seems simple and elegant. I was wondering why we progressed to multinomial logistic regression in our econometrics classes as soon as we had multiple classification problems, why we never paused to consider multiple logistic regressions. I also felt very stupid for not having thought of this. However as I started thinking about it more, I realized that there is a key difference between modelling for machine learning and modelling to study a social science like economics, though the models are very similar. In machine learning as far as I have seen the focus is on the prediction, the parameters of the model being mere tools to get us to an accurate prediction. In a social science( and in business) there is equal if not more emphasis on the parameters themselves as decision tools which tell us about the importance of each of the independent variables in the model.

In a multinomial logistic regression, probabilities of belonging to the various classes are evaluated jointly instead of in a stratified model(This also means you can do with just 1 model with n-1 sets of parameters for n classes instead of n models for n classes like in the one vs all example above). This means we get a clean set of probabilities for the mutually exclusive classes which add up to one. This makes it easy for us to calculate, in one go important marginal effects of the variables like the effect on the odds ratio for different groups vs the base group with a change in one of the dependent variables, in a coherent way for the entire system. I’m not sure how I would interpret marginal effects coming out of a onevsall model in a sensible way. Maybe someone can clear that up for me.

Another realization that hit me was that gradient descent is not a great way to learn the parameters of a logistic regression, that choosing the learning rate is a huge headache.In may scenarios, it is near impossible to choose a ‘goldilocks’ learning rate. In my moments of misplaced idealism I had decided not to use the packaged optimization functions Prof Ng asked us to use, trying my own hack of a gradient descent function instead. Either I would end up having a rather large learning rate leading to an algorithm that wouldn’t converge, and I think as mentioned in the class, would dance around a local minimum without actually hitting it. Looking something like this (lol)

Or I would have a very small learning rate, which would take millions of iterations to converge. Anyway, finally I used a packaged optimizing function, though I could not use the matlab function coded and included in the exercise resources specifically for this exercise. (I was too lazy to translate this into R). I got an in-sample accuracy of 91% which is less than what was expected according to the assignment handout,I attribute this to not using the specified optimizing function.

Overall it was a fun exercise, I hope to learn a lot more and I also wish I had a test sample set aside to see how my onevsall classifier would work on a new sample of data. However given the time it takes for R to run the optimization function on my laptop, I’m totally chickening out of that for now.

PS: I’ve attached the R code for onevsall I used, in case anyone wants to read it and lambaste its inefficiencies.

Geraldo Rivera knows exactly why Trayvon Martin an unarmed black teenager was shot dead. He wore a hoodie, a hoodie on a black person is scary! we all know that, don’t we ? Only some people should wear hoodies, or hoods of any form.

Geraldo also offered a wonderful textbook notpology to all those offended by his extremely valuable advice.

Thanks Geraldo, we all get your advice. Now we know whose hoodies to be scare of and whose not to be scared of.

## Is the sky falling down on the aviation sector?

You don’t have to follow Vijay Mallya’s tweets to know that the aviation sector is in trouble. There is trouble brewing everywhere in the industry which had shown great signs of promise since the liberalization of the sector in 2003.

Let’s look at the state owned Air India first. Sometimes I watch things on the news and I hear Ron Paul’s words ringing in my head. I don’t even believe in his kind of economics, but the state of the public sector in India is such that sometimes you’d think even a homeopathic dilution of the kind of libertarian thoughts I see around me now would be a welcome change in the scenario.

I am looking at Air India’s financial statements as I speak, and really folks it is not all about high fuel prices , what I have with me is the Annual Report from 2009-10. I do not know why later versions are not available on the site.

This is a quote from the annual report for 09-10, which makes it clear that the problems run deeper than high fuel prices. It is mismanagement coupled with deception.

“Operating expenses declined by Rs 23,157.8 million due to a decrease in fuel prices by 29%”

They go on to say how this and other declines in operating expenses were offset by increases in expenses largely due to interest paid on newly acquired aircrafts and borrowings for working capital.

It’s true that fuel prices have since shot up, but it is also clear that while fuel prices are exacerbating Air India’s problem, they are also serving as a carpet under which more endemic defects are being swept under.

Attached with the Annual report is the Comptroller and Auditor General Review of the Annual Report. Back in 2010, the CAGR has stated that AIR India understated their losses by around 54 percent. A couple of instances have been highlighted by the Auditor.

1) Air India has been recording its maintenance expense as prepaid expense. That is, the lump sum the company pays for maintenance is recognized as expenditure and deducted from the revenue only when the maintenance work is actually carried out. While this may seem fair at the first glance, the auditor points out that as the need for maintenance is accrued for the hours already flown by the aircrafts, and not when the work is actually carried out. This is an expenditure that has already occurred and should be recorded as such.

2) Deferred Tax Assets – These are assets which would be recognized as tax credits on future income, but to claim them as assets, there has to be a possibility of taxation in the future, that is the company should have some profits in the future which are taxable. There was no reason to believe that such profits would arise making the entire deferred tax assets phony.

Which brings us to the question of why the government is in the business of running an airplane business in the first place? Many arguments are being made including one that Air India connects remote areas of the country which private operatory wouldn’t cover.

Does such coverage have to come at the expense of huge losses to the tax payer? Does the entire badly managed business which has never known how to survive any kind of competition let alone the fierce competition seen in the civil aviation sector today have to maintained for this purpose? If a certain route is indeed that important, wouldn’t subsidizing a private player to fly that route be more efficient an idea? About the employment Air India generates, remember that the employees haven’t been paid for over quarter of a year now. How long are these operations going to be maintained on the shoulders of a purely welfare claim which also they fall short to achieve?

Looking at the entire market place, we see that it isn’t just the public sector but also the private sector which is in trouble. Since the liberalization in 2003, which allowed private players in civil aviation, competition in the sector has been intense, with each player looking out for itself by slashing prices despite rising fuel prices and vying for the most profitable routes (except Air India ? )

Air India does not have the same incentives to maintain profitability as the private sector has, so it isn’t surprising that the government has been accused of pulling Air India out of profitable routes and allowing the private sector airlines to fly them. If there is no incentive to run a profitable business, why run it at all? Just for a few under the table deals and kickbacks that the collusive moves provide to someone high up in the bureaucracy? The tax payer money being lost here, in all probability belongs to an honest lower middle class Indian who may never even see the skies from the comfort of a plane seat.

Kingfisher’s troubles are a well-known fact. A luxury airplane in a price sensitive economy where competition was driving abnormal profits to zero! I guess nobody cared about in-flight entertainment or ‘personally selected’ airhostesses for their two-three hour flights; they just wanted to get going at a low price. The highly leveraged buy-out of Air Deccan to create simultaneous low-cost operations did not work out too well either. It was badly timed with competition from Jet which acquired Air Sahara, sky rocketing fuel prices and the 2008 recession which did not bode well for any of the players. Indigo has stood through in this environment as a lone example of success, driven by well thought of strategic decisions in fleet acquirement and slow and steady expansion into the market. These however are examples of a competitive market doing what it does best, rewarding better decision making. What Air India which doesn’t even pretend to be interested in a profit is doing flying around with these players is beyond me.

Also why are taxes on fuel as high as thirty percent with an additional surcharge when the sector is clearly struggling? Why is Foreign Direct Investment ok in retail but not in aviation? It is easier to see why FDI in aviation would be a good thing. Another area worth looking into would be the high charges paid by carriers for using airport facilities. Unless the private sector is given the leeway it requires, the liberalization of the aviation sector may come to nothing and people will once again be faced with air travel being a luxury for the very rich.

## Dinesh Derailed

Dinesh Trivedi just presented a Railway Budget. He also got fired. What happened? Well, he wasn’t ‘populist’ enough that’s what happened. He apparently did not keep the best interests of the ‘aam admi’ (common man) in mind while deciding to hike fares.

Mr Trivedi proposed a series of increases in Rail fares cutting across classes. The increases he proposed were given out as paise per kilometer. The opposition accused him of misrepresenting in paise what would actually be sizeable increases in passenger fare. Well here is the truth behind it, a paise/kilometer increase makes a lot of sense! The increase in rates has not been uniform across classes, for the suburban trains most indispensable for the common man, the increase comes up to Rs 2/100 kms. How many people use suburban trains to travel a 100kms? Only those people would face even a Rs 2 increase in their fares.

One thing that stands out clear is that while many support Mr Trivedi’s budget given its strong emphasis on safety , modernization and a reduction in the operating ratio, all dire requirements for the railways, very few oppose the fare increases as a whole. Not even Trinamool Congress Leader Mamta Bannerjee or her sidekick Derek O’Brien (He should have stuck with his know it all school kids at the Bournvia Quiz Contest) are opposed to fare hikes in the classes above the sleeper class, that is, the AC 2^{nd} and 1^{st} class. They oppose the increases in the lower classes as only that would affect the aam admi. This shows that everybody, even the hardcore communists, recognize the need to increase revenues within the railways. Before going into ways in which revenue can be increased, let’s look at why it needs to be increased.

1) Operating ratio has to be decreased, currently 95 rupees are spent for every 100 rupees earned by the railways. The railways are owned by the government, ruining its financial health would not be in the best interests of the aam admi.

2) SAFETY! – The Indian Railways transports a large sea of humanity. Around 30 million people travel by train every day! Given this figure the number of accidents may seem like a relatively small percentage, however is being a part of small percentage any consolation to those who lose their lives? Small percentages translate to large numbers in India, making safety a major priority. The list of accidents can be seen here on wiki. Safety costs money, Trivedi’s plan includes modernization of tracks, signaling systems and manning all level crossings. The Aam admi values his life.

3) Modernization-Hygiene- More Safety – How many of us who have travelled by our beloved Indian trains (even that really cool looking Shatabdi tween Chennai and Bangalore often preferred over flights) can call them hygienic ? The stench that accompanies the railways is so characteristic that its considered ‘un-Indian’ to complain about it. Some of us have noticed recently with glee that the open-toilet systems are changing; many of the trains now have greener toilets (Lalu Prasad Yadav initiated this venture). Under Trivedi’s plan 2,500 more coaches would have green toilets by 2013. The open toilet system is dreadfully unhygienic, especially for those with homes near railway tracks. It also corrodes the tracks and costs the railways around 350 crores of rupees per year.

4) Capacity! – Despite being such a large network serving countless number of people, there is always more demand for railway tickets. (We all have at some point woken up at 8am and restlessly hit the refresh button on the tatkal bookings site of IRCTC with our fingers crossed hoping to get lucky). Trivedi has set aside Rs 4410 crores to augment capacity.

5) RnD (Design) – The Indian Railways don’t really look like the ones in Japan do they? The railways are a solid system and do their job well, it could do with a revamp in design though. Trivedi plans to put money into a dedicated Railway design wing at the National Institute of Design.

6) Many more reasons can be found in the budget highlights.

To do all of the above Dinesh Trivedi has proposed borrowing from the market Rs 15,000 crores and also a nominal hike in the fares (after ten long years!).

The opposition thinks that fares should be increased only in the upper classes and not in the lower. I’d like to talk about why this is not such a great idea.

The Indian Railways is a price discriminating monopoly. Prices are different based on whether you are a student/senior citizen/physically handicapped/female senior citizen etc. Prices are different across different trains even for the same classes. The Indian railways also forces consumers to reveal their preferences by offering a range of products. All this is done with the differences in price elasticities in demand across the categories kept in mind. That is acknowledging the fact that increasing the prices have an impact on the quantity of ticket sales, and thereby on the revenue, but this impact is different across different categories.

While I do not have any data to substantiate my claims, I shall consider a case where there is a high increase in prices in the upper classes and none in the lower ones. I broadly guess that given the high fares in the AC 1^{st} and 2^{nd} classes and presence of low cost airlines as close substitutes, these classes would have relatively more elastic demand, that is an increase in prices would cut down travel by AC 1^{st} class a lot, Making them a lot less competitive when compared to airlines which have a great advantage of saving on time. (Make my trip allowed me to book tickets from Delhi to Chennai by a low cost airline at Rs 4,700 as long as I booked really early, the corresponding Rajdhani ticket price on a 1^{st} class was Rs 4,500). Some of the current AC travelers may also switch to the already overwhelmed second class sleepers given the large difference in prices, again bringing no new revenue and putting the ‘aam admi’ in further trouble with regard to lack of capacity.

For short distances buses might substitute the sleeper class trains, however as far as I have seen, the buses are more expensive than the trains and do not work very well for longer distances. Therefore the most inelastic segment of demand would be that for long distance travel by sleeper class, which is where the revenue killing has to be made. The price hike here is pretty nominal a Rs 5 per hundred kilometers is not going to put anybody off traveling.

We would all like safer more hygienic travel even if it meant shedding an extra hundred rupees. The hue and cry is over nothing. Kudos to Mr Trivedi for standing upright through this whole mess and refusing to roll back his changes despite being threatened with ouster. Those who oppose his changes should take a look at this.

## Comparing Classifiers – Revisiting Bayes

I have been quite interested in data and its analysis lately. One of the major tasks involved in dealing with data is classifying it. Is a customer credit worthy or not? Would a customer be interested in buying the latest flavor of ice cream? Or better still, which flavor/brand is she likely to choose? While these questions require predicting the future, more specifically they require you to classify people/objects into different bins based on what has been observed historically.

To address this issue many types of classifiers have been developed by mathematicians, statisticians and computer scientists. Most of these make some kind of assumptions about underlying data and are varied in their complexity as well as accuracy. As a rule of thumb, the more complex classifiers make less stringent assumptions about the underlying data and thereby give more accurate results for data which isn’t as well behaved as a statistician would ideally like it to be.

Since this piqued my interest I decided to test out two well known classifiers very varied in their level of complexity on the famous iris data. The classifiers I tried out are the Naïve Bayes Classifier and the Multinomial Logistic Regression Model.

I think I’ll talk a little about the data first, the Iris dataset is pretty famous and is available as one of the pre-loaded data sets in R(Open source Statistical Software).

The dataset is a table of observations collected by the botanist Edgar Anderson. It has measurements of the Petal Width, Petal Height, Sepal Width and Sepal Height of three species of the Iris flower namely Iris Setosa, Iris Virginica and Iris Versicolor.

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |

6.2 | 2.9 | 4.3 | 1.3 | versicolor |

4.8 | 3.4 | 1.9 | 0.2 | setosa |

6.4 | 3.2 | 4.5 | 1.5 | versicolor |

6.6 | 2.9 | 4.6 | 1.3 | versicolor |

7.9 | 3.8 | 6.4 | 2.0 | virginica |

6.0 | 2.9 | 4.5 | 1.5 | versicolor |

Our objective would be to create this classifier that could learn from this data and classify future observations about iris flowers into one of these three species based on these parameters.

The Naïve Bayes classifer is an extremely simple classifier based on Bayes Theorem. It makes the strict assumption that each of the attributes (Petal Width , Petal Height, Sepal Width) are conditionally independent that is the probability of a flower having a larger petal width wouldn’t depend on the fact that it has a large petal length, once you know which type of flower species it is. If this is true, we would expect that there wouldn’t be any correlation between a set of attributes within a flower species.

A quick look at the scatter plot below (click and expand the gallery) would tell us that this isn’t exactly true, looking at the third box on the last row, there is an evident correlation between Petal width and Petal Length given that the species is Versicolor (Red). There are plenty of observations which look not too correlated as well. I am going to go ahead and use the Naïve Bayes Classifier and see what it does anyway. Naïve Bayes is a classifier whose algorithm breaks down into simple counting and so its very easy to understand and computationally simple to implement. Professore Andrew Moore’s website is an excellent source for understanding these and other algorithms in data analysis.

The Mulinomial Logistic Regression Model uses a more complex estimation process. It is an extension to multiple groups of the logistic regression model which provides estimators for data which can be classified into binary groups. Here we use multinomial logit rather than the basic logit model as the data has 3 groups namely Versicolor, Setosa and Virginica. Mainly a regression analysis is done for two of these classes while one of them is regarded as the base class. Therefore 5(number of parameters = 4, namely Sepal Length, Sepal Width, Petal Length and Petal Width + 1 (intercept)) parameters are estimated for each regression, bringing it to a total of 10 parameters. The estimation involves a Bayesian concept called Maximum a Posteriori estimation which is far more complex than the simple counting of the Naïve Bayes Classifier.

So why go into all this trouble? The answer is that multinomial logistic regression models make far less stricter assumptions about the nature of the underlying data. The assumption made in the case of this model is one called Independence of Irrelevant Alternatives. That is adding another category to the existing three categories of species should not change the relative odds between any two of the species already listed. This condition only applies to the choice variable (Species) and says nothing about the attributes unlike the conditional independence assumption in the Naïve Bayes classifier.

So I used both the classifiers. The iris dataset has 150 rows of data , that is 150 flowers were observed and recorded in terms of the attributes mentioned and their species. In order to test these classifiers, I used only 75% of the data to train them and the other 25% to test the predictions made by them with the true value of categories that are available.

A simple function to calculate their respective error rates/accuracy was written in R.

The result ? Mlogit was more accurate but only marginally so. Over 10 runs, on average the Naïve Bayes classifier gave 95% accurate results and the Mlogit gave 97% accurate results. Small price to pay for getting rid of complex computation ? Maybe not if you have powerful processors and efficient algorithms but every mistake could cost you a lot, Maybe so if you just want to make quick classifications and a little loss in accuracy wont cost you a lot compared to the gains in speed. Points to ponder..?

*You can download the code to do these comparisons here — It would work with any data frame as long as the last column is the choice variable.(Or so I believe)* .Classifiers

Photo Credits

Wiki Commons –

Setosa – Radomil

Versicolor – Daniel Langlois

Virginica – Frank Mayfield

## Bayesian Skepticism: Why skepticism is not denialism.

**Note: All credit goes to **Kristoffer Rypdal as this post borrows heavily from his paper *Testing hypotheses about climate change: the Bayesian approach *which is available at **click here. ***I am merely an information summarizer and disseminator, nothing fancy about me *

*Also Im not an expert on climate science this post is just about a scientific way of thinking, the example is merely illustrative. *

Have you realized that people believe things which are supported by no evidence and choose to disregard theories supported by mountains of evidence with the pitiful excuse, “Its only a theory, your evidence doesn’t prove anything”. I am a skeptic, but that doesn’t make me a denialist. I do not reject theories which are solidly grounded in evidence. So today I present a post on why skepticism and denialism are actually the opposites of each other and an approach on how to look at and understand evidence. This approach was first gifted to mankind by Thomas Bayes and has been usually transferred onto students as a formula in a textbook that everyone by hearts to pass the class. I really hope I will be able to say something here about why Bayes’ theorem is more than a formula to plug in values and get an answer. Why it is a new approach to think about how we think or should be thinking.

As a skeptic and a lover of statistics, I must admit, the first time I realized the true beauty of Bayes’ theorem I was blown over by it. The formula at its simplest is frankly, very simple and derives very quickly from well-known laws of conditional probability. I will not talk about its derivation because

1) It’s fairly simple, you probably already know it.

2) It’ll make my post too long

3) If you really like my post you will go Google it anyway.

That is the formula, where A and B are two events. P(A) and P(B) are the probabilities of occurrence of events A and B respectively. P(A|B) is the probability that event A occurred knowing that you already know B has occurred and vice versa.

So why is this theorem so important to skeptics? To interpret this, let us think of somebody making a claim you are skeptical of I am borrowing heavily from Rypdal’s paper here and talking about climate change.

Fine you are skeptical about global climate change, well you should be, I always say Question Everything! But how long should you be skeptical? Only until the evidence doesn’t convince you, If you continue being skeptical despite insurmountable evidence, you are well, going wrong.

Taking climate change as the hypothesis and the above formula, let A be the event that climate change occurs. You are a skeptic, so you are allowed to assign the event a 50-50 prior probability before you have an evidence (lesser if you want to). So P(A) = 0.5.

Now you need evidence to back up the claim or disprove it, of course, in most non purely mathematical situations evidence only backs up the theory and it doesn’t prove or disprove it a hundred percent. So how do you analyze the evidence?

In the case of climate change, hurricanes are an important source of evidence. We know that if climate change is true, we would have more catastrophic hurricanes; however that does not imply that a hurricane is a proof for climate change. A certain number of hurricanes occur naturally.

Scientists can however build their toy models and look at historical data to arrive at the probability of occurrence of frequent hurricanes in the presence and absence of climate change.

Let B be the event that there occurs a large hurricane more than once per century.

Now I’ll go ahead and make a little change to the Bayes formula above, it derives in a simple manner from the one above, but I don’t want to go into the details.

So what does this equation tell us,

Well the first equation tells us a few things, the term P(A) is the probability of climate change being true, which is assigned a value of 0.5 prior to experimentation as we are agnostic. We call this the prior probability of A.

The term P(A|B) tells us how we should revise our probability of A being true after observing event B. (frequent hurricanes)

The ratio P(B|A)/P(B) tells us how the evidence should change our mind about the prior probability of A. If the probability of frequent hurricanes given climate change is true, is greater than the absolute probability of frequent hurricanes, the ratio is greater than one and it increases our prior probability. Therefore if this is the case we cannot be agnostic anymore.

The second equation is just a better way of looking at the denominator P(B). Here we are accounting for the fact that frequent hurricanes can occur even in the absence of climate change.

P(B|~A) tells us the probability that frequent hurricanes occur though climate change is not true. Note here that a high value for this probability is detrimental to the updation of our prior probability for climate change being true to a higher value. This term captures the honesty of science, we account for all possibilities.

Now assume that scientific theories, models and data tell us that P(B|A) that is frequent hurricanes given global climate change is 0.5

And P(B|~A) that is the probability that frequent hurricanes occur despite climate change being not true be 0.1

Now, plugging in values, our posterior probability P(A|B) that is the new probability of climate change being true In the light of this evidence is 0.83. We are not totally agnostic anymore.

This is how science and evidence based reasoning convinces us of theories. This is why skepticism is not denialism and they are the opposites of each other. Skepticism is a position of giving unlikely events low prior probabilities not shutting our eyes to the posterior probabilities.

I hope I have made some sense here .

Picture Courtesy: free digital photos