In hvass01, Nicholas said:

After selecting 100 images at random for each iteration and iterating 10,000 times the error was reduced to 10% for misidentifying a given image. Could the error be brought down further if the image selection size was increased from 100 to 200 random images per iteration while keeping the number of iterations the same?

Perhaps. Easy to try. Please let me know! :-)

In hvass01, Wenjie said:

How softmax() work.
what's "bias" do

We discussed softmax in detail prior to watching the video. The biases give *n*_{c} additional degrees of freedom for the mapping. Perhaps not helpful if we've already subtracted out the mean of each weight vector, but I'm not sure.

In hvass01, Samuel said:

What is normally the purpose of the validation set compared to the training and testing sets?

See below.

In hvass01, Michael said:

What are the advantages of a tensorflow processing unit? What is unique?

Wikipedia has some info: https://en.wikipedia.org/wiki/Tensor_processing_unit

In hvass01, Sanjeevani said:

In the video, the speaker uses the word "class" in reference to various bits and
types of data. What is a the character class? Is it just a grouping of the same
character?

Yes, we are trying to separate the images into a discrete set of classes, in this case one corresponding to each numeral 0,1,2,...,9.

In hvass01, Katherine said:

What is cross-entropy? What are some other cost functions that are commonly used?

The cross-entropy of two probability vectors p and q is Σ*p*_{k}*log**q*_{k}.

In hvass01, Mia said:

Does creating the placeholder variables have the same function as creating empty arrays in Numpy?

Very loosely, yes. We are building a computational graph at this point in the code, and we need to create objects that will represent the inputs in that graph.

In hvass01, Megan said:

What are some other types of problems that can be solved using TensorFlow?

Much harder recognition problems that can be solved by "deep learning", i.e. neural networks with many layers.

In hvass01, Anthony said:

What is the advantage of minimizing error vs. maximizing accuracy?

They are effectively the same thing.

In hvass01, Prashant said:

How to determine the batch size ? Does it depend just on just the training data size or
something as well

I don't think we can know the best value a priori. This is a parameter that people will experiment with to get good results with reasonable computational effort.

In hvass01, . said:

I didn't quite understand what the confusion matrix is about. Why do we have a confusion matrix, and what is
it used for?

It is a way of representing how well or badly our recognition function works. The i,j element of the confusion matrix is the number of images that were actually in class i that we classified as belonging to class j. A perfect recognizer will have a diagonal confusion matrix (no misclassifications).

In hvass01, Adhish said:

What could we expect the initial accuracy to be if we chose different initial weights (not blank)?

Well if we used the averages, the recognition was decent even before any gradient minimization steps, I think maybe already in the 70% area.

In hvass01, Margaret said:

What is cross-entropy and why is it necessary in computing the error?

I feel like using the average to classify was simpler and worked better than this example. Is this an inaccurate statement? Would using the average and then finding the errors would give better overall results?

It is a measure of the discrepancy between two vectors p and q whose components add to 1. It is Σ*p*_{k}*log**q*_{k}. I think there are information-theoretic reasons to favor this, but I do not know that theory.

In hvass01, Hedy said:

Will there be a point where tenserflow performs too many optimization iteration where the accuracy
goes down?

No, it will never get worse, because the directional derivative along the negative gradient is at most 0. It can certainly find a flattish place, however, and fail to make further progress.

In hvass01, Sakar said:

Where are Tensor Processing units used? Can you please explain cross entropy?

See above. Also see above for cross entropy.

In hvass01, Jonathan said:

Is it better to start with higher optimization iterations or start small and run it multiple times?

I think it is better to start small, because small may be good enough.

In hvass01, Jonathan said:

It was stated that 'logits' is typical TensorFlow terminology. Where does this name
come from?

Statistics. According to Wikipedia, Joseph Berkson coined the term in 1944.

In hvass01, Xuli said:

Is there any way to get rid of over-optimization so that the weight pictures looks more recognizable?

I found this part of the video dubious. Who are we to say that a weird looking weight image is "wrong"? I don't think these strange pictures necessarily indicate "over-fitting".

In hvass01, Anna said:

Why does the prediction get worse as you do more iterations?

It doesn't. It gets better, though ever more slowly. Magnus just didn't like the later weight pictures, but I think that is unwarranted. If they do a better job of recognizing, who are we to say they're not good?

In hvass01, Aishani said:

Why did the weights deviate so much from 10 iterations seismic to 1000 iterations seismic?

In the mountains, you traveled from near the top of a mountain down to a lake. Similar thing!

In hvass01, Daniel said:

What would the results be if the tensorflow tutorial started with our average weights (how many iterations to get to 90%, 99% accuracy)? And how would our model perform with the data set used in the tutorial (with more, difficult images to classify)?

I do not know if the ultimate result would be better. I suspect not, because the algorithm came to a set of weights rather similar to our averages after a few 10s of iterations.

In hvass01, Hui said:

If we want to increase the accuracy, maybe the linear model is not enough? Then in this case,
do we need to re-define the weights?

We will have to do something more radical than that, by changing the model completely. We will do this on Tuesday. (See the Deep Learning tutorial.)

In hvass01, Krithika said:

Do we test this on the validation dataset ?

No. Google: The question is answered in the book Elements of statistical learning page 222. The validation set is used for model selection, the test set for final model (the model which was selected by selection process) prediction error.

In hvass01, Edward said:

Why does the diagonal in the confusion matrix have much darker spaces?

Ideally all the weight in the confusion matrix would be on the diagonal, which would means every image is classified correctly.

In hvass01, Wenjie said:

what's "biases" do

See above.

In hvass01, Maggie said:

Why is it necessary to "build the computational graph" (create all these empty vectors & matrices)
before optimizing? Can't we do that in the optimization?

It is necessary to specify the structure of the computation. Otherwise there is nothing to optimize.

In hvass01, Nicholas said:

Why does the performance not increase substantially after more iterations? There is not a huge difference in performance between 1000, 10,000 and 100,000 iterations. With 100,000 iterations the accuracy levels out at 92%.

Either we are stuck in a local minimum, or this is the best that a simple linear model can do. I suspect the latter. We will develop a better model on Tuesday.