Posts

Showing posts from December, 2020

Describe back-propagation.

Describe back-propagation. We can calculate the error of the network only at the output units. The hidden units represent latent variables; we cannot observe their true values in the training data and thus, we have nothing to compute their error against. In order to update their weights, we must propagate the network's errors backwards through its layers. We will begin with Output1. Its error is equal to the difference between the true and predicted outputs, multiplied by the partial derivative of the unit's activation. Continue this process all the way to the input variables and then forward propagate the updated weights through the network.

What are the two main types of artificial neural networks?

What are the two main types of artificial neural networks? 1) Feedforward neural networks are the most common type of neural net, and are defined by their directed acyclic graphs. Signals only travel in one direction—towards the output layer—in feedforward neural networks 2) Feedback neural networks, or recurrent neural networks, do contain cycles. The feedback cycles can represent an internal state for the network that can cause the network's behavior to change over time based on its input.

Describe the Multilayer perceptron (MLP)

Describe the Multilayer perceptron (MLP) The multilayer perceptron (MLP) is the one of the most commonly used artificial neural networks. The name is a slight misnomer; a multilayer perceptron is not a single perceptron with multiple layers, but rather multiple layers of artificial neurons that can be perceptrons. The layers of the MLP form a directed, acyclic graph. Generally, each layer is fully connected to the subsequent layer; the output of each artificial neuron in a layer is an input to every artificial neuron in the next layer towards the output.

What are the three components to an Artificial Neural Network?

What are the three components to an Artificial Neural Network? 1) The model's architecture, or topology, which describes the layers of neurons and structure of the connections between them. 2) The activation function used by the artificial neurons. 3) The third component is the learning algorithm that finds the optimal values of the weights.

What is kernalization?

What is kernalization? Answer: Projects linearly inseparable data to a higher dimensional space in which it is linearly separable.

What is an epoch?

What is an epoch? Answer: Each pass through the training instances

What is the perceptrons update rule?

What is the perceptrons update rule? Answer: w(t + 1) = w(t) + alpha (d_j - y_j(t))x_i_j, for all features 0 <= i <= n

What is an error-driven learning algorithm?

What is an error-driven learning algorithm? If the prediction is correct, the algorithm continues to the next instance. If the prediction is incorrect, the algorithm updates the weights.

Describe the steps of PCA

Describe the steps of PCA 1) The first step of PCA is to subtract the mean of each explanatory variable from each observation: 2) Next, we must calculate the principal components of the data. Recall that the principal components are the eigenvectors of the data's covariance matrix ordered by their eigenvalues. The principal components can be found using two different techniques. 3) Next, we will project the data onto the principal components. The first eigenvector has the greatest eigenvalue and is the first principal component. We will build a transformation matrix in which each column of the matrix is the eigenvector for a principal component. 4) Finally, we will find the dot product of the data matrix and transformation matrix.

What are the principal components of PCA?

What are the principal components of PCA? The principal components of a matrix are the eigenvectors of its covariance matrix, ordered by their corresponding eigenvalues. The eigenvector with the greatest eigenvalue is the first principal component; the second principal component is the eigenvector with the second greatest eigenvalue, and so on.

Describe which descriptive characteristic of an eigenvector changes when transformed by a vector A.

Describe which descriptive characteristic of an eigenvector changes when transformed by a vector A. The direction of an eigenvector remains the same after it has been transformed by A; only its magnitude has changed, as indicated by the eigenvalue; that is, multiplying a matrix by one of its eigenvectors is equal to scaling the eigenvector.

What describes a vector?

What describes a vector? A vector is described by a direction and magnitude, or length.

What is covariance?

What is covariance? Covariance is a measure of how much two variables change together; it is a measure of the strength of the correlation between two sets of variables. If the covariance of two variables is zero, the variables are uncorrelated. Note that uncorrelated variables are not necessarily independent, as correlation is only a measure of linear dependence.

Describe principal component analysis.

Describe principal component analysis. PCA reduces the dimensions of a data set by projecting the data onto a lower-dimensional subspace. In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. More formally, PCA can be used to find a set of vectors that span a subspace, which minimizes the sum of the squared errors of the projected data. This projection will retain the greatest proportion of the original data set's variance. Each subsequent principal component preserves the maximum amount of the remaining variance; the only constraint is that each must be orthogonal to the other principal components. PCA is most useful when the variance in a data set is distributed unevenly across the dimensions.

Motivate the need for dimensionality reduction.

Motivate the need for dimensionality reduction. Dimensionality reduction is motivated by several problems. First, it can be used to mitigate problems caused by the curse of dimensionality. Second, dimensionality reduction can be used to compress data while minimizing the amount of information that is lost. Third, understanding the structure of data with hundreds of dimensions can be difficult; data with only two or three dimensions can be visualized easily.

Describe a method to evaluate the clusters.

Describe a method to evaluate the clusters. The silhouette coefficient is a measure of the compactness and separation of the clusters. It increases as the quality of the clusters increase; it is large for compact clusters that are far from each other and small for large, overlapping clusters. The silhouette coefficient is calculated per instance; for a set of instances, it is calculated as the mean of the individual samples' scores. The silhouette coefficient for an instance is calculated with the following equation: s = (ba) / max(a,b) a is the mean distance between the instances in the cluster. b is the mean distance between the instance and the instances in the next closest cluster.

Describe K-Means Clustering.

Describe K-Means Clustering. The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters and large for clusters that contain scattered instances.

What's the difference between an eager-learner vs lazy-learner?

What's the difference between an eager-learner vs lazy-learner? Eager learners build the model and that takes a while but then classify quickly. Lazy-learners wait to do the majority of the work when classifying.

What are random forests?

What are random forests? A random forest is a collection of decision trees that have been trained on randomly selected subsets of the training instances and explanatory variables. Random forests usually make predictions by returning the mode or mean of the predictions of their constituent trees; scikit-learn's implementations return the mean of the trees' predictions. Random forests are less prone to overfitting than decision trees because no single tree can learn from all of the instances and explanatory variables; no single tree can memorize all of the noise in the representation.

What are ensemble learning methods?

What are ensemble learning methods? Methods combine a set of models to produce an estimator that has better predictive performance than its individual components.

What is Gini Impurity?

What is Gini Impurity? Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability {\displaystyle f_{i}} f_{i} of each item being chosen times the probability {\displaystyle 1-f_{i}} 1-f_{i} of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.

What is information gain?

What is information gain? In general terms, the expected information gain is the change in information entropy H from a prior state to a state that takes some information as given: IG(T,a) = H(T) - H(T|a), Where H =

What is Jaccard similarity?

What is Jaccard similarity? Jaccard similarity, or the Jaccard index, is the size of the intersection of the predicted labels and the true labels divided by the size of the union of the predicted and true labels. It ranges from zero to one, and one is the perfect score. Jaccard similarity is calculated by the following equation: J(Predicted,True) = |Predicted intersection True| / |Predicted union True|

What is one-vs.-all, or one-vs.-the-rest, in multi-class classification?

What is one-vs.-all, or one-vs.-the-rest, in multi-class classification? One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance.

What is AUC?

What is AUC? AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier.

What is an ROC Curve?

What is an ROC Curve? Receiver Operating Characteristic. ROC curves plot the classifier's recall against its fall-out. Unlike accuracy, the ROC curve is insensitive to data sets with unbalanced class proportions; unlike precision and recall, the ROC curve illustrates the classifier's performance for all values of the discrimination threshold.

What is Fall Out?

What is Fall Out? Also known as the false positive rate, is the number of false positives divided by the total number of negatives. It is calculated using the following formula: F = FP / (TN + FP)

What is the F1 Score?

What is the F1 Score? The F1 measure is the harmonic mean, or weighted average, of the precision and recall scores. Also called the f-measure or the f-score, the F1 score is calculated using the following formula: F1 = 2(PR/(P + R)) The F1 measure penalizes classifiers with imbalanced precision and recall scores, like the trivial classifier that always predicts the positive class. A model with perfect precision and recall scores will achieve an F1 score of one.

What is Logistic Regression and describe how to think of its output.

What is Logistic Regression and describe how to think of its output. The response variable describes the probability that the outcome is the positive case. If the response variable is equal to or exceeds a discrimination threshold, the positive class is predicted; otherwise, the negative class is predicted. The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function. Given by the following equation, the logistic function always returns a value between zero and one: F(t) = 1 / (1 + e^-t)

What is the Normal Distribution?

What is the Normal Distribution? The normal distribution, also known as the Gaussian distribution or bell curve, is a function that describes the probability that an observation will have a value between any two real numbers. Normally distributed data is symmetrical. That is, half of the values are greater than the mean and the other half of the values are less than the mean. The mean, median, and mode of normally distributed data are also equal.

How to get standardized data?

How to get standardized data? The value of an explanatory variable can be standardized by subtracting the variable's mean and dividing the difference by the variable's standard deviation.

What is TF-IDF?

What is TF-IDF? Answer: Term Frequency - Inverse Document Frequency. It is (TF)*(IDF)

What is Term Frequency?

What is Term Frequency? The number of times a term occurs in a document is called its term frequency. It has various forms but the simplest is just the sum.

What is Inverse Document Frequency?

What is Inverse Document Frequency? The inverse document frequency (IDF) is a measure of how rare or common a word is in a corpus. The inverse document frequency is given by the following equation:

Motivate the need for Inverse Document Frequency.

Motivate the need for Inverse Document Frequency. The feature vectors contain large weights for terms that occur frequently in a document, even if those terms occur frequently in most documents in the corpus. These terms do not help to represent the meaning of a particular document relative to the rest of the corpus.These words can be thought of as corpus-specific stop words and may not be useful to calculate the similarity of documents.

What is a problem with high-dimensional data?

What is a problem with high-dimensional data? The curse of dimensionality, or the Hughes effect.As the feature space's dimensionality increases, more training data is required to ensure that there are enough training instances with each combination of the feature's values. If there are insufficient training instances for a feature, the algorithm may overfit noise in the training data and fail to generalize.

Give an example of one-hot encoding.

Give an example of one-hot encoding. Assume that our model has a city explanatory variable that can take one of three values: New York, San Francisco, or Chapel Hill. One-hot encoding represents this explanatory variable using one binary feature for each of the three possible cities. [[ 0. 1. 0.] [ 0. 0. 1.][ 1. 0. 0.]]

What is one-of-k or one-hot encoding?

What is one-of-k or one-hot encoding? The explanatory variable is encoded using one binary feature for each of the variable's possible values. It may seem intuitive to represent the values of a categorical explanatory variable with a single integer feature, but this would encode artificial information an order for the values of the variable that does not exist in the real world

What is Stochastic Gradient Descent?

What is Stochastic Gradient Descent? Stochastic Gradient Descent (SGD), in contrast, updates the parameters using only a single training instance in each iteration. The training instance is usually selected randomly. The training instance is usually selected randomly. Stochastic gradient descent is often preferred to optimize cost functions when there are hundreds of thousands of training instances or more, as it will converge more quickly than batch gradient descent. Batch gradient descent is a deterministic algorithm, and will produce the same parameter values given the same training set. As a stochastic algorithm, SGD can produce different parameter estimates each time it is run. SGD may not minimize the cost function as well as gradient descent because it uses only single training instances to update the weights. Its approximation is often close enough, particularly for convex cost functions such as residual sum of squares.

Formally describe gradient descent.

Formally describe gradient descent. Gradient descent is an optimization algorithm that can be used to estimate the local minimum of a function. Gradient descent is only guaranteed to find the local minimum; it will find a valley, but will not necessarily find the lowest valley. Fortunately, the residual sum of the squares cost function is convex (linear regression).

Compare LASSO and Ridge Regression.

Compare LASSO and Ridge Regression. The LASSO produces sparse parameters; most of the coefficients will become zero, and the model will depend on a small subset of the features. In contrast, ridge regression produces models in which most parameters are small but nonzero.

What is LASSO?

What is LASSO? Answer: Least Absolute Shrinkage and Selection Operator(LASSO). LASSO penalizes the coefficients by adding their L1 norm to the cost function.

What is Ridge Regression?

What is Ridge Regression? Answer: Ridge regression penalizes model parameters that become too large. Ridge regression modifies the residual sum of the squares cost function by adding the L2 norm of the coefficients

What is the formula for R squared?

What is the formula for R squared? First, we must measure the total sum of the squares. y_i is the observed value of the response variable for the ith test instance, and y_bar is the mean of the observed values of the response variable SS_tot = SUM( (y_i - y_bar)^2 ) Next, we must find the residual sum of the squares. Recall that this is also our cost function. SS_res = SUM( (y_i - f(x_i)) ^ 2 ) Finally, we can find r-squared using the following formula: R^2 = 1 - (SS_res / SS_tot)

What is Regularization?

What is Regularization? Regularization is a collection of techniques that can be used to prevent over-fitting. Regularization adds information to a problem, often in the form of a penalty against complexity, to a problem.

What is the R squared measure?

What is the R squared measure? R-squared measures how well the observed values of the response variables are predicted by the model. More concretely, r-squared is the proportion of the variance in the response variable that is explained by the model. An r-squared score of one indicates that the response variable can be predicted without any error using the model. An r-squared score of one half indicates that half of the variance in the response variable can be predicted using the model.

Describe precision vs recall in diagnosing malignant tumors.

Describe precision vs recall in diagnosing malignant tumors. Answer: precision measures the fraction of tumors that were predicted to be malignant that are actually malignant. Recall measures the fraction of truly malignant tumors that were detected.

What is Recall?

What is Recall? Recall is the fraction of malignant tumors that the system identified. Recall is calculated with the following formula: R = ( TP / (TP + FN) )

What is Precision?

What is Precision? Precision is the fraction of the tumors that were predicted to be malignant that are actually malignant. Precision is calculated with the following formula: P = ( TP / (TP + FP) )