see: 2020 Full-time Interview CCJ’s Preparation (1): Commonly Asked C++ Interview Questions, at this link.
see: 2020 Full-time Interview CCJ’s Preparation (2): Commonly Asked C++ Interview Questions in a Table, at this link.
see
In supervised machine learning algorithms, we have to provide labelled data, for example, prediction of stock market prices, whereas in unsupervised we need not have labelled data, for example, classification of emails into spam and non-spam.
KNN is a supervised machine learning algorithm where we need to provide the labelled data to the model it then classifies the points based on the distance of the point from the nearest points.
Whereas, on the other hand, K-Means clustering is an unsupervised machine learning algorithm thus we need to provide the model with unlabelled data and this algorithm classifies points into clusters based on the mean of the distances between different points
Classification is used to produce discrete results, classification is used to classify data into some specific categories .for example classifying e-mails into spam and non-spam categories.
Whereas, We use regression analysis when we are dealing with continuous data, for example predicting stock prices at a certain point of time.
Keep the design of the model simple. Try to reduce the noise in the model by considering fewer variables and parameters.
Cross-validation techniques such as K-folds cross validation help us keep overfitting under control.
Regularization techniques such as LASSO (L1) and Ridge(L2) help in avoiding overfitting by penalizing certain parameters if they are likely to cause overfitting.
A Naive Bayes classifier converges very quickly as compared to other models like logistic regression. As a result, we need less training data in case of naive Bayes classifier.
In ensemble learning, many base models like classifiers and regressors are generated and combined together so that they give better results. It is used when we build component classifiers that are accurate and independent
. There are sequential as well as parallel ensemble methods.
Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
Why do they work?
Ensemble Methods: Predict class label for unseen data by aggregating a set of predictions (classifiers learned from the training data)
Types of Ensemble Methods: bagging, random forests and boosting
.
Bagging works because it reduces variance by voting/averaging
In some pathological hypothetical situations the overall error might increase
Usually, the more classifiers the better
Problem: we only have one dataset
Solution: generate new ones of size by
bootstrapping
, i.e. sampling with replacement
Can help a lot if data is noisy;
Bagging is performed in parallel.
Random Forest is an extension over bagging
.
Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split.
During classification, each tree votes and the most popular class is returned.
A forest is an ensemble of trees. The trees are all slightly different from one another.
Important Parameters:
input data point:
Node objective function (train): The “energy” to be minimized when training the j-th split node, e.g., information gain
Stopping criteria (training), e.g., max tree depth = D, When to stop growing a tree during training;
Tree depth D: (a) Big D overfitting (i.e., too much model capacity, model overfits the training data) and (b) Small D
underfitting (i.e., too little model capacity);
Forest size: T, Total number of trees in the forest;
How to achieve randomness?
1) Randomness Model
: Bagging (randomizing the training set)
2) Randomness Model
: Randomized node optimization (RNO)
Training and Information Gain
Ensemble model:
Advantages of Random Forests:
Boosting: from weak to strong;
Learning from Weighted Data
where, , we can get
if
and iteration index
labels are changed to {1, -1}, instead of {0, 1} seen before.
Weight each training example by how incorrectly it was classified.
Stacking is a way to ensemble multiple classifications or regression model. There are many ways to ensemble models, the widely known models are Bagging
or Boosting
. Bagging allows multiple similar models with high variance are averaged to decrease variance. Boosting builds multiple incremental models to decrease the bias, while keeping variance small.
Stacking
(sometimes called Stacked Generalization
) is a different paradigm. The point of stacking is to explore a space of different models for the same problem. The idea is that you can attack a learning problem with different types of models which are capable to learn some part of the problem, but not the whole space of the problem. So, you can build multiple different learners and you use them to build an intermediate prediction, one prediction for each learned model. Then you add a new model which learns from the intermediate predictions the same target.
This final model is said to be stacked on the top of the others, hence the name. Thus, you might improve your overall performance, and often you end up with a model which is better than any individual intermediate model. Notice however, that it does not give you any guarantee, as is often the case with any machine learning technique.
The difficulty of searching through a solution space becomes much harder as you have more features (dimensions).
Consider the analogy of looking for a penny in a line vs. a field vs. a building. The more dimensions you have, the higher volume of data you’ll need.
Dimension Reduction is the process of reducing the size of the feature matrix (M x D, each row means a data sample, i.e., ). We try to reduce the number of columns (i.e.,
) so that we get a better feature set either by combining columns or by removing extra variables.
PCA is a method for transforming features in a dataset by combining them into uncorrelated linear combinations. These new features, or principal components, sequentially maximize the variance represented (i.e. the first principal component has the most variance, the second principal component has the second most, and so on). As a result, PCA is useful for dimensionality reduction because you can set an arbitrary variance cutoff.
When the model’s predicted value is very close to the actual value the condition is known as low bias. In this condition, we can use bagging algorithms like random forest regressor.
Random forest uses bagging techniques whereas GBM (Gradient Boosting Machine) uses boosting techniques.
Random forests mainly try to reduce variance and GBM reduces both bias and variance of a model.
Predictive models have a tradeoff between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs).
Simpler models are stable (low variance) but they don’t get close to the truth (high bias).
More complex models are more prone to being overfit (high variance) but they are expressive enough to get close to the truth (low bias).
The best model for a given problem usually lies somewhere in the middle.
Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance
of single estimate as they combine several estimates from different models. So the result may be a model with higher stability.
If the difficulty of the single model is over-fitting
, then Bagging
is the best option. If the problem is that the single model gets a very low performance, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces pitfalls of the single model.
Similarities Between Bagging and Boosting –
Both are ensemble methods to get N learners from 1 learner.
Both generate several training data sets by random sampling.
Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority Voting).
Both are good at reducing variance and provide higher stability.
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform better because they are less likely to be overfit.
If training set is large, low bias / high variance models (e.g. Logistic Regression) tend to perform better because they can reflect more complex relationships.
see: this video at https://www.coursera.org/lecture/deep-neural-network/why-regularization-reduces-overfitting-T6OJj
see: https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261
Let’s define a model to see how L1 and L2 work. For simplicity, we define a simple linear regression model ŷ with one independent variable.
Here I have used the deep learning conventions w (‘weight’) and b (‘bias’).
To demonstrate the effect of L1 and L2 regularisation, let’s fit our linear regression model using 3 different loss functions/objectives:
L: Loss function with no regularisation
L1: Loss function with L1 regularisation
L2: Loss function with L2 regularisation
Our objective is to minimize these different losses.
We define the loss function L as the squared error, where error is the difference between y (the true value) and ŷ (the predicted value).
Let’s assume our model will be overfitted using this loss function.
Based on the above loss function, adding an L1 regularisation term to it looks like this:
where the regularization parameter is manually tuned. Let’s call this loss function L1. Note that
is differentiable everywhere except when
, as shown below. We will need this later.
Similarly, adding an L2 regularisation term to L looks like this:
where again, .
Now, let’s solve the linear regression model using gradient descent optimization based on the 3 loss functions defined above. Recall that updating the parameter in gradient descent is as follows:
Let’s substitute the last term in the above equation with the gradient of L, L1 and L2 w.r.t. w.
L:
L1:
L2:
From here onwards, let’s perform the following substitutions on the equations above (for better readability):
,
which give us
L:
L1:
L2:
Observe the differences between the weight updates with the regularization parameter and without it. Here are some intuitions.
Intuition A:
Let’s say with Equation 0, calculating gives us a
value that leads to overfitting.
Then, intuitively, Equations {1.1, 1.2 and 2} will reduce the chances of overfitting because introducing makes us shift away from the very
that was going to cause us overfitting problems in the previous sentence.
Intuition B:
Let’s say an overfitted model means that we have a value that is perfect for our model.
‘Perfect’
meaning if we substituted the data (x) back in the model, our prediction ŷ
will be very, very close to the true y
.
Sure, it’s good, but we don’t want perfect. Why? Because this means our model is only meant for the dataset which we trained on. This means our model will produce predictions that are far off from the true value for other datasets. So we settle for less than perfect, with the hope that our model can also get close predictions with other data. To do this, we ‘taint’
this perfect in Equation 0 with a penalty term
. This gives us Equations {1.1, 1.2 and 2}.
Intuition C:
Notice that H (as defined here) is dependent on the model ( and
) and the data (
and
). Updating the weights based only on the model and data in Equation 0 can lead to overfitting, which leads to poor generalization.
On the other hand, in Equations {1.1, 1.2 and 2}, the final value of is not only influenced by the model and data, but also by a predefined parameter
which is
independent of the model and data
. Thus, we can prevent overfitting if we set an appropriate value of , though too large a value will cause the model to be severely underfitted.
Intuition D:
Edden Gerber (thanks!) has provided an intuition about the direction toward which our solution is being shifted. Have a look in the comments: https://medium.com/@edden.gerber/thanks-for-the-article-1003ad7478b2
We shall now focus our attention to L1 and L2, and rewrite Equations {1.1, 1.2 and 2} by rearranging their and
terms as follows:
L1:
L2:
Compare the second term of each of the equation above. Apart from , the change in
depends on the
term or the
term, which highlight the influence of the following:
(i) sign of current (L1, L2)
(ii) magnitude of current (L2)
(iii) doubling of the regularization parameter (L2)
While weight updates using L1 are influenced by the first point, weight updates from L2 are influenced by all the three points. While I have made this comparison just based on the iterative equation update, please note that this does not mean that one is ‘better’ than the other.
For now, let’s see below how a regularization effect from L1 can be attained just by the sign of the current w.
sparsity
)Take a look at L1 in Equation 3.1. If is positive, the regularization parameter
will push
to be less positive, by subtracting
from
. Conversely in Equation 3.2, if
is negative,
will be added to w, pushing it to be less negative. Hence, this has the effect of pushing
towards 0.
This is of course pointless in a 1-variable linear regression model, but will prove its prowess to ‘remove’
useless variables in multivariate regression models
. You can also think of L1 as reducing the number of features in the model altogether.
Here is an arbitrary example of L1 trying to ‘push’ some variables in a multivariate linear regression model:
So how does pushing towards
help in overfitting in L1 regularization? As mentioned above, as
goes to
, we are reducing the number of features by reducing the variable importance. In the equation above, we see that
,
and
are almost
‘useless’
because of their small coefficients, hence we can remove them from the equation. This in turn reduces the model complexity
, making our model simpler
. A simpler model can reduce the chances of overfitting.
Note
While L1 has the influence of pushing weights towards 0 and L2 does not, this does not imply that weights are not able to reach close to 0 due to L2.
see: https://medium.com/@edden.gerber/thanks-for-the-article-1003ad7478b2
I’d like to suggest that the last part — how regularization reduces overfitting — does not give a satisfying enough answer. The intuitions presented are based on the idea of taking an overfitted solution and moving “away from it” (getting a “less than perfect” solution, one that is affected by factors independent of the dataset, etc.). But that would also apply to moving away from the overfitted solution by naively adding a constant to all weights, which would of course not be helpful. In other words, regularization does make our solution less “perfect” but this in itself is not why it helps.
Instead, I suggest that we need to think about the direction
toward which our solution is being shifted. Specifically, not just away from the overfitted solution but also toward the axis origin. This means that:
1) Weights for different potential training sets will be more similar — which means that the model variance
is reduced (in contrast, if we shifted our weights randomly each time just to move away from the overfitted solution, the variance would not change).
2) We will have a smaller weight for each feature (and/or less features if using L1 reg.). Why does this decrease overfitting?
The way I find it easy to think about is that in a typical case we will have a small number of simple features that will explain most of the variance (e.g. most of will be explained by
); but if our model is not regularized, we can add as many more features we want that explain the residual variance of the dataset (e.g.
), which would naturally overfit the trainin set. Introducing a penalty to the sum of the weights means that the model has to
“distribute”
its weights optimally, so naturally most of this “resource”
will go to the simple features that explain most of the variance, with complex features getting small or zero weights.
KNN和Linear Regression有什么本质区别?
Naive: assuming independent features;
Classification is used when your target is categorical, while regression is used when your target variable is continuous. Both classification and regression belong to the category of supervised machine learning algorithms.
Logistic regression is a classification algorithm used to predict a binary outcome for a given set of independent variables.
The output of logistic regression is either a 0 or 1
with a threshold value of generally 0.5
. Any value above 0.5 is considered as 1, and any point below 0.5 is considered as 0.
A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.
SVM的损失函数是什么?怎么理解?
The ROC (receiver operating characteristic) - the performance plot for binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x-axis).
AUC is area under the ROC curve, and it’s a common performance metric for evaluating binary classification models.
It’s equivalent to the expected probability that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
See the following figure:
AUROC is robust to class imbalance
, unlike raw accuracy.
For example, if you want to detect a type of cancer that’s prevalent in only 1% of the population, you can build a model that achieves 99% accuracy by simply classifying everyone has cancer-free.
An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:
Collect more data to even the imbalances in the dataset: 收集更多的数据。很多时候多收集数据,是最容易被忽略的方法.
Resample
the dataset to correct for imbalances: over-sampling to small class samples, and under-sampling to big class data samples.
通过正负样本的惩罚权重解决样本不均衡: re-weighting
the data
Try a different algorithm altogether on your dataset. E.g., ensemble learning
(via bagging or boosting), and robust to noisy data;
评价模型的方法。用confusion matrix, AUC/ROC等方法来评估模型。
What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
Remove rows with missing values
Build another predictive model to predict the missing values – This could be a whole project in itself, so simple techniques are usually used here.
Use a model that can incorporate missing data – Like a random forest, or any tree-based method.
A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.
Logistic Regression VS Linear Regression:
Random Forest: Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data science speak, the reason that the random forest model works so well is: A large number of relatively uncorrelated
models (trees) operating as a committee will outperform any of the individual constituent models. The low correlation between models is the key.
Bagging (Bootstrap Aggregation)
— Decisions trees are very sensitive to the data they are trained on — small changes to the training set can result in significantly different tree structures. Random forest takes advantage of this by allowing each individual tree to randomly sample from the dataset with replacement
, resulting in different trees. This process is known as bagging.
Given left , and
, (x,y) in left image, and (x-d, y) is in the right image;
1) The Python Code:
import numpy as np
def image_filter(video, kernel):
D, H, W = video.shape
N,_,_ = kernel.shape
n_h = (N-1)/2
size = N*N*N
y = np.zero((H,W,D))
for d in range(0, D):
for h in range(0, H):
for w in range(0, W):
x = video[d-n_h:d+n_d, h- n_h:h+n_h, w-n_h : w+ n_h]
r = sum(x * kernel)/ size
y[d, h,w] = r
return y
2) C++ code:
using namespace std;
using namespace cv;
Mat image_filter(Mat & video, Mat & kernel){
int D = video.size[0];
int H = video.size[1];
int W = video.size[2];
int N = kernel.size[0];
int n_h = (N-1)/2;
int size_inv = 1.0/(N*N*N);
Mat y = cv::Mat((D,H,W));
for (int d = 0; d < D; ++d){
for (int h = 0; h < H; ++h){
for (int w = 0; w <W; ++w){
float r = 0;
for (int i = -n_h; i < n_h; i++){
r += video[d+i][h+i][w+i] * kernel[i+n_h][i+n_h][i+n_h];
}
y[d][h][w] = r *size_inv;
}
}
}
return y;
}
3) Use im2col
for efficient convolution:
Actually we can use im2col
for efficient convolution via matrix multiplication.
see https://www.thebalancecareers.com/top-behavioral-interview-questions-2059618
What They Want to Know: If you’re being considered for a high-stress job, the interviewer will want to know how well you can work under pressure. Give a real example of how you’ve dealt with pressure when you respond.
I had been working on a key project that was scheduled for delivery to the client in 60 days. My supervisor came to me and said that we needed to speed it up and be ready in 45 days, while keeping our other projects on time. I made it into a challenge for my staff, and we effectively added just a few hours to each of our schedules and got the job done in 42 days by sharing the workload. Of course, I had a great group of people to work with, but I think that my effective allocation of tasks was a major component that contributed to the success of the project.
What They Want to Know: Regardless of your job, things may go wrong and it won’t always be business as usual. With this type of question, the hiring manager wants to know how you will react in a difficult situation. Focus on how you resolved a challenging situation when you respond. Consider sharing a step-by-step outline of what you did and why it worked.
One time, my supervisor needed to leave town unexpectedly, and we were in the middle of complicated negotiations with a new sponsor. I was tasked with putting together a PowerPoint presentation just from the notes he had left, and some briefing from his manager. My presentation was successful. We got the sponsorship, and the management team even recommended me for an award.
What They Want to Know: Nobody is perfect, and we all make mistakes. The interviewer is more interested in how you handled it when you made an error, rather than in the fact that it happened.
I once misquoted the fees for a particular type of membership to the club where I worked. I explained my mistake to my supervisor, who appreciated my coming to him, and my honesty. He told me to offer to waive the application fee for the new member. The member joined the club despite my mistake, my supervisor was understanding, and although I felt bad that I had made a mistake, I learned to pay close attention to the details so as to be sure to give accurate information in the future.
What They Want to Know: With this question, the interviewer wants to know how well you plan and set goals for what you want to accomplish. The easiest way to respond is to share examples of successful goal setting.
Within a few weeks of beginning my first job as a sales associate in a department store, I knew that I wanted to be in the fashion industry. I decided that I would work my way up to department manager, and at that point I would have enough money saved to be able to attend design school full-time. I did just that, and I even landed my first job through an internship I completed the summer before graduation.
What They Want to Know: The hiring manager is interested in learning what you do to achieve your goals, and the steps you take to accomplish them.
When I started working for XYZ Company, I wanted to achieve the Employee of the Month title. It was a motivational challenge, and not all the employees took it that seriously, but I really wanted that parking spot, and my picture on the wall. I went out of my way to be helpful to my colleagues, supervisors, and customers - which I would have done anyway. I liked the job and the people I worked with. The third month I was there, I got the honor. It was good to achieve my goal, and I actually ended up moving into a managerial position there pretty quickly, I think because of my positive attitude and perseverance.
What They Want to Know: Sometimes, management has to make difficult decisions, and not all employees are happy when a new policy is put in place. If you’re interviewing for a decision-making role, the interviewer will want to know your process for implementing change.
Once, I inherited a group of employees when their supervisor relocated to another city. They had been allowed to cover each other’s shifts without management approval. I didn’t like the inconsistencies, where certain people were being given more opportunities than others. I introduced a policy where I had my assistant approve all staffing changes, to make sure that everyone who wanted extra hours and was available at certain times could be utilized.
What They Want to Know: Many jobs require working as part of a team. In interviews for those roles, the hiring manager will want to know how well you work with others and cooperate with other team members.
During my last semester in college, I worked as part of a research team in the History department. The professor leading the project was writing a book on the development of language in Europe in the Middle Ages. We were each assigned different sectors to focus on, and I suggested that we meet independently before our weekly meeting with the professor to discuss our progress, and help each other out if we were having any difficulties. The professor really appreciated the way we worked together, and it helped to streamline his research as well. He was ready to start on his final copy months ahead of schedule because of the work we helped him with.
What They Want to Know: With this question, the interviewer is seeking insight into how you handle issues at work. Focus on how you’ve solved a problem or compromised when there was a workplace disagreement.
A few years ago, I had a supervisor who wanted me to find ways to outsource most of the work we were doing in my department. I felt that my department was one where having the staff on the premises had a huge impact on our effectiveness and ability to relate to our clients. I presented a strong case to her, and she came up with a compromise plan.
Tips for Responding: How to answer interview questions about problems at work.
What They Want to Know: Do you have strong motivational skills? What strategies do you use to motivate your team? The hiring manager is looking for a concrete example of your ability to motivate others.
I was in a situation once where the management of our department was taken over by employees with experience in a totally different industry, in an effort to maximize profits over service. Many of my co-workers were resistant to the sweeping changes that were being made, but I immediately recognized some of the benefits, and was able to motivate my colleagues to give the new process a chance to succeed.
More Answers: What strategies would you use to motivate your team?
What They Want to Know: Can you handle difficult situations at work, or do you not deal with them well? The employer will want to know what you do when there’s a problem.
When I worked at ABC Global, it came to my attention that one of my employees had become addicted to painkillers prescribed after she had surgery. Her performance was being negatively impacted, and she needed to get some help. I spoke with her privately, and I helped her to arrange a weekend treatment program that was covered by her insurance. Fortunately, she was able to get her life back on track, and she received a promotion about six months later.
与老板的冲突, 如何解决?
与团队成员的合作, 如何拉抬士气?
如果你的算法与其它热门算法方向不同, 怎么办?三个回答都讲自身经验. 漫长的读博时间里, 很不幸(幸运? )都经历过, 说起来毫不费力, 不用瞎掰.
Two additional major benefits of ReLUs are
sparsity
and
a reduced likelihood of vanishing gradient.
But first recall the definition of a ReLU is where
.
One major benefit is the reduced likelihood of the gradient to vanish.
So the gradient of the ReLu function is either 0 for a < 0 or 1 for a>0. So this reduced likelihood of the gradient to vanish arises when a>0. In this regime the gradient has a constant value 1.
In contrast, the gradient of sigmoids
becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning
.
sigmoid function is
Its derivative is obtained as below
The other benefit of ReLUs is sparsity
. Sparsity arises when a <= 0. The more such units that exist in a layer the more sparse the resulting representation. Sigmoids
on the other hand are always likely to generate some non-zero value resulting in dense representations
. Sparse representations seem to be more beneficial than dense representations.
Some other advantages:
More computationally efficient
to compute than Sigmoid like functions since Relu just needs to pick max(0,x)
and not perform expensive exponential operations
as in Sigmoids;
Relu : In practice, networks with Relu tend to show better convergence
performance than sigmoid. (Krizhevsky et al.)
Disadvantage:
Sigmoid: tend to vanish gradient (because there is a mechanism to reduce the gradient as “a” increase, where “a” is the input of a sigmoid function. Gradient of Sigmoid: S′(a)=S(a)(1−S(a))
. When “a” grows to infinite large , S′(a)=S(a)(1−S(a))=1×(1−1)=0)
.
Relu : tend to blow up activation (there is no mechanism to constrain the output of the neuron, as “a” itself is the output)
Relu : Dying Relu problem - if too many activation maps get below zero then most of the units (i.e., neurons) in network with Relu will simply output zero, in other words, die and thereby prohibiting learning. (This can be handled, to some extent, by using Leaky-Relu
instead.)
Just complementing the other answers:
Vanishing Gradients:
The other answers are right to point out that the bigger the input (in absolute value) the smaller the gradient of the sigmoid function. But, probably an even more important effect is that the derivative of the sigmoid function is ALWAYS smaller than one. In fact it is at most 0.25!
The down side of this is that if you have many layers, you will multiply these gradients, and the product of many smaller than 1 values goes to zero very quickly.
Since the state of the art of for Deep Learning has shown that more layers helps a lot, then this disadvantage of the Sigmoid function is a game killer. You just can’t do Deep Learning with Sigmoid.
On the other hand the gradient of the ReLu function is either 0 for a < 0 or 1 for a > 0. That means that you can put as many layers as you like, because multiplying the gradients will neither vanish nor explode.