Data Science Questions

6 min readJun 8, 2022

Prepare for your next Machine Learning job interview

Please explain the difference between machine learning, deep learning, and AI.
Artificial intelligence (AI) is when a machine can emulate the behavior of an intelligent human to accomplish some task. Any time a computer uses an algorithm to solve a problem, that can be called AI. But we usually reserve the term AI for cases where a machine can perform the task more effectively and efficiently than a human being could.
Machine learning (ML) is a special type of AI. The idea is to avoid the need for a human to teach the computer every step of the solution. Instead, the computer should improve its own performance by learning from its experience.
There are different subdivisions within ML. First is unsupervised learning. This is where the machine, on its own, recognizes patterns within the data. The goal may be to categorize the items into discrete clusters; or to identify the most significant features, and thus reduce the numbers variables to be addressed.
Second is supervised learning. This is where you yourself tell the machine how a bunch of example items are supposed to be identified. This could be done in two ways. Either by classification, which is splitting the items into discrete categories; or by regression, where the label’s value can be any number in a continuous range. After training an ML model to learn the structure of the known sample data, you can then apply the model to new unknown items which the model will try to correctly identify based on its given features.
Deep learning (DL) is a specific type of machine learning. The goal of DL is for the computer to learn without the human even having to even train it. DL does this by making use of artificial neural networks (ANNs). An ANN contains multiple “layers”, which are used to recognize different features in the data. The first layer recognizes the simplest features. Successive layers identify the more subtle features. Unlike most ordinary ML, in which the human to must concretely select the features, in DL the model selects them abstractly on its own. A benefit of DL is that it’s predictions are much more accurate than typical ML. Unfortunately, however, a DL algorithm does require more data and time to be trained than ordinary ML.
Please choose one of the following algorithms: XGBoost, SVM, Logistic Regression, Random Forest and write the following.
A brief description about the algorithm and how it is working
The random forest algorithm is based on decision trees. A decision tree successively splits your data into separate categories. This process repeats until all items are neatly sorted into final subgroups, called “leaves.” Ideally, all items in the leaf will share the same label.
The tree does this branching in the most efficient way. How? Because the features upon which each split is based are carefully chosen so as to minimize any disorder. This leads to quick and neat categorization.
Your tree can now take a new item, which is not yet labeled, and pass its features through the tree’s “branches”, in an attempt to guess the correct label.
Unfortunately, a single decision tree is not very good at identifying items which are not inside its training set! Random forest helps alleviate that problem. How so? Random forest creates a collection of trees. Each tree is based on random samplings of your training data. Every tree then attempts to label your new mystery item. Random forests then tells you the answer which most of the trees have voted for. This provides you with quite an accurate answer.
A nice benefit of random forests, is the efficient way it can be tested. Other typical algorithms must be validated and tested on a portion of items which are taken out from the training data. This not only requires additional computation, but also effectively lessens the amount of data you have for training. A random forests, however, if you wish, can exclude portions of your data already when it randomly created all those decision trees. This makes it simple to assess how well the trees identify the excluded data, and determine what size error will be expected on future examples.
An overview of the algorithm’s parameters, and a description of the top ones
There are several parameters for Random Forest. I will list and briefly define them all. But for the most important ones, which will appear at the top of my list, I will describe them in a little more detail.
1. n_estimators is the total number of trees. Having more trees is good, because it will make your model more consistent (i.e. decrease variance). And having more trees will not result in overfitting. The down side is, more trees requires more computation, and therefore takes more time, and hence money. It is a good idea to pick several hundred. Then gradually more trees to the model until the performance reaches a plateau.
2. max_depth is how many times each tree will splits. Having a larger tree will enable you to fit patterns that are more complex. On the other hand, there is a greater danger of overfitting your model to the noise of the training set. Tune the tree size to enhance performance. You want to balance between over- and under-fitting.
3. max_features is how many features to consider when you want to make the best split. If you take into account more features, you are more likely to achieve a better sorted split. On the other hand, it will also make your model be very dependent upon the particular training set which used. This is bad because it results in your model having an inconsistent level of accuracy (i.e. high variance). It is a good idea to set the value to the square root of the total number of features. (That is for classification. For regression, use 1/3 the total number.) Tuning by hand can enhance performance.
4. min_samples_leaf is how many items must exist at a leaf node, i.e. remaining on each side of a split. In other words, how many items must minimally exist in each final node.
5. oob_score is whether you want to assess the accuracy using the items omitted in the bootstrap.
6. max_leaf_nodes is how much you want to the limit the total number of leaves in the trees.
7. min_samples_split is how many items are needed to trigger the next split.
8. min_weight_fraction_leaf is the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. [I don’t understand this idea yet.]
9. min_impurity_decrease is how much you want each split to lessen the disorder.
10. min_impurity_split is how unsorted a node must be to be subsequently split again.
11. bootstrap is whether you want to create trees based on only part, or all, of the items in the sample set.
12. n_jobs is how many tasks you want to run in parallel for fitting and prediction.
13. random_state is a way to generate arbitrary numbers to consistently split up your data.
14. verbose is how much logging activity you want the computer to report during fitting and prediction.
15. warm_start is used if you want to reapply the algorithm based upon the old forest, but now, with a new expanded version, i.e. more trees.
16. criterion is the function which you want to measure the quality of a split, for example, gini.
17. class_weight is if you want a greater impact from one categories items than another.

How to use the algorithm (find an open source library, and describe the steps to use it)
You can use random forest according to the following steps. (I would be glad to include all of the lines of code, but I didn’t believe you want me to do that.)
1. Select the column of your dataframe which holds the property you wish to predict. This is your “target”. We will call it y.
2. Select the column of your dataframe which holds the input features, or “independent variables”. We call this X.
3. Use train_test_split to divide your data into training and validation data for both X and y. Do this randomly using a number generator. Use a seed to get the same division each time. Now you should have train_X, val_X, train_y, val_y.
4. Now use the model — RandomForestRegressor, or Classifier, depending on your case — to fit your training data.
5. Generate predictions upon the validation set (val_X).
6. Use mean_absolute_error to show the difference between the actual values of your validation set (val_y) and the predictions you just made.