In this article, I am going to share with you my 2022 machine learning toolkit. This toolkit contains the best set of tool that will give you the ability to solve any machine learning problem that you decide to work on. Throughout the years of experience in the field of machine learning, I have used numerous programing languages, frameworks, and libraries but I always asked myself after learning and using a certain tool, was it really worth the time and effort? And, was there a better alternative?
Fortunately, most of the time the answer was positive, and that is because of the research I do beforehand. However, in some cases, I realize that learning and using a certain tool was not the right decision, and the cause differed according to each case.
Thus, I decided to write this article, to help you save your precious time, and start learning the best machine learning tool from the get-go.
Overview
When we start learning Machine Learning, the first question that needs to be answered clearly is, what programming language that we need to learn?
And that is because of 2 main reasons:
- Every language has its own set of specific rules, and therefore writing the same machine learning algorithm differs from language to another, and transitioning from one language to another can be a tedious task for some people.
- There are programming languages that have larger Machine Learning communities, and for some one new in the field that is of huge importance as you may know.
- Python: It is by far the most used programming language in the Machine Learning field, and this is due to multiple reasons.
- R: It has a large community of machine learning practitioner, however R is generally used in machine learning to create proof-of-concepts, and mostly R is used when statistical analysis represent a significant part of the project.
- C++: It is indeed the basis of most ML libraries and provides speed and efficiency to the code execution, however what makes it less used in ML than Python, is that it takes significantly more time to develop ML solutions.
Thus, in the present toolkit, I only selected the tools that are supported by Python, and I classified the tools in 4 categories:
- Exploratory Data Analysis
- Multi-purpose Machine Learning
- Computer Vision
- Natural Language Processing
Notebook
Exploratory Data Analysis
Tools:
- NumPy: Operations on multi-dimentional arrays and matricies.
- Pandas: Tabular data manipulation and analysis.
- Matplotlib & Seaborn: Data visualization & plotting.
- SweetViz & Dataprep: Automated graphical and numerical EDA.
NumPy
Pandas
Matplotlib & Seaborn
Matplotlib and Seaborn are both python packages that are used for Data Visualization. Matplotlib library for creating static, animated, and interactive visualizations in Python, so in it essence Matplotlib can be used to create any kind of plots, graphic, and charts in python. It is commonly used via its higher-level API called Pyplot which present a collection of functions that make Matplotlib work like MATLAB. Whereas Seaborn, is a library that uses Matplotlib underneath to plot graphs, however it is more specialized in Data Visualization than Matplotlib and it produces more enhanced statistical graphs. Seaborn is a tool that fits comfortably in the toolchain of anyone interested in extracting insights from structured data.
Both Matplotlib and Seaborn are essential libraries for Data Science and Data Analysis, and we can find multiple online resources other than their official documentations to achieve proficiency level with both.
SweetViz & Dataprep
Multi-purpose Machine Learning
Tools:
- Scikit-learn: Multi-purpose ML tool (Classification, Regression, Clustering, Preprocessing, Model selection).
- TensorFlow & Keras: Multi-purpose DL tools (Creation and training of DL models for numerous tasks).
- PyTorch: Multi-purpose DL tools (Creation and training of DL models for numerous tasks).
- fastai: High-level API for DL tasks, built on top of PyTroch.
- JAX/FLAX: FLAX is a high-performance neural network library for JAX (NumPy compiler on GPU).
- PYCARET: Low-code machine learning tool.
Scikit-learn
- Regression, including Linear and Logistic Regression.
- Classification, including K-Nearest Neighbors, SVM, and Decision Trees.
- Clustering, including K-Means and K-Means++.
- Model selection, including Train/Test split, and Scoring.
- Preprocessing, including Min-Max Normalization.
I would definitely recommend Scikit-learn as the first machine learning library to learn, furthermore, I also recommend implementing the algorithms that are provided by Scikit-learn from scratch using only NumPy. Because by doing so, you would've acquired the basics of the development of ML algorithms and that would also help you understand what's going on under the hood of the predefined Scikit-learn functions so that you can better fine tune the models you work on.
TensorFlow & Keras
PyTorch
PyTorch is an open source machine learning library that can be considered the top 1 competitor of TensorFlow with regard to popularity and usage in the worldwide ML community. Similar to TensorFlow, PyTorch specializes in tensor computations, automatic differentiation, and GPU acceleration which makes it a very powerful tool for developing Deep Learning models. However, PyTorch is more popular in the research field than TensorFlow, and more researchers are switching to PyTorch each year, and that is due to multiple reasons.
Perhaps the most significant reason that started the migration from TensorFlow to Pytorch, is that PyTorch is considered more Pythonic i.e. Tensorflow 1.0 uses static graphs, while PyTorch uses dynamic graphs. However, with the release of TensorFlow 2.0, a system of eager execution was introduced which made working with TensorFlow low-level API much easier and it felt more Pythonic. But, up to this day, PyTorch is still considered more Pythonic, and for more details refer to this article.
In my opinion, the difference between both frameworks architecture and is only relevant when it come to doing research in the machine learning field. However, I would still recommend that you eventually learn both frameworks, because working with the State Of The Art or SOTA models is curial in machine learning projects and sometimes certain models are only available on one framework or the other and not both. Furthermore, switching from one framework to the other is quite natural and simple since, nowadays, both frameworks use similar syntax and similar architecture and the logic behind creating models are is the same.
That being said, I would still suggest for beginner to start with TensorFlow & Keras, because as I mentioned earlier. Keras is the most beginner friendly API for developing and customizing deep learning models.
fastai
I couldn't describe fastai better than it is described on the official documentation website, so fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches.
I chose fastai among the libraries in my toolkit, because it definitely provides a unique feature that I find quite powerful. fastai and as the name suggest, make the process of training and testing SOTA models very fast. We can, in a handful lines of code, create an image classifier which has VERY good performance.
I recommend learning fastai as an additional library to both TensorFlow and PyTorch, and that is to make use of its higher level components that would speed up the prototyping process, however all the results obtained via fastai can always then be reproduced via PyTorch, since it is built using PyTorch.
JAX/FLAX
FLAX is a deep learning library built on top of JAX, which can be described as a GPU-compiler for numpy with automatic differentiation. FLAX delivers an end-to-end, flexible, user experience for researchers who use JAX with neural networks, as stated in the official documentation.
I have not learned FLAX yet, however I believe that this is a very promising library that has great potential. However, I suggest putting it in the watchlist for the moment because it still needs development to be able to compete with PyTorch and TensorFlow.
PyCaret
PyCaret is a Python open source machine learning library created as the Python version of the Caret library on R, to make performing standard tasks in a machine learning project simpler. It is designed to automate the principal steps for evaluating and comparing machine learning algorithms for regression and classification. The main benefit of the library is that a lot can be achieved with very few lines of code and little manual configuration.
On the other hand, PyCaret doesn't allow for much model customization. Thus, I suggest using this library to find the best machine learning model architecture for the task you're working on. Afterwards, switching to other libraries like Scikit-learn which allow model customization and tuning would be the sensible and efficient way to make use of PyCaret.
Computer Vision
Tools:
- Opencv-python: Powerful tool for Image processing and computer vision tasks .
- Scikit-image: Python Image processing and computer vision tool.
- Pillow: Python library mainly for image handling (open/save).
- PyTesseract: Python library for the Google Tesseract OCR engine.
Opencv-python
Scikit-image
Natural Language Processing
Tools:
- NLTK: Most popular NLP library in python, can be used for various simple NLP tasks.
- spaCy: Powerful library for simple and advanced NLP applications.
- Transformers: Python library by Hugging face that provides state of the art transformer models mainly for PyTorch.
- Gensim: Python library for unsupervised topic modeling, retrieval by similarity, and other natural language processing functionalities