My 2022 Machine Learning Toolkit


There are many machine learning tools that has been developed in the last decade and learning them all would not be the as efficient as you may think. Whether you're thinking about starting your journey in Machine Learning and you're not sure which tools you need to start with, or you're currently in the learning process and want to know what set of Machine Learning tools can help you become a great problem solver in any area of the machine learning field. You are in the right place! 

In this article, I am going to share with you my 2022 machine learning toolkit. This toolkit contains the best set of tool that will give you the ability to solve any machine learning problem that you decide to work on. Throughout the years of experience in the field of machine learning, I have used numerous programing languages, frameworks, and libraries but I always asked myself after learning and using a certain tool, was it really worth the time and effort? And, was there a better alternative?
Fortunately, most of the time the answer was positive, and that is because of the research I do beforehand. However, in some cases, I realize that learning and using a certain tool was not the right decision, and the cause differed according to each case.
Thus, I decided to write this article, to help you save your precious time, and start learning the best machine learning tool from the get-go.

Overview

When we start learning Machine Learning, the first question that needs to be answered clearly is, what programming language that we need to learn?
And that is because of 2 main reasons:

  1. Every language has its own set of specific rules, and therefore writing the same machine learning algorithm differs from language to another, and transitioning from one language to another can be a tedious task for some people.
  2. There are programming languages that have larger Machine Learning communities, and for some one new in the field that is of huge importance as you may know.
Mainly there is no "best" programming language for Machine Learning, It all depends on where you’re coming from,  what you want to build and why you got involved in machine learning. But, let's assume that we don't know any programming language and we want to learn Machine Learning to start a career. The there are mainly 3 options, and I would recommend the first one:
  1. Python: It is by far the most used programming language in the Machine Learning field, and this is due to multiple reasons.
  2. R: It has a large community of machine learning practitioner, however R is generally used in machine learning to create proof-of-concepts, and mostly R is used when statistical analysis represent a significant part of the project.
  3. C++: It is indeed the basis of most ML libraries and provides speed and efficiency to the code execution, however what makes it less used in ML than Python, is that it takes significantly more time to develop ML solutions. 

Thus, in the present toolkit, I only selected the tools that are supported by Python, and I classified the tools in 4 categories:

  1. Exploratory Data Analysis
  2. Multi-purpose Machine Learning
  3. Computer Vision
  4. Natural Language Processing

Notebook

You can find examples for all the discussed tools in this article in the following Jupyter notebook.

Exploratory Data Analysis

Since any machine learning projects includes working on data(Gathering, Assessment, Cleaning, and Visualization), I decided to include the tools that allow us to perform EDA or Exploratory Data Analysis which consists mainly of the three tasks mentioned earlier.

Tools:

  1. NumPy: Operations on multi-dimentional arrays and matricies.
  2. Pandas: Tabular data manipulation and analysis.
  3. Matplotlib & Seaborn: Data visualization & plotting.
  4. SweetViz & Dataprep: Automated graphical and numerical EDA.

NumPy

NumPy stands for Numerical Python and is pronounced as /ˈnÊŒmpaɪ/. NumPy is a Python library that performs numerical calculations on matrices-like object called Arrays. It is very fast because it is written in the C programming language.
NumPy is the most popular and most commonly used python library for and type of Linear Algebra calculations which is the base of numerous Machine Learning algorithms.

NumPy is a fundamental package that any ML practitioner should be very familiar with and it is also a beginner-friendly library that has an extensive documentation, and a many free courses online. So learning NumPy should not present much of a challenge to anyone familiar with the basics of mathematical computations.

Pandas

Pandas is an open source Python library that is most widely used for data science, data analysis, and machine learning tasks. It is built on top of NumPy, and It provides a wide collection of functions that can be used to perform multiple EDA tasks. It handles structured data through array-like objects called Dataframes and Series.

Pandas is a very powerful package that sometimes we don't need any other package with it to perform EDA or other Statistical Analysis. Like NumPy, pandas also has an extensive documentation online with multiple examples, so you can learn it without needing any other source generally.

Matplotlib & Seaborn

Matplotlib and Seaborn are both python packages that are used for Data Visualization. Matplotlib library for creating static, animated, and interactive visualizations in Python, so in it essence Matplotlib can be used to create any kind of plots, graphic, and charts in python. It is commonly used via its higher-level API called Pyplot which present a collection of functions that make Matplotlib work like MATLAB. Whereas Seaborn, is a library that uses Matplotlib underneath to plot graphs, however it is more specialized in Data Visualization than Matplotlib and it produces more enhanced statistical graphs. Seaborn is a tool that fits comfortably in the toolchain of anyone interested in extracting insights from structured data.

Both Matplotlib and Seaborn are essential libraries for Data Science and Data Analysis, and we can find multiple online resources other than their official documentations to achieve proficiency level with both.


SweetViz & Dataprep

SweetViz and Dataprep are python libraries designed for the automation of the EDA process. As described in their official package page, Swwetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. The report generated by SweetViz, present numerical and statistical features for each feature of the input dataset. SweetViz can also be use to compare two dataset, such feature can come in handy when creating Training and Testing sets for a machine learning model.
Dataprep on the other hand, can be used for more than generating EDA reports. Dataprep provides API for both connecting databases, and cleaning datasets. Automatically clean a structured dataset is a very powerful feature that we can use to speed up the EDA process and have more time for model development.

Certainly, automating EDA tasks is very useful, but we need to be cautious when using EDA automation tools because they are always prone to error, and sometimes error can propagate and cause unexpected problems on later steps of a machine learning project.

Multi-purpose Machine Learning

In this category, we will discuss the tools that I selected for developing ML models for different type of tasks(Classical ML, Computer Vision, NLP, etc..). This is the most important category in this toolkit, because the tools discussed here are at the heart of any ML project, which is developing ML models.
Becoming proficient with the suggested tools, would put you in a position where you can tackle any machine learning challenge with a confident spirit and a very high success probability.

Tools:

  1. Scikit-learn: Multi-purpose ML tool (Classification, Regression, Clustering, Preprocessing, Model selection). 
  2. TensorFlow & Keras: Multi-purpose DL tools (Creation and training of DL models for numerous tasks). 
  3. PyTorch: Multi-purpose DL tools (Creation and training of DL models for numerous tasks). 
  4. fastai: High-level API for DL tasks, built on top of PyTroch. 
  5. JAX/FLAX: FLAX is a high-performance neural network library for JAX (NumPy compiler on GPU). 
  6. PYCARET: Low-code machine learning tool.

Scikit-learn

Basically, Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms. Being built upon NumPy, SciPy and Matplotlib, Scikit-learn or Sklearn, provides a large selection of efficient tools for machine learning and statistical modeling including regression, classification, clustering and dimensionality reduction via a consistent interface in Python.

The functionality that Scikit-learn provides include: 
  • Regression, including Linear and Logistic Regression. 
  • Classification, including K-Nearest Neighbors, SVM, and Decision Trees.
  • Clustering, including K-Means and K-Means++.
  • Model selection, including Train/Test split, and Scoring.
  • Preprocessing, including Min-Max Normalization.

I would definitely recommend Scikit-learn as the first machine learning library to learn, furthermore, I also recommend implementing the algorithms that are provided by Scikit-learn from scratch using only NumPy. Because by doing so, you would've acquired the basics of the development of ML algorithms and that would also help you understand what's going on under the hood of the predefined Scikit-learn functions so that you can better fine tune the models you work on.


TensorFlow & Keras

As described in the official website, TensorFlow is an end-to-end machine learning platform. TensorFlow is the most popular deep learning framework on Python. TensorFlow makes it easy for beginners and experts to create machine learning models for desktop, mobile, web, and cloud. In my opinion, what gave TensorFlow its huge reputation, other than its great features, is that it incorporates a higher level API called Keras that is very beginner friendly and can be used to progressively advance towards doing more complicated tasks, such as modifying existing components(Layers, Training loops, etc..) or even creating new architectures. I also believe that becoming proficient in Tensorflow and Keras is crucial for building a solid career in Machine Learning.

PyTorch

PyTorch is an open source machine learning library that can be considered the top 1 competitor of TensorFlow with regard to popularity and usage in the worldwide ML community. Similar to TensorFlow, PyTorch specializes in tensor computations, automatic differentiation, and GPU acceleration which makes it a very powerful tool for developing Deep Learning models. However, PyTorch is more popular in the research field than TensorFlow, and more researchers are switching to PyTorch each year, and that is due to multiple reasons. 

Perhaps the most significant reason that started the migration from TensorFlow to Pytorch, is that PyTorch is considered more Pythonic i.e. Tensorflow 1.0 uses static graphs, while PyTorch uses dynamic graphs. However, with the release of TensorFlow 2.0, a system of eager execution was introduced which made working with TensorFlow low-level API much easier and it felt more Pythonic. But, up to this day, PyTorch is still considered more Pythonic, and for more details refer to this article.

In my opinion, the difference between both frameworks architecture and is only relevant when it come to doing research in the machine learning field. However, I would still recommend that you eventually learn both frameworks, because working with the State Of The Art or SOTA models is curial in machine learning projects and sometimes certain models are only available on one framework or the other and not both. Furthermore, switching from one framework to the other is quite natural and simple since, nowadays, both frameworks use similar syntax and similar architecture and the logic behind creating models are is the same.

That being said, I would still suggest for beginner to start with TensorFlow & Keras, because as I mentioned earlier. Keras is the most beginner friendly API for developing and customizing deep learning models.


fastai

I couldn't describe fastai better than it is described on the official documentation website, so fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches.

I chose fastai among the libraries in my toolkit, because it definitely provides a unique feature that I find quite powerful. fastai and as the name suggest, make the process of training and testing SOTA models very fast. We can, in a handful lines of code, create an image classifier which has VERY good performance. 

I recommend learning fastai as an additional library to both TensorFlow and PyTorch, and that is to make use of its higher level components that would speed up the prototyping process, however all the results obtained via fastai can always then be reproduced via PyTorch, since it is built using PyTorch.


 JAX/FLAX

FLAX is a deep learning library built on top of JAX, which can be described as a GPU-compiler for numpy with automatic differentiation. FLAX delivers an end-to-end, flexible, user experience for researchers who use JAX with neural networks, as stated in the official documentation.

I have not learned FLAX yet, however I believe that this is a very promising library that has great potential. However, I suggest putting it in the watchlist for the moment because it still needs development to be able to compete with PyTorch and TensorFlow.


PyCaret

PyCaret is a Python open source machine learning library created as the Python version of the Caret library on R, to make performing standard tasks in a machine learning project simpler. It is designed to automate the principal steps for evaluating and comparing machine learning algorithms for regression and classification. The main benefit of the library is that a lot can be achieved with very few lines of code and little manual configuration. 
On the other hand, PyCaret doesn't allow for much model customization. Thus, I suggest using this library to find the best machine learning model architecture for the task you're working on. Afterwards, switching to other libraries like Scikit-learn which allow model customization and tuning would be the sensible and efficient way to make use of PyCaret.


Computer Vision

In this category, we'll discuss the tools that are specific for Computer Vision, which means that the following tools are used for handling images and videos.

Tools:

  1. Opencv-python: Powerful tool for Image processing and computer vision tasks .
  2. Scikit-image: Python Image processing and computer vision tool.
  3. Pillow: Python library mainly for image handling (open/save).
  4. PyTesseract: Python library for the Google Tesseract OCR engine.

Opencv-python

Opencv-python is Python API for OpenCV, where CV stands for Computer Vision. This library can be used for a wide variety of Computer Vision and Image Processing tasks such as, image transformation and filtering, color space conversion, video capturing, and object detection.

Scikit-image

Scikit-image is similar to Opencv-python, it is mostly used for image processing and manipulation. It is fairly simple to learn and doesn't require much time to master. Scikit-image can perform certain machine learning tasks, however Opencv is much more powerful is that area since it is much more developed. Also, OpenCV have a much bigger community and we can find answers to our question more quickly than for Scikit-image.


PyTesseract
PyTesseract is an optical character recognition (OCR) tool for python. It is built as a wrapper for Google’s Tesseract-OCR Engine. This can come in handy for performing OCR in quite few lines of code to have a performance baseline.





Natural Language Processing

This category contains the python packages that are created specifically to handle text and to train models for different NLP tasks.

Tools:

  1. NLTK: Most popular NLP library in python, can be used for various simple NLP tasks. 
  2. spaCy: Powerful library for simple and advanced NLP applications. 
  3. Transformers: Python library by Hugging face that provides state of the art transformer models mainly for PyTorch. 
  4. Gensim: Python library for unsupervised topic modeling, retrieval by similarity, and other natural language processing functionalities

NLTK

NLTK or Natural Language ToolKit is a python package that provides us with various text processing libraries with a lot of test datasets. A variety of tasks can be performed using NLTK such as tokenizing, Lemmeization, POS tagging, etc…

For beginners in NLP, I would recommend starting with this library. It is easy to learn and we can use it to understand the basics of text processing in python. Moreover, NLTK can be used to create basic N-gram language models.

spaCy

spaCy is a free, open-source Python library that provides advanced capabilities to conduct natural language processing NLP on large volumes of text at high speed. It helps you build models and production applications that involves document analysis, chatbot capabilities, and all other forms of text analysis.

Transformers

Transformers provides APIs to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you time from training a model from scratch.
Transformers by Hugging Face is a very powerful library supported by a large community. It has a large collection of pretrained models mainly for NLP but also for other applications such as Computer Vision. It is supported mainly by PyTorch but also provides partial support for TensorFlow and JAX/FLAX.
This is a must-learn library for anyone willing to practice NLP, and for someone who's already learned PyTorch, It wouldn't represent much of a challenge to do so.

Gensim

Gensim which stands for "Generate Similar" is a python-based open-source framework for unsupervised topic modeling and natural language processing. It's a tool for extracting semantic concepts from documents, and it can handle extensive text collections. 
Therefore, it distinguishes itself from other machine learning software packages that focus on memory processing. Gensim also provides efficient multicore implementations for several algorithms to improve processing speed.


Computer Vision
September 10, 2022
0

Search

Popular Posts

Boosting Your Machine Learning Models with Bagging Techniques

Introduction: In the world of machine learning, improving the accuracy and ro…

Exploring the Tech Job Horizon: Unveiling Insights from 25,000 Opportunities

In the rapidly advancing landscapes of Information Technology, Artificial Int…

What is Stable Diffusion and How Does it Work?

Stable Diffusion stands as a cutting-edge deep learning model introduced in 2…

Recent Comments

Contact Me