A Beginner's Guide to Python Machine Learning and Data Science Frameworks

All libraries below are free, and most are open-source.

Table of contents:

Learn to build AI apps now »

Machine Learning

General purpouse Machine Learning

  • scikit-learn - machine learning in Python
  • Shogun - machine learning toolbox
  • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package
  • Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans
  • modAL - a modular active learning framework for Python3
  • Sparkit-learn - PySpark + Scikit-learn = Sparkit-learn
  • mlpack - a scalable C++ machine learning library (Python bindings)
  • dlib - A toolkit for making real world machine learning and data analysis applications in C++ (Python bindings)
  • MLxtend - extension and helper modules for Python’s data analysis and machine learning libraries
  • tick - module for statistical learning, with a particular emphasis on time-dependent modelling
  • sklearn-extensions - a consolidated package of small extensions to scikit-learn
  • civisml-extensions - scikit-learn-compatible estimators from Civis Analytics
  • scikit-multilearn - multi-label classification for python
  • tslearn - machine learning toolkit dedicated to time-series data
  • seqlearn - seqlearn is a sequence classification toolkit for Python
  • pystruct - Simple structured learning framework for python
  • sklearn-expertsys - Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models
  • skutil - A set of scikit-learn and h2o extension classes (as well as caret classes for python)
  • sklearn-crfsuite - scikit-learn inspired API for CRFsuite
  • RuleFit - implementation of the rulefit
  • metric-learn - metric learning algorithms in Python
  • pyGAM - Generalized Additive Models in Python
  • luminol - Anomaly Detection and Correlation library

Automated machine learning

  • TPOT - Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming
  • auto-sklearn - is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
  • MLBox - a powerful Automated Machine Learning python library.

Ensemble methods

  • ML-Ensemble - high performance ensemble learning
  • brew - Python Ensemble Learning API
  • Stacking - Simple and useful stacking library, written in Python.
  • stacked_generalization - library for machine learning stacking generalization.
  • vecstack - Python package for stacking (machine learning technique)

Imbalanced datasets

  • imbalanced-learn - module to perform under sampling and over sampling with various techniques
  • imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data.

Random Forests

Extreme Learning Machine

  • Python-ELM - Extreme Learning Machine implementation in Python
  • Python Extreme Learning Machine (ELM) - a machine learning technique used for classification/regression tasks
  • hpelm ![alt text][gpu] - High performance implementation of Extreme Learning Machines (fast randomized neural networks).

Kernel methods

  • pyFM - Factorization machines in python
  • fastFM - a library for Factorization Machines
  • tffm - TensorFlow implementation of an arbitrary order Factorization Machine
  • liquidSVM - an implementation of SVMs
  • scikit-rvm - Relevance Vector Machine implementation using the scikit-learn API

Gradient boosting

  • XGBoost ![alt text][gpu] - Scalable, Portable and Distributed Gradient Boosting
  • LightGBM ![alt text][gpu] - a fast, distributed, high performance gradient boosting by Microsoft
  • CatBoost ![alt text][gpu] - an open-source gradient boosting on decision trees library by Yandex
  • InfiniteBoost - building infinite ensembles with gradient descent
  • TGBoost - Tiny Gradient Boosting Tree

Deep Learning

Keras

  • Keras - a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
  • keras-contrib - Keras community contributions
  • Hyperas - Keras + Hyperopt: A very simple wrapper for convenient hyperparameter
  • Elephas - Distributed Deep learning with Keras & Spark
  • Hera - Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.
  • dist-keras - Distributed Deep Learning, with a focus on distributed training
  • Conx - The On-Ramp to Deep Learning

PyTorch

  • PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
  • torchvision - Datasets, Transforms and Models specific to Computer Vision
  • torchtext - Data loaders and abstractions for text and NLP
  • torchaudio - an audio library for PyTorch
  • ignite - high-level library to help with training neural networks in PyTorch
  • PyToune - a Keras-like framework and utilities for PyTorch
  • skorch - a scikit-learn compatible neural network library that wraps pytorch
  • PyTorchNet - an abstraction to train neural networks
  • Aorun - intend to implement an API similar to Keras with PyTorch as backend.
  • pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch

Tensorflow

  • TensorFlow - Computation using data flow graphs for scalable machine learning by Google
  • TensorLayer - Deep Learning and Reinforcement Learning Library for Researcher and Engineer.
  • TFLearn - Deep learning library featuring a higher-level API for TensorFlow
  • Sonnet - TensorFlow-based neural network library by DeepMind
  • TensorForce - a TensorFlow library for applied reinforcement learning
  • tensorpack - a Neural Net Training Interface on TensorFlow
  • Polyaxon - a platform that helps you build, manage and monitor deep learning models
  • Horovod - Distributed training framework for TensorFlow
  • tfdeploy - Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy
  • hiptensorflow ![alt text][amd] - ROCm/HIP enabled Tensorflow
  • TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow
  • tensorlm - wrapper library for text generation / language models at char and word level with RNN
  • TensorLight - a high-level framework for TensorFlow
  • Mesh TensorFlow - Model Parallelism Made Easier

Theano

Warning: Theano development has ceased

  • Theano - is a Python library that allows you to define, optimize, and evaluate mathematical expressions
  • Lasagne - Lightweight library to build and train neural networks in Theano Lasagne add-ons…
  • nolearn - scikit-learn compatible neural network library (mainly for Lasagne)
  • Blocks - a Theano framework for building and training neural networks
  • platoon - Multi-GPU mini-framework for Theano
  • NeuPy - NeuPy is a Python library for Artificial Neural Networks and Deep Learning
  • scikit-neuralnetwork - Deep neural networks without the learning cliff
  • Theano-MPI - MPI Parallel framework for training deep learning models built in Theano

MXNet

  • MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler
  • Gluon - a clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet)
  • MXbox - simple, efficient and flexible vision toolbox for mxnet framework.
  • gluon-cv - provides implementations of the state-of-the-art deep learning models in computer vision.
  • gluon-nlp - NLP made easy
  • MXNet ![alt text][amd] - HIP Port of MXNet

Caffe

  • Caffe - a fast open framework for deep learning
  • Caffe2 - a lightweight, modular, and scalable deep learning framework
  • hipCaffe ![alt text][amd] - the HIP port of Caffe

CNTK

  • CNTK - Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Chainer

  • Chainer - a flexible framework for neural networks
  • ChainerRL - a deep reinforcement learning library built on top of Chainer.
  • ChainerCV - a Library for Deep Learning in Computer Vision
  • ChainerMN - scalable distributed deep learning with Chainer
  • scikit-chainer - scikit-learn like interface to chainer
  • chainer_sklearn - Sklearn (Scikit-learn) like interface for Chainer

Others

  • Neon - Intel Nervana™ reference deep learning framework committed to best performance on all hardware
  • Tangent - Source-to-Source Debuggable Derivatives in Pure Python
  • autograd - Efficiently computes derivatives of numpy code
  • Myia - deep learning framework (pre-alpha)
  • nnabla - Neural Network Libraries by Sony

Model explanation

  • Auralisation - auralisation of learned features in CNN (for audio)
  • CapsNet-Visualization - a visualization of the CapsNet layers to better understand how it works
  • lucid - a collection of infrastructure and tools for research in neural network interpretability.
  • Netron - visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks)
  • FlashLight - visualization Tool for your NeuralNetwork
  • tensorboard-pytorch - tensorboard for pytorch (and chainer, mxnet, numpy, …)
  • anchor - code for “High-Precision Model-Agnostic Explanations” paper
  • aequitas - Bias and Fairness Audit Toolkit
  • Contrastive Explanation - Contrastive Explanation (Foil Trees)
  • yellowbrick - visual analysis and diagnostic tools to facilitate machine learning model selection
  • scikit-plot - an intuitive library to add plotting functionality to scikit-learn objects
  • shap - a unified approach to explain the output of any machine learning model
  • ELI5 - a library for debugging/inspecting machine learning classifiers and explaining their predictions
  • Lime - Explaining the predictions of any machine learning classifier
  • FairML - FairML is a python toolbox auditing the machine learning models for bias
  • L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation
  • PDPbox - partial dependence plot toolbox
  • pyBreakDown - Python implementation of R package breakDown
  • PyCEbox - Python Individual Conditional Expectation Plot Toolbox
  • Skater - Python Library for Model Interpretation
  • tensorflow/model-analysis - Model analysis tools for TensorFlow
  • themis-ml - a library that implements fairness-aware machine learning algorithms
  • treeinterpreter [alt text][skl] -interpreting scikit-learn’s decision tree and random forest predictions

Reinforcement Learning

  • OpenAI Gym - a toolkit for developing and comparing reinforcement learning algorithms.

Distributed computing systems

  • PySpark - exposes the Spark programming model to Python
  • Veles - Distributed machine learning platform by Samsung
  • Jubatus - Framework and Library for Distributed Online Machine Learning
  • DMTK - Microsoft Distributed Machine Learning Toolkit
  • PaddlePaddle - PArallel Distributed Deep LEarning by Baidu
  • dask-ml - Distributed and parallel machine learning
  • Distributed - Distributed computation in Python

Probabilistic methods

  • pomegranate ![alt text][cp] - probabilistic and graphical models for Python
  • pyro - a flexible, scalable deep probabilistic programming library built on PyTorch.
  • ZhuSuan - Bayesian Deep Learning
  • PyMC - Bayesian Stochastic Modelling in Python
  • PyMC3 - Python package for Bayesian statistical modeling and Probabilistic Machine Learning
  • sampled - Decorator for reusable models in PyMC3
  • Edward - A library for probabilistic modeling, inference, and criticism.
  • InferPy - Deep Probabilistic Modelling Made Easy
  • GPflow - Gaussian processes in TensorFlow
  • PyStan - Bayesian inference using the No-U-Turn sampler (Python interface)
  • gelato - Bayesian dessert for Lasagne
  • sklearn-bayes - Python package for Bayesian Machine Learning with scikit-learn API
  • bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models
  • PyFlux - Open source time series library for Python
  • skggm - estimation of general graphical models
  • pgmpy - a python library for working with Probabilistic Graphical Models.
  • skpro - supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute
  • Aboleth - a bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation
  • PtStat - Probabilistic Programming and Statistical Inference in PyTorch
  • PyVarInf - Bayesian Deep Learning methods with Variational Inference for PyTorch
  • emcee - The Python ensemble sampling toolkit for affine-invariant MCMC
  • hsmmlearn - a library for hidden semi-Markov models with explicit durations
  • pyhsmm - bayesian inference in HSMMs and HMMs
  • GPyTorch - a highly efficient and modular implementation of Gaussian Processes in PyTorch
  • Bayes - Python implementations of Naive Bayes algorithm variants

Genetic Programming

  • gplearn - Genetic Programming in Python
  • DEAP - Distributed Evolutionary Algorithms in Python
  • karoo_gp - A Genetic Programming platform for Python with GPU support
  • monkeys - A strongly-typed genetic programming framework for Python
  • sklearn-genetic - Genetic feature selection module for scikit-learn

Optimization

  • Spearmint - Bayesian optimization
  • SMAC3 - Sequential Model-based Algorithm Configuration
  • Optunity - is a library containing various optimizers for hyperparameter tuning.
  • hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python
  • hyperopt-sklearn - hyper-parameter optimization for sklearn
  • sklearn-deap - use evolutionary algorithms instead of gridsearch in scikit-learn
  • sigopt_sklearn - SigOpt wrappers for scikit-learn methods
  • Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
  • SafeOpt - Safe Bayesian Optimization
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface
  • Solid - A comprehensive gradient-free optimization framework written in Python
  • PySwarms - A research toolkit for particle swarm optimization in Python
  • Platypus - A Free and Open Source Python Library for Multiobjective Optimization
  • GPflowOpt - Bayesian Optimization using GPflow
  • POT - Python Optimal Transport library
  • Talos - Hyperparameter Optimization for Keras Models

Natural Language Processing

  • NLTK - modules, data sets, and tutorials supporting research and development in Natural Language Processing
  • CLTK - The Classical Language Toolkik
  • gensim - Topic Modelling for Humans
  • PSI-Toolkit - a natural language processing toolkit by Adam Mickiewicz University in Poznań
  • pyMorfologik - Python binding for Morfologik (Polish morphological analyzer)
  • skift - scikit-learn wrappers for Python fastText.
  • Phonemizer - Simple text to phonemes converter for multiple languages

Computer Audition

  • librosa - Python library for audio and music analysis
  • Yaafe - Audio features extraction
  • aubio - a library for audio and music analysis
  • Essentia - library for audio and music analysis, description and synthesis
  • LibXtract - is a simple, portable, lightweight library of audio feature extraction functions
  • Marsyas - Music Analysis, Retrieval and Synthesis for Audio Signals
  • muda - a library for augmenting annotated audio data
  • madmom - Python audio and music signal processing library

Computer Vision

  • OpenCV - Open Source Computer Vision Library
  • scikit-image - Image Processing SciKit (Toolbox for SciPy)
  • imgaug - image augmentation for machine learning experiments
  • imgaug_extension - additional augmentations for imgaug
  • Augmentor - Image augmentation library in Python for machine learning
  • albumentations - fast image augmentation library and easy to use wrapper around other libraries

Feature engineering

  • Featuretools - automated feature engineering
  • scikit-feature - feature selection repository in python
  • skl-groups - scikit-learn addon to operate on set/”group”-based features
  • Feature Forge - a set of tools for creating and testing machine learning feature
  • boruta_py - implementations of the Boruta all-relevant feature selection method
  • BoostARoota - a fast xgboost feature selection algorithm
  • few - a feature engineering wrapper for sklearn
  • scikit-rebate - a scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning
  • scikit-mdr - a sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.
  • tsfresh - Automatic extraction of relevant features from time series

Data manipulation & pipelines

  • pandas - powerful Python data analysis toolkit
  • sklearn-pandas - Pandas integration with sklearn
  • alexander - wrapper that aims to make scikit-learn fully compatible with pandas
  • blaze - NumPy and Pandas interface to Big Data
  • pandasql - allows you to query pandas DataFrames using SQL syntax
  • pandas-gbq - Pandas Google Big Query
  • xpandas - universal 1d/2d data containers with Transformers functionality for data analysis by The Alan Turing Institute
  • Fuel - data pipeline framework for machine learning
  • Arctic - high performance datastore for time series and tick data
  • pdpipe - sasy pipelines for pandas DataFrames.
  • SSPipe - Python pipe ( ) operator with support for DataFrames and Numpy and Pytorch
  • meza - a Python toolkit for processing tabular data
  • pandas-ply - functional data manipulation for pandas
  • Dplython - Dplyr for Python
  • pysparkling - a pure Python implementation of Apache Spark’s RDD and DStream interfaces
  • quinn - pyspark methods to enhance developer productivity
  • Dataset - helps you conveniently work with random or sequential batches of your data and define data processing
  • swifter - a package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

Statistics

  • statsmodels - statistical modeling and econometrics in Python
  • stockstats - Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline stock statistics/indicators support.
  • simplestatistics - simple statistical functions implemented in readable Python.
  • weightedcalcs - pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more
  • scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests
  • pysie - provides python implementation of statistical inference engine

Experiments tools

  • Sacred - a tool to help you configure, organize, log and reproduce experiments by IDSIA
  • Xcessiv - a web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling
  • Persimmon - A visual dataflow programming language for sklearn

Visualization

  • Matplotlib - plotting with Python
  • seaborn - statistical data visualization using matplotlib
  • Bokeh - Interactive Web Plotting for Python
  • HoloViews - stop plotting your data - annotate your data and let it visualize itself
  • Alphalens - performance analysis of predictive (alpha) stock factors by Quantopian
  • python-ternary - ternary plotting library for python with matplotlib
  • Naarad - framework for performance analysis & rating of sharded & stateful services.

Evaluation

Computations

  • numpy - the fundamental package needed for scientific computing with Python.
  • Dask - parallel computing with task scheduling
  • bottleneck - Fast NumPy array functions written in C
  • minpy - NumPy interface with mixed backend execution
  • CuPy - NumPy-like API accelerated with CUDA
  • scikit-tensor - Python library for multilinear algebra and tensor factorizations
  • numdifftools - solve automatic numerical differentiation problems in one or more variables
  • quaternion - Add built-in support for quaternions to numpy
  • adaptive - Tools for adaptive and parallel samping of mathematical functions

Spatial analysis

  • GeoPandas - Python tools for geographic data
  • PySal - Python Spatial Analysis Library

Quantum Computing

  • QML - a Python Toolkit for Quantum Machine Learning

Conversion

  • sklearn-porter - transpile trained scikit-learn estimators to C, Java, JavaScript and others
  • ONNX - Open Neural Network Exchange
  • MMdnn - a set of tools to help users inter-operate among different deep learning frameworks.

See Also

Chris V. Nicholson

Chris V. Nicholson is a venture partner at Page One Ventures. He previously led Pathmind and Skymind. In a prior life, Chris spent a decade reporting on tech and finance for The New York Times, Businessweek and Bloomberg, among others.