The Modern Data Stack, with Harry Glaser

Added on July 25, 2023 by Jon Krohn.

Today, eloquent Harry Glaser details the Modern Data Stack, including cloud collab tools (like Deepnote), running ML from data warehouses (like Snowflake), using dbt Labs for model orchestration, and model deployment best-practices.

Harry:
• Is Co-Founder and CEO of Modelbit, a San Francisco-based startup that has raised $5m in venture capital to make the productionization of machine learning models as fast and as simple as possible.
• Previously, was Co-Founder and CEO of Periscope Data, a code-driven analytics platform that was acquired by Sisense for $130m.
• And, prior to that, was a product manager at Google.
• Holds a degree in Computer Science from the University of Rochester.

Today’s episode is squarely targeted at practicing data scientists but could be of interest to anyone who’d like to enrich their understanding of the modern data stack and how ML models are deployed into production applications.

In the episode, Harry details:
• The major tools available for developing ML models.
• The best practices for model deployment such as version control, CI/CD, load balancing and logging.
• The data warehouse options for running models.
• What model orchestration is.
• How BI tools can be leveraged to collaborate on model prototypes across your organization.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

How Firms Can Actually Adopt A.I., with Rehgan Avon

Added on July 23, 2023 by Jon Krohn.

Rehgan Avon's DataConnect conference is this week and is getting rave reviews. In this SuperDataScience episode, Jon Krohn, the silver-tongued entrepreneur details how organizations can successfully adopt A.I.

The (Short) Path to Artificial General Intelligence, with Dr. Ben Goertzel

Added on July 18, 2023 by Jon Krohn.

Today, the luminary Dr. Ben Goertzel details how we could realize Artificial General Intelligence (AGI) in 3-7 years, why he's optimistic about the Artificial Super Intelligence (ASI) this would trigger, and what post-Singularity society could be like.

Dr. Goertzel:
• Is CEO of SingularityNET, a decentralized open market for A.I. models that aims to bring about AGI and thus the singularity that would transform society beyond all recognition.
• Has been Chairman of The AGI Society for 14 years.
• Has been Chairman of the foundation behind OpenCog — an open-source AGI framework — for 16 years.
• Was previously Chief Scientist at Hanson Robotics Limited, the company behind Sophia, the world’s most recognizable humanoid robot.
• Holds a PhD in mathematics from Temple University and held tenure-track professorships prior to transitioning to industry.

Today’s episode has parts that are relatively technical, but much of the episode will appeal to anyone who wants to understand how AGI — a machine that has all of the cognitive capabilities of a human — could be brought about and the world-changing impact that would have.

In the episode, Ben details:
• The specific approaches that could be integrated with deep learning to realize, in his view, AGI in as few as 3-7 years.
• Why the development of AGI would near-instantly trigger the development of ASI — a machine with intellectual capabilities far beyond humans’.
• Why, despite triggering the singularity — beyond which we cannot make confident predictions about the future — he’s optimistic that AGI will be a positive development for humankind.
• The connections between self-awareness, consciousness and the ASI of the future.
• With admittedly wide error bars, what a society that includes ASI may look like.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Brain-Computer Interfaces and Neural Decoding, with Prof. Bob Knight

Added on July 14, 2023 by Jon Krohn.

In today's extraordinary episode, Prof. Bob Knight details how ML-powered brain computer interfaces (BCIs) could allow real-time thought-to-speech synthesis and the reversal of cognitive decline associated with aging.

This is a rare treat as "Dr. Bob" doesn't use social media and has only made two previous podcast appearances: on Ira Flatow's "Science Friday" and a little-known program called "The Joe Rogan Experience".

Dr. Bob:
• Is Professor of Neuroscience and Psychology at University of California, Berkeley.
• Is Adjunct Professor of Neurology and Neurosurgery at UC San Francisco.
• Over his career, has amassed tens of millions of dollars in research funding, 75 patents, and countless international awards for neuroscience and cognitive computing research.
• His hundreds of papers have together been cited over 70,000 times.

In this episode, Bob details:
• Why the “prefrontal cortex” region of our brains makes us uniquely intelligent relative to all the other species on this planet.
• The invaluable data that can be gathered by putting recording electrodes through our skulls and directly into our brains.
• How "dynamic time-warping" algorithms allow him to decode imagined sounds, even musical melodies, through recording electrodes implanted into the brain.
• How BCIs are life-changing for a broad range of illnesses today.
• The extraordinary ways that advances in hardware and machine learning could revolutionize medical care with BCIs in the coming years.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

NLP with Transformers, feat. Hugging Face’s Lewis Tunstall

Added on July 11, 2023 by Jon Krohn.

Lewis Tunstall — brilliant author of the bestseller "NLP with Transformers" and an ML Engineer at Hugging Face — today details how to train and deploy your own LLMs, the race for an open-source ChatGPT, and why RLHF leads to better models.

Dr. Tunstall:
• Is an ML Engineer at Hugging Face, one of the most important companies in data science today because they provide much of the most critical infrastructure for A.I. through open-source projects such as their ubiquitous Transformers library, which has a staggering 100,000 stars on GitHub.
• Is a member of Hugging Face’s prestigious research team, where he is currently focused on bringing us closer to having an open-source equivalent of ChatGPT by building tools that support RLHF (reinforcement learning from human feedback) and large-scale model evaluation.
• Authored “Natural Language Processing with Transformers”, an exceptional bestselling book that was published by O'Reilly last year and covers how to train and deploy Large Language Models (LLMs) using open-source libraries.
• Prior to Hugging Face, was an academic at the University of Bern in Switzerland and held data science roles at several Swiss firms.
• Holds a PhD in theoretical and mathematical physics from Adelaide in Australia.

Today’s episode is definitely on the technical side so will likely appeal most to folks like data scientists and ML engineers, but as usual I made an effort to break down the technical concepts Lewis covered so that anyone who’s keen to be aware of the cutting edge in NLP can follow along.

In the episode, Lewis details:
• What transformers are.
• Why transformers have become the default model architecture in NLP in just a few years.
• How to train NLP models when you have few to no labeled data available.
• How to optimize LLMs for speed when deploying them into production.
• How you can optimally leverage the open-source Hugging Face ecosystem, including their Transformers library and their hub for ML models and data.
• How RLHF aligns LLMs with the outputs users would like.
• How open-source efforts could soon meet or surpass the capabilities of commercial LLMs like ChatGPT.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

CatBoost: Powerful, efficient ML for large tabular datasets

Added on July 7, 2023 by Jon Krohn.

CatBoost is making waves in open-source ML as it's often the top approach for tasks as diverse as classification, regression, ranking, and recommendation. This is especially so if working with tabular data that include categorical variables.

This justifiable excitement in mind, today's "Five-Minute Friday" episode of SuperDataScience is dedicated to CatBoost (short for “category” and “boosting”).

CatBoost has been around since 2017 when it was released by Yandex, a tech giant based in Moscow. In a nutshell, CatBoost — like the more established (and regularly Kaggle-leaderboard-topping approaches) XGBoost and LightGBM — is at its heart a decision-tree algorithm that leverages gradient boosting. So that explains the “boost” part of CatBoost.

The “cat” (“category”) part comes from CatBoost’s superior handling of categorical features. If you’ve trained models with categorical data before, you’ve likely experienced the tedium of preprocessing and feature engineering with categorical data. CatBoost comes to the rescue here, efficiently dealing with categorical variables by implementing a novel algorithm that eliminates the need for extensive preprocessing or manual feature engineering. CatBoost handles categorical features automatically by employing techniques such as target encoding and one-hot encoding.

In addition to CatBoost’s superior handling of categorical features, the algorithm also makes use of:
• A specialized gradient-based optimization scheme known as Ordered Boosting that takes advantage of the natural ordering of categorical variables to minimize the loss function efficiently.
• Symmetric decision trees, which have a fixed tree depth that enables a faster training time relative to XGBoost and a comparable training time to LightGBM, which is famous for its speed.
• Regularization techniques, such as the well-known L2 regularization as well as ordered boosting and symmetric trees already discussed, all together make CatBoost unlikely to overfit to training data relative to other boosted-tree algorithms.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

YOLO-NAS: The State of the Art in Machine Vision, with Harpreet Sahota

Added on July 4, 2023 by Jon Krohn.

Deci's YOLO-NAS architecture provides today's state of the art in Machine Vision, specifically the key task of Object Detection. Harpreet Sahota joins us from Deci today to detail YOLO-NAS as well as where Computer Vision is going next.

Harpreet:
• Leads the deep learning developer community at Deci AI, an Israeli startup that has raised over $55m in venture capital and that recently open-sourced the YOLO-NAS deep learning model architecture.
• Through prolific data science content creation, including The Artists of Data Science podcast and his LinkedIn live streams, Harpreet has amassed a social-media following in excess of 70,000 followers.
• Previously worked as a lead data scientist and as a biostatistician.
• Holds a master’s in mathematics and statistics from Illinois State University.

Today’s episode will likely appeal most to technical practitioners like data scientists, but we did our best to break down technical concepts so that anyone who’d like to understand the latest in machine vision can follow along.

In the episode, Harpreet details:
• What exactly object detection is.
• How object detection models are evaluated.
• How machine vision models have evolved to excel at object detection, with an emphasis on the modern deep learning approaches.
• How a “neural architecture search” algorithm enabled Deci to develop YOLO-NAS, an optimal object detection model architecture.
• The technical approaches that will enable large architectures like YOLO-NAS to be compute-efficient enough to run on edge devices.
• His “top-down” approach to learning deep learning, including his recommended learning path.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Lossless LLM Weight Compression: Run Huge Models on a Single GPU

Added on June 30, 2023 by Jon Krohn.

Many recent episodes have been focused on open-source Large Language Models that you can download and fine-tune to particular use cases depending on your needs or your users’ needs. I’ve particularly been highlighting LLMs with seven billion up to 13 billion model parameters because this size of model can typically be run on a single consumer GPU so it’s relatively manageable and affordable both to train and have in production.

A.I. Accelerators: Hardware Specialized for Deep Learning

Added on June 27, 2023 by Jon Krohn.

Today we’ve got an episode dedicated to the hardware we use to train and run A.I. models (particularly LLMs) such as GPUs, TPUs and AWS's Trainium and Inferentia chips. Ron Diamant may be the best guest on earth for this fascinating topic.

Ron:
• Works at Amazon Web Services (AWS) where he is Chief Architect for their A.I. Accelerator chips, which are designed specifically for training (and making inferences with) deep learning models.
• Holds over 200 patents across a broad range of processing hardware, including security chips, compilers and, of course, A.I. accelerators.
• Has been at AWS for nearly nine years – since the acquisition of the Israeli hardware company Annapurna Labs, where he served as an engineer and project manager.
• Holds a Masters in Electrical Engineering from Technion, the Israel Institute of Technology.

Today’s episode is on the technical side but doesn’t assume any particular hardware expertise. It’s primarily targeted at people who train or deploy machine learning models but might be accessible to a broader range of listeners who are curious about how computer hardware works.

In the episode, Ron details:
• CPUs versus GPUs.
• GPUs versus specialized A.I. Accelerators such as Tensor Processing Units (TPUs) and his own Trainium and Inferentia chips.
• The “AI Flywheel” effect between ML applications and hardware innovations.
• The complex tradeoffs he has to consider when embarking upon a multi-year chip-design project.
• When we get to Large Language Model-scale models with billions of parameters, the various ways we can split up training and inference over our available devices.
• How to get popular ML libraries like PyTorch and TensorFlow to interact optimally with A.I. accelerator chips.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

How to Catch and Fix Harmful Generative A.I. Output

Added on June 23, 2023 by Jon Krohn.

Today, the A.I. entrepreneur Krishna Gade joins me to detail open-source solutions for overcoming the safety and security issues associated with generative A.I. systems, such as those powered by Large Language Models (LLMs).

The remarkably well-spoken Krishna:
• Is Co-Founder and CEO of Fiddler AI, an observability platform that has raised over $45m in venture capital to build trust in A.I. systems.
• Previously worked as an engineering manager on Facebook’s Newsfeed, as Head of Data Engineering at Pinterest, and as a software engineer at both Twitter and Microsoft.
• Holds a Masters in Computer Science from the University of Minnesota.

In this episode, Krishna details:
• How the LLMs that enable Generative A.I. are prone to inaccurate statements, can be biased against protected groups and are susceptible to exposing private data.
• How these undesirable and even harmful LLM outputs can be identified and remedied with open-source solutions like the Fiddler Auditor that his team has built.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Observing LLMs in Production to Automatically Catch Issues

Added on June 20, 2023 by Jon Krohn.

Today, Amber Roberts and Xander Song provide a technical deep dive into the major challenges (such as drift) that A.I. systems (particularly LLMs) face in production. They also detail solutions, such as open-source ML Observability tools.

Both Amber and Xander work at Arize AI, an ML observability platform that has raised over $60m in venture capital.

Amber:
• Serves as an ML Growth Lead at Arize, where she has also been an ML engineer.
• Prior to Arize, worked as an AI/ML product manager at Splunk and as the head of A.I. at Insight Data Science.
• Holds a Masters in Astrophysics from the Universidad de Chile in South America.

Xander:
• Serves as a developer advocate at Arize, specializing in their open-source projects.
• Prior to Arize, he spent three years as an ML engineer.
• Holds a Bachelors in Mathematics from UC Santa Barbara as well as a BA in Philosophy from the University of California, Berkeley.

Today’s episode will appeal primarily to technical folks like data scientists and ML engineers, but we made an effort to break down technical concepts so that it’s accessible to anyone who’d like to understand the major issues that A.I. systems can develop once they’re in production as well as how to overcome these issues.

In the episode, Amber and Xander detail:
• The kinds of drift that can adversely impact a production A.I. system, with a particular focus on the issues that can affect Large Language Models (LLMs).
• What ML Observability is and how it builds upon ML Monitoring to automate the discovery and resolution of production A.I. issues.
• Open-source ML Observability options.
• How frequently production models should be retrained.
• How ML Observability relates to discovering model biases against particular demographic groups.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Six Reasons Why Building LLM Products Is Tricky

Added on June 16, 2023 by Jon Krohn.

Many of my recent podcast episodes have focused on the bewildering potential of fine-tuning open-source Large Language Models (LLMs) to your specific needs. There are, however, six big challenges when bringing LLMs to your users:

1. Strictly limited context windows
2. LLMs are slow and compute-intensive at inference time
3. "Engineering" reliable prompts can be tricky
4. Prompt-injection attacks make you vulnerable to data and IP theft
5. LLMs aren't (usually) products on their own
6. There are legal and compliance issues

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Generative Deep Learning, with David Foster

Added on June 13, 2023 by Jon Krohn.

Today, bestselling author David Foster provides a fascinating technical introduction to cutting-edge Generative A.I. concepts including variational autoencoders, diffusion models, contrastive learning, GANs and (my favorite!) "world models".

David:
• Wrote the O'Reilly book “Generative Deep Learning”; the first edition from 2019 was a bestseller while the second edition was released just last week.
• Is a Founding Partner of Applied Data Science Partners, a London-based consultancy specialized in end-to-end data science solutions.
• Holds a Master’s in Mathematics from the University of Cambridge and a Master’s in Management Science and Operational Research from the University of Warwick.

Today’s episode is deep in the weeds on generative deep learning pretty much from beginning to end and so will appeal most to technical practitioners like data scientists and ML engineers.

In the episode, David details:
• How generative modeling is different from the discriminatory modeling that dominated machine learning until just the past few months.
• The range of application areas of generative A.I.
• How autoencoders work and why variational autoencoders are particularly effective for generating content.
• What diffusion models are and how latent diffusion in particular results in photorealistic images and video.
• What contrastive learning is.
• Why “world models” might be the most transformative concept in A.I. today.
• What transformers are, how variants of them power different classes of generative models such as BERT architectures and GPT architectures, and how blending generative adversarial networks with transformers supercharges multi-modal models.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Open-Source “Responsible A.I.” Tools, with Ruth Yakubu

Added on June 9, 2023 by Jon Krohn.

In today's episode, Ruth Yakubu details what Responsible A.I. is and open-source options for ensuring we deploy A.I. models — particularly the Generative variety that are rapidly transforming industries — responsibly.

Ruth:
• Has been a cloud expert at Microsoft for nearly seven years; for the past two, she’s been a Principal Cloud Advocate that specializes in A.I.
• Previously worked as a software engineer and manager at Accenture.
• Has been a featured speaker at major global conferences like Websummit.
• Studied computer science at the University of Minnesota.

In this episode, Ruth details:
• The six principles that underlie whether a given A.I. model is responsible or not.
• The open-source Responsible A.I. Toolbox that allows you to quickly assess how your model fares across a broad range of Responsible A.I. metrics.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Tools for Building Real-Time Machine Learning Applications, with Richmond Alake

Added on June 6, 2023 by Jon Krohn.

Today, the astonishingly industrious ML Architect and entrepreneur Richmond Alake crisply describes how to rapidly develop robust and scalable Real-Time Machine Learning applications.

Richmond:
• Is a Machine Learning Architect at Slalom Build, a huge Seattle-based consultancy that builds products embedded with analytics and ML.
• Is Co-Founder of two startups: one uses computer vision to correct peoples’ form in the gym and the other is a generative A.I. startup that works with human speech.
• Creates/delivers courses for O'Reilly and writes for NVIDIA.
• Previously worked as a Computer Vision Engineer and as a Software Developer.
• Holds a Masters in Computer Vision, ML and Robotics from the University of Surrey.

Today’s episode will appeal most to technical practitioners, particularly those who incorporate ML into real-time applications, but there’s a lot in this episode for anyone who’d like to hear about the latest tools for developing real-time ML applications from a leader in the field.

In this episode, Richmond details:
• The software choices he’s made up and down the application stack — from databases to ML to the front-end — across his startups and the consulting work he does.
• The most valuable real-time ML tools he teaches in his courses.
• Why writing for the public is an invaluable career hack that everyone should be taking advantage of.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Get More Language Context out of your LLM

Added on June 2, 2023 by Jon Krohn.

The "context window" limits the number of words that can be input to (or output by) a given Large Language Model. Today's episode introduces FlashAttention, a trick that allows for much larger context windows.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Contextual A.I. for Adapting to Adversaries, with Dr. Matar Haller

Added on May 30, 2023 by Jon Krohn.

Today, the wildly intelligent Dr. Matar Haller introduces Contextual A.I. (which considers adjacent, often multimodal information when making inferences) as well as how to use ML to build moat around your company.

Matar:
• Is VP of Data and A.I. at ActiveFence, an Israeli firm that has raised over $100m in venture capital to protect online platforms and their users from malicious behavior and malicious content.
• Is renowned for her top-rated presentations at leading conferences.
• Previously worked as Director of Algorithmic A.I. at SparkBeyond, an analytics platform.
• Holds a PhD in neuroscience from the University of California, Berkeley.
• Prior to data science, taught soldiers how to operate tanks.

Today’s episode has some technical moments that will resonate particularly well with hands-on data science practitioners but for the most part the episode will be interesting to anyone who wants to hear from a brilliant person on cutting-edge A.I. applications.

In this episode, Matar details:
• The “database of evil” that ActiveFence has amassed for identifying malicious content.
• Contextual A.I. that considers adjacent (and potentially multimodal) information when classifying data.
• How to continuously adapt A.I. systems to real-world adversarial actors.
• The machine learning model-deployment stack she uses.
• The data she collected directly from human brains and how this research relates to the brain-computer interfaces of the future.
• Why being a preschool teacher is a more intense job than the military.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Business Intelligence Tools, with Mico Yuk

Added on May 26, 2023 by Jon Krohn.

Today's guest is the straight shooter Mico Yuk, who pulls absolutely no punches in her assessment of, well, anything! ...but particularly about vendors in the business intelligence and data analytics space. Enjoy!

Mico:
• Is host of the popular Analytics on Fire Podcast (top 2% worldwide).
• Co-founded the BI Brainz Group, an analytics consulting and solutions company that has taught over 15,000 students analytics, visualization and data storytelling courses — included at major multinationals like Nestlé, FedEx and Procter & Gamble.
• Authored the "Data Visualization for Dummies" book.
• Is a sought-after keynote speaker and TV-news commentator.

In this episode, Mico details:
• Her BI (business intelligence) and analytics framework that persuades executives with data storytelling.
• What the top BI tools are on the market today.
• The BI trends she’s observed that could predict the most popular BI tools of the coming years.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

XGBoost: The Ultimate Classifier, with Matt Harrison

Added on May 23, 2023 by Jon Krohn.

XGBoost is typically the most powerful ML option whenever you're working with structured data. In today's episode, world-leading XGBoost XPert (😂) Matt Harrison details how it works and how to make the most of it.

Matt:
• Is the author of seven best-selling books on Python and Machine Learning.
• His most recent book, "Effective XGBoost", was published in March.
• Teaches "Exploratory Data Analysis with Python" at Stanford University.
• Through his consultancy MetaSnake, he’s taught Python at leading global organizations like NASA, Netflix, and Qualcomm.
• Previously worked as a CTO and Software Engineer.
• Holds a degree in Computer Science from Stanford.

Today’s episode will appeal primarily to practicing data scientists who are keen to learn about XGBoost or keen to become an even deeper expert on XGBoost by learning about it from a world-leading educator on the library.

In this episode, Matt details:
• Why XGBoost is the go-to library for attaining the highest accuracy when building a classification model.
• Modeling situations where XGBoost should not be your first choice.
• The XGBoost hyperparameters to adjust to squeeze every bit of juice out of your tabular training data and his recommended library for automating hyperparameter selection.
• His top Python libraries for other XGBoost-related tasks such as data preprocessing, visualizing model performance, and model explainability.
• Languages beyond Python that have convenient wrappers for applying XGBoost.
• Best practices for communicating XGBoost results to non-technical stakeholders.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Automating Industrial Machines with Data Science and the Internet of Things (IoT)

Added on May 19, 2023 by Jon Krohn.

Despite poor lighting on my face in today's video version (my bad!), we've got a fascinating episode with the brilliant (and well-lit!) Allegra Alessi, who details how data science is automating industrial machines.

Allegra:
• Is Product Owner for IoT (Internet of Things) devices at BOBST, a Swiss industrial manufacturing giant.
• Previously, she worked as a Product Owner and Data Scientist for Rolls-Royce in the UK and as a Data Scientist for Alstom, the enormous train manufacturing company, in Paris.
• She holds a Master’s in Engineering from Politecnico di Milano in Italy.

In this episode, Allegra details:
• How modern industrial machinery depends on data science for real-time performance analytics, predicting issues before they happen, and fully automating their operations.
• The tech stack her team uses to build data-driven IoT platforms.
• The key methodologies she uses to be effective at product management.
• The kinds of data scientists that might be ideally suited to moving into a product role.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.