Computational Mathematics and Fluid Dynamics, with Prof. Margot Gerritsen

Added on October 3, 2023 by Jon Krohn.

Today, the extremely intelligent and super delightful Prof. Margot Gerritsen returns to the show to introduce what Computational Mathematics is, detail countless real-world applications of it, and relate it to the field of data science.

Margot:
• Has been faculty at Stanford University for more than 20 years, including eight years as Director of the Institute for Computational and Mathematical Engineering.
• In 2015, co-founded Women in Data Science (WiDS) Worldwide, an organization that supports, inspires and lowers barriers to entry for women across over 200 chapters in over 160 countries.
• Hosts the corresponding Women in Data Science podcast.
• Holds a PhD from Stanford in which she focused on Computational Fluid Dynamics — a passion she has retained throughout her academic career.

Today’s episode should appeal to anyone.

In it this episode, Margot details:
• What computational mathematics is.
• How computational math is used to study fluid dynamics, with fascinating in-depth examples across traffic, water, oil, sailing, F1 racing, the flight of pterodactyls and more.
• Synaesthesia, a rare perceptual phenomenon, which in her case means she sees numbers in specific colors and how this relates to her lifelong interest in math.
• The genesis of her Women in Data Science organization and the impressive breadth of its global impact today.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

ChatGPT Custom Instructions: A Major, Easy Hack for Data Scientists

Added on September 29, 2023 by Jon Krohn.

Thanks to Shaan Khosla for tipping me off to a crazy easy hack to get markedly better results from GPT-4: providing Custom Instructions that prompt the algorithm to iterate upon its own output while critically evaluating and improving it.

Here's Shaan's full Custom Instructions text, which he himself has been iterating on in recent months:

"I need you to help me with a task. To help me with the task, first come up with a detailed outline of how you think you should respond, then critique the ideas in this outline (mention the advantages, disadvantages, and ways it could be improved), then use the original outline and the critiques you made to come up with your best possible solution.

"Overall, your tone should not be overly dramatic. It should be clear, professional, and direct. Don't sound robotic or like you're trying to sell something. You don't need to remind me you're a large language model, get straight to what you need to say to be as helpful as possible. Again, make sure your tone is clear, professional, and direct - not overly like you're trying to sell something."

Try it out! If you haven't used Custom Instructions before, in today's episode I talk you through how to set it up and explain why this approach is so effective. In the video version, I provide a screenshare that makes getting started foolproof.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Overcoming Adversaries with A.I. for Cybersecurity, with Dr. Dan Shiebler

Added on September 26, 2023 by Jon Krohn.

Recently in Detroit, my hotel randomly had a podcast studio complete with "ON AIR" sign haha. From there, I interviewed the wildly intelligent Dr. Dan Shiebler on how machine learning is used to tackle cybercrime.

Dan:
• As Head of Machine Learning at Abnormal Security, a cybercrime detection firm that has grown to over $100m in annual recurring revenue in just four years, manages a team of over 50 engineers.
• Previously worked at Twitter, first as a Staff ML Engineer and then as an ML Engineering Manager.
• Holds a PhD in A.I. Theory from the University of Oxford and obtained a perfect 4.0 GPA in his Computer Science and Neuroscience joint Bachelor’s from Brown University.

Today’s episode is on the technical side so might appeal most to hands-on practitioners like data scientists and ML engineers, but anyone who’d like to understand the state-of-the-art in cybersecurity should give it a listen.

In this episode, Dan details:
• The machine learning approaches needed to tackle the uniquely adversarial application of cybercrime detection.
• How to carry out real-time ML modeling.
• What his PhD research on Category Theory entailed and how it applies to the real world.
• The major problems facing humanity in the coming decades that he thinks A.I. will be able to help with… and those that he thinks A.I. won’t.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Happiness and Life-Fulfillment Hacks

Added on September 22, 2023 by Jon Krohn.

Today, my 94-year-old grandmother shares the secrets behind her radiant happiness. Annie talks about the importance of community, relationships and setting daily intentions, blending time-tested wisdom with forward-thinking optimism.

In today’s episode, Annie discusses:
• Her secret to happiness.
• How she maintains flourishing long-term relationships.
• The routines and mindset she has to still be living independently, including driving herself everywhere, at 94 years old.
• The pace of technological progress in her lifetime and how A.I. could enrich her life in the years to come.

This episode is something different from the usual pure tech focus so I encourage you to provide feedback if you had strong feelings on this episode one way or another. As always, your feedback is invaluable for shaping the direction of the show.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Make Better Decisions with Data, with Dr. Allen Downey

Added on September 19, 2023 by Jon Krohn.

Today's episode with many-time bestselling author Allen Downey is incredible. Learn a ton from him about making better decisions with data, including how to prepare for Black Swan events and how your core beliefs will shift over your life.

Allen:
• Is a Professor Emeritus at Olin College and Curriculum Designer at the learning platform Brilliant.org.
• He was previously a Visiting Professor of Computer Science at Harvard University and a Visiting Scientist at Google.
• He has written 18 books (which he has made available for free online but which are also published in hard copy by major publishers. For example, his books "Think Python" and "Think Bayes" were bestsellers published by O'Reilly).
• His next book, "Probably Overthinking It", is available for pre-order now.
• Holds a PhD in Computer Science from University of California, Berkeley and Bachelor's and Masters degrees from the Massachusetts Institute of Technology.

Today’s episode focuses largely on content from Allen’s upcoming book — his first book intended for a lay audience — and so should appeal to anyone who’s keen to learn from an absolutely brilliant writer and speaker on “How to Use Data to Answer Questions, Avoid Statistical Traps, and Make Better Decisions.”

In this episode, Allen details:
• Underused techniques like Survival Analysis that can be uniquely powerful in lots of ordinary circumstances.
• How to better prepare for rare “Black Swan” events.
• How to wrap your head around common data paradoxes such as Preston’s Paradox, Berkson’s Paradox and Simpson’s Paradox.
• What the Overton Window is and how our core beliefs shift relative to it over the course of our lifetime (this is extra trippy).

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Using A.I. to Overcome Blindness and Thrive as a Data Scientist

Added on September 15, 2023 by Jon Krohn.

Today's guest is the remarkable Tim Albiges, who lost the ability to see as an adult. Thanks to A.I. tools, as well as learning how to learn by sound and touch, he is now thriving as a data scientist and pursuing a fascinating PhD!

Tim was working as a restaurant manager eight years ago when he tragically lost his sight.

In the face of countless alarming and discriminatory acts against him on account of his blindness, he taught himself Braille and auditory learning techniques (and to raise math equations and diagrams using a special thermoform machine so that he can feel them) in order to be able to return to college and study computing and data science.

Not only did he succeed in obtaining a Bachelor’s degree in computing (with First-Class Honours), he is now pursuing a PhD at Bournemouth University full-time, in which he’s applying machine learning to solve medical problems. His first paper was published in the peer-reviewed journal Sensors earlier this year.

Today’s inspiring episode is accessible to technical and non-technical listeners alike.

In it, Tim details:
• Why a career in data science can be ideal for a blind person.
• How he’s using ML to automate the diagnosis of chronic respiratory diseases.
• The techniques he employs to live a full and independent life, with a particular focus on the A.I. tools that assist him both at work and at leisure.
• A keen athlete, how he’s adapted his approach to fitness in order to run the London marathon and enjoy a gripping team sport called goalball.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Llama 2, Toolformer and BLOOM: Open-Source LLMs with Meta’s Dr. Thomas Scialom

Added on September 12, 2023 by Jon Krohn.

Thomas Scialom, PhD is behind many of the most popular Generative A.I. projects including Llama 2, the world's top open-source LLM. Today, the Meta A.I. researcher reveals the stories behind Llama 2 and what's in the works for Llama 3.

Thomas:
• Is an A.I. Research Scientist at Meta.
• Is behind some of the world’s best-known Generative A.I. projects including Llama 2, BLOOM, Toolformer and Galactica.
• Is contributing to the development of Artificial General Intelligence (AGI).
• Has lectured at many of the top A.I. labs (e.g., Google, Stanford, MILA).
• Holds a PhD from Sorbonne University, where he specialized in Natural-Language Generation with Reinforcement Learning.

Today’s episode should be equally appealing to hands-on machine learning practitioners as well as folks who may not be hands on but are nevertheless keen to understand the state-of-the-art in A.I. from someone who’s right on the cutting edge of it all.

In this episode, Thomas details:
• Llama 2, today’s top open-source LLM, including what is what like behind the scenes developing it and what we can expect from the eventual Llama 3 and related open-source projects.
• The Toolformer LLM that learns how to use external tools.
• The Galactica science-specific LLM, why it was brought down after a few days, and how it might eventually re-emerge in a new form.
• How RLHF — reinforcement learning from human feedback — shifts the distribution of generative A.I. outputs from approximating the average of human responses to excellent, often superhuman quality.
• How soon he thinks AGI — artificial general intelligence — will be realized and how.
• How to make the most of the Generative A.I. boom as an entrepreneur.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Code Llama

Added on September 8, 2023 by Jon Krohn.

Meta's Llama 2 offered state-of-the-art performance for an "open-source"* LLM... except on tasks involving code. Now Code Llama is here and it magnificently fills that gap by outperforming all other open-source LLMs on coding benchmarks.

Image, Video and 3D-Model Generation from Natural Language, with Dr. Ajay Jain

Added on September 5, 2023 by Jon Krohn.

Today, brilliant ML researcher Ajay Jain, Ph.D explains how a full-length feature film could be created using Stable-Diffusion-style generative A.I. — these models can now output flawless 3D models and compelling video clips.

Ajay:
• Is a Co-Founder of Genmo AI, a platform for using natural language to generate stunning state-of-the-art images, videos and 3D models.
• Prior to Genmo, he worked as a researcher on the Google Brain team in California, in the Uber Advanced Technologies Group in Toronto and on the Applied Machine Learning team at Facebook.
• Holds a degree in Computer Science and Engineering from MIT and did his PhD within the world-class Berkeley A.I. Research (BAIR) lab, where he specialized in deep generative models.
• Has published highly influential papers at all of the most prestigious ML conferences, including NeurIPS, ICML and CVPR.

Today’s episode is on the technical side so will likely appeal primarily to hands-on practitioners, but we did our best to explain concepts so that anyone who’d like to understand the state of the art in image, video and 3D-model generation can get up to speed.

In this episode, Ajay details:
• How the Creative General Intelligence he’s developing will allow humans to express anything in natural language and get it.
• How feature-length films could be created today using generative A.I. alone.
• How the Stable Diffusion approach to text-to-image generation differs from the Generative Adversarial Network approach.
• How neural nets can represent all the aspects of a visual scene so that the scene can be rendered as desired from any perspective.
• Why a self-driving vehicle forecasting pedestrian behavior requires similar modeling capabilities to text-to-video generation.
• What he looks for in the engineers and researchers he hires.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

LangChain: Create LLM Applications Easily in Python

Added on September 2, 2023 by Jon Krohn.

Today's episode is a fun intro to the powerful, versatile LLM-development framework LangChain. In it, Kris Ograbek talks us through how to use LangChain to chat with previous episodes of SuperDataScience! 😎

Kris:
• Is a content creator who specializes in creating LLM-based projects — with Python libraries like LangChain and the Hugging Face Transformers library — and then using the projects to teach these LLM techniques.
• Previously, he worked as a software engineer in Germany.
• He holds a Master’s in Electrical and Electronics Engineering from the Wroclaw University of Science and Technology.

In this episode, Kris details:
• The exceptionally popular LangChain framework for developing LLM applications.
• Specifically, he introduces how LangChain is so powerful by walking us step-by-step through a chatbot he built that interactively answers questions about episodes of the SuperDataScience podcast.

Having been listening to the podcast for years, at the end of the episode Kris flips the script on me and asks me some of the burning questions he has for me — questions that perhaps many other listeners also have wondered the answers to.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Big A.I. R&D Risks Reap Big Societal Rewards, with Meta’s Dr. Laurens van der Maaten

Added on August 29, 2023 by Jon Krohn.

By making big research bets, the prolific Meta Senior Research Director Dr. Laurens van der Maaten has devised or supported countless world-changing machine-learning innovations across healthcare, climate change, privacy and more.

Laurens:
• Is a Senior Research Director at Meta, overseeing swathes of their high-risk, high-reward A.I. projects with application areas as diverse as augmented reality, biological protein synthesis and tackling climate change.
• Developed the "CrypTen" privacy-preserving ML framework.
• Pioneered web-scale weakly supervised training of image-recognition models.
• Along with the iconic Geoff Hinton, created the t-SNE dimensionality reduction technique (this paper alone has been cited over 36,000 times).
• In aggregate, his works have been cited nearly 100,000 times!
• Holds a PhD in machine learning from Tilburg University in the Netherlands.

Today’s episode will probably appeal primarily to hands-on data science practitioners, but there is tons of content in this episode for anyone who’d like to appreciate the state of the art in A.I. across a broad range of socially impactful, super-cool applications.

In this episode, Laurens details:
• How he pioneered learning across billions of weakly labeled images to create a state-of-the-art machine-vision model.
• How A.I. can be applied to the synthesis of new biological proteins with implications for both medicine and agriculture.
• Specific ways A.I. is being used to tackle climate change as well as to simulate wearable materials for enhancing augmented-reality interactivity.
• A library just like PyTorch but where all the computations are encrypted.
• The wide range of applications of his ubiquitous dimensionality-reduction approach.
• His vision for the impact of A.I. on society in the coming decades.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

ChatGPT Code Interpreter: 5 Hacks for Data Scientists

Added on August 25, 2023 by Jon Krohn.

The ChatGPT Code Interpreter is surreal: It creates and executes Python code for whatever task you describe, debugs its own runtime errors, displays charts, does file uploads/downloads, and suggests sensible next steps all along the way.

Whether you write code yourself today or not, you can take advantage of GPT-4's stellar natural-language input/output capabilities to interact with the Code Interpreter. The mind-blowing experience is equivalent to having an expert data analyst, data scientist or software developer with you to instantaneously respond to your questions or requests.

As an example of these jaw-dropping capabilities (and given the data science-focused theme of my show), I use today's episode demonstrate the ChatGPT Code Interpreter's full automation of data analysis and machine learning. If you watch the episode on YouTube, you can even see the Code Interpreter hands-on in action while I interact with it solely with natural language.

Over the course of today's episode/video, the Code Interpreter:
1. Receives a sample data file that I provide it.
2. Uses natural language to describe all of the variables that are in the file.
3. Performs a four-step Exploratory Data Analysis (EDA), including histograms, scatterplots that compare key variables and key summary statistics (all explained in natural language).
4. Preprocesses all of my variables for machine learning.
5. Selects an appropriate baseline ML model, trains it and quantitatively evaluates its performance.
6. Suggests alternative models and approaches (e.g., grid search) to get even better performance and then automatically carries these out.
7. Optionally provides Python code every step of the way and is delighted to answer any questions I have about the code.

The whole process is a ton of fun and, again, requires no coding abilities to use (the "Code Interpreter" moniker could be misleadingly intimidating to non-coding folks). Even as an experienced data scientist, however, I would estimate that in many everyday situations use of the Code Interpreter could decrease my development time by a crazy 90% or more.

The big caveat with all of this is whether you're comfortable sharing your code with OpenAI. I wouldn't provide proprietary company code to it without clearing it with your firm first and — if you do use proprietary code with it — turn "Chat history & training" off in your ChatGPT Plus settings. To circumnavigate the data-privacy issue entirely, you could alternatively try Meta's newly-released "Code Llama — Instruct 34B" Large Language Model on your own infrastructure. Code Llama won't, however, be as good as the Code Interpreter in many circumstances and will require some technical savvy to get it up and running.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Vicuña, Gorilla, Chatbot Arena and Socially Beneficial LLMs, with Prof. Joey Gonzalez

Added on August 22, 2023 by Jon Krohn.

Vicuna, Gorilla and the Chatbot Arena are all critical elements of the new open-source LLM ecosystem — the extremely knowledgeable and innovative Prof. Joseph Gonzalez is behind all of them. Get the details in today's episode

Joey:
• Is an Associate Professor of Electrical Engineering and Computer Science at the University of California, Berkeley.
• Co-directs the Berkeley RISE Lab, which studies Real-time, Intelligent, Secure and Explainable systems.
• Co-founded Turi (acquired by Apple for $200m) and more recently Aqueduct.
• His research is integral to major software systems including Apache Spark, Ray (for scaling Python ML), GraphLab (a high-level interface for distributed ML) and Clipper (low-latency ML serving).
• His papers—published in top ML journals—have been cited over 24,000 times.
• Developed Berkeley's upper-division data science class, which he now teaches to over 1000 students per semester.

Today’s episode will probably appeal primarily to hands-on data science practitioners but we made an effort to break down technical terms so that anyone who’s interested in staying on top of the latest in open-source Generative A.I. can enjoy the episode.

In it, Prof. Gonzalez details:
• How his headline-grabbing LLM, Vicuña, came to be and how it arose as one of the leading open-source alternatives to ChatGPT.
• How his Chatbot Arena became the leading proving ground for commercial and open-source LLMs alike.
• How his Gorilla project enables open-source LLMs to call APIs making it an open-source alternative to ChatGPT’s powerful plugin functionality.
• The race for longer LLM context windows.
• How both proprietary and open-source LLMs will thrive alongside each other in the coming years.
• His vision for how A.I. will have a massive, positive societal impact over the coming decades.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Large Language Model Leaderboards and Benchmarks

Added on August 19, 2023 by Jon Krohn.

Llamas, Alpacas, Koalas, Falcons... there is a veritable zoo of LLMs out there! In today's episode, Caterina Constantinescu breaks down the LLM Leaderboards and evaluation benchmarks to help you pick the right LLM for your use case.

Caterina:
• Is a Principal Data Consultant at GlobalLogic, a full-lifecycle software development services provider with over 25,000 employees worldwide.
• Previously, she worked as a data scientist for financial services and marketing firms.
• Is a key player in data science conferences and Meetups in Scotland.
• Holds a PhD from The University of Edinburgh.

In this episode, Caterina details:
• The best leaderboards (e.g., HELM, Chatbot Arena and the Hugging Face Open LLM Leaderboard) for comparing the quality of both open-source and proprietary Large Language Models (LLMs).
• The advantages and issues associated with LLM evaluation benchmarks (e.g., evaluation dataset contamination is an big issue because the top-performing LLMs are often trained on all the publicly available data they can find... including benchmark-evaluation datasets).

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Feeding the World with ML-Powered Precision Agriculture

Added on August 15, 2023 by Jon Krohn.

Earth's population may peak around 10 billion later this century. To feed everyone while also avoiding climate disaster, A.I. is essential. Today, three leaders from Syngenta detail how ML is transforming agriculture and assuring our future.

Jon’s “Generative A.I. with LLMs” Hands-on Training

Added on August 11, 2023 by Jon Krohn.

Today's episode introduces my two-hour "Generative A.I with LLMs" training, which is packed with hands-on Python demos in Colab notebooks. It details open-source LLM (Hugging Face; PyTorch Lightning) and commercial (OpenAI API) options.

How Data Happened: A History, with Columbia Prof. Chris Wiggins

Added on August 8, 2023 by Jon Krohn.

Today, Chris Wiggins — Chief Data Scientist at The New York Times and faculty at Columbia University — provides an enthralling, witty and rich History of Data Science. Chris is an extraordinarily gifted orator; don't miss this episode!

Chris:
• Is an Associate Professor of Applied Math at Columbia University.
• Has been CDS at The NY Times for nearly a decade.
• Co-authored two fascinating recently-published books: "How Data Happened: A History from the Age of Reason to the Age of Algorithms" and "Data Science in Context: Foundations, Challenges, Opportunities"

The vast majority of this episode will be accessible to anyone. There are just a couple of questions near the end that cover content on tools and programming languages that are primarily intended for hands-on practitioners.

In the episode, Chris magnificently details:
• The history of data and statistics from its infancy centuries ago to the present.
• Why it’s a problem that most data scientists have limited exposure to the humanities.
• How and when Bayesian statistics became controversial.
• What we can do to address the key issues facing data science and ML today.
• His computational biology research at Columbia.
• The tech stack used for data science at the globally revered New York Times.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

LLaMA 2 — It’s Time to Upgrade your Open-Source LLM

Added on August 4, 2023 by Jon Krohn.

If you've been using fine-tuned open-source LLMs (e.g. for generative A.I. functionality or natural-language conversations with your users), it's very likely time you switch your starting model over to Llama 2. Here's why:

Generative A.I. without the Privacy Risks (with Prof. Raluca Ada Popa)

Added on August 1, 2023 by Jon Krohn.

Consumers and enterprises dread that Generative A.I. tools like ChatGPT breach privacy by using convos as training data, storing PII and potentially surfacing confidential data as responses. Prof. Raluca Ada Popa has all the solutions.

Today's guest, Raluca:
• Is Associate Professor of Computer Science at University of California, Berkeley.
• Specializes in computer security and applied cryptography.
• Her papers have been cited over 10,000 times.
• Is Co-Founder and President of Opaque Systems, a confidential computing platform that has raised over $31m in venture capital to enable collaborative analytics and A.I., including allowing you to securely interact with Generative A.I.
• Previously co-founded PreVeil, a now-well-established company that provides end-to-end document and message encryption to over 500 clients.
• Holds a PhD in Computer Science from MIT.

Despite Raluca being such a deep expert, she does such a stellar job of communicating complex concepts simply that today’s episode should appeal to anyone that wants to dig into the thorny issues around data privacy and security associated with Large Language Models (LLMs) and how to resolve them.

In the episode, Raluca details:
• What confidential computing is and how to do it without sacrificing performance.
• How you can perform inference with an LLM (or even train an LLM!) without anyone — including the LLM developer! — being able to access your data.
• How you can use commercial generative models OpenAI’s GPT-4 without OpenAI being able to see sensitive or personally-identifiable information you include in your API query.
• The pros and cons of open-source versus closed-source A.I. development.
• How and why you might want to seamlessly run your compute pipelines across multiple cloud providers.
• Why you should consider a career that blends academia and entrepreneurship.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

“The Dream of Life” by Alan Watts

Added on July 28, 2023 by Jon Krohn.

For episode #700 today, I bring you the "Dream of Life" thought experiment originally penned by Alan Watts. You are terrifically powerful (particularly now that you're armed with A.I.!) — are you making good use of your power?

Also, time flies, eh? Another hundred episodes in the bag today! Thanks for listening, providing feedback and otherwise contributing to making SuperDataScience, with over a million downloads per quarter, the most listened-to podcast in the data science industry. We've got some serious awesomeness lined up for the next hundred episodes — I can't wait for the amazing, inspiring, mind-opening conversations.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.