Historically, when we deploy a machine learning model into production, the parameters that the model learned during its training on data were the sole driver of the model’s outputs. With the Generative LLMs that have taken the world by storm in the past few years, however, the model parameters alone are not enough to get reliably high-quality outputs. For that, the so-called decoding method that we choose when we deploy our LLM into production is also critical.
Read MoreFiltering by Category: Five-Minute Friday
Decoding Speech from Raw Brain Activity, with Dr. David Moses
Dr. David Moses and his colleagues have pulled off a miracle with A.I.: allowing paralyzed patients to "speak" through a video avatar in real time — using brain waves alone. In today's episode, David details how ML makes this possible.
David:
• Is an adjunct professor at the University of California, San Francisco.
• Is the project lead on the BRAVO (Brain-Computer Interface Restoration of Arm and Voice) clinical trial.
• The success of this extraordinary BRAVO project led to an article in the prestigious journal Nature and YouTube video that already has over 3 million views.
Today’s episode does touch on specific machine learning (ML) terminology at points, but otherwise should be fascinating to anyone who’d like to hear how A.I. is facilitating real-life miracles.
In this episode, David details:
• The genesis of the BRAVO project.
• The data and the ML models they’re using on the BRAVO project in order to predict text, speech sounds and facial expressions from the brain activity of paralyzed patients.
• What’s next for this exceptional project including how long it might be before these brain-to-speech capabilities are available to anyone who needs them.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
AI Emits Far Less Carbon Than Humans (Doing the Same Task)
There's been a lot of press about Large Language Models (LLMs), such as those behind ChatGPT, using vast amounts of energy per query. In fact, however, a person doing the same work emits 12x to 45x more carbon from their laptop alone.
Today’s "Five-Minute Friday" episode is a quick one on how “The Carbon Emissions of Writing and Illustrating Are Lower for AI than for Humans”. Everything in today’s episode is based on an ArXiV preprint paper with that title by researchers from UC Irvine, the Massachusetts Institute of Technology and other universities.
For writing a page of text, for example, the authors estimate:
• BLOOM open-source LLM (including training) produces ~1.6g CO2/query.
• OpenAI's GPT-3 (including training) produces ~2.2g CO2/query.
• Laptop usage for 0.8 hours (average time to write page) emits ~27g CO2 (that's 12x GPT-3).
• Desktop for same amount of writing time emits ~72g CO2 (32 x GPT-3).
For creating a digital illustration:
• Midjourney (including training) produces ~1.9g CO2/query.
• DALL-E 2 produces ~2.2g CO2/query.
• Human takes ~3.2 hours for the same work, emitting ~100g CO2 (45 x DALL-E 2) on a laptop or ~280g CO2 (127 x DALL-E 2) on a desktop.
There are complexities here, such as what humans do with their time instead of writing or illustrating; if it’s spent driving, for example, then the net impact would be worse. As someone who’d love to see the world at net negative carbon emissions ASAP through innovations like nuclear fusion and carbon capture, however, I have been getting antsy about how much energy state-of-the-art LLMs use, but this simple article turned that perspective upside down. I’ll continue to use A.I. to augment my work wherever I can... and hopefully get my day done earlier so I can get away from my machine and enjoy some time outdoors.
Hear more detail in today's episode or check out the video version to see figures as well.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
OpenAI’s DALL-E 3, Image Chat and Web Search
Today's episode details three big releases from OpenAI: (1) DALL-E 3 text-to-image model, which "exactly" adheres to your prompt. (2) Image-to-text chat. (3) Real-time web search integrated into ChatGPT (which seems to lag behind Google's Bard).
So, first, DALL-E 3 text-to-image generation:
• Appears to generate images that are on par with Midjourney V5, the current state-of-the-art.
• The big difference is that apparently DALL-E 3 will actually generate images that adhere “exactly” to the text you provide.
• In contrast, the incumbent models in the state of the art typically ignore words or key parts of the description even though the quality is typically stunning.
• This adherence to prompts extends even to language that you’d like to include in the image, which is mega.
• Watch today's YouTube version for examples of all the above.
In addition, using Midjourney is a really bizarre user experience because it's done through Discord where you provide prompts and get results alongside dozens of other people at the same time. DALL-E 3, in contrast, will be within the slick ChatGPT Plus environment, which could completely get rid of the need to develop text-to-image prompt-engineering expertise in order to get great results. Instead, you can simply have an iterative back-and-forth conversation with ChatGPT to produce the image of your dreams.
Next up is image-to-text chat in ChatGPT Plus:
• We've known this was coming for a while.
• Works stunningly well in the test I've done so far.
• Today's YouTube version also shows an example of this.
Finally, real-time web search with Bing is now integrated into ChatGPT Plus:
• In my personal (anecdotal tests), this lagged behind Google's Bard.
• Bard is also free, so if real-time web search is what you're after, there doesn't seem to be a reason to pay for ChatGPT Plus. That said, for state-of-the-art general chat plus now image generation and text-to-image chat (per the above), ChatGPT Plus is well worth the price tag.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
ChatGPT Custom Instructions: A Major, Easy Hack for Data Scientists
Thanks to Shaan Khosla for tipping me off to a crazy easy hack to get markedly better results from GPT-4: providing Custom Instructions that prompt the algorithm to iterate upon its own output while critically evaluating and improving it.
Here's Shaan's full Custom Instructions text, which he himself has been iterating on in recent months:
"I need you to help me with a task. To help me with the task, first come up with a detailed outline of how you think you should respond, then critique the ideas in this outline (mention the advantages, disadvantages, and ways it could be improved), then use the original outline and the critiques you made to come up with your best possible solution.
"Overall, your tone should not be overly dramatic. It should be clear, professional, and direct. Don't sound robotic or like you're trying to sell something. You don't need to remind me you're a large language model, get straight to what you need to say to be as helpful as possible. Again, make sure your tone is clear, professional, and direct - not overly like you're trying to sell something."
Try it out! If you haven't used Custom Instructions before, in today's episode I talk you through how to set it up and explain why this approach is so effective. In the video version, I provide a screenshare that makes getting started foolproof.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Using A.I. to Overcome Blindness and Thrive as a Data Scientist
Today's guest is the remarkable Tim Albiges, who lost the ability to see as an adult. Thanks to A.I. tools, as well as learning how to learn by sound and touch, he is now thriving as a data scientist and pursuing a fascinating PhD!
Tim was working as a restaurant manager eight years ago when he tragically lost his sight.
In the face of countless alarming and discriminatory acts against him on account of his blindness, he taught himself Braille and auditory learning techniques (and to raise math equations and diagrams using a special thermoform machine so that he can feel them) in order to be able to return to college and study computing and data science.
Not only did he succeed in obtaining a Bachelor’s degree in computing (with First-Class Honours), he is now pursuing a PhD at Bournemouth University full-time, in which he’s applying machine learning to solve medical problems. His first paper was published in the peer-reviewed journal Sensors earlier this year.
Today’s inspiring episode is accessible to technical and non-technical listeners alike.
In it, Tim details:
• Why a career in data science can be ideal for a blind person.
• How he’s using ML to automate the diagnosis of chronic respiratory diseases.
• The techniques he employs to live a full and independent life, with a particular focus on the A.I. tools that assist him both at work and at leisure.
• A keen athlete, how he’s adapted his approach to fitness in order to run the London marathon and enjoy a gripping team sport called goalball.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Code Llama
Meta's Llama 2 offered state-of-the-art performance for an "open-source"* LLM... except on tasks involving code. Now Code Llama is here and it magnificently fills that gap by outperforming all other open-source LLMs on coding benchmarks.
Read MoreChatGPT Code Interpreter: 5 Hacks for Data Scientists
The ChatGPT Code Interpreter is surreal: It creates and executes Python code for whatever task you describe, debugs its own runtime errors, displays charts, does file uploads/downloads, and suggests sensible next steps all along the way.
Whether you write code yourself today or not, you can take advantage of GPT-4's stellar natural-language input/output capabilities to interact with the Code Interpreter. The mind-blowing experience is equivalent to having an expert data analyst, data scientist or software developer with you to instantaneously respond to your questions or requests.
As an example of these jaw-dropping capabilities (and given the data science-focused theme of my show), I use today's episode demonstrate the ChatGPT Code Interpreter's full automation of data analysis and machine learning. If you watch the episode on YouTube, you can even see the Code Interpreter hands-on in action while I interact with it solely with natural language.
Over the course of today's episode/video, the Code Interpreter:
1. Receives a sample data file that I provide it.
2. Uses natural language to describe all of the variables that are in the file.
3. Performs a four-step Exploratory Data Analysis (EDA), including histograms, scatterplots that compare key variables and key summary statistics (all explained in natural language).
4. Preprocesses all of my variables for machine learning.
5. Selects an appropriate baseline ML model, trains it and quantitatively evaluates its performance.
6. Suggests alternative models and approaches (e.g., grid search) to get even better performance and then automatically carries these out.
7. Optionally provides Python code every step of the way and is delighted to answer any questions I have about the code.
The whole process is a ton of fun and, again, requires no coding abilities to use (the "Code Interpreter" moniker could be misleadingly intimidating to non-coding folks). Even as an experienced data scientist, however, I would estimate that in many everyday situations use of the Code Interpreter could decrease my development time by a crazy 90% or more.
The big caveat with all of this is whether you're comfortable sharing your code with OpenAI. I wouldn't provide proprietary company code to it without clearing it with your firm first and — if you do use proprietary code with it — turn "Chat history & training" off in your ChatGPT Plus settings. To circumnavigate the data-privacy issue entirely, you could alternatively try Meta's newly-released "Code Llama — Instruct 34B" Large Language Model on your own infrastructure. Code Llama won't, however, be as good as the Code Interpreter in many circumstances and will require some technical savvy to get it up and running.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Jon’s “Generative A.I. with LLMs” Hands-on Training
Today's episode introduces my two-hour "Generative A.I with LLMs" training, which is packed with hands-on Python demos in Colab notebooks. It details open-source LLM (Hugging Face; PyTorch Lightning) and commercial (OpenAI API) options.
Read MoreLLaMA 2 — It’s Time to Upgrade your Open-Source LLM
If you've been using fine-tuned open-source LLMs (e.g. for generative A.I. functionality or natural-language conversations with your users), it's very likely time you switch your starting model over to Llama 2. Here's why:
Read More“The Dream of Life” by Alan Watts
For episode #700 today, I bring you the "Dream of Life" thought experiment originally penned by Alan Watts. You are terrifically powerful (particularly now that you're armed with A.I.!) — are you making good use of your power?
Also, time flies, eh? Another hundred episodes in the bag today! Thanks for listening, providing feedback and otherwise contributing to making SuperDataScience, with over a million downloads per quarter, the most listened-to podcast in the data science industry. We've got some serious awesomeness lined up for the next hundred episodes — I can't wait for the amazing, inspiring, mind-opening conversations.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Brain-Computer Interfaces and Neural Decoding, with Prof. Bob Knight
In today's extraordinary episode, Prof. Bob Knight details how ML-powered brain computer interfaces (BCIs) could allow real-time thought-to-speech synthesis and the reversal of cognitive decline associated with aging.
This is a rare treat as "Dr. Bob" doesn't use social media and has only made two previous podcast appearances: on Ira Flatow's "Science Friday" and a little-known program called "The Joe Rogan Experience".
Dr. Bob:
• Is Professor of Neuroscience and Psychology at University of California, Berkeley.
• Is Adjunct Professor of Neurology and Neurosurgery at UC San Francisco.
• Over his career, has amassed tens of millions of dollars in research funding, 75 patents, and countless international awards for neuroscience and cognitive computing research.
• His hundreds of papers have together been cited over 70,000 times.
In this episode, Bob details:
• Why the “prefrontal cortex” region of our brains makes us uniquely intelligent relative to all the other species on this planet.
• The invaluable data that can be gathered by putting recording electrodes through our skulls and directly into our brains.
• How "dynamic time-warping" algorithms allow him to decode imagined sounds, even musical melodies, through recording electrodes implanted into the brain.
• How BCIs are life-changing for a broad range of illnesses today.
• The extraordinary ways that advances in hardware and machine learning could revolutionize medical care with BCIs in the coming years.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
CatBoost: Powerful, efficient ML for large tabular datasets
CatBoost is making waves in open-source ML as it's often the top approach for tasks as diverse as classification, regression, ranking, and recommendation. This is especially so if working with tabular data that include categorical variables.
This justifiable excitement in mind, today's "Five-Minute Friday" episode of SuperDataScience is dedicated to CatBoost (short for “category” and “boosting”).
CatBoost has been around since 2017 when it was released by Yandex, a tech giant based in Moscow. In a nutshell, CatBoost — like the more established (and regularly Kaggle-leaderboard-topping approaches) XGBoost and LightGBM — is at its heart a decision-tree algorithm that leverages gradient boosting. So that explains the “boost” part of CatBoost.
The “cat” (“category”) part comes from CatBoost’s superior handling of categorical features. If you’ve trained models with categorical data before, you’ve likely experienced the tedium of preprocessing and feature engineering with categorical data. CatBoost comes to the rescue here, efficiently dealing with categorical variables by implementing a novel algorithm that eliminates the need for extensive preprocessing or manual feature engineering. CatBoost handles categorical features automatically by employing techniques such as target encoding and one-hot encoding.
In addition to CatBoost’s superior handling of categorical features, the algorithm also makes use of:
• A specialized gradient-based optimization scheme known as Ordered Boosting that takes advantage of the natural ordering of categorical variables to minimize the loss function efficiently.
• Symmetric decision trees, which have a fixed tree depth that enables a faster training time relative to XGBoost and a comparable training time to LightGBM, which is famous for its speed.
• Regularization techniques, such as the well-known L2 regularization as well as ordered boosting and symmetric trees already discussed, all together make CatBoost unlikely to overfit to training data relative to other boosted-tree algorithms.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Lossless LLM Weight Compression: Run Huge Models on a Single GPU
Many recent episodes have been focused on open-source Large Language Models that you can download and fine-tune to particular use cases depending on your needs or your users’ needs. I’ve particularly been highlighting LLMs with seven billion up to 13 billion model parameters because this size of model can typically be run on a single consumer GPU so it’s relatively manageable and affordable both to train and have in production.
Read MoreSix Reasons Why Building LLM Products Is Tricky
Many of my recent podcast episodes have focused on the bewildering potential of fine-tuning open-source Large Language Models (LLMs) to your specific needs. There are, however, six big challenges when bringing LLMs to your users:
1. Strictly limited context windows
2. LLMs are slow and compute-intensive at inference time
3. "Engineering" reliable prompts can be tricky
4. Prompt-injection attacks make you vulnerable to data and IP theft
5. LLMs aren't (usually) products on their own
6. There are legal and compliance issues
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Get More Language Context out of your LLM
The "context window" limits the number of words that can be input to (or output by) a given Large Language Model. Today's episode introduces FlashAttention, a trick that allows for much larger context windows.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
StableLM: Open-source “ChatGPT”-like LLMs you can fit on one GPU
Known for their widely popular text-to-image generators like Stable Diffusion, the company's recent release of the first models from their open-source suite of StableLM language models marks a significant advancement in the AI domain.
Read MoreThe Chinchilla Scaling Laws
The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, I cover this ratio and the LLMs that have arisen from it (incl. the new Cerebras-GPT family).
Read MoreOpen-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0
Want a GPT-4-style model on your own hardware and fine-tuned to your proprietary language-generation tasks? Today's episode covers the key open-source models (Alpaca, Vicuña, GPT4All-J, and Dolly 2.0) for doing this cheaply on a single GPU 🤯
We begin with a retrospective look at Meta AI's LLaMA model, which was introduced in episode #670. LLaMA, with its 13 billion parameters, achieves performance comparable to GPT-3 while being significantly smaller and more manageable. This efficiency makes it possible to train the model on a single GPU, democratizing access to advanced AI capabilities.
The focus then shifts to four models that surpass LLaMA in terms of power and sophistication: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0. Each of these models presents a unique blend of innovation and practicality, pushing the boundaries of what's possible with AI:
Alpaca
Developed by Stanford researchers, Alpaca is an evolution of the 7 billion parameter LLaMA model, fine-tuned with 52,000 examples of instruction-following natural language. This model excels in mimicking GPT-3.5's instruction-following capabilities, offering high performance at a fraction of the cost and size.
Vicuña
Vicuña, a product of collaborative research across multiple institutions, builds on both the 7 billion and 13 billion parameter LLaMA models. It's fine-tuned on 70,000 user-shared ChatGPT conversations from the ShareGPT repository, achieving GPT-3.5-like performance with unique user-generated content.
GPT4All-J
GPT4All-J, released by Nomic AI, is based on EleutherAI's open source 6 billion parameter GPT-J model. It's fine-tuned with an extensive 800,000 instruction-response dataset, making it an attractive option for commercial applications due to its open-source nature and Apache license.
Dolly 2.0
Dolly 2.0, from database giant Databricks, builds upon EleutherAI's 12 billion parameter model. It's fine-tuned with 15,000 human-generated instruction response pairs, offering another open source, commercially viable option for AI applications.
These models represent a significant shift in the AI landscape, making it economically feasible for individuals and small teams to train and deploy powerful language models. With a few hundred to a few thousand dollars, it's now possible to create proprietary, ChatGPT-like models tailored to specific use cases.
The advancements in AI models that can be trained on a single GPU mark a thrilling era in data science. These developments not only showcase the rapid progression of AI technology but also significantly lower the barrier to entry, allowing a broader range of users to explore and innovate in the field of artificial intelligence.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
LLaMA: GPT-3 performance, 10x smaller
By training (relatively) small LLMs for (much) longer, Meta AI's LLaMA architectures achieve GPT-3-like outputs at as little as a thirteenth of GPT-3's size. This means cost savings and much faster execution time.
LLaMA, a clever nod to LLMs (Large Language Models), is Meta AI's latest contribution to the AI world. Based on the Chinchilla scaling laws, LLaMA adopts a principle that veers away from the norm. Unlike its predecessors, which boasted hundreds of millions of parameters, LLaMA emphasizes training smaller models for longer durations to achieve enhanced performance.
The Chinchilla Principle in LLaMA
The Chinchilla scaling laws, introduced by Hoffmann and colleagues, postulate that extended training of smaller models can lead to superior performance. LLaMA, with its 7 billion to 65 billion parameter models, is a testament to this principle. For perspective, GPT-3 has 175 billion parameters, making the smallest LLaMA model just a fraction of its size.
Training Longer for Greater Performance
Meta AI's LLaMA pushes the boundaries by training these relatively smaller models for significantly longer periods than conventional approaches. This method contrasts with last year's top models like Chinchilla, GPT-3, and PaLM, which relied on undisclosed training data. LLaMA, however, uses entirely open-source data, including datasets like English Common Crawl, C4, GitHub, Wikipedia, and others, adding to its appeal and accessibility.
LLaMA's Remarkable Achievements
LLaMA's achievements are notable. The 13 billion parameter model (LLaMA 13B) outperforms GPT-3 in most benchmarks, despite having 13 times fewer parameters. This implies that LLaMA 13 can offer GPT-3 like performance on a single GPU. The largest LLaMA model, 65B, competes with giants like Chinchilla 70B and PaLM, even preceding the release of GPT-4.
This approach signifies a shift in the AI paradigm – achieving state-of-the-art performance without the need for enormous models. It's a leap forward in making advanced AI more accessible and environmentally friendly. The model weights, though intended for researchers, have been leaked and are available for non-commercial use, further democratizing access to cutting-edge AI.
LLaMA not only establishes a new benchmark in AI efficiency but also sets the stage for future innovations. Building on LLaMA's foundation, models like Alpaca, Vicuna, and GPT4ALL have emerged, fine-tuned on thoughtful datasets to exceed even LLaMA's performance. These developments herald a new era in AI, where size doesn't always equate to capability.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.