The Chinchilla Scaling Laws dictate the amount of training data needed to optimally train a Large Language Model (LLM) of a given size. For Five-Minute Friday, I cover this ratio and the LLMs that have arisen from it (incl. the new Cerebras-GPT family).
Read MoreFiltering by Category: Five-Minute Friday
Open-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0
Want a GPT-4-style model on your own hardware and fine-tuned to your proprietary language-generation tasks? Today's episode covers the key open-source models (Alpaca, Vicuña, GPT4All-J, and Dolly 2.0) for doing this cheaply on a single GPU 🤯
We begin with a retrospective look at Meta AI's LLaMA model, which was introduced in episode #670. LLaMA, with its 13 billion parameters, achieves performance comparable to GPT-3 while being significantly smaller and more manageable. This efficiency makes it possible to train the model on a single GPU, democratizing access to advanced AI capabilities.
The focus then shifts to four models that surpass LLaMA in terms of power and sophistication: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0. Each of these models presents a unique blend of innovation and practicality, pushing the boundaries of what's possible with AI:
Alpaca
Developed by Stanford researchers, Alpaca is an evolution of the 7 billion parameter LLaMA model, fine-tuned with 52,000 examples of instruction-following natural language. This model excels in mimicking GPT-3.5's instruction-following capabilities, offering high performance at a fraction of the cost and size.
Vicuña
Vicuña, a product of collaborative research across multiple institutions, builds on both the 7 billion and 13 billion parameter LLaMA models. It's fine-tuned on 70,000 user-shared ChatGPT conversations from the ShareGPT repository, achieving GPT-3.5-like performance with unique user-generated content.
GPT4All-J
GPT4All-J, released by Nomic AI, is based on EleutherAI's open source 6 billion parameter GPT-J model. It's fine-tuned with an extensive 800,000 instruction-response dataset, making it an attractive option for commercial applications due to its open-source nature and Apache license.
Dolly 2.0
Dolly 2.0, from database giant Databricks, builds upon EleutherAI's 12 billion parameter model. It's fine-tuned with 15,000 human-generated instruction response pairs, offering another open source, commercially viable option for AI applications.
These models represent a significant shift in the AI landscape, making it economically feasible for individuals and small teams to train and deploy powerful language models. With a few hundred to a few thousand dollars, it's now possible to create proprietary, ChatGPT-like models tailored to specific use cases.
The advancements in AI models that can be trained on a single GPU mark a thrilling era in data science. These developments not only showcase the rapid progression of AI technology but also significantly lower the barrier to entry, allowing a broader range of users to explore and innovate in the field of artificial intelligence.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
LLaMA: GPT-3 performance, 10x smaller
By training (relatively) small LLMs for (much) longer, Meta AI's LLaMA architectures achieve GPT-3-like outputs at as little as a thirteenth of GPT-3's size. This means cost savings and much faster execution time.
LLaMA, a clever nod to LLMs (Large Language Models), is Meta AI's latest contribution to the AI world. Based on the Chinchilla scaling laws, LLaMA adopts a principle that veers away from the norm. Unlike its predecessors, which boasted hundreds of millions of parameters, LLaMA emphasizes training smaller models for longer durations to achieve enhanced performance.
The Chinchilla Principle in LLaMA
The Chinchilla scaling laws, introduced by Hoffmann and colleagues, postulate that extended training of smaller models can lead to superior performance. LLaMA, with its 7 billion to 65 billion parameter models, is a testament to this principle. For perspective, GPT-3 has 175 billion parameters, making the smallest LLaMA model just a fraction of its size.
Training Longer for Greater Performance
Meta AI's LLaMA pushes the boundaries by training these relatively smaller models for significantly longer periods than conventional approaches. This method contrasts with last year's top models like Chinchilla, GPT-3, and PaLM, which relied on undisclosed training data. LLaMA, however, uses entirely open-source data, including datasets like English Common Crawl, C4, GitHub, Wikipedia, and others, adding to its appeal and accessibility.
LLaMA's Remarkable Achievements
LLaMA's achievements are notable. The 13 billion parameter model (LLaMA 13B) outperforms GPT-3 in most benchmarks, despite having 13 times fewer parameters. This implies that LLaMA 13 can offer GPT-3 like performance on a single GPU. The largest LLaMA model, 65B, competes with giants like Chinchilla 70B and PaLM, even preceding the release of GPT-4.
This approach signifies a shift in the AI paradigm – achieving state-of-the-art performance without the need for enormous models. It's a leap forward in making advanced AI more accessible and environmentally friendly. The model weights, though intended for researchers, have been leaked and are available for non-commercial use, further democratizing access to cutting-edge AI.
LLaMA not only establishes a new benchmark in AI efficiency but also sets the stage for future innovations. Building on LLaMA's foundation, models like Alpaca, Vicuna, and GPT4ALL have emerged, fine-tuned on thoughtful datasets to exceed even LLaMA's performance. These developments herald a new era in AI, where size doesn't always equate to capability.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
GPT-4 Has Arrived
SuperDataScience episode #666 — appropriate for an algorithm that has folks (quixotically) signing a letter to pause all A.I. development. In this first episode of the GPT-4 trilogy; in ten minutes, I introduces GPT-4's staggering capabilities.
A Leap in AI Safety and Accuracy
GPT-4 marks a significant advance over its predecessor, GPT-3.5, in terms of both safety and factual accuracy. It is reportedly 82% less likely to respond with disallowed content and 40% more likely to produce factually correct responses. Despite improvements, challenges like sociodemographic biases and hallucinations persist, although they are considerably reduced.
Academic and Professional Exam Performance
The prowess of GPT-4 becomes evident when revisiting queries initially tested on GPT-3.5. Its ability to summarize complex academic content accurately and its human-like response quality are striking. In one test, GPT-4’s output was mistaken for human writing by GPTZero, an AI detection tool, underscoring its sophistication. In another test, the uniform bar exam, GPT-4 scored in the 90th percentile, a massive leap from GPT-3.5's 10th percentile.
Multimodality
GPT-4 introduces multimodality, handling both language and visual inputs. This capability allows for innovative interactions, like recipe suggestions based on fridge contents or transforming drawings into functional websites. This visual aptitude notably boosted its performance in exams like the Biology Olympiad, where GPT-4 scored in the 99th percentile.
The model also demonstrates proficiency in numerous languages, including low-resource ones, outperforming other major models in most languages tested. This linguistic versatility extends to its translation capabilities between these languages.
The Secret Behind GPT-4’s Success
While OpenAI has not disclosed the exact number of model parameters in GPT-4, it's speculated that they significantly exceed GPT-3's 175 billion. This increase, coupled with more and better-curated training data, and the ability to handle vastly more context (up to 32,000 tokens), are likely contributors to GPT-4's enhanced performance.
Reinforcement Learning from Human Feedback (RLHF)
GPT-4 incorporates RLHF, a method that refines its output based on user feedback, allowing it to align more closely with desired responses. This approach has already proven effective in previous models like InstructGPT.
GPT-4 represents a monumental step in AI development, balancing unprecedented capabilities with improved safety measures. Its impact is far-reaching, offering new possibilities in various fields and highlighting the importance of responsible AI development and use. As we continue to explore its potential, the conversation around AI safety and ethics becomes increasingly vital.
The SuperDataScience GPT-4 trilogy is comprised of:
• #666 (today): an introductory overview by yours truly
• #667 (Tuesday): world-leading A.I.-monetization expert Vin Vashishta joins me to detail how you can leverage GPT-4 to your commercial advantage
• #668 (next Friday): world-leading A.I.-safety expert Jeremie Harris joins me to detail the (existential!) risks of GPT-4 and the models it paves the way for
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
MIT Study: ChatGPT Dramatically Increases Productivity
With all of this ChatGPT and GPT-4 news, I was wondering whether these generative A.I. tools actually result in the productivity gains everyone supposes them to. Well, wonder no more…
Read MoreHow to Build Data and ML Products Users Love
What makes people latch onto data products and come back for more? In today's episode, Brian T. O'Neill unveils the processes and teams that make data and A.I. products engaging and sticky for users.
Brian:
• Founded and runs Designing for Analytics, a consultancy that specializes in designing analytics and ML products so that they are adopted.
• Hosts the "Experiencing Data" podcast, an entertaining show that covers how to use product-development methodologies and UX design to drive meaningful user and business outcomes with data.
In today's episode, Brian details:
• What data product management is.
• Why so many data projects fail.
• How to develop machine learning-powered products that users love.
• The teams and skill sets required to develop successful data products.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A.I. Talent and the Red-Hot A.I. Skills
What skills and traits do the best A.I. talent have? And how do you attract the best A.I. talent to your firm? Jaclyn Rice Nelson of Tribe AI, the world's most prestigious ML collective, fills us in in today's episode.
Jaclyn:
• Is Co-Founder/CEO of Tribe A.I., a "collective" of ML engineers and data scientists that drop into companies to accelerate their A.I. capabilities.
• Previously worked in senior roles at Google and CapitalG, Alphabet's growth equity fund.
In today's episode, she details:
• What characterizes the very best A.I. talent.
• What skills you should learn today to be tomorrow’s top A.I. talent.
• How to attract the top engineers and data scientists to your firm.
• The specific category of A.I. project that her clients are suddenly demanding tons of help with.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A.I. Speech for the Speechless
Thanks to a new lip-reading A.I., non-verbal medical patients can now "speak" to their clinicians and loved ones…
Read MoreSparseGPT: Remove 100 Billion Parameters but Retain 100% Accuracy
Today’s episode isn’t specifically about GPT-3, however. It’s about the issue of how massive these large language models are and how we can prune these models to compress them.
Read MoreA Framework for Big Life Decisions
The biggest decisions we make involve trade-offs between professional opportunity (money) and our personal life (love). Today, Stanford labor economist Prof. Myra Strober provides a framework for making a big choice.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
What I Learned in 2022
To cap 2022 off, like I did to cap 2021 off, I’m covering the five big lessons that I learned over the course of the year:
Read MoreThe Equality Machine
Many recent books and articles spread fear about data collection and A.I. Today's guest, Prof. Orly Lobel, offers the antidote with her book "The Equality Machine" — an optimistic take on the future of data science.
Liquid Neural Networks
Liquid Neural Networks are a new, biology-inspired deep learning approach that could be transformative. I think they're super cool and Adrian Kosowski, PhD introduced them to me for today's Five-Minute Friday episode.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Resilient Machine Learning
Machine learning is often fragile in production. For today's Five-Minute Friday episode, Dr. Dan Shiebler details how we can make ML more resilient.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The Critical Human Element of Successful A.I. Deployments
For today's episode, I sat down with the prolific data-science instructor, author and practitioner Keith McCormick to discuss how critical user considerations are for developing a successful A.I. application.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Subword Tokenization with Byte-Pair Encoding
When working with written natural language data as we do with many natural language processing models, a step we typically carry out while preprocessing the data is tokenization. In a nutshell, tokenization is the conversion of a long string of characters into smaller units that we call tokens.
Read MoreImagen Video: Incredible Text-to-Video Generation
For today’s Five-Minute Friday episode, it’s my pleasure to introduce you to the Imagen Video model published upon just a few weeks ago by researchers from Google.
Read MoreBurnout: Causes and Solutions
What really is Burnout? What causes it? And how can you prevent or treat it? Prof. Christina Maslach — world-leading researcher and author on Burnout — joins me for today's episode to unpack these questions.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
OpenAI Whisper: General-Purpose Speech Recognition
One of the challenges holding machines back from approaching human-level speech recognition like Whisper has has been acquiring sufficiently large amounts of high-quality, labeled training data. “Labeled” in this case means audio of speech that has a corresponding text associated with it. With enough of these labeled data, a machine learning model can learn to take in speech audio as an input and then output the correct corresponding text.
Read MoreThe Joy of Atelic Activities
You might think to yourself “I could be spending this time productively!” But pushing past these inner calls for productivity and leaning into the initial discomfort of atelic activities is likely to be rewarding. When you’re consumed by telic activities, by always pursuing outcomes, you’re missing out on being, on appreciating being alive for the fleeting moments that you have.
Read More