AI Digest Weekly -Week 16, Apr 16th -Apr 23rd, 2023

Summaries of top Open Sourced AI projects in domains of LLMs, Computer Vision and NLP

Aankur Bhatia
6 min readApr 23, 2023
Created using DALL-E (OpenAI)

Here are the research summaries for top open sourced AI projects with their github links

For week 14 top AI research paper summaries, please read here

  1. Instruction tuning with GPT-4

Paper link : www.arxiv.org/pdf/2304.03277.pdf

github link : www.github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM

LLaMA vs GPT-4 Comparison

This github page introduces GPT-4-LLM, a new model that generates data for building instruction-following LLMs (Large Language Models). This repository contains two sets of instructions for fine-tuning LLMs; English instruction-following data and Chinese instruction-following data. GPT-4 also generates comparison data that can be used to train reward models. Additionally, there is a set of 9K “unnatural instructions” data that can be used to determine the gap between GPT-4 and instruction-tuned models. Researchers can use this dataset to advance the state of the art in instruction-tuning for LLMs. However, the dataset is intended and licensed for research use only. Models trained using the dataset should not be used outside of research purposes.

The performance of the LLMs was evaluated using the Helpfulness, Honesty, and Harmlessness criteria by Anthropic AI. Two instruction-tuned LLaMA models were compared, fine-tuned on data generated by GPT-4 and GPT-3 respectively. LLaMA-GPT-4 performs substantially better than LLaMA-GPT-3 in the “Helpfulness” criterion. LLaMA-GPT-4 performs similarly to the original GPT-4 in all three criteria, suggesting a promising direction for developing state-of-the-art instruction-following LLMs.

Researchers can use the included IPython notebook plots/main_plots.ipynb to plot the results. The code to fine-tune LLaMA is also included in the repository, and researchers can use this recipe to fine-tune LLaMA using standard Hugging Face training code.

2. MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models

Paper link : www.minigpt-4.github.io

Github link: www.github.com/Vision-CAIR/MiniGPT-4

Chatting with MiniGPT-4

MiniGPT-4 is a model that aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer. It is trained in two stages. In the first stage, traditional pretraining is performed using around 5 million image-text pairs in 10 hours with 4 A100s, which allows Vicuna to understand the image, but its generation ability is impacted. To address this issue and improve usability, a novel way to create high-quality image-text pairs is proposed, and a small dataset of 3500 pairs is created. The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. MiniGPT-4 yields emerging vision-language capabilities similar to those demonstrated in GPT-4.

The installation process involves cloning the repository, creating a python environment, and activating it, preparing the Vicuna weights and setting the path to the vicuna weight in the model config file, downloading the pretrained checkpoint, and setting the path to the pretrained checkpoint in the evaluation config file.

The training of MiniGPT-4 contains two alignment stages. In the first stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. In the second stage, a small high-quality image-text pair dataset is created by the model itself and ChatGPT to further align MiniGPT-4. The commands to launch the two stages are provided, along with instructions for downloading and preparing the datasets.

MiniGPT-4’s model architecture follows BLIP-2, is built upon Lavis, and makes use of the fantastic language ability of Vicuna with only 13B parameters. The model’s abilities make it a useful tool in generating text related to images.

3. Web LLM

Paper link: www.mlc.ai/web-llm/

Github link : www.github.com/mlc-ai/web-llm

Chat with WebLLM

Web LLM is a project that aims to bring language model chats directly onto web browsers. The project is an attempt to build AI assistants and enable privacy while using GPU acceleration by running everything inside the browser with no server support, and accelerating it with WebGPU. To build a chat service, Web LLM wants to simply bake LLMs directly into the client-side and directly run them inside a browser, with the benefits of cost reduction, enhancement for personalization, and privacy protection.

The project faces some hurdles, such as bringing the models somewhere without the relevant GPU-accelerated Python frameworks, careful planning of memory usage, and aggressive compression of weights so that the models can fit into memory. In addition, Web LLM wants to present a repeatable and hackable workflow that enables anyone to easily develop and optimize these models in a productive Python-first approach and deploy them universally, including on the web.

The project relies on machine learning compilation (MLC) technology that builds on the open-source ecosystem, including Hugging Face, model variants from LLaMA and Vicuna, wasm, and WebGPU. It also uses TVM Unity, an ongoing development in the Apache TVM Community. The solution builds the language model’s IRModule in TVM with native dynamic shape support, reducing both computation amount and memory usage. Finally, the project provides instructions for local deployment, including installing TVM Unity, prerequisite for web deployment, and preparing all the necessary dependencies for web build.

4. Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

Paper link : www.arxiv.org/pdf/2304.06939.pdf

Github link: www.github.com/allenai/mmc4

Multimodal C4 Dataset

The Allen Institute for AI (AI2) has released a new open dataset called Multimodal-C4, which contains more than 500 million images along with the text they are associated with. The corpus has been curated and preprocessed to create four subsets, including Multimodal-C4 fewer faces, which contains 385 million images and 79 million documents. The dataset is freely available for download and use by researchers. The dataset includes image features extracted by the CLIP ViT-L/14 model and image-by-text similarity matrices, as well as text and URL information. The dataset can be used to train machine learning models for image captioning, multimodal learning, and more. The data is available in JSON format and can be downloaded directly from Google Cloud Storage. Researchers can also request access to the complete Multimodal-C4 dataset or the raw images via a Google form. The dataset is expected to help advance research in computer vision and natural language processing by enabling researchers to train models on a large-scale, open dataset of multimodal data.

5. StableLM: Stability AI Language Models

Paper link : www.stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models

Github link: www.github.com/stability-AI/stableLM/

Stability AI has released its StableLM series of AI language models, with the initial set including the 3B and 7B parameter models. StableLM-Alpha is trained on a dataset that is three times the size of The Pile, which contains 1.5 trillion tokens, with the context length for these models being 4096 tokens. StableLM-Tuned-Alpha has also been fine-tuned using five recent datasets for conversational agents, including Nomic-AI’s gpt4all, Stanford’s Alpaca, RyokoAI’s ShareGPT52K datasets, Databricks labs’ Dolly, and Anthropic’s HH. The models are hosted on the Hugging Face hub. The StableLM models can write poetry, short stories, jokes, and engage in chit-chat. StableLM Tuned should be used with prompts formatted to reassure the AI’s helpful and harmless nature. The StableLM series includes models of different sizes, with 15B, 30B, 65B, and 175B models in progress. A technical report is upcoming to document the model specifications and training settings. StableLM’s base models are released under CC BY-SA-4.0.

Thanks for reading, as usual, you can reach me out on linkedIn and subscribe to my YouTube channel for a video explanation. See you next week.

--

--

Aankur Bhatia

Aankur works as the Chief Data Scientist for a large multinational company. Passionate about application of ML to Cyber Security and holds over 20 patents