AI Digest Weekly -Week 11, Mar 13th -19th, 2023

Top 10 AI Research Paper Summary

Aankur Bhatia
6 min readMar 20


Created using

Hi Everyone, continuing our journey in exploring the top Research papers in AI including Large Language Models (LLMs), Computer vision and NLP, here are the top 10 for this week.

For last week’s (Week 10) top paper summary, please read here

  1. Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

Vid2Avatar: 3D Avatar Construction

Digital avatars are gaining popularity and finding increasing integration in our daily lives, especially in areas like video games, virtual reality, robotics, and movies. However, generating high-fidelity 3D avatars has traditionally required expensive and specialized equipment. A new tool called Vid2Avatar simplifies this process by generating high-quality 3D avatars from videos captured in the wild, without the need for professional equipment or complicated setups. Vid2Avatar uses neural fields to separate the human subject from the background, models the human body and background separately, and uses optimization algorithms to create the most accurate and detailed reconstruction possible. Vid2Avatar’s approach has the potential to revolutionize many applications that rely on digital avatars.

2. Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

CaFo Model

Researchers have introduced a new approach, CaFo, for few-shot visual recognition. The method employs a “Prompt, Produce, then Cache” pipeline to generate more training data, adaptively integrate predictions, and improve the model’s generalization capacity. CaFo combines pre-trained models such as CLIP, DINO, DALL-E, and GPT-3 to provide different forms of prior knowledge to enhance few-shot learning. The method achieves state-of-the-art results on 11 datasets for few-shot classification without requiring additional annotated data.

3. MATHPROMPTER: Mathematical Reasoning using Large language Models


Large Language Models (LLMs) often struggle with arithmetic reasoning tasks, producing incorrect responses that can erode user trust. Researchers have proposed MathPrompter, a new AI-powered tool that enhances LLM performance on mathematical problems and boosts reliance on the results. MathPrompter utilizes the Zero-shot chain-of-thought (CoT) technique to generate multiple algebraic expressions or Python functions to answer the same mathematical problem in various ways, increasing confidence in the output results. Zero-shot-CoT approaches can help address the limitations of LLMs in arithmetic reasoning tasks, including challenging mathematics problems found in contests or standardized tests.

4. Meta-Semi: A Meta-Learning Approach for Semi-Supervised Learning

Meta-Semi is a meta-learning-based semi-supervised learning (SSL) algorithm developed by researchers from Tsinghua University. It addresses the difficulty of finding optimal hyper-parameters in real-world scenarios with scarce annotated data. Meta-Semi generates “pseudo-labeled” data from unlabeled data, removes unreliable samples, and trains the model on the remaining data. It uses dynamic weighting to reweight previously pseudo-labeled samples and outperforms state-of-the-art deep networks and SSL algorithms in image classification benchmarks. Although it requires a little more training time, it is seen as a potential area for improvement in future research.

5. The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Hindsight Instruction Relabeling (HIR)

Language models have ethical concerns due to generating fake information, toxic text, or not following instructions. Reinforcement learning (RL) algorithms can address these issues but require extensive training and can overlook failure cases. Hindsight Instruction Relabeling (HIR) is a new algorithm proposed to improve language models by using successful and failed instruction-output pairs to align them better with human instructions. HIR outperforms baseline models in various reasoning tasks and does not need additional RL training.

6. PETALS: Collaborative Inference and Fine-tuning of Large Models


Pretrained language models can perform real-world tasks with minor adjustments or assistance. Larger models perform better, with modern models having hundreds of billions of parameters, but their use is limited due to memory and computational expenses. Offloading model parameters to slower but more affordable memory can democratize LLMs, but with high latency. The PETALS framework enables collaboration to optimize large language models, with enhancements such as dynamic quantization and load balancing improving performance. Security, privacy, rewards, and ongoing improvement are all considered in the framework, with freely available code and a deployed chat application.

7. Alpaca: A Strong, Replicable Instruction-Following Model

Alpaca- a 7B parameter model that performs comparable to 175B GPT-3.5

Training high-quality instruction-following models is expensive due to the need for a powerful pretrained language model and high-quality data. Stanford HAI has released Alpaca, an instruction-following model based on Meta AI LLaMA 7B that is compact and cheap to reproduce. They created a dataset of 52K unique instructions and their outputs for under $500 using OpenAI’s API and Hugging Face’s methods. Alpaca performs similarly well as text-da-vinci-003 but has some shortcomings, such as a tendency towards delusion, toxicity, and stereotyping.

8. Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

VALL-E-X- a cross lingual Codec Language Model

Traditional Text to Speech (TTS) models produced robotic-sounding outputs, but deep neural networks have enabled more human-like speech. Microsoft has developed VALL-E X, a cross-lingual neural codec language model that overcomes foreign accent problems in speech transmission. VALL-E X uses a multilingual in-context learning framework, training a multilingual conditional codec language model to predict acoustic token sequences of target language speech using source language speech and target language text as prompts. The model has been evaluated with English and Chinese languages using LibriSpeech and EMIME datasets, demonstrating high-quality zero-shot cross-lingual speech synthesis performance. VALL-E X beats strong baselines for speaker similarity, speech quality, translation quality, speech naturalness, and human evaluation.

9. Phone2Proc: Bringing Robust Robots Into Our Chaotic World

Phone2Proc: Training Embodied AI Agents in Real World

Embodied AI enables physical objects to interact with the real world in a way that mimics human behavior. However, deploying agents trained in simulation to the real world has been challenging. Phone2Proc is a lightweight approach that uses a cellphone to scan an environment and generate targeted training scene variations for agents. It makes use of Apple’s RoomPlan API and offers real-time feedback to aid in accurate scanning. Phone2Proc outperforms ProcTHOR with a success rate of 70.7%, compared to the baseline’s rate of 34.7%. It is also resilient to different types of scene disturbance and environmental dynamism, including crowded spaces, changes in lighting, and movement of target objects. Phone2Proc’s success suggests that it is a promising method for training embodied AI agents in the real world.

10. Meet in the Middle: A New Pre-training Paradigm

MIM : Two Language Models approaching data from either side

Microsoft researchers have proposed a new pretraining and inference paradigm called “Meet in the Middle” (MIM) to improve the effectiveness and consistency of language models. MIM uses two LMs that read tokens in opposing directions and co-regularize each other to leverage both the prefix and suffix in pretraining. It provides a quick and effective inference process and outperforms several baselines in common evaluation criteria. MIM improves the consistency of the two LMs and aids in early termination of the generation process during infilling tasks.

So, that’s a wrap for this week. For a video explanation, please visit the following link and subscribe to my YouTube channel



Aankur Bhatia

Aankur works as the Chief Data Scientist for a large multinational company. Passionate about application of ML to Cyber Security and holds over 20 patents