AI Digest Weekly — Top AI Research Papers Summary

Week, Mar 5th — 12th

Aankur Bhatia
7 min readMar 12


Image generated from

Guys, here are the top 10 research papers in AI (LLMs, Computer Vision, NLP) for this week.

For last week’s top paper summary, please read here

  1. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

Microsoft Research and Columbia University conducted research that identified the problem of memory distortion and hallucinations in large language models (LLMs) that encode vast amounts of knowledge. To address this issue, they proposed a system called LLM-Augmenter that uses plug and play modules to ground LLM responses in external knowledge stored in task-specific databases. This system interactively improves responses using feedback generated from utility functions, resulting in reduced hallucinations without sacrificing fluency. The effectiveness of LLM-Augmenter was empirically validated in task-oriented dialog and open-domain QA, using Knowledge F1 and BLEU-4 performance metrics. Overall, LLM-Augmenter was found to be an effective solution for augmenting black-box LLMs with external knowledge.

2. 3D-aware Blending with Generative NeRFs

Blending with Generative NeRF

The traditional image blending techniques fail to consider the 3D geometric features like pose or shape and rely on 2D affine transformations. To address this issue, researchers at CMU proposed a 3D-aware image blending method based on generative Neural Radiance Fields (NeRFs). The 3D-aware blending is performed on NeRFs latent representation spaces, which reduces the complexity of the data. The 3D alignment is achieved through a CNN encoder, which infers the camera pose of each input image and the latent code of the image itself. This method outperforms all classic methods in terms of photorealism and can disentangle color and geometric changes during blending, thus creating view-consistent results.

3. MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

MimicPlay :Long Horizon Imitation Learning

Teaching robots how to perform manipulation tasks with high efficiency has been a persistent challenge. While imitation learning involves teaching robots how to perform tasks by imitating human demonstrations, it can be time-consuming. However, researchers at Stanford, NVIDIA, Georgia Tech, CalTech, and UT Austin have developed a new approach called MimicPlay. This approach combines hierarchical imitation learning and learning from play data by using human play data and demonstration data to teach robots long-horizon manipulation tasks. MimicPlay has been found to improve the robot’s ability to perform complex tasks with greater efficiency and more generalization abilities.

4. Language Is Not All You Need: Aligning Perception with Language Models

Aligning Perception with LLMs

Microsoft researchers have developed a Multimodal Large Language Model (MLLM) called KOSMOS-1 that integrates vision with Large Language Models (LLMs). KOSMOS-1 natively supports language, perception-language, and vision activities, and can handle perception-intensive tasks and natural language tasks. The model can perform zero- and few-shot multimodal learning, assess the Raven IQ test, and support multi-turn interactions for broad modalities. MLLMs outperform LLMs in common sense reasoning, and the perception-language alignment opens up new applications and opportunities. The researchers used webscale multimodal datasets to train KOSMOS-1, including text data, image-text pairings, and arbitrarily interleaved pictures and words.

5. Approximate, adapt, anonymize (3A): A framework for privacy preserving training data release for machine learning

The legal, ethical, and trust problems associated with training and applying machine learning models in industries that deal with sensitive information, such as healthcare, slow down the development of this technology. To address this, Amazon researchers have developed a system for creating synthetic data that protects privacy while enhancing usefulness for ML. The main contribution is the flexible privacy-preserving data generation framework and introduction of cluster-based instead of random mixing for preserving differential privacy, allowing significant accuracy increases over previous methods. The method uses sampling from the vicinity of cluster centroids instead of random sampling to maintain data distribution.

6. Internet Explorer: Targeted Representation Learning on the Open Web

Internet Explorer: Targeted Representation Learning

Fine-tuning pre-trained models on large datasets such as ImageNet or CLIP for vision tasks can be successful but requires a lot of effort to collect and label data. To address this, researchers from CMU and Berkeley Univ have proposed a new approach called Internet Explorer, which treats the Internet as a dataset and uses reinforcement learning-inspired agents to actively search for relevant visual data to improve feature quality on the target dataset. The model leverages self-supervised learning to learn useful representations from unlabeled images downloaded from the Internet and uses WorNet concept to learn relevant query identification. Internet Explorer meets CLIP and ResNet 50 in terms of performance while reducing the compute and training images by orders of magnitude, making it a smart and cost-effective way to collect and learn to solve image classification tasks.

7. Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Google’s Universal Speech Models

Universal speech models (USMs) are machine learning models trained to recognize spoken language across different languages and accents. They have potential applications for virtual assistants, speech to text transcription, and language translation. Google’s 1000 languages initiative aims to develop an ML model that supports the world’s top 1000 languages. USMs use the Conformer, a convolution-augmented transformer, as the encoder, and the training process involves unsupervised learning on speech audio that includes hundreds of different languages and an optional pre-training stage using text data. Achieving an average word error rate of less than 30% across 73 languages using less than 3000 hours of data in each language is very impressive.

8. PaLM-E: An Embodied Multimodal Language Model

PaLM-E : Embodied Multimodal Language Model

Google and TU Berlin have developed PaLM-E, a single multimodal model that incorporates continuous inputs from embodied agents’ sensor modalities such as pictures and state estimations into the same latent embedding as language tokens. PaLM-E achieves state-of-the-art performance on OK-VQA benchmarks without task-specific fine-tuning and is tested on three robotic manipulation domains, common visual language tasks, and language tasks. This technique is a significant step towards connecting the reasoning abilities of large language models to real-world visual and physical sensor modalities, crucial for solving real-world problems in computer vision and robotics.

9. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Visual ChatGPT

Microsoft Visible ChatGPT is a new approach that aims to improve human-machine interactions with Visual Foundation Models (VFM). LLMs like T5, BLOOM, GPT-3 have made significant advancements but are not adept at handling visual information. Visible ChatGPT uses a prompt manager to enable interactive learning from VFMs, which specifies input, output formats, handles histories, priorities, conflicts, and transforms visual information into language format. The system has shown success in executing complex visual tasks such as generating a red flower from a yellow flower image.

10. Prismer: A Vision-Language Model with An Ensemble of Experts


Prismer is a vision-language model developed by a collaboration of Imperial College, London, NVIDIA, ASU, and Caltech. It uses an ensemble of pre-trained domain experts to handle visual QA and picture captioning tasks. The model only requires training a few components as most of the network weights are inherited from publicly available pre-trained domain expert models. It uses multi-modal auxiliary knowledge to capture semantics and information about input images. The model shows strong multi-modal reasoning performance in tasks like image captioning, image classification, visual QA using only 13M examples of publicly available images/alt-text data. Prismer uses data extremely efficiently while training, reducing GPU hours necessary to attain equivalent performance to other SOTA vision-language models.

For a video explanation, please visit the following link and subscribe to my YouTube channel



Aankur Bhatia

Aankur works as the Chief Data Scientist for a large multinational company. Passionate about application of ML to Cyber Security and holds over 20 patents