Silicon vs. Carbon

187 bookmarks

Custom sorting

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.

arxiv.org #research #arxiv.org #W32 #AUG #2023

·arxiv.org·Aug 17, 2023

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Generative Agents: Interactive Simulacra of Human Behavior

Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this...

arxiv.org #arxiv.org #W32 #AUG #2023

·arxiv.org·Aug 9, 2023

Generative Agents: Interactive Simulacra of Human Behavior

Shortest Path to Boundary for Self-Intersecting Meshes

We introduce a method for efficiently computing the exact shortest path to the boundary of a mesh from a given internal point in the presence of self-intersections. We provide a formal definition of shortest boundary paths for self-intersecting objects and present a robust algorithm for computing the actual shortest boundary path. The resulting method offers an effective solution for collision and self-collision handling while simulating deformable volumetric objects, using fast simulation techniques that provide no guarantees on collision resolution. Our evaluation includes complex self-collision scenarios with a large number of active contacts, showing that our method can successfully handle them by introducing a relatively minor computational overhead.

arxiv.org #arxiv.org #W32 #AUG #2023

·arxiv.org·Aug 6, 2023

Shortest Path to Boundary for Self-Intersecting Meshes

CoDi: Generate Anything from Anything All At Once through Composable Diffusion

Deformable Neural Radiance Fields creates free-viewpoint portraits (nerfies) from casually captured videos.

arxiv.org #arxiv.org #W32 #AUG #2023

·codi-gen.github.io·Aug 6, 2023

CoDi: Generate Anything from Anything All At Once through Composable Diffusion

Aman's AI Journal • Papers List

Aman's AI Journal | Course notes and learning material for Artificial Intelligence and Deep Learning Stanford classes.

arxiv.org #research #questions #arxiv.org #W32 #AUG #2023

·aman.ai·Aug 6, 2023

Aman's AI Journal • Papers List

Category Taxonomy

arxiv.org #arxiv.org #research #AUG #2023 #W31

·arxiv.org·Aug 6, 2023

Category Taxonomy

2022 ar xiv annual report

null

arxiv.org #arxiv.org #research #AUG #2023 #W31

·info.arxiv.org·Aug 6, 2023

2022 ar xiv annual report

arXiv submission rate statistics - arXiv info

arxiv.org #arxiv.org #research #AUG #2023 #W31

·info.arxiv.org·Aug 6, 2023

arXiv submission rate statistics - arXiv info

Two million articles and counting! – arXiv blog

arxiv.org #arxiv.org #research #AUG #2023 #W31

·blog.arxiv.org·Aug 6, 2023

Two million articles and counting! – arXiv blog

Heat-assisted detection and ranging

Nature - Heat-assisted detection and ranging is experimentally shown to see texture and depth through darkness as if it were day, and also perceives decluttered physical attributes beyond RGB or...

arxiv.org #arxiv.org #research #AUG #2023 #W31

·nature.com·Aug 6, 2023

Heat-assisted detection and ranging

New acoustic attack steals data from keystrokes with 95% accuracy

A team of researchers from British universities has trained a deep learning model that can steal data from keyboard keystrokes recorded using a microphone with an accuracy of 95%.

arxiv.org #arxiv.org #research #AUG #2023 #W31

·www-bleepingcomputer-com.cdn.ampproject.org·Aug 6, 2023

New acoustic attack steals data from keystrokes with 95% accuracy

Record Once, Post Everywhere: Automatic Shortening of Audio Stories for Social Media | Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology

arxiv.org #arxiv.org #research #2023 #W31 #AUG

·dl.acm.org·Aug 5, 2023

Record Once, Post Everywhere: Automatic Shortening of Audio Stories for Social Media | Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology

High Fidelity Neural Audio Compression

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

arxiv.org #arxiv.org #research #AUG #2023 #W31

·arxiv.org·Aug 4, 2023

High Fidelity Neural Audio Compression

Southrye - (₳) (@Southrye) / X

-'- Cardano Ambassador -- A #Futurist, #Blockchain and #Decentralization fan. Exploring AI 🤖 10 yrs experience in Enterprise IT solution design. 🇨🇦

arxiv.org #arxiv.org #research #twitter #2023 #W31 #AUG

·twitter.com·Aug 4, 2023

Southrye - (₳) (@Southrye) / X

Multi-level Temporal-channel Speaker Retrieval for Robust Zero-shot Voice Conversion

Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, to achieve better speech disentanglement and reconstruction, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently. We adopt perpetual constraints on three aspects, including content, style, and speaker, to drive this process. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.

arxiv.org #arxiv.org #research #AUG #2023 #W31

·arxiv.org·Aug 4, 2023

Multi-level Temporal-channel Speaker Retrieval for Robust Zero-shot Voice Conversion

Gorilla

arxiv.org #arxiv.org #research #AUG #2023 #W31

·gorilla.cs.berkeley.edu·Aug 3, 2023

Gorilla

Audio Super Resolution

arxiv.org #arxiv.org #research #2023 #JUL #W31

·kuleshov.github.io·Jul 31, 2023

Audio Super Resolution

PDP: Parameter-free Differentiable Pruning is All You Need

DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too...

arxiv.org #arxiv.org #2023 #JUL #W29 #W30 #research

·arxiv.org·Jul 23, 2023

PDP: Parameter-free Differentiable Pruning is All You Need

Objaverse: A Universe of Annotated 3D Objects

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.

arxiv.org #2023 #JUL #W30 #arxiv.org #research

·arxiv.org·Jul 29, 2023

Objaverse: A Universe of Annotated 3D Objects

A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions 2011

null

arxiv.org #2023 #JUL #W30 #arxiv.org #research

·arxiv.org·Jul 29, 2023

A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions 2011

EmotionBox: A music-element-driven emotional music generation system based on music psychology

With the development of deep neural networks, automatic music composition has made great progress. Although emotional music can evoke listeners' different auditory perceptions, only few research studies have focused on generating emotional music. This paper presents EmotionBox -a music-element-driven emotional music generator based on music psychology that is capable of composing music given a specific emotion, while this model does not require a music dataset labeled with emotions as previous methods. In this work, pitch histogram and note density are extracted as features that represent mode and tempo, respectively, to control music emotions. The specific emotions are mapped from these features through Russell's psychology model. The subjective listening tests show that the Emotionbox has a competitive performance in generating different emotional music and significantly better performance in generating music with low arousal emotions, especially peaceful emotion, compared with the emotion-label-based method.

EmotionBox: A music-element-driven emotional music generation system based on music psychology

arxiv.org #2023 #JUL #W30 #arxiv.org #research

·frontiersin.org·Jul 26, 2023

EmotionBox: A music-element-driven emotional music generation system based on music psychology

Meta-Transformer: A Unified Framework for Multimodal Learning

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $\textbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer

arxiv.org #2023 #JUL #W30 #arxiv.org #research

·arxiv.org·Jul 25, 2023

Meta-Transformer: A Unified Framework for Multimodal Learning

Brain2Music: Reconstructing Music from Human Brain Activity

The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music generation model conditioned on embeddings derived from fMRI data. The generated music resembles the musical stimuli that human subjects experienced, with respect to semantic properties like genre, instrumentation, and mood. We investigate the relationship between different components of MusicLM and brain activity through a voxel-wise encoding modeling analysis. Furthermore, we discuss which brain regions represent information derived from purely textual descriptions of music stimuli. We provide supplementary material including examples of the reconstructed music at https://google-research.github.io/seanet/brain2music

arxiv.org #2023 #JUL #W30 #arxiv.org #research

·arxiv.org·Jul 25, 2023

Brain2Music: Reconstructing Music from Human Brain Activity

Learning from Pixels with Expert Observations

In reinforcement learning (RL), sparse rewards can present a significant challenge. Fortunately, expert actions can be utilized to overcome this issue. However, acquiring explicit expert actions can be costly, and expert observations are often more readily available. This paper presents a new approach that uses expert observations for learning in robot manipulation tasks with sparse rewards from pixel observations. Specifically, our technique involves using expert observations as intermediate visual goals for a goal-conditioned RL agent, enabling it to complete a task by successively reaching a series of goals. We demonstrate the efficacy of our method in five challenging block construction tasks in simulation and show that when combined with two state-of-the-art agents, our approach can significantly improve their performance while requiring 4-20 times fewer expert actions during training. Moreover, our method is also superior to a hierarchical baseline.

arxiv.org #arxiv.org #2023 #JUL #W30 #research

·arxiv.org·Jul 25, 2023

Learning from Pixels with Expert Observations

EmotionPrompt: Leveraging Psychology for Large Language Models Enhancement via Emotional Stimulus

Large language models (LLMs) have achieved significant performance in many fields such as reasoning, language understanding, and math problem-solving, and are regarded as a crucial step to artificial general intelligence (AGI). However, the sensitivity of LLMs to prompts remains a major bottleneck for their daily adoption. In this paper, we take inspiration from psychology and propose EmotionPrompt to explore emotional intelligence to enhance the performance of LLMs. EmotionPrompt operates on a remarkably straightforward principle: the incorporation of emotional stimulus into prompts. Experimental results demonstrate that our \method, using the same single prompt templates, significantly outperforms original zero-shot prompt and Zero-shot-CoT on 8 tasks with diverse models: ChatGPT, Vicuna-13b, Bloom, and T5. Further, EmotionPrompt was observed to improve both truthfulness and informativeness. We believe that EmotionPrompt heralds a novel avenue for exploring interdisciplinary knowledge for humans-LLMs interaction.

arxiv.org #arxiv.org #2023 #JUL #W30 #research

·arxiv.org·Jul 25, 2023

EmotionPrompt: Leveraging Psychology for Large Language Models Enhancement via Emotional Stimulus

HDHumans: A Hybrid Approach for High-fidelity Digital Humans

arxiv.org #2023 #JUL #W30 #arxiv.org #research

·people.mpi-inf.mpg.de·Jul 23, 2023

HDHumans: A Hybrid Approach for High-fidelity Digital Humans

Retentive Network: A Successor to Transformer for Large Language Models

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance....

arxiv.org #arxiv.org #2023 #JUL #W29 #research

·arxiv.org·Jul 21, 2023

Retentive Network: A Successor to Transformer for Large Language Models

Neural Haircut: Prior-Guided Strand-Based Hair Reconstruction

Generating realistic human 3D reconstructions using image or video data is essential for various communication and entertainment applications. While existing methods achieved impressive results...

arxiv.org #arxiv.org #2023 #JUL #W29 #research

·arxiv.org·Jul 19, 2023

Neural Haircut: Prior-Guided Strand-Based Hair Reconstruction

Robust flight navigation out of distribution with liquid neural networks

science.org

arxiv.org #arxiv.org #2023 #JUL #W28 #research

·science.org·Jul 17, 2023

Robust flight navigation out of distribution with liquid neural networks

Neural Relighting with Subsurface Scattering by Learning the...

Reconstructing and relighting objects and scenes under varying lighting conditions is challenging: existing neural rendering methods often cannot handle the complex interactions between materials...

arxiv.org #2023 #JUL #W27 #arxiv.org #research

·arxiv.org·Jul 3, 2023

Neural Relighting with Subsurface Scattering by Learning the...