Home
International Journal of Science and Research Archive
International, Peer reviewed, Open access Journal ISSN Approved Journal No. 2582-8185

Main navigation

  • Home
    • Journal Information
    • Abstracting and Indexing
    • Editorial Board Members
    • Reviewer Panel
    • Journal Policies
    • IJSRA CrossMark Policy
    • Publication Ethics
    • Issue in Progress
    • Current Issue
    • Past Issues
    • Instructions for Authors
    • Article processing fee
    • Track Manuscript Status
    • Get Publication Certificate
    • Become a Reviewer panel member
    • Join as Editorial Board Member
  • Contact us
  • Downloads

ISSN Approved Journal || eISSN: 2582-8185 || CODEN: IJSRO2 || Impact Factor 8.2 || Google Scholar and CrossRef Indexed

Peer Reviewed and Referred Journal || Free Certificate of Publication

Research and review articles are invited for publication in March 2026 (Volume 18, Issue 3) Submit manuscript

Multi-Modal generative AI systems: Bridging text, vision and speech with advanced LLM Architectures

Breadcrumb

  • Home
  • Multi-Modal generative AI systems: Bridging text, vision and speech with advanced LLM Architectures

Dinesh John *

Independent Researcher, USA.

Research Article

 

International Journal of Science and Research Archive, 2023, 09(02), 1044-1058.
Article DOI: 10.30574/ijsra.2023.9.2.0619
DOI url: https://doi.org/10.30574/ijsra.2023.9.2.0619

Received on 28 June 2023; revised on 20 August 2023; accepted on 23 August 2023

Due to the fast development of artificial intelligence (AI), multi-modal generative AI systems have been introduced, which can handle text, vision, and speech in one manner. These systems allow for obtaining high-quality, context-related results to solve multifaceted problems related to different data types. Achievements in developing LLMs, including the current GPT-4, have proven critical in creating techniques for merging those modality streams, greatly boosting generative AI.
Multi-modal systems are promising in the interest of change in many industries. Creative specialities allow for producing art, music, and literary works based on various input data. In healthcare, they help provide diagnostic views based on such reports and figures or other data types as text, images, or sounds. Most experiments in self-driving cars specify that to make real-time decisions, they employ multi-modal AI. These models provide visual information, voice, and text. Likewise, human-computer interaction is more effective with multi-modal systems providing enhanced intuitive end-user interactions.
This article further explores the underlying technologies of multi-modal generative AI where LLMs built with the transformer foundation are across-modal integration. Issues discussed would be data alignment, scalability, generalization and lastly, the ethical factors that should be considered will also be discussed. If these issues are addressed, then multi-modal AI systems can be more reliable and more general purpose. The article also overviews future directions, including cross-modal transfer learning and the interactive and fairness in AI, thus illustrating how effectively these systems can transform various applications and rewire how people interact with machines.

Multi-modal generative AI; Text; Vision; Speech; Large language models; Creativity; Automation; Human-machine interaction; Problem-solving; Innovation;Inclusivity; Accessibility; Data alignment; Computational efficiency; Robustness; Ethical considerations; Training datasets; Fairness; biases; Performance; Scalability; Optimization; Hardware acceleration; Real-world scenarios; Testing; Transparency; Accountability; Regulatory frameworks; Healthcare; Education; Storytelling; Immersive narratives; Scientific research; Climate science; Cross-disciplinary; Cultural divides; Linguistic barriers; Societal impact; Collaboration; Sustainability; Global challenges; Paradigm shift; Transformative power; Human ingenuity

https://ijsra.net/sites/default/files/fulltext_pdf/IJSRA-2023-0619.pdf

Preview Article PDF

Dinesh John. Multi-Modal generative AI systems: Bridging text, vision and speech with advanced LLM Architectures. International Journal of Science and Research Archive, 2023, 09(02), 1044-1058. Article DOI: https://doi.org/10.30574/ijsra.2023.9.2.0619.

Copyright © Author(s). All rights reserved. This article is published under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as appropriate credit is given to the original author(s) and source, a link to the license is provided, and any changes made are indicated.


All statements, opinions, and data contained in this publication are solely those of the individual author(s) and contributor(s). The journal, editors, reviewers, and publisher disclaim any responsibility or liability for the content, including accuracy, completeness, or any consequences arising from its use.

Get Certificates

Get Publication Certificate

Download LoA

Check Corssref DOI details

Issue details

Issue Cover Page

Editorial Board

Table of content

          

   

Copyright © 2026 International Journal of Science and Research Archive - All rights reserved

Developed & Designed by VS Infosolution