M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (2024)

Wei Song1,2,3  Yadong Li211footnotemark: 1  Jianhua Xu2  Guowei Wu422footnotemark: 2  Lingfeng Ming2  Kexin Yi2  
Weihua Luo2  Houyi Li2  Yi Du4  Fangda Guo5  Kaicheng Yu1
1
AutoLab, Westlake University  2AI Business, Alibaba Group  3Zhejiang University  
4Key Laboratory of Behavioral Science, Institute of Psychology, CAS  
5Key Laboratory of AI Safety, Institute of Computing Technology, CAS  
{songwei, kyu}@westlake.edu.cn, adonlee.lyd@alibaba-inc.com
Co-first authors. Work done during Wei Song’s internship at Alibaba. Co-second authors. Corresponding author.

Abstract

As recent multi-modality large language models(MLLMs) have shown formidable proficiency on various complex tasks, there has been increasing attention on debating whether these models could eventually mirror human intelligence.However, existing benchmarks mainly focus on evaluating solely on task performance, such as the accuracy of identifying the attribute of an object. Combining well-developed cognitive science to understand the intelligence of MLLMs beyond superficial achievements remains largely unexplored. To this end, we introduce the first cognitive-driven multi-lingual and multi-modal benchmark to evaluate the general intelligence ability of MLLMs, dubbed M3GIA. Specifically, we identify five key cognitive factors based on the well-recognized Cattell-Horn-Carrol(CHC) model of intelligence and propose a novel evaluation metric. In addition, since most MLLMs are trained to perform in different languages, a natural question arises: is language a key factor influencing the cognitive ability of MLLMs? As such, we go beyond English to encompass other languages based on their popularity, including Chinese, French, Spanish, Portuguese and Korean, to construct our M3GIA. We make sure all the data relevant to the cultural backgrounds are collected from their native context to avoid English-centric bias.We collected a significant corpus of data from human participants, revealing that the most advanced MLLM reaches the lower boundary of human intelligence in English. Yet, there remains a pronounced disparity in the other five languages assessed. We also reveals an interesting winner takes all phenomenon that are aligned with the discovery in cognitive studies. Our benchmark will be open-sourced, with the aspiration of facilitating the enhancement of cognitive capabilities in MLLMs.

1 Introduction

In 1956, researchers across different domains, including mathematics, cognitive psychology and computer science, pointed out an interesting direction, dubbed artificial intelligence(AI). The formal definition is “The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”[33].Through extensive efforts in pursuing artificial intelligence, the field has converged to a paradigm of data-driven machine learning models, which are still deeply intertwined with cognitive science as they often mirror basic cognitive mechanisms, e.g. convolutional neural networks[23] and the attention mechanism[50].Recent advances, such as GPT-4o[36], demonstrate that these MLLMs can outperform human on various complex tasks[1, 53] and shed light to emergent ability with the increasing scale of data and model size[55]. In light of these developments, our aim is to evaluate these state-of-the-art models through the lens of cognitive science, as it directly aligns with the primary motivation of AI research.

To explore the mental intelligence emerging from these large models, efforts have been directed toward analyzing these models from a psychological perspective. Some pioneering works report that LLMs have demonstrated human-like cognition[6, 22]. For instance, Theory of mind (ToM) has been applied to assess large models, revealing that GPT-4 exhibits ToM capabilities similar to human inference patterns[7, 22, 20].Meanwhile, Multimodal Large Language Models (MLLMs), which use powerful LLMs as brain to process and integrate multimodal information, have exhibited impressive emergent abilities, such as generating website code from images[65], understanding the meaning of a meme[57], and math reasoning[16].Thanks to their ability to process information from a broader spectrum of sources, they exhibit a more holistic cognitive process, resembling human cognition more closely than models confined to purely linguistic input.

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (1)

Existing multi-modality benchmarks, such as MMBench[28], MME[19], and MM-Vet[61], have made the attempt to compartmentalize model capabilities across multiple tasks.For instance, MMBench covers 20 different abilities, encompassing function reasoning, physical property reasoning, object localization and social reasoning.However, they often fail to provide a persuasive explanation for their selection of dimensions, as they tend to be mired in subjectivity and lack a solid theoretical underpinning.Moreover, as depicted in Figure1(left), their ability dimensions are still rather task-oriented, neglecting a systematic evaluation of the models’ underlying cognitive abilities that govern task performance through the lens of cognitive science. This oversight raises concerns that benchmarks might devolve into mere training targets rather than instruments for true insight, failing to provide a holistic measure of the models’ capabilities[41].In short, the ability to solve specific tasks is insufficient to reflect the true level of intelligence, as supported by a psychological study[37], and formally evaluating the cognitive factors of MLLMs remains largely unexplored.

In this paper, we close the gap by introducing the first benchmark that comprehensively evaluate the cognitive abilities of MLLMs under the theoretical umbrella of the well-recognized Cattell-Horn-Carroll (CHC) Model of Intelligence[42], dubbed M3GIA.As in Figure1(right), based on the CHC Model, we categorizes the cognitive capacities of current MLLMs into five dimensions: Fluid reasoning (Gf), Comprehension-Knowledge (Gc), Visual processing (Gv), Reading and Writing (Grw), Quantitative knowledge (Gq), and collect corresponding questions as a measurement. In addition, as using multi-lingual data to scale up the capability of MLLMs becomes a de-facto standard, we are curious whether languages make any impact on their cognitive abilities. As such, we extend our benchmark to include five more languages, including Chinese, Spanish, French, Portuguese and Korean roughly based on their population, to disentangle the language factor with cognitive ability.

To evenly assess the five cognitive dimensions, we refer to human intelligence tests, such as Raven’s Progressive Matrices Test[38] and the Woodco*ck-Johnson IV Tests of Cognitive Abilities (WJ IV)[44], and establish broad question types that correspond to the these cognitive dimensions, which are further subdivided into 18 narrow question types (see later Sec.3). All in all, our M3GIA contains 1,800 questions, where over half are carefully designed from scratch following the standard. The test for each language maintain consistency in terms of the number of questions, structure, and distribution of question types. In addition, to highlight the multilingual nature of our benchmark, we collect data relevant to cultural backgrounds from native language sources rather than simply translating them from English, thereby avoiding the English-centric bias.

We evaluate 24 MLLMs, including the state-of-the-art close and open-sourced ones. In general, The latest advancements in MLLMs have achieved performance levels that fall within the lower boundary of human intelligence in English. Yet, there remains a pronounced disparity in the other five languages assessed.We also notice that MLLMs’ proficiency in one cognitive domain often translates into superior performance across other domains as well. This phenomenon interestingly aligns with the pattern observed in human intelligence which empirically suggests the existence of General Intelligence Ability (GIA) in MLLMs.

2 Related works

Evaluation Benchmark for MLLMs.

As multimodal large language models (MLLMs) exhibit remarkable generalization capabilities across a broad spectrum of downstream tasks, relying exclusively on their performance within single vision-language tasks — such as visual recognition[21], image description[11, 2, 60], scene text understanding[47, 46], and external knowledge[32] — is insufficient to fully uncover the comprehensive performance of MLLMs.People then turn to a new paradigm to construct all-round benchmarks to assess a broader spectrum of challenging multimodal tasks[58, 56, 25, 28, 19, 61].Another trend in MLLM assessment is the use of human exam questions[31, 30, 64, 62, 63]. For instance, AGIEval[64] sources questions from standardized exams such as college entrance exams and lawyer qualification tests.While these benchmarks makes progresses in evaluating the human-centric ability of MLLMs, it may not be suitable to evaluate the intelligence of MLLMs because research in psychological field points out that the superficial performance on tasks alone cannot be a solid indicator for human’s intelligence.[37]

General Intelligence Ability and the CHC Theory.

Arising from the empirical fact that an individual’s proficiency in one area frequently correlates with high performance in other areas, Charles Spearman first introduced General Intelligence Ability (GIA) in 1904[48]. This construct refers to the idea that a single underlying factor, known as the g-factor, can account for the positive correlations among cognitive abilities and reflect the general intelligence that fundamentally underlies an individual’s intelligence.To concretely understand GIA, numerous attempts has been made to model the structure of human cognition.John Carroll’s Three-Stratum Model[9] elaborated on this with a hierarchical structure of intelligence, including a general “g” factor and specific cognitive abilities. Howard Gardner’s Multiple Intelligence Theory[18] proposed diverse forms of intelligence, while Sternberg’s Triarchic Theory[49] focused on practical, creative, and analytical aspects.These theories collectively contributed to the development of the Cattell-Horn-Carroll (CHC) model of intelligence, which is the most comprehensive and empirically validated structural model of human cognition[35] to date, integrating various aspects of cognition into a unified framework.Recent study[14], has primarily focused on evaluating the performance of large language models on sophisticated psychological tasks, neglecting the assessment of models’ intelligence from the foundational standpoint of cognitive models. Our M3GIA constitutes the first attempt to bring the latest cognitive science modeling of intelligence into MLLMs evaluation to address this gap.

3 M3GIA

Concretely, we introduce the first cognition inspired multi-linguistic and multi-modal benchmark to evaluate the general intelligence accuracy of large models. In short, our M3GIA distinguishes itself from existing benchmarks as follow:

  • Cognition Inspired: In contrast to existing benchmarks that focuses on task-level evaluation, we study the intelligence of large models from a cognition perspective. The benchmark dissects the cognitive abilities of contemporary MLLMs into five foundational factors, as per the Cattell-Horn-Carroll theory.This cognitive theory underpins the structure of our evaluation, informing the specific types of questions devised to test each cognitive skill.

  • Multilingual Coverage: To comprehensively measure the cognitive abilities of multimodal large models across multiple languages, M3GIA is constructed to span six languages: English, French, Chinese, Spanish, Portuguese, and Korean. In order to mitigate English-centric bias, all data relevant to cultural backgrounds have been sourced from native language resources, except for questions that transcend cultural considerations—such as the Raven test and number series problems.

The subsequent content of this section is organized as follows: In sec.3.1, we introduce the five-factor cognitive model of M3GIA and discuss the design philosophy behind it. In sec.3.2, we describe how we designed and collected the questions for M3GIA and provide some statistical data on M3GIA.

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (2)

3.1 The Five-factor Cognitive Model of M3GIA

To formally study the large models intelligence level, we start from the state-of-the-art cognitive model, Cattell-Horn-Carroll(CHC)[42], which is by far the most empirically validated structure model of human cognition[35].The CHC theory articulates a hierarchical framework of human cognitive abilities divided into three strata: general intelligence “g” (stratum III), broad cognitive abilities (stratum II), and narrow cognitive abilities (stratum I).The theory has now expanded to include 16 broad abilities and over 80 narrow abilities.These broad but domain-specific abilities are nevertheless positively associated with one another. This positive manifold is accounted for in the CHC model by a general factor of intelligence (“g”) at stratum III.While there is ongoing discourse regarding the exact delineation of the narrow abilities, 9 out of the 16 broad cognitive abilities have achieved substantial consensus and are well-supported by empirical evidence and practical application.[8]These include Fluid Reasoning (Gf), Comprehension-Knowledge (Gc), Visual Processing (Gv), Auditory Processing (Ga), Short-term Memory (Gsm), Long-term Retrieval (Glr), Processing Speed (Gs), Quantitative Knowledge (Gq), and Reading and Writing Abilities (Grw).

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (3)

As shown in Fig.2, the structure of our M3GIA is underpinned by the five-factor hierarchical cognitive model, which is derived from the CHC model of cognitive abilities.Although Large Language Models exhibit cognitive processes similar to humans, they also differ in internal mechanisms, particularly with regard to processing speed (Gs) and memory (Gwm, Glr), which is greatly related to external technologies beyond the model itself, such as external databases and retrieval-augmented generation (RAG)[24].Additionally, given that the majority of current MLLMs, with the exception of a select few closed-source models, are not yet expanded to embrace the auditory modality, we have not included the Ga (Auditory Processing) factor in this version of M3GIA, reserving it as one of the directions for future expansion.Consequently, based on the consultations with psychology experts, we have chosen to assess the cognitive abilities of current MLLMs in this iteration of M3GIA by focusing on five key CHC factors: Gc, Grw, Gq, Gf, and Gv, from among the nine most frequently identified CHC factors.

Interestingly, the five factors we select align closely with those of the renowned Stanford-Binet Test, Fifth Edition (SB5)[40], which was also constructed upon five cognitive factors derived from the CHC theory.Specifically, the five cognitive factors identified in the SB5 are: Fluid Reasoning (FR), Knowledge (KN), Quantitative Reasoning (QR), Visual-Spatial Processing (VS), and Working Memory (WM). Except for Working Memory (WM), which we have substituted with Grw, these factors align directly with our selected factors, corresponding to Gf, Gc, Gq, and Gv, respectively.This alignment is noteworthy, as the selection of these factors for the SB5 was based on extensive research on school achievement and expert ratings of the importance of these factors in the assessment of reasoning, especially in giftedness assessment[39].

3.2 Question Design and Collection

Our M3GIA contains a total of 1,800 multiple choice problems, of which 1,200 are Visual Question Answering (VQA) questions. The ground truths for 630 questions are human-annotated, while the remainder of the answers for 1,170 questions were gathered from the Internet.

As shown in Fig.1, we have devised five broad question clusters: reasoning, visual-spatial, common sense, mathematics and comprehension, separately corresponding to the assessment of the five CHC cognitive factors – Gf, Gv, Gc, Gq, and Grw. See Appendix for more details.To prevent the assessment of any particular ability from being constrained to a fixed and singular perspective, we have stratified each of the five clusters into 2-4 narrow question types that reflect different perspectives on a broad CHC construct. This subdivision results in a total of 18 distinct question types, each designed to tap into different facets of the ability being measured.Consequently, any generalizations that are made from a cluster are based on two or more samples of ability, which reduces the possibility of making over-generalizations from a narrow sampling of ability.

Moreover, as illustrated in the right part of Fig.2, the five cognitive factors are not isolated but rather overlap with each other. For example, Fluid reasoning (Gf) not only has a process facet (inductive vs. deductive reasoning) but also has a content facet (verbal, spatial, and quantitative), each of which overlaps with other broad abilities.[43].In order to conduct a comprehensive measurement of this overlapping nature, our narrow question types include not only tests that measure each cognitive factor individually but also cover the parts where these factors overlap. The corresponding relationships between the question types, the cognitive factors and their intersections are also shown in Fig.2.

What is more, to ensure that our assessment remains anchored in reality, we incorporate real-world problems into the evaluations of cognitive abilities. Specifically, each broad question type includes not only abstract cognitive test questions but also typical real-world problems that require the use of one or more cognitive abilities. This approach enables us to conduct a more accurate and practical assessment of how well these abilities are applied outside controlled, test-like environments.To ensure a balanced and comprehensive evaluation for each ability, we have tried our best to maintain an even distribution among problems associated with different abilities during data collection.

Examples of the narrow question types can be seen in Fig.3, while more detailed descriptions are included in the Appendix.

3.3 Metrics

We use two type of metrics in our evaluation benchmark. For each narrow question type, we follow the existing benchmarks[28, 19] to use accuracy. However, to holistically compare the cognitive ability, we design a novel metric general intelligence accuracy(GIA) based on findings in cognitive field. To compute the GIA scores of the models and validate the consistency of the cognitive structure between MLLMs and human intelligence, we adopted a standard psychometric approach. This involved utilizing a confirmatory factor analysis (CFA) model, developed from our collected human evaluation data. For more details about the CFA process, see Appendix for more details.

4 Evaluation Results

In this section, we evaluate a total of 24 MLLMs and 480 human participants using our M3GIA. The MLLMs comprise both closed-source models, such as GPT-4o[36], and open-source models[27, 26, 59, 4, 52, 29, 12], including LLaVA[27] and Mini-Gemini[26].Our evaluation for the MLLMs is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark.For all models, we conduct prompt engineering on the validation set and use the most effective prompt for the zero-shot setup in the experiments. All experiments are conducted with NVIDIA A800 GPUs[27, 26].

Human Performance Baseline.

To establish a reference for human cognitive levels against MLLMs, we collected 480 valid sets of test data from human subjects using electronic questionnaires. These 480 participants were from native countries of the six selected languages, with 80 individuals per language. The 1,800 questions of M3GIA are then divided into six complete sub-questionnaires by language, with each individual only responsible for completing the sub-questionnaire corresponding to their native language. See supplementary for more details.

4.1 Accuracy Score on Five Cognitive Factors

Types(LLM Size)ModelsViTSizeGfGcGqGrwGvOverallAcc
IRGRQOverall
HumanAverage Performance-86.860.071.269.779.165.478.181.176.9
APIGPT-4o-58.059.233.950.172.342.879.646.359.8
GPT-4v-56.756.340.951.974.846.477.552.459.2
Gemini-1.5-Pro-54.356.441.854.375.860.877.153.862.4
Gemini-Pro-39.030.822.732.456.531.767.143.146.5
Cluade3-Sonnet-39.732.927.334.058.334.261.343.947.0
Cluade3-Haiku-35.335.830.333.155.833.357.936.443.1
Mini-Gemini-34b0.3B37.737.530.634.861.034.262.945.748.2
OSSMini-Gemini-8*7b0.3B28.730.026.730.358.135.061.341.944.8
(Large)LLaVA-v1.6-34b0.3B20.740.028.530.853.836.461.740.442.8
Yi-VL-34b0.6B25.032.935.829.548.129.254.635.738.2
InternVL-chat-v1.2-plus6B45.042.532.442.564.641.466.747.551.9
OSSMini-Gemini-13b0.3B22.329.223.324.341.526.144.228.332.9
(Medium)LLaVA-v1.5-13b0.3B17.726.315.219.942.120.340.028.830.4
LLaVA-v1.6-vicuna-13b0.3B23.319.624.523.136.726.947.528.533.2
Fuyu-8b-21.722.127.323.327.324.427.124.925.1
Mini-Gemini-8b0.3B37.329.631.830.451.530.656.336.141.4
LLaVA-v1.5-7b0.3B18.025.015.819.741.519.735.025.728.4
LLaVA-v1.6-vicuna-7b0.3B21.322.918.220.536.519.432.926.931.5
OSSLLaVA-v1.6-mistral-7b0.3B24.325.824.524.938.524.236.732.128.9
(Small)Deepseek-VL-7b0.38B32.329.222.128.350.424.454.232.437.5
Yi-VL-6b0.6B25.235.526.228.835.629.054.530.834.4
Qwen-VL1.9B18.723.825.222.541.027.542.530.132.1
CogVLM2-LLaMA3-Chinese10B29.721.729.726.554.827.237.940.338.7

We report the accuracy of each type of question for the 24 models alongside the average human performance for each cognitive ability in Table.1. We categorize the models into groups by their types, where open-source (OSS) MLLMs are grouped according to the size of their LLMs.It’s observed that even the most advanced MLLMs only marginally meet the passing line (60) for overall accuracy, e.g., Gemini-1.5-Pro (62.4) / GPT-4o (59.8) vs human (76.9).Notably, these models excel in domains related to verbal skills and knowledge, such as Gc and Grw. This success can likely be attributed to the powerful language capabilities inherent in large language models, bolstered by their extensive training datasets.

However, a significant performance gap remains between MLLMs and humans in areas like Visual-Spatial Abilities (Gv) and Fluid Reasoning (Gf). This is particularly evident in the Visual-Spatial Abilities domain, where all models lag considerably behind human capabilities, e.g., Gemini-1.5-Pro (53.8) vs human (81.1). This underscores a substantial opportunity for advancements in the visual aspects of MLLMs. See supplementary for case studies.Furthermore, our findings also highlight a pronounced deficiency in the Fluid Reasoning (Gf) capability among all MLLMs, particularly in tasks involving Induction (I) and Quantitative Reasoning (RQ).However, it is surprising to note that in the domain of Deductive Reasoning (RG), the most advanced MLLMs, such as GPT-4o, are approaching the average human level with scores of 59.2 compared to 60.0 for human participants. This might be attributed to the strategy they use synthetic reasoning data to enhance such ability[13].

Overall, MLLMs perform well in crystallized intelligence (Gc), possibly owing to their extensive training data, while the most advanced MLLMs still have a large gap with humans in fluid intelligence. This proves that our benchmark M3GIA can measure the difference between crystallized intelligence and fluid intelligence of MLLMs from a cognitive perspective, which is the key difference between M3GIA and other benchmarks.

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (4)

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (5)

Winner Takes All.

More importantly, our finding reveals an intriguing Winner Takes All phenomenon that merits further attention beyond the initial observations. Specifically, we noted a consistent trend within each group of models where proficiency in one cognitive domain often translates into superior performance across other domains as well in Table1. In particular, despite the diversity in score distribution among different abilities, there is a noteworthy pattern: the models achieving the top and second-best scores across various cognitive abilities are predominantly the same two models within each group.

This shows an interesting consistency to the pattern observed in human intelligence which empirically suggests the existence of General Intelligence Ability (see Sec.2). Therefore, it offers compelling evidence that general intelligence ability, also identified as the general factor of intelligence (“g”) at the stratum III of the CHC model, has also emerged in large models.Furthermore, it suggests that as MLLMs evolve towards more comprehensive cognitive processes, they too demonstrate a foundational GIA factor that simultaneously governs a variety of cognitive abilities.

4.2 Multilingual GIA Scores

By collecting a large amount of testing data from human subjects, we adoptted CFA (Confirmatory Factor Analysis) model to calculate the GIA scores which can reflect comprehensive intelligence factors.Since the questions for each language are not exactly the same, we need to establish a separate CFA model for each language.To our surprise, the model built with human data showed high explanatory validity for the test results of MLLMs (cor > 0.93). This indicates, to some extent, that the cognitive structure of MLLMs indeed shows similarities to humans.We report the GIA scores of each language for some MLLMs of different sizes in Table.2 and Fig.5. It’s observed that the current state-of-the-art MLLMs have reached the minimum level within the human subjects’ confidence interval in English. However, these MLLMs still exhibit a significant performance gap compared to humans in other languages.

ModelsGeneral Intelligence Ability (GIA)Normalized GIA Scores
EnChFrSpPtKoEnChFrSpPtKo
Human16.0116.6919.5216.2216.0018.05100.0100.0100.0100.0100.0100.0
GPT-4o13.8511.4612.3713.1212.8013.0186.568.763.380.980.072.1
GPT-4v12.6110.9513.8314.0412.1212.2578.865.670.886.575.867.9
LLaVA-1.6-34b11.479.2511.357.9610.679.0471.655.458.149.166.750.1
LLaVA-1.6-13b6.966.898.717.756.947.7543.541.344.647.843.442.9
LLaVA-1.6-7b6.755.997.676.746.015.9342.135.939.341.537.632.9
Mini-Gemini-34b11.009.9612.7511.529.4510.6968.759.765.371.059.159.2
Mini-Gemini-13b8.687.768.657.737.107.9854.246.544.347.744.444.2
Mini-Gemini-8b9.328.1111.259.768.997.0858.248.657.660.256.239.3
Qwen-72b11.6810.7510.2010.509.719.7672.964.452.264.760.754.1
Qwen-32b10.589.799.6210.119.259.1866.158.749.362.357.850.9
Qwen-14b8.468.769.158.498.798.3252.852.546.952.454.946.1
Qwen-7b8.568.938.988.428.778.4153.453.546.051.954.846.6
Qwen-1.8b7.346.568.017.376.496.4845.839.341.045.540.635.9

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (6)

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (7)

To further investigate the influence of LLM size to the GIA score, we conducted an ablation study with the Qwen series. In order to strictly control variables like different training data and ViT components. we trained the models by ourselves using the same training data for pretraining and fine-tuning and we also use the same ViT component (CLIP-ViT-L-14) in the series.Overall, the GIA scores of the models increase with the rise in LLM parameters. However, we observed a somewhat counterintuitive phenomenon. There is often no improvement in cognitive abilities from 7B to 13B, and seems to be a emerging point of General Intelligence Ability between 13B and 34B.

5 Conclusion

This paper has presented M3GIA, the first evaluation benchmark that comprehensively evaluate the cognitive abilities of MLLMs under the theoretical umbrella of the well-recognized Cattell-Horn-Carroll (CHC) Model of Intelligence.Based on the CHC theory, we identified five key cognitive factors for current MLLMs: Fluid reasoning (Gf), Comprehension-Knowledge (Gc), Visual processing (Gv), Reading and Writing (Grw), Quantitative knowledge (Gq), and designed five broad types of questions to measure them.In order to meet the pressing need for multilingual assessment, our evaluation data spans across six languages and are collected from native language sources, including English, Chinese, French, Spanish, Portuguese and Korean.We conducted a series of experiments to comparative analyze the cognitive abilities of various MLLMs against human performance, and discussed how factors like the size of LLM component impact cognitive abilities.

6 Limitations and Discussions

This version of M3GIA does not include all the broad cognitive factors of the CHC model, such as auditory processing (Ga), olfactory processing (Go), etc.As advancements in MLLMs incorporate a broader range of modalities, more factors from the CHC framework can be integrated, ensuring that M3GIA remains at the forefront of evaluating future generations of multimodal models.

Appendix A Dataset Documentation

A.1 Motivation

M3GIA is a multimodal and multilingual benchmark designed to evaluate the cognitive abilities and general intelligence of MLLMs under the theoretical underpinning of human cognition.Instead of leveraging well-developed cognitive science to understand the intelligence of MLLMs beyond superficial achievements, existing benchmarks still mainly focus on evaluating solely on task performance.As described in the paper, these approaches have several limitations. We aim to bridge this gap through M3GIA, providing helpful insights into the development of artificial intelligence models with true intelligence. The creation of the dataset is funded by AI Business, Alibaba Group.

A.2 Composition

  • M3GIA contains a total of 1,800 multiple-choice problems, of which 1,200 are Visual Question Answering (VQA), while the remaining 600 are textual questions. We ensure that all VQA tasks necessitate reliance on images for resolution and cannot be resolved with text alone (see Sec.D.2). The ground truths for 630 questions are human-annotated, while the remainder of the answers for 1,170 questions were gathered from the Internet. M3GIA includes question sets in six languages, comprising Chinese, English, Spanish, Korean, Portuguese, and French, each with 300 questions.

  • Each question is labeled with one or several CHC factors, with involved factors marked as ‘1’ and non-involved factors marked as ‘0’. Each question is also annotated with the question cluster and the narrow question type to which it belongs, to facilitate the calculation of accuracy rates.

  • M3GIA is self-contained. We bear all responsibility in case of violation of rights.

  • The dataset does not contain any information that might be offensive, insulting, or threatening.

A.3 Usage and Distribution

  • The evaluation dataset is released at https://huggingface.co/datasets/Songweii/M3GIA.

  • M3GIA is released under the Apache 2.0 license.

  • The data is saved in Parquet format, where an example is shown in the README.md file. An example code snippet is also provided showing how to read and process the data.

A.4 Maintenance

  • M3GIA will be managed and maintained by our research group. For any questions, please contact Wei Song (songweii@zju.edu.cn) and Prof. Kaicheng Yu (kyu@westlake.edu.cn), who are responsible for maintenance.

  • If we further expand our dataset or find any errors, we will update the dataset and results in the leaderboard accordingly. It will be updated on our website.

Appendix B Definitions of the CHC factors

According to the Cattell-Horn-Carroll (CHC) Model of Intelligence[42, 43], the definitions of the five cognitive factors are as follows:

Comprehension-Knowledge (Gc), also known as Crystallized Intelligence, is the knowledge of culture that is incorporated by individuals through a process of “acculturation”[34]. Gc is typically described as the breadth and depth of acquired knowledge of the language, information and concepts of a culture, and the application of the knowledge. Gc is primarily a store of verbal or language-based declarative (knowing what) and procedural (knowing how) knowledge acquired during general life experiences.In short, Gc reflects the ability to apply and reason using previously learned experiences and common knowledge.[42]

Fluid Reasoning (Gf) is the broad ability involved in reasoning, forming concepts, and solving problems using unfamiliar information or in novel situations. It includes inductive, deductive, and quantitative reasoning and is typically evident in mental operations, such as inferential reasoning, forming concepts, classification of unfamiliar stimuli and recognizing patterns.[34, 42] Furthermore, there are three factors that are generally considered the hallmark indicators of Gf:

  • Induction (I). The ability to observe a phenomenon and discover the underlying principles or rules that determine its behavior.

  • Deductive Reasoning (RG). This ability, also known as general sequential reasoning, refers to the capacity to reason logically using known premises and principles step by step.

  • Quantitative Reasoning (RQ). The ability to reason, either with induction or deduction, with numbers, mathematical relations, and operators.

Visual-spatial Processing (Gv) is the ability to perceive, analyze, synthesize, and think with visual patterns, or more succinctly, "the ability to make use of simulated mental imagery to solve problems". Once the eyes have transmitted visual information, the visual system of the brain automatically performs a large number of low-level computations (e.g., edge detection, light/dark perception, color-differentiation, motion-detection, and so forth). The results of these low-level computations are used by various higher-order processors to infer more complex aspects of the visual image.[42].Gv abilities are typically measured by tasks (figural or geometric stimuli) that require the perception and transformation of visual shapes, forms, or images and/or tasks that require maintaining spatial orientation with regard to objects that may change or move through space.[34]

Reading and Writing (Grw) is the depth and breadth of knowledge and skills related to written language. It is worth noting that, although reading and writing are clearly distinct activities, the underlying sources of individual differences in reading and writing skills do not differentiate between the two activities cleanly[42]. It appears that the ability that is common across all reading skills also unites all writing skills.

Quantitative Knowledge (Gq) is the depth and breadth of knowledge related to mathematics. Specifically, it is the ability to comprehend quantitative concepts and relationships and to manipulate numerical symbols. It consists of acquired knowledge about mathematics such as knowledge of mathematical symbols (e.g., ,π,,,,,+,,×,÷,𝜋\int,\pi,\sum,\infty,\neq,\leq,+,-,\times,\div,∫ , italic_π , ∑ , ∞ , ≠ , ≤ , + , - , × , ÷ , and many others), operations (e.g., addition/subtraction, multiplication/division, exponentiation/nth rooting, factorials, negation, and many others), computational procedures (e.g., long division, reducing fractions, quadratic formula, and many others).Gq abilities are typically measured by tests include measures of math calculation, applied problems (or math problem solving), and general math knowledge (e.g., Arithmetic on the Wechsler Scales, Quantitative Reasoning on the SB5).

Appendix C Introduction to the Evaluation Questions

ClusterQuestion TypesGcGvGrwGqGfNum
IRGRQ
Common SenseGeneral Information20×620620\times 620 × 6
Oral Vocabulary15×615615\times 615 × 6
Logo Problem15×615615\times 615 × 6
Visual-spatialVisualization30×630630\times 630 × 6
Picture Recognition15×615615\times 615 × 6
Real-world Spatial15×615615\times 615 × 6
ComprehensionReadings-text15×615615\times 615 × 6
Readings-VL10×610610\times 610 × 6
Comic Problem15×615615\times 615 × 6
MathematicsMath Facts25×625625\times 625 × 6
Algebra15×615615\times 615 × 6
Geometry10×610610\times 610 × 6
Applied Problem10×610610\times 610 × 6
ReasoningNumber Series20×620620\times 620 × 6
Concept Formation20×620620\times 620 × 6
Raven’s Matrices10×610610\times 610 × 6
Syllogism Problem20×620620\times 620 × 6
Real-world Reasoning20×620620\times 620 × 6

In this section, we will outline the five question clusters and the 18 narrow question types they encompass.

The Common Sense Cluster.

The common sense cluster is designed to measures the Gc factor of an MLLM and includes 3 narrow question types: general information, oral vocabulary and logo problem.In general information, the model is presented with an image and is asked, “Where would you find [the object] in the picture?” or “What would you do with [the object] in the picture?” The initial items in each subtest draw from familiar everyday objects, and the items become increasingly difficult as the objects become more obscure or less familiar.Oral vocabulary consists of two subtests: Synonyms and Antonyms. In the Synonyms subtest, the model is provided with a word and is asked to choose its synonym. In the Antonyms subtest, the model is provided with a word and is asked to choose its antonym. In CHC theory, this test primarily measures a narrow aspect of Comprehension-Knowledge (Gc) referred to as lexical knowledge (VL; vocabulary knowledge), or knowledge of words and word meanings. [45]The logo problem is the real-world problem of the cluster, where a model is provided with a logo and is required to identify an abstract element within it. To achieve this, it must have a very deep impression on the element, such as a confusing artistic characters or symbolic expression of cultural elements, which requires a high level of Gc and a certain level of Gv.

The Visual-spatial Cluster.

This cluster is designed to evaluate the Gv factor and includes 3 narrow question types: visualization, picture recognition and real-world spatial.Visualization consists of two subtests: Block Rotation and Spatial Relations. In the former, the model is asked to identify the rotated 3D block that match the original 3D block. In the latter, the model is required to identify three or four pieces that form a complete target shape.In picture recognition, a model is asked to identify a subset of specified pictures within a field of distracting pictures. The stimuli and distracters for each item include varieties of the same type of object (e.g., several different leaves) to eliminate verbal mediation as a memory strategy[44].Real-world spatial problem necessitates that the model accurately determines the relative 3D positioning of objects within an image depicting real-world scenarios. This requires the model to recognize and interpret all existing relationships in the physical world, including comprehensive 3D spatial relationships and the dynamic interconnections between the objects portrayed.

The Comprehension Cluster.

This cluster is designed to evaluate the Grw factor and includes 3 narrow question types: readings-text, readings-VL and the comic problem.In readings-text, the model is provided with long articles (about 4-6 paragraphs) and will be required to answer questions related to the main ideas of the articles or the relationships between paragraphs. The articles are collected from reading comprehension exercises found in middle and high school levels across the six countries.To highlight the multimodal nature of our benchmark, we designed readings-VL, where responses must be selected from image-based options besides the conventional text-based queries.In the comic problem, the model will be provided with a comic consisting of four or more panels that make up a complete plot. To answer the questions, the model needs to understand the entire story’s connotation based on the textual dialogues between characters and the plot development. This approach evaluates the model’s ability to integrate visual narrative comprehension with textual comprehension, challenging it to understand scenarios represented both visually and textually.

The Mathematics Cluster.

This cluster is designed to evaluate the Gq factor and includes 4 narrow question types.Math facts is tailored to measure Gq alone and consists of two subtests: symbolic knowledge and geometric knowledge. The former focuses on the model’s acquired knowledge about mathematical symbols and operations. It covers knowledge from elementary to university level, including arithmetic, vector operations, calculus, etc. The latter emphasizes the model’s capability to solve problems using geometric knowledge.In algebra and geometry, we source the questions from authentic middle school and high school exam papers across the six countries. Unlike math facts problem which can be directly answered once the knowledge is acquired, these problems require a further reasoning process. Thus, they not only call upon Gq but also require RQ.To evaluate the model’s ability to solve mathematical problems in real-life scenarios, we have specially designed application problems. For example, the model might be provided with a restaurant bill and asked to calculate the total amount to be paid.Since it rely heavily on common knowledge, Gc is also annotated in this type of problems.

The Reasoning Cluster.

This cluster is designed to assess the Gf factor and includes five narrow question types. Specifically, number series, concept formation, and Raven’s Matrices are targeted at evaluating the I (inductive) factor, while the syllogism problem and real-world reasoning target the RG (deductive reasoning) factor.In number series, the model is presented a series of numbers with one or more numbers missing. The model must determine the numerical pattern and provide the missing number in the series.Concept formation measures the ability to categorize and compare[3], a basis for abstracting concepts[51]. It requires the model to examine a series of shapes or pictures and then formulate a rule that applies to the item and then figure out the item that do not coincide with the rule.The syllogism problem is a classic form of deductive reasoning, where the model is presented with two statements followed by two conclusions. The model have to take the statements to be true even if they appear to contradict commonly known facts. Then it is asked to decide which of the given conclusions logically follows from the two given statements, disregarding commonly known facts. Real-world reasoning refers to logical reasoning questions rooted in real-world scenarios, where Gc is also important.

Appendix D Data Curation Process

D.1 Data Collection and Statistics

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (8)

Data Balancing.

To ensure equal consideration for each CHC factor during the assessment, we have maintained a balanced number of questions for each cluster that measures the various CHC factors, as shown in Fig.6. Specifically, the number of questions in each cluster fluctuates around 50, with a maximum capped at 60 and a minimum threshold of 40.

Questions Crafted from Scratch.

Due to the fact that many human intelligence tests are not open to the public, and considering the novelty of some of our question types (such as logo problem, comic problem, etc.), we could not source pre-existing QA pairs from available datasets for many questions. Consequently, we have crafted numerous questions from scratch.For these questions, ensuring the correctness of the answers and the clarity of the descriptions is particularly important. See later Sec.D.2 for more detailed information.

English-centric Bias.

Apart from questions that are completely independent of cultural background, such as Number Series and Raven’s Matrices, all data are sourced from native websites corresponding to the language.These data encompass not only text explicitly linked to cultural backgrounds but also images, since images can also convey information about the cultural contexts implicitly, such as the attire of people in the image background, architectural styles specific to a region, etc.

Multimodal Nature.

As a multimodal benchmark, safeguarding the dataset’s multimodal attributes is crucial. In particular, questions related to images should require the visual information for resolution and not be solvable through text alone. This principle was rigorously adhered to during the data collection phase, and we also placed emphasis on it during the checking process (see later Sec.D.2). We further validated the importance of image information in our benchmark through an experiment that involved removing images from the evaluation dataset, as shown in Fig.4.

Common SenseVisual-SpatialComprehensionMathematicsReasoningOverall
With Images87.0 %48.0 %77.8 %46.4 %56.5 %60.7 %
Without Images44.5 % (){(\downarrow)}( ↓ )23.9 % (){(\downarrow)}( ↓ )50.4 % (){(\downarrow)}( ↓ )31.7 % (){(\downarrow)}( ↓ )37.8 % (){(\downarrow)}( ↓ )36.6 % (){(\downarrow)}( ↓ )

D.2 Data Quality Control

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (9)

To further control the quality of our data, we perform the data cleaning process from three perspectives, as illustrated in Fig.7.

  • Image Quality.We traverse the dataset and locate all blurry images with resolutions lower than 100×\times×100 px. For questions featuring these images, we either replace them with similar questions that use high-resolution images or substitute the images with clear alternatives that convey the same meaning.

  • Accuracy Check.For the questions we designed from scratch, we have paid special attention to ensuring their correctness.

    (i) To guarantee the authenticity of the language expression in our questions, we engaged native speakers to both formulate and review the descriptions of the question stems. Specifically, after establishing the intended meaning and creating a draft version, these native speakers undertake a thorough review, culminating in the finalized version of the question descriptions.

    (ii) We employed volunteer feedback and peer review as methods to assess the clarity of our question descriptions and to detect any potential issues with the answers.

    Clarity of Descriptions: We recruited 10 volunteers for each language who were not involved in question creation to take our tests and provide feedback on any errors or unclear descriptions they encountered in the questions. After thorough discussion of their feedback, we ultimately incorporated revisions into 28 questions.

    Correctness of Answers: After the volunteers submit their answers to the electronic questionnaire, the correct answers will be automatically disclosed. They will then be prompted to revisit any questions they answered incorrectly and are encouraged to challenge these, offering feedback on any they assert to be correct or view as contentious. This feedback was taken seriously, and we ultimately made corrections to six instances where we recognized that the answers were indeed controversial or misleading.Besides, we also employed peer review within our group to ensure the correctness of answers. Specifically, after formulating their questions, team members will swap them with each other for a round of testing. Following this exercise, if a tester has a justifiable reason for an incorrect response, they will engage in a direct discussion with the question’s author. This method led to the identification of around ten answers that were deemed contentious.

  • Annotation of the CHC Factors. To ensure the rationality of the questions designed for each CHC factor and the validity of the CHC factors annotated for each question, psychologists were deeply involved and cooperated in the question design and annotation phases.

Appendix E The GIA Metrics

E.1 Human Data Collection

We collected human data in each language from 80 participants using paid electronic questionnaires.Participation in the test is compensated and entirely voluntary. To protect user privacy, the test is also conducted anonymously.Each participant was mandated to answer all questions to be eligible for payment. To motivate participants to provide thoughtful responses, the compensation is structured incrementally, increasing with the number of questions answered accurately. Additionally, to mitigate the risk of participants choosing answers at random just for the monetary incentive, we randomly inserted several “check question” within the questionnaire. For instance, a check question might instruct participants to “Please select option B.” If a participant answer more than two such questions incorrectly, their submission would be considered invalid.

E.2 Calculation of the GIA Score

In this study, we employed a cognitive factor analysis (CFA) approach to model the General Intelligence Ability (GIA) of human subjects based on the CHC theory of cognitive abilities[17]. The CHC theory posits a hierarchical structure of cognitive abilities, encompassing broad factors such as Gc, Gf, Gv, Gq, and Grw, which are further broken down into narrower tasks. Our MarcoBench, a comprehensive set of 1,800 multiple choice problems corresponding to the assessment of the five CHC cognitive factors, was meticulously subdivided into 18 distinct question types, each designed to measure different facets of the cognitive abilities being assessed.

Data collection involved 80 human subjects across four different languages: Chinese, English, Portuguese, and Korean. A total of 60 subjects were utilized for model building, while the remaining 20 subjects were reserved for model validation. Subjects were administered the MarcoBench, and their performance on the tasks was meticulously recorded. The data comprised accuracy scores on 18 cognitive tasks, representing the 18 distinct question types. The accuracy data was firstly normalized to generate z-scores. And then, the EFAtools package was employed to scale the data and calculate the correlations between the variables. A series of statistical tests, including Bartlett’s test and the Kaiser-Meyer-Olkin (KMO) measure, were conducted to assess the suitability of the data for factor analysis. An overall KMO value larger than 0.6 was deemed acceptable for factor analysis[54].

The CFA model was constructed in accordance with the CHC theory, with the broad and narrow factors defined as per the theoretical framework. We used the lavaan package (https://www.lavaan.ugent.be/) to fit the CFA model to the pre-processed data. The CFA model structure included:

  • Gc: Measured through general information, oral vocabulary, and logo problem tasks.

  • Gv: Included visualization, picture recognition, and real-world spatial tasks.

  • Grw: Assessed through readings-text, readings-visual-language (VL), and comic problem.

  • Gq: Comprised math facts, algebra, geometry, and application problems.

  • Gf: Evaluated through number series, concept formation, Raven’s Matrices, syllogism problem, and real-world reasoning tasks.

Additionally, a General Intelligence Ability (GIA) factor was included, integrating all five broad factors. Model estimation was performed using Maximum Likelihood with Restricted Maximum Likelihood (MLR) estimation, which has been demonstrated to be more robust in the presence of multicollinearity.

The model’s fit was evaluated using a range of indices, including the chi-square statistic, degrees of freedom, p-value, Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), Standardized Root Mean Square Residual (SRMR), and Akaike Information Criterion (AIC). The primary focus was on the CFI and SRMR, as they are considered more reliable indicators of model fit. A CFI larger than 0.8 or 0.9 was considered acceptable, while an SRMR equal to or lower than 0.08 was deemed acceptable[5, 15].

Upon establishing a satisfactory model fit, we employed it to calculate latent scores for the GIA on a separate set of test data. Subsequently, we calculated the Pearson correlation coefficient between the GIA latent score and the overall accuracy of the subjects on the test data to validate the model’s effectiveness. The results of this analysis provided robust evidence for the validity of the CFA model in capturing the GIA of human subjects, as indicated by the significant positive correlation between the GIA latent score and overall accuracy. This validation process underscores the model’s theoretical grounding in the CHC theory and its empirical support from the data. Subsequently, we applied the CFA model to estimate the GIA for several MLLMs, including gpt-4o[36], gpt-4v[1], llava1.6-34b[27], llava1.6-13b, llava1.6-7b, mini-gemini-34b[26], mini-gemini-7*8b, mini-gemini-13b, and mini-gemini-8b, enabling a comparative analysis of their cognitive abilities against human performance.

Appendix F Evaluation Strategy

Option Extraction

For choice extraction, we adopted a two-stage strategy. In the first stage, we employed a keyword-based rule method to parse the model output in order to obtain options. This approach proved very effective, with the majority of existing multimodal large models successfully identifying correct answers at this stage. Yet, to enhance the robustness of our evaluation, we adopted a second stage of precautionary measures in case the parsing in the first stage fails.This involves deploying GPT-4-turbo for the concise summary of answer choices from the original model responses. If the second stage still fails, we will randomly generate an option for the model as the answer to the question.It is noteworthy, though, that throughout the actual testing process thus far, we have not encountered scenarios necessitating the use of random option generation.

The rationale behind not directly resorting to large language models for option extraction in the first stage stems from the superior stability and reliability of the rule-based method. Despite leveraging large language models for option extraction has been a common practice model evaluations, it still carries a certain error rate. On the contrary, the rule-based method, while not infallible in parsing answers across all scenarios, nearly guarantees correctness in the instances where parsing is successful. Consequently, we advocate for an initial screening using the rule-based method, followed by the employment of large language models for extraction, as a strategy that enhances overall robustness.

Scoring

In addition to the calculation of the GIA score mentioned above, our benchmark can also be broken down to calculate accuracy across various cognitive dimensions. Specifically, each question is annotated with the CHC factors it involves; factors that are involved are marked with a 1, and those that are not involved are marked with a 0. When a question involves a certain factor, the correctness of that question will contribute to the accuracy statistics for that particular CHC factor; otherwise, it will not be included in the statistics. Taking the calculation of the accuracy score of the Gc factor as an example:

Acc_Gc=i=1nGciTii=1nGci𝐴𝑐𝑐_𝐺𝑐superscriptsubscript𝑖1𝑛𝐺subscript𝑐𝑖subscript𝑇𝑖superscriptsubscript𝑖1𝑛𝐺subscript𝑐𝑖\displaystyle Acc\_Gc=\frac{\sum_{i=1}^{n}{Gc}_{i}\cdot T_{i}}{\sum_{i=1}^{n}%Gc_{i}}italic_A italic_c italic_c _ italic_G italic_c = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_G italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_G italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG(1)

where n𝑛nitalic_n is the total number of questions, Gci𝐺subscript𝑐𝑖Gc_{i}italic_G italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates whether the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT question involves the Gc factor, marked as 1 if it does, and 0 otherwise. Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates whether the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT question was answered correctly, with 1 representing a correct answer and 0 representing an incorrect answer.To mitigate the effects of randomness on the evaluation results, including both the scores of the various CHC factors and the overall GIA score, we adopt a strategy of iterating five times and taking the average.

Appendix G GIA Scores on More Languages

G.1 Ablation Study on LLM Size

Data and HyperparametersPretrainFinetune
data size558K1550K
batch size256128
lr1e-32e-5
lr schedulecosine decaycosine decay
lr warmup ratio0.030.03
weight decay00
epoch11
optimizerAdamWAdamW

To further investigate the influence of LLM size to the GIA score, we conducted an ablation study with the Qwen series from 1.8B to 72B. In this experiment, we applied the LLaVA architecture and used the same ViT component (CLIP-ViT-L-14). In order to strictly control variables, we trained the models by ourselves using the same training data and the same set of hyperparameters for pretraining and fine-tuning. The data for pretraining is completely from LLaVA-1.5, and the data for fine-tuning is composed of LLaVA1.5[27] dataset, ShareGPT4v[10] dataset and our private visual-text instruct data. We show the training data and hyperparameters for both first-stage vision-language alignment pretraining and the second-stage visual instruction tuning in Table.5. We use greedy decoding for evaluation to ensure reproducibility. The GIA scores on six languages are shown in Fig.8.

Across the six languages analyzed, we consistently observe a significant increase in GIA scores with the expansion of LLM parameters. However, it is notably surprising that scaling up the size of LLMs from 7B to 14B parameters often yields no observable performance enhancement (and there might even be a slight decline).This phenomenon suggests the existence of a threshold - indicative of an emerging point of general intelligence for MLLMs somewhere between 13B and 32B parameters. In other words, it indicates a potential threshold for attaining a superior level of general intelligence, likely situated in the parameter range of 13B to 32B.

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (10)

Appendix H Case Study

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (11)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (12)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (13)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (14)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (15)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (16)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (17)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (18)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (19)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (20)

H.1 The Common Sense Cluster

Current advanced MLLMs excel in common sense cluster, especially in general information and oral vocabulary questions, which can likely be bolstered by their extensive training datasets. However, there are still some deficiencies in logo problem related to cultural background for some MLLMs, e.g. GPT-4v. Logo problems usually contain confusing artistic characters or symbolic expression of cultural elements, which requires a high level of Gc and a certain level of Gv. As shown in Fig.9, GPT-4v can recognize the locomotive in the logo of chinese question 46, but it fails to recognize the Chinese character ("hang" in pinyin) in chinese question 41, while GPT-4o can perfectly recognize characters containing Chinese cultural elements.

H.2 The Visual-spatial Cluster

In the Visual-spatial Cluster, current advanced MLLMs performe very well on the Picture Recognition questions, followed by the Real-world Spatial questions, and performed the worst on the Visualization transformation questions. The high accuracy on the Picture Recognition questions shows that the advanced MLLMs already has a good object recognition ability. Compared with object recognition ability, their ability to recognize three-dimensional spatial relationships is much worse, which can be divided into translation transformation and rotation transformation. The performance on the Real-world Spatial questions proves that the MLLMs can recognize the translation transformation relationship of objects in three-dimensional space with a certain probability, including up, down, left, right, front, and back. At the same time, the MLLMs suffer from the rotation transformation ability and spatial imagination ability in three-dimensional space, resulting the low accuracy on the Visualization transformation questions. As shown in Fig.11, after multiple inferences, GPT-4o can always recognize the same cup in english question 81 and the spatial relationship between the two remote controls with a high probability in english question 99, but it is difficult to recognize the same blocks after rotation in english question 80 in Fig.10.

H.3 The Comprehension Cluster

Similar to the common sense cluster, current advanced MLLMs perform very well in comprehension cluster, including readings-text, readings-VL and the comic problem, which can be attributed to the powerful language capabilities of LLM. Surprisingly, GPT-4o understands the scenarios represented both visually and textually in comics quite well, which proves it can integrate visual narrative comprehension with textual comprehension. As shown in Fig.13, in english question 146 and french question 144, GPT-4o can understand the entire story’s connotation based on the textual dialogues between characters and the plot development, especially can recognize the facial expressions and quantitative contrast of population in english question 146. At the same time, GPT-4o still has some shortcomings in understanding the relationship between text paragraphs. As shown in Fig.12, in english question 7, GPT-4o fails to capture the "general-specific-general" structure of the article.

H.4 The Mathematics Cluster

This Mathematics cluster is designed to evaluate the Gq factor. Although current advanced MLLMs did not perform well on math problems overall, we found two interesting phenomena. One is that the model performs better on algebra problems than geometry problems, such as the english question 182 in Fig.14. This may be attributed to the training data of LLM contains enough math knowledge text, but the visual module of MLLMs still has defects in abstract geometric figures and their relationships. The other is that the model performed better on math facts problems and problems that can be solved in one step by directly applying mathematical knowledge including symbolic knowledge and geometric knowledge than on problems that require multi-step reasoning. For example in Fig.14, GPT-4o can apply the Central Angle Theorem to solve the english question 182, but fails to solve the english question 175 which needs multi-step reasoning and calculation. In addition, GPT-4o has reached a level of practical application in simple mathematical applied problem, such as the problem of choosing the shortest flight time in english question 188 as shown in Fig.15.

H.5 The Reasoning Cluster

The reasoning cluster is designed to evaluate the I (inductive) factor and RG (deductive reasoning) factor. Similar to the performance gap between geometry and algebra in mathematics cluster, there is also a performance gap between deductive and inductive reasoning. Although GPT-4o are approaching the average human level for deductive reasoning, it only marginally meet the passing line (60) on syllogism problem and real-world reasoning problem. For example in Fig.17, GPT-4o fails on the english question 286 which is a classic form of deductive reasoning and ask GPT-4o to decide which of the given conclusions logically follows from the two given statements. For inductive reasoning, GPT-4o performs quiet well in number series and concept formation problems, such as the english question 214 and 237 in Fig.16, but performs very poorly on the Raven’s Matrices problems. Take the english question 254 in Fig.18 as an example, GPT-4o mistakenly recognized the graphic in the third row and first column as a vertical line with a black square at the bottom, when it should actually be a black square at the top, resulting in the incorrect selection. GPT-4o can perform effective reasoning, but there is a certain probability that it will make small mistakes when recognizing graphics, which shows that its visual module needs to be further improved. In addition, we also show the results of GPT-4v, which misidentifies counterclockwise rotation as clockwise rotation and incorrectly identifies option E as the arrow pointing straight down. This proves GPT-4v is much worse than GPT-4o in both reasoning and visual recognition.

Appendix I Limitations

  • We have observed a phenomenon in MLLMs similar to human cognition known as “winner takes all”, which corroborates the emergence of GIA within cutting-edge MLLMs. However, we have not yet been able to provide a more definitive and persuasive explanation for the underlying causes. Unraveling this will be one of the directions we dedicate ourselves to in the future.

  • We have gathered human data to construct the GIA model and to compare the cognitive abilities of current MLLMs with those of humans. Yet, the human data we have amassed thus far is limited, which might impinge on the accuracy of the GIA model and the objectivity of our findings. Hence, we aim to continue maintaining and our dataset, as well as collecting more human participant data in the future to encompass a more comprehensive and varied set of human samples.

References

  • Achiam etal. [2023]Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • Agrawal etal. [2019]Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., and Anderson, P.Nocaps: Novel object captioning at scale.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8948–8957, 2019.
  • Andrewes [2015]Andrewes, D.Neuropsychology: From theory to practice.Psychology Press, 2015.
  • Bai etal. [2023]Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023.
  • Baumgartner & Homburg [1996]Baumgartner, H. and Homburg, C.Applications of structural equation modeling in marketing and consumer research: A review.International journal of Research in Marketing, 13(2):139–161, 1996.
  • Binz & Schulz [2023]Binz, M. and Schulz, E.Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
  • Bubeck etal. [2023]Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., etal.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023.
  • Caemmerer etal. [2020]Caemmerer, J.M., Keith, T.Z., and Reynolds, M.R.Beyond individual intelligence tests: application of cattell-horn-carroll theory.Intelligence, 79:101433, 2020.
  • Carroll [1993]Carroll, J.B.Human Cognitive Abilities: A Survey of Factor-Analytic Studies.Cambridge University Press, 1993.
  • Chen etal. [2023a]Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D.Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023a.
  • Chen etal. [2015]Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C.L.Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015.
  • Chen etal. [2023b]Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J.Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023b.
  • Chung etal. [2024]Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., etal.Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024.
  • Coda-Forno etal. [2024]Coda-Forno, J., Binz, M., Wang, J.X., and Schulz, E.Cogbench: a large language model walks into a psychology lab.arXiv preprint arXiv:2402.18225, 2024.
  • Doll etal. [1994]Doll, W.J., Xia, W., and Torkzadeh, G.A confirmatory factor analysis of the end-user computing satisfaction instrument.MIS quarterly, pp. 453–461, 1994.
  • Driess etal. [2023]Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., etal.Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023.
  • Dubois etal. [2018]Dubois, J., Galdi, P., Paul, L.K., and Adolphs, R.A distributed brain network predicts general intelligence from resting-state human neuroimaging data.Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756):20170284, 2018.
  • Flynn [1987]Flynn, J.R.Massive iq gains in 14 nations: What iq tests really measure.Psychological bulletin, 101(2):171, 1987.
  • Fu etal. [2024]Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R.Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
  • Gandhi etal. [2024]Gandhi, K., Fränken, J.-P., Gerstenberg, T., and Goodman, N.Understanding social reasoning in language models with language models.Advances in Neural Information Processing Systems, 36, 2024.
  • Goyal etal. [2017]Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D.Making the v in vqa matter: Elevating the role of image understanding in visual question answering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
  • Kosinski [2023]Kosinski, M.Theory of mind may have spontaneously emerged in large language models.arXiv preprint arXiv:2302.02083, 4:169, 2023.
  • Krizhevsky etal. [2012]Krizhevsky, A., Sutskever, I., and Hinton, G.E.Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012.
  • Lewis etal. [2020]Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., etal.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Li etal. [2023]Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., and Shan, Y.Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.
  • Li etal. [2024]Li, Y., Zhang, Y., Wang, C., Zhong, Z., Chen, Y., Chu, R., Liu, S., and Jia, J.Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814, 2024.
  • Liu etal. [2023a]Liu, H., Li, C., Li, Y., and Lee, Y.J.Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a.
  • Liu etal. [2023b]Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., etal.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023b.
  • Lu etal. [2024]Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., etal.Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024.
  • Lu etal. [2022]Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A.Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  • Lu etal. [2023]Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J.Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023.
  • Marino etal. [2019]Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R.Ok-vqa: A visual question answering benchmark requiring external knowledge.In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204, 2019.
  • McCarthy etal. [2006]McCarthy, J., Minsky, M.L., Rochester, N., and Shannon, C.E.A proposal for the dartmouth summer research project on artificial intelligence, august 31, 1955.AI magazine, 27(4):12–12, 2006.
  • McGrew [2009]McGrew, K.S.Chc theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research, 2009.
  • McGrew & Evans [2004]McGrew, K.S. and Evans, J.J.Internal and external factorial extensions to the cattell-horn-carroll (chc) theory of cognitive abilities: A review of factor analytic research since carroll’s seminal 1993 treatise.Institute for Applied Psychometrics, 2004.
  • OpenAI [2024]OpenAI.Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024.
  • Poldrack & Yarkoni [2016]Poldrack, R.A. and Yarkoni, T.From brain maps to cognitive ontologies: informatics and the search for mental structure.Annual review of psychology, 67:587–612, 2016.
  • Raven [2003]Raven, J.Raven progressive matrices.In Handbook of nonverbal assessment, pp. 223–237. Springer, 2003.
  • Roid & Barram [2004]Roid, G.H. and Barram, R.A.Essentials of Stanford-Binet intelligence scales (SB5) assessment, volume39.John Wiley & Sons, 2004.
  • Roid & Pomplun [2012]Roid, G.H. and Pomplun, M.The stanford-binet intelligence scales, volume 654.The Guilford Press New York, NY, USA:, 2012.
  • Schaeffer etal. [2024]Schaeffer, R., Miranda, B., and Koyejo, S.Are emergent abilities of large language models a mirage?Advances in Neural Information Processing Systems, 36, 2024.
  • Schneider & McGrew [2012]Schneider, W.J. and McGrew, K.S.The cattell-horn-carroll model of intelligence.Contemporary intellectual assessment: Theories, tests, and issues, pp. 99–144, 2012.
  • Schneider & McGrew [2018]Schneider, W.J. and McGrew, K.S.The cattell-horn-carroll theory of cognitive abilities.Contemporary intellectual assessment: Theories, tests, and issues, pp. 73–163, 2018.
  • Schrank & Wendling [2018]Schrank, F.A. and Wendling, B.J.The woodco*ck–johnson iv.Contemporary intellectual assessment: Theories, tests, and issues, 383, 2018.
  • Schrank etal. [2016]Schrank, F.A., Decker, S.L., and Garruto, J.M.Essentials of WJ IV cognitive abilities assessment.John Wiley & Sons, 2016.
  • Sidorov etal. [2020]Sidorov, O., Hu, R., Rohrbach, M., and Singh, A.Textcaps: a dataset for image captioning with reading comprehension.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 742–758. Springer, 2020.
  • Singh etal. [2019]Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M.Towards vqa models that can read.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019.
  • Spearman [1961]Spearman, C."general intelligence" objectively determined and measured.The American Journal of Psychology, 1961.
  • Sternberg [1985]Sternberg, R.J.Beyond IQ: A triarchic theory of human intelligence.CUP Archive, 1985.
  • Vaswani etal. [2017]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • Wang [2019]Wang, P.L.Concept formation and frontal lobe function: The search for a clinical frontal lobe test.In The frontal lobes revisited, pp. 189–205. Psychology Press, 2019.
  • Wang etal. [2023a]Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., etal.Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079, 2023a.
  • Wang etal. [2023b]Wang, X., Li, X., Yin, Z., Wu, Y., and Liu, J.Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023b.
  • Watkins [2018]Watkins, M.W.Exploratory factor analysis: A guide to best practice.Journal of black psychology, 44(3):219–246, 2018.
  • Wei etal. [2022]Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., etal.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022.
  • Xu etal. [2023]Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y., and Luo, P.Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.arXiv preprint arXiv:2306.09265, 2023.
  • Yang etal. [2023]Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., and Wang, L.Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023.
  • Yin etal. [2024]Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., Huang, X., Wang, Z., Sheng, L., Bai, L., etal.Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark.Advances in Neural Information Processing Systems, 36, 2024.
  • Young etal. [2024]Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., etal.Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652, 2024.
  • Young etal. [2014]Young, P., Lai, A., Hodosh, M., and Hockenmaier, J.From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • Yu etal. [2023]Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L.Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023.
  • Yue etal. [2023]Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., etal.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.arXiv preprint arXiv:2311.16502, 2023.
  • Zhang etal. [2024]Zhang, W., Aljunied, M., Gao, C., Chia, Y.K., and Bing, L.M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models.Advances in Neural Information Processing Systems, 36, 2024.
  • Zhong etal. [2023]Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N.Agieval: A human-centric benchmark for evaluating foundation models.arXiv preprint arXiv:2304.06364, 2023.
  • Zhu etal. [2023]Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark (2024)

References

Top Articles
Latest Posts
Article information

Author: Ouida Strosin DO

Last Updated:

Views: 5537

Rating: 4.6 / 5 (76 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.