Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
We investigate whether unverifiable rewards in visual insight generation can be operationalized and improved via test-time scaling. To this end, we build a simple multi-agent pipeline that produces insightful reports from raw data, together with a human-aligned LLM-based judge to evaluate report quality. Using this judging signal, we then extend our work to see how far we may scale quality under a fixed compute budget.
Our multi-agent data analysis pipeline aims to answer two key questions: (RQ1) whether judge-guided scaling can align with human experts. We then ask (RQ2) how much performance can be scaled within the same compute budget using the proposed Selective TTS.
In detail, our pipeline comprises four stages: Data Profiling, Visualization, Insight Generation, and Judger Verification, each implemented as an agent with stage-specific prompts, powered by an LLM or a vision-language model (VLM). The overall design is inspired by work by Gan et al. (2025).
Below we showcase high-quality insights generated by our pipeline across five diverse datasets, demonstrating consistent quality and diverse analytical perspectives.
The Medical Insurance dataset contains patient attributes (age, BMI, smoking status) and insurance charges.

Children show negligible correlation with both age (~0.04) and BMI (~0.01), reinforcing the notion that family size is not a dominant factor in determining medical charges compared to age or BMI. This lack of correlation across demographic attributes highlights the nuanced nature of medical cost drivers, where socioeconomic factors or other unmeasured variables may play a larger role in determining charges. For healthcare administrators, understanding these independence relationships is critical when designing bundled payment models or risk-adjustment mechanisms, as reliance on poorly correlated predictors like children would likely lead to inaccurate cost projections. Further exploration into non-demographic factors such as income level or chronic condition prevalence could improve cost prediction accuracy.

The age distribution of insured individuals is skewed towards younger demographics, particularly those in their early twenties, where frequencies reach their peak (~160). This observation suggests that younger age groups may be more actively seeking insurance coverage compared to older individuals. Such a trend might indicate targeted marketing strategies aimed at youths or lower barriers to entry for younger customers. However, this also raises concerns regarding the potential imbalance in risk pooling, as older individuals, who are typically associated with higher risk profiles, are underrepresented in the insured population. Insurance providers should consider revising their outreach strategies to engage and retain older demographics, possibly through tailored offerings or education campaigns, to achieve a more balanced risk portfolio.

The chart reveals a significant difference in the relationship between BMI and medical charges for smokers versus non-smokers. Specifically, the trend lines show that for individuals who smoke, there is a much steeper positive correlation between BMI and charges compared to non-smokers. Non-smokers exhibit a relatively flat trend with minimal changes in charges across different BMIs, indicating that other factors may play a more dominant role in determining their charges. In contrast, smokers experience a marked increase in charges as BMI rises, suggesting that both smoking status and BMI have compounding effects on healthcare costs. This insight highlights the need for targeted interventions, such as wellness programs or health screenings, for smokers, especially those with higher BMIs, as they appear to bear a disproportionately higher financial burden due to combined risk factors.

The chart reveals a complex relationship between age and medical charges, characterized by a general upward trend but also significant variability across different age groups. For instance, while medical charges tend to increase with age, there are multiple clusters of data points that suggest distinct subgroups within certain age ranges. For ages below 40 years, the spread of medical charges is relatively narrow, indicating a more homogeneous population in terms of healthcare spending. However, beyond 40 years, the spread widens significantly, particularly around the 50-60 age range, where there is a marked cluster of higher charges. This suggests that health complications or increased medical needs may become more common in this age group. For ages above 60, there appears to be a plateauing or even slight decrease in the upper range of medical charges, which could be due to factors such as Medicare coverage or end-of-life scenarios. Given this variability, healthcare providers should consider targeted interventions for older adults, focusing on preventive care and personalized treatment plans tailored to different health needs observed in these clusters.

The pronounced skewness in the distribution of medical charges for smokers, particularly towards the higher end, suggests that smoking may disproportionately impact healthcare costs for a smaller subset of individuals. The long right tail indicates that while some smokers experience relatively normal healthcare expenses, others encounter extremely high costs due to smoking-related complications. This skewed distribution highlights the need for differentiated healthcare approaches, where high-risk smokers are prioritized for more intensive interventions and support. For instance, insurers or public health organizations could develop tailored programs focused on early detection and prevention for this vulnerable subgroup, potentially leading to better health outcomes and cost containment.

There is a significant upward trend in medical charges as age increases, suggesting that healthcare costs rise systematically with age. This trend is evident from the scattered data points clustering more densely along an upward trajectory that aligns with the fitted regression line. While there are individual variations, particularly among younger age groups, the overall direction of the trend is consistent across both male and female populations, as indicated by the color coding. This insight suggests that older individuals may require more intensive or frequent medical interventions, leading to higher expenses. A potential next step for stakeholders could involve developing targeted healthcare cost management strategies for aging populations, focusing on preventive care or subsidies for individuals aged 60+.
The Diabetes dataset includes clinical measurements (glucose, blood pressure, insulin, BMI) and diabetes status indicators.

The DiabetesPedigreeFunction exhibits a notably low correlation with most features, including Glucose (0.14), BloodPressure (0.041), SkinThickness (0.18), Insulin (0.19), and BMI (0.14). This finding implies that family history or genetic susceptibility to diabetes, as represented by the DiabetesPedigreeFunction, does not strongly predict other physiological characteristics commonly associated with diabetes onset. This may suggest that environmental or lifestyle factors play a more dominant role than genetics in modulating these traits. Researchers may need to focus on identifying interactions between DiabetesPedigreeFunction and other unmeasured variables to explain its limited standalone predictive power. Public health initiatives could prioritize lifestyle interventions, such as diet and exercise, even in individuals with a positive family history, due to the weak genetic associations observed here.

The sharp increase in insulin levels observed in Outcome 1 around age 60 suggests a significant metabolic shift that could indicate the onset of severe diabetic complications. This peak, exceeding 400 units, stands out as an anomaly compared to the consistent trend in Outcome 0, where insulin levels remain relatively stable and decline gradually after peaking earlier. The dramatic spike likely reflects heightened insulin resistance or hyperglycemia, conditions commonly associated with advanced diabetes progression. This finding suggests that individuals with Outcome 1 may benefit from more intensive monitoring and intervention during this critical period, potentially delaying the progression of diabetes-related complications.

The chart illustrates a substantial overlap in insulin levels between 'Outcome 0' and 'Outcome 1' across various age groups. While there is a discernible separation in younger age groups, the lines for both outcomes become increasingly indistinguishable as age progresses. This overlap becomes particularly pronounced in the 50+ age range, where data points from both groups cluster around very low insulin levels. The lack of clear differentiation between outcomes later in life may suggest that age outweighs other distinguishing factors, potentially indicating shared physiological processes or limitations in distinguishing 'Outcome 1' from 'Outcome 0' based solely on insulin levels. Policymakers may need to reassess how diagnostic criteria adapt for different age groups to ensure accuracy in identifying potential health risks.

The scatter plot reveals a positive correlation between BMI and Glucose levels for individuals across both outcomes (Outcome 0 and Outcome 1). As BMI increases, there is a general upward trend in glucose levels, suggesting that higher body mass index is likely associated with higher blood sugar levels. This trend is evident through the distribution of data points, which cluster more densely along the upper-right quadrant of the chart where both BMI and Glucose are higher. Furthermore, the regression line suggests a consistent increase in glucose levels with increasing BMI, although the relationship appears non-linear, as indicated by the curvature of the data points around the line. This insight implies that interventions targeting weight management may play a crucial role in managing glucose levels, particularly for populations at risk of developing glucose-related health issues. Stakeholders such as healthcare providers could focus on promoting weight loss programs or lifestyle changes for patients with higher BMIs to potentially reduce glucose levels.

Glucose shows moderate correlations with several features: BMI (0.22), BloodPressure (0.15), and Insulin (0.33). This pattern indicates that glucose levels are influenced by a combination of body composition and metabolic activity. The presence of these correlations may highlight the complex interplay between lifestyle factors and physiological responses. For instance, interventions aimed at reducing BMI and managing BloodPressure might indirectly help stabilize glucose levels in pre-diabetic populations. Further studies could investigate whether a multi-faceted approach addressing all correlated factors leads to more effective glucose control compared to single-focus strategies. The threshold for action could be set at identifying individuals with simultaneous elevations in these correlated metrics, targeting early intervention protocols.

The relationship between glucose and insulin levels across the outcomes reveals an indirect linkage tied to metabolic health. While glucose levels differ sharply between outcomes 0 and 1, insulin levels also exhibit a distinct pattern indicative of different physiological states. For outcome 0, the low insulin levels coupled with relatively normal glucose suggest a lack of metabolic stress or insulin resistance. Conversely, the high insulin levels in outcome 1, even though glucose levels are elevated, imply compensatory insulin secretion possibly to counteract insulin resistance. This observation underscores the complex interplay between glucose metabolism and insulin response. Healthcare providers could implement strategies to assess insulin sensitivity concurrently with glucose monitoring, focusing on interventions such as lifestyle modifications or medications to address insulin resistance for patients with outcomes consistent with outcome 1.
The Customer Shopping Behavior dataset captures retail transaction data including product categories and seasonal patterns.

Clothing demonstrates a cyclical trend in popularity across seasons, peaking in Spring and rebounding in Winter. In Spring, the total purchase amount for clothing surpasses $27,000, indicating its seasonal demand spike. After declining in Summer to around $23,000, there's a noticeable recovery in Fall, hovering close to $26,000, followed by another peak in Winter nearing the Spring levels. This cyclicality likely reflects consumer behavior driven by seasonal fashion changes. Retailers may consider increasing inventory and marketing efforts during Spring and Winter to capitalize on this peak demand, focusing on innovative products and discounts in Summer to maintain sales traction.

Footwear demand exhibits noticeable seasonal variation, showing higher purchases in Summer (160) and Spring (163), compared to Fall (136) and Winter (140). This suggests that warmer seasons may trigger increased demand for lighter footwear types. Retailers might consider adjusting inventory levels or launching promotions during the warmer months to maximize sales. Conversely, the relatively lower demand in colder seasons may signal the need for different marketing strategies focused on winter boots or heavier footwear options. This seasonal disparity highlights the importance of adapting inventory management and advertising based on weather trends.

Summer has the lowest average review rating at approximately 3.71, suggesting it might be the most challenging season for maintaining customer satisfaction. This dip could stem from factors like intense competition, user fatigue, or a shift in consumer priorities during the warmer months. Given that summer marks a notable low compared to other seasons, the company might benefit from conducting seasonal analyses to pinpoint the exact causes. For instance, they could survey customers specifically in summer to understand their concerns better. Once identified, actions like implementing summer-exclusive features, improving logistics, or offering seasonal discounts could help lift ratings closer to other seasons' benchmarks.

The payment method landscape indicates a competitive environment among the top four methods (Credit Card, PayPal, Cash, and Debit Card), all contributing approximately 17%-18% of total sales. This close parity suggests that customers have diverse preferences, making it critical for businesses to maintain flexibility in their payment options. Given the similarity in usage across these methods, marketing strategies should focus on enhancing user experience and incentives across multiple platforms to capture a larger share of customer preference. The small differences (~1%) between these methods imply that even slight improvements in user convenience or loyalty programs could shift customer choices noticeably.

The seasonal sales trend reveals a significant peak during Fall, where the total purchase amount reaches its highest point at approximately $60,000. This peak is roughly $4,000 higher than the next highest season (Winter). The sharp drop observed between Spring and Summer, followed by a substantial rise from Summer to Fall, suggests that consumer spending behavior may be heavily influenced by seasonal factors such as holiday shopping or changes in purchasing habits. The similarity in values between Winter ($58,000) and Spring ($59,000) indicates a relatively stable period before the summer dip. Retailers could leverage this insight by intensifying marketing campaigns and inventory preparation ahead of Fall, potentially increasing revenue by capitalizing on the peak season. For instance, retailers might consider increasing stock levels or running promotional campaigns in early Fall to capture this demand surge.

Bank transfers exhibit a periodic pattern, with usage peaking during the bi-weekly and quarterly purchase intervals. These intervals show higher scores (88 and 109, respectively) compared to monthly and weekly intervals. This cyclical behavior may reflect structured payment cycles where consumers prefer bank transfers for routine financial planning. Businesses could leverage this by offering bank transfer discounts during these intervals, potentially increasing transaction volumes by appealing to customers with structured purchasing habits. For instance, promoting bank transfers during bi-weekly and quarterly cycles could see a 10-15% increase in usage within six months.
The World Happiness Report dataset aggregates country-level indicators including GDP, social support, and life expectancy.

The heatmap reveals a strong positive correlation between 'Log GDP Per Capita' and both 'Social Support' and 'Healthy Life Expectancy At Birth,' with correlation coefficients of 0.68 and 0.82, respectively. This suggests that higher income levels within a country are closely linked to enhanced social support systems and better health outcomes. The relatively moderate positive correlation with 'Freedom To Make Life Choices' (0.37) indicates that economic prosperity may also contribute to personal freedoms, though to a lesser extent compared to social support and life expectancy. These findings imply that policy interventions focused on improving GDP per capita might simultaneously boost various social and health indicators. However, the lack of significant correlation with 'Generosity' (-0.00) could suggest that wealth alone does not necessarily translate to increased generosity in societal behavior. A possible intervention strategy could involve reallocating funds towards social programs that emphasize community-building activities to encourage more generous behaviors alongside economic growth initiatives.

There appears to be a diminishing marginal return effect of GDP on happiness as we move to higher GDP per capita levels. While the red regression line shows a positive correlation, the slope begins to flatten as the Log GDP Per Capita increases toward the right end of the x-axis. This implies that while economic growth significantly enhances happiness initially, its impact plateaus at higher GDP levels. This observation aligns with the concept of the 'Easterlin Paradox,' which posits that beyond a certain level of affluence, further economic gains do not substantially improve subjective well-being. Policymakers targeting happiness maximization should focus on ensuring equitable distribution of wealth rather than solely pursuing economic growth once a country reaches a certain prosperity threshold. For example, redistributive policies and investments in social services may yield better results in enhancing happiness compared to further GDP-driven strategies.

There is a negative correlation between perceptions of corruption and confidence in the national government, as indicated by the concentration of high-density areas (yellow-green regions) towards the bottom-right and top-left quadrants of the heatmap. Specifically, as the perceptions of corruption increase (approaching 1.0), confidence in the government decreases, and vice versa. This suggests that public perception of corruption significantly influences trust levels in governance. The density distribution demonstrates that the majority of data points lie along this trend, highlighting the importance of addressing corruption to restore public trust. Policymakers should focus on implementing transparent anti-corruption measures to shift the population towards higher confidence levels, using interventions targeted at regions with high corruption perceptions (e.g., a policy aimed at reducing corruption levels by 0.2 points within two years).

The data demonstrates a non-linear drop in confidence as perceptions of corruption rise. While the decline is gradual at lower corruption levels (below 0.3), the rate of decline accelerates significantly once perceptions exceed 0.5. This sharp decrease suggests that public tolerance for corruption has a threshold beyond which confidence rapidly diminishes. Such non-linearity may indicate that citizens are willing to overlook minor instances of corruption but become highly sensitive to more pervasive issues. Policymakers should therefore focus intensively on preventing corruption from crossing this threshold to avoid sudden trust collapses. Developing early detection systems and rapid response mechanisms for emerging corruption could be key strategies to maintain confidence.

The Negative Affect distribution exhibits a steep decline in frequency beyond a score of 0.3. This sharp drop-off is evident as the red bars reduce dramatically towards the right side of the histogram. This pattern suggests that as affect scores increase, the likelihood of experiencing negative emotions decreases substantially, indicating a potential saturation or 'ceiling effect' for negative affect at higher scores. This could imply that individuals who report higher affect scores are less likely to report significant negative experiences, pointing to a possible threshold where positivity dominates. Public health campaigns focused on improving overall well-being might aim to guide individuals towards achieving affect scores above 0.3, as doing so seems to correlate with a marked reduction in negative emotional experiences. Strategic initiatives could measure success by tracking the shift in mean affect scores moving above this critical threshold.

The correlation matrix shows that 'Social Support' has moderately strong positive correlations with 'Life Ladder' (0.72) and 'Log GDP Per Capita' (0.68), suggesting that social cohesion and support systems contribute to both subjective well-being and economic prosperity. However, the relatively weak correlation with 'Generosity' (0.07) raises questions about whether social support independently drives charitable behavior or if it requires additional factors. For example, cultivating a sense of community through public policy could enhance social support, which in turn might indirectly encourage generosity by fostering empathy. A potential strategy could involve local governments investing in community-building initiatives, aiming to boost social support scores by 30% within five years, thereby exploring indirect pathways to influence generosity.
The Student Performance Factors dataset contains student attributes including study hours, attendance, tutoring sessions, and exam scores.

The category with 'High' access to resources and no extracurricular activities (795) is unexpectedly large, revealing a subgroup that does not utilize available opportunities. This could indicate preferences, time constraints, or misalignment between offered programs and interests. Educational institutions or policymakers should explore survey-based feedback mechanisms to understand the motivations behind non-participation. By identifying and addressing these reasons, they could achieve a 15% reduction in the 'No' category within the 'High' access group over the next academic year, thereby better aligning resource utilization with actual demand.

Despite the general positive relationship, there is a noticeable variability in exam scores even among students who studied similar amounts of time. For example, some students who studied 20 hours achieved scores as low as 60, while others scored close to 90. This variability suggests that factors other than study duration, such as study methods, prior knowledge, or individual learning styles, play a crucial role in determining exam outcomes. Therefore, schools should explore diversifying teaching strategies or personalizing study guidance to address these variations, potentially leading to more consistent performance across students. A pilot program testing different study techniques with students who study a similar amount of time could help identify more effective approaches.

While 'Medium' access to resources provides moderate exam scores, it does not appear to compensate as effectively for the lack of internet access as other conditions. For example, students with 'Medium' access and 'No' internet have a mean score of 66.8, slightly below 'High' access with 'No' internet (67.0). This suggests that 'Medium' resource levels alone might not be sufficient to offset the disadvantages of limited internet access. Schools or policy initiatives focused on boosting resources for these students could aim to move them closer to 'High' access thresholds, particularly in areas where internet penetration is low. A strategic goal could be to increase 'Medium' access scores by at least 0.5 points when 'No' internet is present, through targeted investments in resource augmentation over the next five years.

While 'Previous_Scores' show a moderate positive correlation (~0.18) with 'Exam_Score,' the influence appears weaker compared to 'Hours_Studied' and 'Attendance.' This suggests that prior academic performance has some predictive power but is not the sole determinant of current exam scores. Educational interventions aimed at improving overall performance might need to focus more on current behaviors like studying and attending classes, rather than solely relying on previous scores as predictors of success. A balanced approach addressing both past performance and current engagement could yield better results.

The chart reveals a weak positive correlation between the number of tutoring sessions and exam scores, indicated by the Pearson correlation coefficient of r=0.16. Despite this weak relationship, the trend line shows a slight upward trajectory as the number of tutoring sessions increases. However, there is significant variability in exam scores across different numbers of tutoring sessions, suggesting that other factors may play a larger role in determining exam performance. For instance, the range of scores for 1 tutoring session spans from approximately 50 to 95, which highlights the complexity of predicting outcomes based solely on tutoring sessions. This suggests that educational institutions might consider exploring additional interventions to complement tutoring, such as targeted study materials or personalized feedback, to potentially boost student performance more effectively.

For very low attendance levels (below ~50%), there is a significant variability in exam scores, indicated by the longer whiskers and scattered individual data points. This variability suggests that students with low attendance have inconsistent outcomes, possibly due to missing crucial lectures or lacking foundational knowledge. On the other hand, for higher attendance levels (above ~70%), the spread of exam scores reduces, showing more consistent performance among students. This consistency may reflect a stronger engagement with course content and a more structured learning environment. To improve overall academic outcomes, interventions targeting students with low attendance rates could be particularly beneficial, focusing on strategies to reduce absenteeism and ensure foundational knowledge is acquired.
To address RQ1 (whether judge-guided scaling can align with human experts), we designed three judges with varying degrees of strictness to evaluate insight quality. Each judge employs different evaluation criteria and reasoning processes, ranging from surface-level readability checks to deep analytical assessments. We conducted small-scale scaling experiments on two datasets: VIS Publication and Medical Insurance. Below are the prompt design snippets for the three judges:
| Easy Judge | Moderate Judge | Harsh Judge |
|---|---|---|
|
Task: objective evaluation using chart-only evidence. Traits: Readability; OnTopic; TrendAlignment. Process: direct observation and scoring. Scoring: integers 0–100 per trait. Output: JSON {scores, evidence, conclusion}. |
Task: objective evaluation using chart-only evidence. Traits: Correctness; Specificity; InterpretiveValue. Process: identify chart elements and rate insight clarity. Scoring: integers 0–100 per trait. Output: JSON {scores, evidence, conclusion}. |
Task: objective evaluation using chart-only evidence. Traits: Correctness and Factuality; Specificity and Traceability; Insightfulness and Depth; So-what quality. CoT Process: observe chart → decompose insight → map evidence → score → conclude. Scoring: integers 0–100 per trait. Output: JSON {scores, evidence, conclusion}. |
Using Qwen2.5-VL-32B-Instruct as the generation backbone, we ran the full pipeline with test-time branching to produce final reports, each containing a visualization paired with its corresponding textual insight. To ensure robust judge design, we employed GPT-4o as a strong judging backbone throughout the human-alignment study, which helps minimize annotation bias arising from model-level limitations. For stability, each report was evaluated by the same judger three independent times, and the final score was obtained by averaging these repeated evaluations (majority-voting). We evaluated two representative tabular datasets (VIS Publication and Medical Insurance), each yielding approximately 1,500 final reports. To efficiently gather human preferences without annotating all reports, we employed vertical sampling: at five key quantile points (0%, 25%, 50%, 75%, 100%) along each judge's score distribution, we extracted the corresponding reports and created pairwise comparisons across the three judges. This sampling strategy yielded 30 systematically selected pairs (5 quantiles × 2 datasets × 3 judge pairs) for expert human annotation.
We collected preferences from four expert annotators who evaluated which report in each pair demonstrated higher insight quality. To measure judge-human alignment, we computed rank correlation metrics including Kendall's τ and Spearman's ρ, along with inter-annotator agreement measured by Kendall's W. The evaluation results are summarized below:
| Judger | VIS Publication Dataset | Medical Insurance Dataset | ||||
|---|---|---|---|---|---|---|
| Kendall's τ (↑) | Spearman's ρ (↑) | Kendall's W (↑) | Kendall's τ (↑) | Spearman's ρ (↑) | Kendall's W (↑) | |
| Easy | 0.40±0.24 | 0.53±0.24 | 0.64 | 0.55±0.30 | 0.62±0.26 | 0.54 |
| Moderate | 0.45±0.17 | 0.55±0.23 | 0.51 | 0.40±0.00 | 0.55±0.08 | 0.65 |
| Harsh | 0.55±0.30 | 0.60±0.37 | 0.59 | 0.65±0.30 | 0.72±0.27 | 0.64 |
The results demonstrate that the harsh judge consistently achieves the strongest alignment with human expert preferences across both datasets, with Kendall's τ reaching 0.55 on VIS Publication and 0.65 on Medical Insurance. We therefore adopt the harsh judge as our pseudo ground-truth for subsequent scaling experiments in RQ2.
Using the human-aligned judge as our scoring signal, we run the full pipeline to generate a large pool of reports and find that scaling is indeed effective for this otherwise unverifiable task. As compute increases, scores consistently improve and the distributions shift upward, indicating a higher concentration of strong reports. The curves below show scaling behavior across datasets, and you can interactively click highlighted points to inspect the corresponding chart-insight pairs.
Click on a red point on the chart
to preview the report
Having established the harsh judge as a reliable evaluator aligned with human experts, we now address RQ2: how to maximize insight quality under a fixed compute budget. Traditional test-time scaling approaches repeatedly refine outputs over time, but this becomes problematic in multi-stage pipelines where errors can compound across stages, which is known as judge drift. Instead, we propose Selective Test-Time Scaling (Selective TTS), a process-based refinement framework that distributes compute across different stages of the pipeline rather than iterating on the final output alone.
The core idea of Selective TTS is to generate multiple candidate outputs at each stage of the pipeline, evaluate them using stage-local evaluators, and prune low-quality branches before they propagate to downstream stages. For each stage s in metadata reports generation, visualization directions generation, insights generation, we define a branching factor \(b_s\) and a pruning ratio \(\rho \in (0, 1)\). At each stage, we generate \(b_s\) candidates, evaluate them using a stage-specific judge, and retain only the top \(n'_s = \max(1, \lceil(1-\rho) \cdot b_s\rceil)\) candidates for the next stage. This selective pruning strategy prevents error accumulation by eliminating poor-quality intermediate outputs early in the pipeline.
For our four-stage pipeline (Data Profiling → Visualization → Insight Generation → Judge Verification), we employ the following pruning schedule: starting with 1 initial input (the raw dataset), we generate \(b_s\) metadata reports and prune to retain \(n'_s\) reports. For each retained report, we generate \(b_s\) visualization directions and prune to \(n'_s\) directions. Each direction then goes through chart rendering with quality verification. Let \(p_v \in [0,1]\) denote the chart verification pass probability. This quantity represents the expected rate at which generated charts pass quality checks. Among \(n'_s \cdot n'_s\) attempted visualizations, we expect \(p_v n'_s \cdot n'_s\) verified charts to pass. Finally, for each verified chart, we generate \(b_s\) insights and prune to \(n'_s\) insights for final judge verification. This results in a branching tree structure where high-quality paths are preserved and low-quality branches are eliminated at each stage.
Each pruning decision is guided by an LLM-based judger tailored to its stage: a metadata judger ranks metadata reports, a visualization judger ranks directions of visual charts, and an insight judger ranks drafted insights. This process-level allocation reduces the accumulation of judging noise across time and yields more stable scaling behavior under fixed budget.
To ensure fair comparison and practical deployment, we measure computational cost in terms of LLM API calls, which directly translates to inference cost in production settings. The total compute budget of a single run can be decomposed by enumerating all major LLM calls performed across stages. Specifically, the run-level budget \(B_{\text{run}}(\rho)\) can be written as:
| Stage | Budget Calculation (Total LLM Calls) | Description |
|---|---|---|
| Profiling | \(b_s + \mathbb{I}[\rho>0]\) | Generate \(b_s\) metadata reports (\(b_s\) calls) and prune (if applied) |
| Visualization | \(n'_s \cdot 1 + n'_s \cdot \mathbb{I}[\rho>0]\) | For each retained metadata report, generate \(b_s\) visualization directions (1 call) and prune (if applied) |
| Chart Rendering | \(2n'_s \cdot n'_s\) | For each visualization direction: code generation (1 call) + image quality verification (1 call) |
| Insight Generation | \(p_v n'_s \cdot n'_s+ p_v n'_s \cdot n'_s \cdot \mathbb{I}[\rho>0]\) | For each verified chart, generate \(b_s\) insights (1 call) and prune (if applied) |
| Judge Verification | \(p_v n'_s \cdot n'_s \cdot n'_s\) | Final quality evaluation for each chart-insight pair (1 call) |
As mentioned above, \(p_v \in [0,1]\) denotes the expected rate at which generated visualizations pass quality checks. The term \(p_v n'_s\) represents the expected number of verified charts among \(n'_s\) attempted visualizations per retained meta-report. The indicator function \(\mathbb{I}[\rho>0]\) equals 1 when pruning is applied (\(\rho > 0\)) and 0 otherwise. The expected budget has asymptotic complexity \(\mathcal{O}(p_v {n'_s}^3)\) for generation and \(\mathcal{O}(\mathbb{I}[\rho>0] \cdot p_v {n'_s}^2)\) for pruning overhead, where \(n'_s = \max(1, \lceil(1-\rho) \cdot b_s\rceil)\) denotes the number of surviving branches after pruning at stage \(s\).
By fixing the total budget and varying the pruning ratio \(\rho\) and branching factor \(b_s\), we explore different compute allocation strategies. For example, a pruning ratio of \(\rho = 0.8\) with a large branching factor allows more aggressive exploration with stricter filtering, while \(\rho = 0.2\) with moderate branching provides more conservative pruning that retains more candidates at each stage.
All main experiments are conducted on the VIS Publication Dataset (Isenberg et al., 2017), and we adopt Qwen2.5-VL-32B-Instruct as the generation backbone for all agents. To obtain more accurate and stable evaluations, we employ GPT-4.1-nano as the judging backbone (stage-local evaluators and final judger) throughout the scaling experiments. To ensure sufficient diversity in test-time scaling, we set the decoding parameters to top-\(p = 0.9\), temperature \(= 1.0\), and a maximum generation length of 1500 tokens. We fix the stage-local branching factor to \(b_s = 5\), ensuring consistent diversity across all stages. Pruning applies a single ratio \(\rho\) uniformly across stages within a run; we sweep \(\rho \in \{0.2, 0.4, 0.6, 0.8\}\), while all other settings are held fixed across runs.
We establish a reference budget by running the full pipeline with no pruning (\(\rho=0\)) for 15 independent runs. All Selective TTS conditions are matched to this budget (within a small tolerance) to isolate allocation effects from total compute.
Figure below presents our main experimental findings across three dimensions: quality-variance trade-off (left), score distribution across all generated reports (middle), and compute-quality efficiency (right). The results demonstrate that Selective TTS with moderate pruning (\(\rho = 0.6\)) achieves the best balance, reaching a peak average score of 65.86 ± 8.16 compared to the baseline's 61.64 ± 13.36. This represents a gain of 4.5 points while simultaneously reducing variance by 39%.
Fig. a: Runs (left) and total final reports (right) vs. pruning ratio \(\rho\)
Fig. b: Average score (left) and standard deviation (right) vs. pruning ratio \(\rho\)
Fig. c: Sorted score curves under different \(\rho\)
Fig. b shows that mean scores increase steadily from \(\rho = 0\) to \(\rho = 0.6\) (from 61.64 to 65.86), accompanied by a clear reduction in variance (from 13.36 to 8.16). However, at \(\rho = 0.8\) (best-of-N at each stage) the average score drops and the variance rises. This degradation may stem from over-pruning: stage-local evaluators are not always aligned with the final judge, and stronger pruning can amplify these misalignments, potentially removing a larger number of high-quality candidates. Sorted curves (Fig. c) show that the lower tail is pruned away and the distribution shifts upwards, while top scores are preserved. Together, these trends indicate that Selective TTS concentrates compute on promising directions, yielding higher mean quality and lower variance under a matched budget.
As \(\rho\) increases, the total number of final reports (i.e., chart-insight pairs) decreases substantially (Fig. a, right axis), since weak branches are discarded early. Under the same overall budget, this enables more independent runs (left axis), trading within-run breadth for cross-run exploration. The total number of reports is not simply Runs · \(n_s'^3\) with \(n_s' = \max(1, \lceil(1-\rho) b_s\rceil)\), because some branches fail quality gates or execution and are dropped; when \(n_s' = 1\) (e.g., \(\rho = 0.8\)), each surviving run deterministically yields one report. Pruning adds lightweight judging overhead but recovers budget by avoiding low-quality generations downstream, netting a better allocation under the same total calls.
The table below provides detailed statistics for each pruning ratio configuration on the VIS dataset. Selective TTS progressively reduces the number of final reports while improving average quality under a comparable LLM-call budget.
| Setting | Runs | Total Final Reports | Score (Avg. ± Std.) | Gen. Budget | Prune Budget |
|---|---|---|---|---|---|
| Baseline (\(\rho = 0\)) | 15 | 1435 | 61.64 ± 13.36 | 2567 | 0 |
| \(\rho = 0.2\) | 22 | 1128 | 63.35 ± 10.56 | 2240 | 384 |
| \(\rho = 0.4\) | 40 | 876 | 65.24 ± 8.81 | 2117 | 439 |
| \(\rho = 0.6\) | 79 | 578 | 65.86 ± 8.16 | 2032 | 522 |
| \(\rho = 0.8\) | 197 | 197 | 64.97 ± 8.48 | 1970 | 591 |
Beyond this default experimental setup, we further examine whether the scaling behavior of SelectiveTTS is robust across multiple dimensions. In particular, we evaluate robustness with respect to compute budget definitions (LLM-call vs.\ token-level accounting), generator–judger backbone choices, decoding strategies and generation diversity, dataset transfer beyond the VIS Publication dataset, and variation in insight length. Across all five dimensions, we observe consistent and stable scaling trends: moderate pruning reliably improves performance, and the qualitative behavior of SelectiveTTS is preserved across compute definitions, model backbones, decoding regimes, datasets, and output-length distributions.
Below we present three representative reports generated under the optimal pruning ratio \(\rho = 0.6\), ranging from low to high quality as scored by our harsh judge. Each report consists of a visualization paired with textual insights. The examples illustrate how report quality varies along the score spectrum.
Several keywords, such as 'Dimensionality reduction,' 'Time-varying data,' and 'Ensemble data,' have similar frequencies (~3), indicating that these topics are equally important but distinct areas of inquiry. This suggests a diverse set of challenges and methods being explored in handling large and dynamic datasets. Researchers looking to innovate might explore novel combinations of dimensionality reduction techniques with ensemble data analysis, particularly when dealing with temporal data. Conference organizers or professional associations in the field could plan panel discussions or workshops focusing on these interconnected areas to facilitate knowledge sharing and collaboration among experts.
⚠️ Surface-level keywords without deep analysis
There appears to be a cyclical pattern in the publication trend, with periods of growth followed by stabilization or slight decline. For instance, after the initial growth phase ending in 2004, the number of publications shows stability or minor declines in subsequent years. However, there is a resurgence of growth starting around 2015, leading to another peak in 2020. This cycle may reflect periodic cycles in funding, attention, or major discoveries within the field. Understanding these cycles could help stakeholders anticipate periods of increased activity and plan resources accordingly. For example, research institutions might increase their focus on grant proposals and collaboration efforts ahead of expected growth phases, as seen leading up to the peaks in 2004 and 2020.
📊 Summarizes trends but lacks specificity
The scatter plot reveals a positive correlation between Aminer Citation Count and CrossRef Citation Count, as indicated by the upward-sloping red regression line. This suggests that as the citation count in Aminer increases, there is a corresponding increase in the citation count reported by CrossRef. However, the scatter of data points around the regression line implies variability in this relationship, meaning some papers may have citation counts that deviate significantly from the expected trend. For example, while most points are near the line, there are outliers such as the paper with approximately 3700 Aminer citations but only ~400 CrossRef citations. This could indicate discrepancies in how these platforms count citations or differences in their coverage of certain publication types or fields. Given this variability, actors like researchers or publishers might need to consider both citation databases and cross-reference results to ensure a more accurate understanding of a paper's impact.
✓ Explains discrepancies with specific numerical evidence and actionable guidance
The low-scoring report contains only surface-level keyword observations without connecting them to meaningful insights or actionable implications. The mid-scoring report identifies publication trends and proposes explanations, but remains somewhat generic. The highest-scoring report demonstrates superior analytical depth by explaining why certain outliers show different citation counts across two citation systems, providing specific numerical evidence (e.g., "3700 Aminer citations but only ~400 CrossRef citations") and offering practical guidance for researchers.
Increasing \(\rho\) generally improves overall report quality. Even low-scoring reports at \(\rho = 0.8\) are more coherent and analytical than those at smaller \(\rho\). High-scoring reports become similarly informative while remaining diverse in perspective.
We collected pairwise quality preferences from four expert annotators across reports sampled from different pruning ratios. The figure below shows mean annotator rankings, revealing that human preferences increasingly converge for the highest-ranked reports as pruning ratio increases.
Mean annotator rankings across pruning ratios converge for high-quality reports
Annotators judged high-quality reports to be comparably strong, as reflected by low variance in preferences and largely indistinguishable ratings. This trend indicates that strong reports become difficult to distinguish as their quality improves. Overall, these results suggest that Selective TTS raises the quality floor while simultaneously reducing variability among top-ranked outputs, leading to more consistent and reliable insights.
In this work, we reframed unverifiable insight generation as a test-time scaling problem and proposed Selective TTS, a process-based, stage-wise pruning strategy guided by lightweight LLM-based evaluators. Building on an end-to-end multi-agent pipeline for chart-grounded insight generation, we introduced judger selection via human alignment and formulated compute usage in terms of LLM calls. Our experiments demonstrated that selective pruning under a fixed budget progressively reduces variance, improves average final report quality, and enables broader exploration across runs without requiring additional resources.
This study closes a gap between generic reasoning agents and data-centric decision support, showing how LLM-as-Judgers can provide a principled mechanism for scaling unverifiable reasoning. We hope our findings serve as the first step toward scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
@misc{gan2025scalingunverifiablerewardscase,
title={Scaling Unverifiable Rewards: A Case Study on Visual Insights},
author={Shuyu Gan and James Mooney and Pan Hao and Renxiang Wang and Mingyi Hong and Qianwen Wang and Dongyeop Kang},
year={2025},
eprint={2512.22650},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.22650},
}