About Me

My name is Ruixuan Tu (zh_CN: 涂 睿轩, ja_JP: トゥ・ルイシュェン). I am a fourth year undergraduate majoring in Computer Sciences (Honors), Mathematics (Honors), Data Science, Statistics, and Japanese at the University of Wisconsin–Madison (UW–Madison). I expect to graduate in May 2025 and am actively seeking for Ph.D. opportunities in natural language processing (NLP) and large language models (LLMs).

I am fortunate to be advised by and work with Prof. Forrest Sheng Bao @ Iowa State CS and (ML Head @) Vectara Inc, Prof. Ramya Korlakai Vinayak @ UW–Madison ECE & CS & Stat, and Prof. Junjie Hu @ UW–Madison BMI & CS. I have previously worked with Prof. Jerry Zhu @ UW–Madison CS.

Research Interests

Human-aligned LLMs: Although LLMs have archived great improvements after 2018 and gained popularity among humans, it does not mean they always perform as good as humans, once we consider issues such as hallucination, bias, and factually incorrectness. I have been working on multiple projects to align LLMs with human expectations and behaviors, including:

Multilingual NLP and Computational Linguistics (Japanese NLP): With Japanese as one of my majors, I connects my NLP knowledge with my Japanese linguistics and classical Japanese courses. I have applied multilinguial transfer learning from modern Japanese to classical Japanese in WakaGPT, and applied computational linguistics tools to analyze the morpheme origins in Japanese literature. I have used clustering method to analyze role language in Japanese media (game and anime) from a computational socialinguistics perspective.

Papers

Peer-reviewed Papers

  1. Is Semantic Chunking Worth the Computational Cost?
    Renyi Qu, Ruixuan Tu, Forrest Sheng Bao
    Findings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
    [arXiv] [PDF]

  Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.

  1. FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
    Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad
    Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
    [arXiv] [PDF] [GitHub Repo]

  Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. “Challenging” here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.

  1. DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics
    Forrest Sheng Bao*, Ruixuan Tu*, Ge Luo, Yinfei Yang, Hebi Li, Minghui Qiu, Youbiao He, and Cen Chen
    Findings of the Association for Computational Linguistics: EMNLP 2023
    (Presented the paper and the poster orally at 4th NewSumm Workshop in person as co-first-author)
    [ACL Anthology] [PDF] [Poster] [GitHub Repo]

  Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.

  1. Funix - The laziest way to build GUI apps in Python
    Forrest Sheng Bao, Mike Qi, Ruixuan Tu, Erana Wan
    Proceedings of the Python in Science Conference 2024
    [SciPy Proceedings] [PDF] [GitHub Repo]

  The rise of machine learning (ML) and artificial intelligence (AI), especially the generative AI (GenAI), has increased the need for wrapping models or algorithms into GUI apps. For example, a large language model (LLM) can be accessed through a string-to-string GUI app with a textbox as the primary input. Most of existing solutions require developers to manually create widgets and link them to arguments/returns of a function individually. This low-level process is laborious and usually intrusive. Funix automatically selects widgets based on the types of the arguments and returns of a function according to the type-to-widget mapping defined in a theme, e.g., bool to a checkbox. Consequently, an existing Python function can be turned into a GUI app without any code changes. As a transcompiler, Funix allows type-to-widget mappings to be defined between any Python type and any React component and its props, liberating Python developers to the frontend world without needing to know JavaScript/TypeScript. Funix further leverages features in Python or its ecosystem for building apps in a more Pythonic, intuitive, and effortless manner. With Funix, a developer can make it (a functional app) before they (competitors) fake it (in Figma or on a napkin).

Keywords: type hints, docstrings, transcompiler, frontend development

  1. A review in the core technologies of 5G: device-to-device communication, multi-access edge computing and network function virtualization
    Ruixuan Tu*, Ruxun Xiang*, Yang Xu, Yihan Mei
    International Journal of Communications, Network and System Sciences, 2019
    [SCIRP] [PDF]

  5G is a new generation of mobile networking that aims to achieve unparalleled speed and performance. To accomplish this, three technologies, Device-to-Device communication (D2D), multi-access edge computing (MEC) and network function virtualization (NFV) with ClickOS, have been a significant part of 5G, and this paper mainly discusses them. D2D enables direct communication between devices without the relay of base station. In 5G, a two-tier cellular network composed of traditional cellular network system and D2D is an efficient method for realizing high-speed communication. MEC unloads work from end devices and clouds platforms to widespread nodes, and connects the nodes together with outside devices and third-party providers, in order to diminish the overloading effect on any device caused by enormous applications and improve users’ quality of experience (QoE). There is also a NFV method in order to fulfill the 5G requirements. In this part, an optimized virtual machine for middle-boxes named ClickOS is introduced, and it is evaluated in several aspects. Some middle boxes are being implemented in the ClickOS and proved to have outstanding performances.

Preprints

None at the moment.

Course Papers

  1. WakaGPT: Classical Japanese Poem Generator
    Ruixuan Tu
    Full-mark final paper for STAT 453 (Deep Learning) @ UW–Madison, Spring 2024
    [PDF] [Slide]

  Waka is a traditional Japanese poem that is usually in a certain mora sequence format. However, generating waka is challenging for general-purpose LLMs like GPT-4 due to lack of data in classical Japanese and this kind of poetry, as well as the usual format restrictions. In this paper, we present WakaGPT, a waka composer based on Japanese GPT-2 and the base models it is fine-tuned on. By self-supervised and semi-supervised training, we are able to generate waka poems with correct grammar and format.

  1. Analysis of Post-Meiji Word Origins in Japanese Literature: An approach in computational linguistics
    Ruixuan Tu
    A-mark final paper for ASIAN 434 (Japanese Linguistics) @ UW–Madison, Fall 2023
    [PDF] [Slide]

  We have analyzed the distribution of origins of morphemes on Aozora Bunko dataset over all morphemes, parts of speech, and origins. For the analysis, we have used morpheme analysis tools MeCab and Juman++ by Kyoto University, and based on UniDic data, we fine-tuned DeBERTa-v2-base-Japanese to classify the origins of morphemes into three categories: native, Sino-Japanese (SJ), and mixed. The hypothesis was that the Japanese government advocates the usage of SJ and native words before/in WWII, and western culture becomes more popular after WWII, but as a result from this analysis, we can even see some preferences toward native words, contradicting the hypothesis.

  1. Cluster Analysis of Role Languages in Visual Novel Game AIR
    Ruixuan Tu
    A-mark final paper for ASIAN 358 (Japanese Sociolinguistics) @ UW–Madison, Fall 2024
    [PDF] [Slide]

  Through our analysis of the visual novel game AIR, most keywords “特徴語” from our method could be recognized as“yakuwarigo” that represents characteristics of specific individuals or groups, but might not the reverse side (not all “yakuwarigo” are keywords that could be found). From our method, we have observed non-female language, casual female language, formal and polite female language, and dialectal language as clusters. We also found that different groups of script authors might affect extracted keywords.

Method: We apply agglomerative hierarchical clustering (Ward’s method + euclidean distance) on word frequency vectors for every speaker, and then extract significant keywords by CoS (coefficient of specialization) >2.

  1. Optimizing Bike-Sharing Systems: A Machine Learning Approach to Predict Station Imbalances
    Ruixuan Tu, Larissa Xia, Steven Haworth, Jackson Wegner
    1st Most Creative or Interesting Project and 2nd Best Visualizations for STAT 451 (Machine Learning) @ UW–Madison, Summer 2024
    [PDF] [Slide]

  This study analyzes Divvy Bike Station, Trip Data, and American Community Survey Data to predict bike station flow imbalances (overflow/underflow). The key questions are: How can demographic data and machine learning predict bike availability? Is the status of existing stations a reliable indicator for nearby stations? Using Logistic Regression, Decision Tree, SVM for demographic data, and kNN for geographic data, with Recursive Feature Elimination and Grid Search with Cross-Validation, SVM was the most effective. The status of existing Divvy stations reliably predicts the status of nearby stations.

Work Experience

Textea Inc
Software Development Engineer Intern (May 2022 — September 2022)

UW–Madison

Projects

KDE Connect (Apple Continuity-like Experience) (November 2018 — Present)

Memberships

Awards

Ruixuan Tu

Papers