Exploring Open Source AI for LLM Code Translation Systems
Penn collection
Degree type
Discipline
Subject
AI Code Translation
AI Code Refactoring
Funder
Grant number
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
AI code generation has unlocked a new wave of possibilities in recent years. Coding agents have evolved from simple autocompleters to full-fledged, collaborative coding agents, capable of reasoning, debugging, refactoring and other advanced development tasks. These breakthroughs have been inspired by innovations with transformer architectures, multi-agent systems and specialized code Large Language Models (LLM). However, significant challenges remain in several areas, including code understanding and code translation for legacy and niche languages. In particular, AI code translation for legacy languages is relatively less mature than other tasks, as it is more likely to be impacted by poor documentation, abstract code dependencies, tangled “spaghetti” code and fewer training datasets and benchmarks due to the scarcity of publicly available repositories. This creates a significant burden for enterprises, as it results in increased security risks, higher operational costs, reduced innovation and challenged talent retention. This project explores a practical approach for improving the efficiency of legacy code refactoring with open source LLMs. For demonstration purposes, the prototype involves the creation of an LLM-backed coding agent that automates the migration of Java Server Pages (JSPs) to a more modern NodeJS and React architecture. It leverages various methodologies such as continuous finetuning, synthetic data generation and spec-driven development, and presents an optimal, cost-effective approach for building the agentic system. Using predefined metrics as guidance, it builds a finetuned model that provides a measurable improvement in legacy code refactoring tasks over the baseline model. In particular, using DORA finetuning, the prototype yielded an 18% median improvement in code relevant metrics over the baseline model, and outperformed LoRA finetuning by a factor of 2-2.5X in overall median performance.