I need to develop an application where I give an llm a piece of code, like maybe a function, and then the llm finds the closest match that does the same thing. It would look in one or more source files. The thing found may be worded differently. If the search finds the identical code then it should consider that the match. I assume the llm needed would be the same as a good coding llm.

Is this feasable at all? How hard would this be to develop? Thanks in advance.

I'm going to start with a joke:
Q. How do I get to Carnegie Hall?
A. Practice, practice, practice.

You won't be doing what you ask without building up your skillset and learning how to create "things".
Now if you need this today you outsource the work. I know folk underestimate projects all the time so I'll share that the smallest commercial project I know of is about 100K USD. And the vehicle tracking system went over a million.

So let's hear what you've done so far to create applications so far.

I'll echo rproffitt in that anything is feasible with the right tools, skillset, money, and time. However, I think the question you really mean to be asking is how feasible this is for you?

I will start off by saying that, as long as you're only thinking of very simplistic complexity, I think the existing LLMs out there are capable of looking at two pieces of code and determining if they do something similar. So the really hard work is done.

Screenshot_2025-03-02_at_12.41.56 PM.png

Now how good at you at prompting LLMs? What would you even prompt the LLM to look for to consider the matches similar? I asked ChatGPT:

What is pseudocode for an algorithm to rank 1000 articles in terms of similarity to a particular article?

Here’s a pseudocode for an algorithm that ranks 1000 articles in terms of similarity to a particular article. We'll assume the algorithm uses cosine similarity or another similarity measure to compare the articles.

1. Initialize a list of 1000 articles
   articles = [article1, article2, ..., article1000]

2. Define the target article
   target_article = articleX  // The article to compare against

3. Initialize an empty list to store similarity scores
   similarity_scores = []

4. For each article in the articles list:
   a. Compute the similarity between the target_article and the current article
      similarity = compute_similarity(target_article, current_article)

   b. Append the similarity score to the similarity_scores list
      similarity_scores.append((current_article, similarity))

5. Sort the similarity_scores list in descending order by similarity score
   sorted_similarity_scores = sort(similarity_scores, key=lambda x: x[1], reverse=True)

6. Rank the articles based on their similarity to the target article
   ranked_articles = [article for article, score in sorted_similarity_scores]

7. Output the ranked list of articles
   return ranked_articles

I'm not very good at LLM prompts, so the whole compute_similarity() thing is pretty meaningless. You would need to determine some type of concrete algorithm you want to tell the LLM about such that it can create objective similarity scores between two pieces of code across all of the files.

What's the use-case?

"Black Duck" is an existing application for scanning your in-house proprietary software to make sure it doesn't get tainted with open source software with vague/incompatible licence terms.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.