Brute force no longer sufficient to extract a beetle project from compilation output

To import an Arduino sketch into Embeetle, I roughly use the following approach:

  1. Compile the sketch file with embeetle/sys/<os>/bin/arduino-cli.exe at full verbosity.

  2. Observe the output to extract:

    • All c- and s-files being compiled.
    • All h-dirs being on the compiler’s searchpath.
  3. Copy all c- and s-files into the embeetle project. Also copy all the h-files that can be found in the h-dirs - even though many of them are actually unused.

This approach was good enough for regular Arduino projects. Below I’ll describe some refinements I made and where I get stuck.

1. First refinement: recursive h-file search

By copying all the h-files from each h-dir on the compiler’s searchpath, I usually had a complete project in the end. However, some h-files are a bit nasty, having include statements like this:

#include "../foo/bar.h"

There is a good chance that bar.h doesn’t appear in any of the h-dirs being on the compiler’s searchpath. This results in an incomplete project! That’s when I implemente the first refinement to my importer code: I open each and every h-file to look for #include statements. Then I test if the given include statement points to a h-file that is already copied. If not, I copy the newly discovered h-file too and recurse into it (doing the same thing).

This refinement proved to be successful for the Arduino Uno and Nano projects I’ve tried to import.

2. Second refinement: start the recursion from the c-files

As soon as I tried to import an ESP32 project with my approach, it failed miserably. It turns out that I made a crucial mistake in my recursive h-file search. It’s not so much the search itself, but the set of h-files I start from. You see, I start from all the h-files that can be found in the h-dirs. (h-dirs = compiler searchpaths defined with the -I flag in the verbose output)

Obviously, that means I’m considering way too many. The recursive search blows up in my face. I end up with a 250 MB project and still have enormeous amounts of unsolved include statements (include statements for which nothing was found, and therefore marked as an error in Embeetle).

So here comes the second refinement. I start from the c- and s-files to begin the recursive search for h-files. Then I only copy those h-files that result from this recursive search.

3. The problem

Although the second refinement is certainly an improvement, I still end up with way too many h-files. Many of them are clearly not intended for the project - eg. they target other families of microcontrollers. That’s because my recursive search is a bit dumb. I use some regex magic in each source file to figure out the h-files it includes. However, many include statements are encapsulated in #ifdef blocks.

In other words, @ygramoel I need your source analyzer to rescue me.

Right, I will look at it