Skip to main content

ESP-IDF project standards and conventions

I've built a bunch of ESP32 projects using Espressif's ESP-IDF framework, and every time I start a new one, I second-guess myself on basic structure: where does the config go? How do I version the firmware? What partition layout should I use? This post is documentation-to-self, but hopefully it saves someone else from the iterative debugging I've done.

Project structure and build configuration

The ESP-IDF follows a pretty strict layout. At the top level you'll have:

  • main/: your application code (main.c is the entry point)
  • CMakeLists.txt: project-level build configuration
  • sdkconfig.defaults: sensible defaults for menuconfig options
  • partitions.csv: partition table (more on that below)
  • .clangd: language server config to fix false positives on macOS

The main/CMakeLists.txt is minimal:

idf_component_register(SRCS "main.c" 
                       INCLUDE_DIRS ".")

Keep components in main/ until you have more than a few files, then break them into components/mycomponent/.

OTA update patterns: ota_0 and ota_1

Never use the factory+ota_0 partition layout. I learned this the hard way when I bricked a device. Use ota_0 and ota_1 instead, so you can always fall back to the previous firmware if an update goes wrong.

Your partitions.csv should look like:

nvs,      data, nvs,     0x9000, 0x6000,
otadata,  data, ota,     0xf000, 0x2000,
ota_0,    app,  ota_0,   0x20000, 0x1C0000,
ota_1,    app,  ota_1,   0x1E0000, 0x1C0000,
spiffs,   data, spiffs,  0x3A0000, 0x60000,

The sizes here depend on your flash size (I'm assuming 4MB). The key is that otadata tracks which OTA partition is active, and the bootloader automatically switches between them. On a bad update, it rolls back to the previous partition on next boot.

I allocate 1.75 MB (0x1C0000) per OTA partition because my firmware is usually 600-800 KB, and that leaves plenty of headroom for growth without requiring a reflash to resize.

Firmware versioning

Every binary that gets flashed or pushed to S3 needs a version bump. Use a simple semver string:

// main/main.c
#define FIRMWARE_VERSION "1.2.3"
#define FIRMWARE_VERSION_MAJOR 1
#define FIRMWARE_VERSION_MINOR 2
#define FIRMWARE_VERSION_PATCH 3

Then in your OTA or setup code, log it:

ESP_LOGI(TAG, "Firmware version: %s", FIRMWARE_VERSION);

This sounds obvious, but I've spent frustrating hours trying to reproduce a bug and realizing the device is running an older build than I thought. A simple version string at startup saves that pain.

If you're pushing to S3 for OTA (see below), include the version in the build artifact name: myproject-v1.2.3.bin.

WiFi authentication: stick with WPA2_PSK

Always use WIFI_AUTH_WPA2_PSK, not WIFI_AUTH_WPA2_WPA3_PSK. The mixed mode causes mysterious auth failures on home routers, especially older ones or devices that don't advertise both protocols equally.

wifi_config_t wifi_config = {
    .sta = {
        .ssid = (uint8_t *)CONFIG_WIFI_SSID,
        .password = (uint8_t *)CONFIG_WIFI_PASSWORD,
        .threshold.authmode = WIFI_AUTH_WPA2_PSK,
    },
};

Even if your home network supports WPA3, the mixed mode is a footgun. If you're deploying to multiple locations or devices with varying WiFi hardware, stick with WPA2_PSK. It's universally supported and won't surprise you.

Partition sizing and resizing pain

Leave expansion room in your OTA partitions. I allocate 1.75 MB when my current firmware is 600 KB because resizing partitions later requires a USB cable and esptool.py flash_id followed by a full reflash. It's doable but annoying.

If your project grows and you need more space, you have two options: 1. Resize partitions (USB reflash required) 2. Move to a larger flash chip (harder)

Allocate generously at the start. A few MB of unused space costs nothing.

Clangd/LSP false positives on macOS

If you're using clangd or VSCode's C/C++ extension on macOS, you'll see tons of red squiggles for ESP-IDF types and macros that actually compile fine. This is because clangd doesn't know about the build configuration.

Create a .clangd file at your project root:

CompileFlags:
  CompilationDatabase: build/compile_commands.json

Build once (so compile_commands.json exists), then clangd will use it and the false positives disappear. This is especially important for understanding macro expansions and avoiding frustration while editing.

S3 OTA uploads: the cache-control trap

If you're pushing built binaries to S3 for OTA updates, always include --cache-control "max-age=30" when uploading:

AWS_PROFILE=yourprofile aws s3 cp build/myproject.bin s3://your-bucket/firmware/ \
  --cache-control "max-age=30"

Without this, CloudFront (or any other CDN in front of S3) will serve stale versions for up to 1 hour, and your devices will pull the old firmware. I've spent hours debugging "why is the new firmware not deployed?" only to realize the CDN was serving the previous build.

Set max-age to something short (30 seconds works) so you get fresh binaries on each request, not "fresh every hour."

Battery-powered sensors: ESP-NOW vs WiFi

If you're building a battery-powered sensor, ESP-NOW uses far less power than WiFi, but requires a receiver that's always on (or close to it). Tradeoffs:

WiFi: - Higher latency and power draw (radio startup is expensive) - Works everywhere without custom infrastructure - Built-in over-the-air update support

ESP-NOW: - One-way or paired communication only; you need a gateway/receiver - Extremely low power if you can wake, transmit, and sleep in milliseconds - Mesh capability with enough peers

For a humidity sensor that transmits once per hour, ESP-NOW + a gateway is overkill. For a door sensor that needs to wake and transmit immediately, ESP-NOW is worth the complexity. For anything in between, build WiFi first and optimize to ESP-NOW later if battery life becomes critical.


That's the foundation. Every new project I start, I copy the partition table, firmware version pattern, and build wrapper from the last one, and I've saved myself countless hours of "wait, why is this partition 512K?" debugging. Hopefully this saves you from the same mistakes.

Fixing Elasticsearch/Logstash/ELK's DATESTAMP grok pattern

Elasticsearch, including Logstash and Pipeline Processors, love to use grok patterns. These are basically named regex patterns, allowing the complexity to be hidden behind easier-to-read labels (though they do require referring to the source). Regexes are great, with the "now you have two problems" caveat of any sufficiently advanced technology. (it could be worse, you could have 100 problems)

What's the problem? Timestamp support. Such a trivial issue has been a problem for a long time. I'll show a fix (rather, a workaround) below.

The problem: year-first timestamps

Many datestamps in logs are in a year-first format (e.g., 2020-01-01). That makes sense, as many operating systems and languages default to ISO 8601 for a human-readable datetime format. For instance, here's a recent example from my system's dpkg.log:

2020-05-21 06:01:01 upgrade tzdata:all 2019c-3ubuntu1 2020a-0ubuntu0.20.04

Or from a Mac's log:

2020-05-24 20:04:25-07 ted-macbook-pro softwareupdated[753]: Removing client SUUpdateServiceClient pid=5347, uid=0, installAuth=NO rights=(), transactions=0 (/usr/sbin/softwareupdate)

Or from Octoprint's python-based logs:

2020-05-25 03:26:36,643 - octoprint.server.heartbeat - INFO - Server heartbeat <3

There are other common variants, like the 'T' delimiting the date and time sections, time zones, or other delimiters- I will also be addressing this format, found in the grok comment:

2020/05/25-16:02:11.5533

I'm less concerned about time zones, since real computers run on UTC.

What's wrong with these? none of them are supported in vanilla grok in any version.

Digging into the problem

Let's look at logstash in 2015. Or logstash in 2020. Or elasticsearch in 2020.

They define UK-format and US-format dates:

DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}

And in a comment, they suggest that the datestamp will be accepted with slashes:

# datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)

Which leads to an implication that the DATESTAMP would support it:

DATE %{DATE_US}|%{DATE_EU}
DATESTAMP %{DATE}[- ]%{TIME}

But.. look back to the US/EU formats. No year-first format. Sometimes you'll see a weird match, like "20 April 2001", but it's just seeing "2020/04/01", slicing off the first few digits, and parsing it as a date-first string. Weird, huh? This explains some of the weird indexes you might find in an ELK stack, where there's something like logstash-2001.04.01 and it's almost 20 years later.

Anyhow, on to...

The easy fix

If you have the luxury of redefining DATE, you can prepend a sane format to it:

DATE_YEARFIRST %{YEAR}[\/\-\s]%{MONTHNUM}[\/\-\s]%{MONTHDAY}
DATE %{DATE_YEARFIRST}|%{DATE_US}|%{DATE_EU}

Why prepend? That keeps it from accidentally matching 2020 as above.

The hard fix

But if you don't have the luxury of updating your grok patterns, you'll have to whip up a custom one:

(?<my_datetime_field>%{YEAR}/%{MONTHNUM}/%{MONTHDAY}[T\s\-]+%{TIME})

You'll notice I've simplified the delimiters from above. It's just less readable o accept them all, but you might need to do so:

(?<my_datetime_field>%{YEAR}[\/\-\s]%{MONTHNUM}[\/\-\s]%{MONTHDAY}[T\s\-]+%{TIME})

So, there you go. That can easily be stuffed into Logstash, or a processor. Hooray!

Caveats

Now, there is an 8601-style year-first pattern defined, so, great if it matches your format. It didn't match enough of my variants:

TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{ISO8601_HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

There was also a year-first version added to logstash in 2016, then the day and month were flipped, then it was removed or never made it to master. Hilariously, it was typoed as 8061 the whole time. It also didn't exist in elasticsearch, only logstash. It doesn't help that the Elasticsearch version of the file was moved in 2016, then also moved in 2018, which took away the easy-to-view commit history.

Why not fix it?

Here's an issue from 2015. Here's another one. Here's a PR from 2016. The issue isn't submitting a PR, obviously. Don't get me started on discuss.elastic.co, which auto-closes and locks discussions, meaning you're pretty much guaranteed to find an out-of-date, inferior, solution. If any at all.

Gavin Belson Signature Edition liquid cooled server

the gavin belson liquid cooled server, top open

I've been slowly collecting bits to build a liquid-cooled server. I have no practical reason to do so, but I've been intrigued by the concept. Finally I bit the bullet and got the following:

I then ordered a bunch of parts from EKWB:

Everything else was pretty normal to build a PC/pseudoserver. It took me quite a while to put this together- not only did I have to wait for two batches of parts from EKWB in Slovenia, plus deal with a customs delay, but I also needed to fabricate a bunch of parts.

The first thing was the radiator. I designed two "sliders" to hold it in, since the width of the radiator is close to the case width. It took a lot of iterations to deal with bolt holes, drain cap of the radiator, and so on.

radiator CAD radiator mount

I then designed a block to hold the pump. It only took a couple of tries on it. I was happy to put a bit of a "lean" on the side bolt holes, which leaves room for a rounded corner on the inside of the case.

pump mount CAD

I installed the motherboard and then tried adding the GPU and the water cooling. They both required disassembling the motherboard from its mounting plate. So many screws! The GPU was even more work when I realized the X570 chipset heatsink/fan were in the way. I first took it off and thought about cutting it, but it's a big chonky piece that would have been really difficult.

The next GPU problem is the Rosewill case doesn't have expansion slots in the back. So weird! Obviously it's designed for a motherboard to be placed there, but I guess they assume nobody will need to add a card to the motherboard. So I Dremeled out the back of the case. I might design and print something to secure the card in place.

expansion slot cutout server window

To add to my list of GPU struggles, the liquid cooling makes the GPU taller than a 4U case. I took some measurements, carefully dremeled a hole, then designed a plexiglass popemobile window. I might add a second window or enlarge that one to show the cooling and RGBs, but that is optional and would come later.

I'm unhappy with a couple of things:

  1. the motherboard only seems to have two fan headers that are usable from Linux. One has the pump, the other has puller fans. I have a few other rows of fans, so I either need to daisy them all together or control them from off the motherboard.
  2. the RGB LED headers aren't controllable.
  3. there are no ports for off-motherboard thermistors. I put four inline thermistors on the cooling loop but I have nowhere to plug them in.
  4. It's hard to know if the pump is running. It sends RPM back, but there's no other real verification. I haven't solved this yet, but I will probably stick a flow indicator in the loop, then 3D print an optical sensor to look through it.
  5. The cabling mess. I tried heat-shrinking the fan cables, but that made things worse, partly because it's thick heatshrink that glued everything together, so I can't remove a fan from the shrink. There's a lot of cabling coming from the PSU too. I'm delaying solving this until everything is done.

I have a lead on solving the first three things- I've designed a board around the ESP32 with three fan headers and two DRGB headers. If that works I'll add at least five inputs for the thermistors and flow control. That way I can use the DRGBs to display temperature at a glance, plus control the fans and gather/transmit all the important data back to my network.

cooling loop custom liquid cooling controller v1

What am I using it for? I have a couple things in mind. First, it sits in my Kubernetes cluster, and the GPU will let me transcode my security videos better. I thought I'd dabble with some other GPU computing, but there aren't great drivers or support for non-Nvidia cards :/. As a final use, if I can figure out how, it'd make for a nice Steam Remote gaming system, if I can figure out how to do it with a non-Windows and container-based system.

The crowning touch was to add the 3D-printed replica of the Gavin Belson signature to the case. It looks perfect!

the gavin belson liquid cooled server, top closed

Deleting (almost) all of my tweets

I decided a few months ago that my 10+ years of Twitter history wasn't helpful. It had more chance to be a liability than anything. So I finally wrote a script to delete it.

I deploy it locally as a Kubernetes cron, so all of the secrets are stored elsewhere. That means I've shared the repo:

https://github.com/tedder/tweet-cleanup

It isn't perfect (for instance, dealing with pinned tweets), it's still hardcoded to my username, but I'd love PRs.

Far Side comic newsfeed (RSS feed)

In late 2019, the archives of Gary Larson's awesome The Far Side comic officially came back to the web. It's fun to see them, not just the 10 that needs more JPEG. Two 'daily' comics are posted per day, which is cool.

Unfortunately, there's no newsfeed. So, I fixed that. It uses the modern JSON feed spec. The feed URL is: https://dyn.tedder.me/rss/farside/daily.json

If you need an older style feed, here's an Atom XML feed: https://dyn.tedder.me/rss/farside/daily.xml

You can see a sample of it being used on Newsblur (even if you don't have a login).

Robocalling in 12 minutes*

On John Oliver's Last Week Tonight, he showed how they were robocalling the FCC commissioners. As an offhanded comment he said their tech person literally got it running in less than 15 minutes. I took that as a challenge, and based on a comment from Kirsten, my goal was to get audio of our Audreycat growling when she answered the phone.

I was confident because I knew I had an account ready to go at Twilio. They have developer-friendly APIs and I had previous experience of managing over 10k phone numbers on their platform, though I hadn't done it recently.

So, I bought a phone number from an area code she'd recognize and cobbled together the code from their sample:

from twilio.rest import Client
k = "+15035551212"
phone="+13095551212"
acct_sid="xx"
acct_tok="yy"

client = Client(acct_sid,acct_tok)
call = client.calls.create(
                        url='https://tedder.me/twilio/hi3.xml',
                        to=k,
                        from_=phone
                    )

print(call.sid)

And here's the XML, which tells Twilio what to do after the phone is picked up, which I put on S3:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="alice">hi kirsten</Say>
    <Play>https://tedder.me/twilio/audrey.mp3</Play>
</Response>

Sure enough, I had her phone ringing 12 minutes later. She picked up, and a robot voice told her there was an error. What?!

So, I did it in 12 minutes, but it took another 13 minutes to figure out what was wrong. Twilio's error console said it was returning a 403 error, which didn't make sense, as I could access the content just fine.

With some further digging I found that the XML URL is hit with POST, so I knew the problem- the content was being served from S3, which only expects a GET. So I had to dig further to find all of the parameters to the Twilio code. It was actually really hard to find good documentation for the python library, as everything pointed to the basics of using the library. Finally I found the entrypoint to the documentation, and after a lot of work I got to the calls.create documentation. It was difficult to find but the answer was easy- I needed to add a parameter called method:

call = client.calls.create(
                        method='GET',
                        url='https://tedder.me/twilio/hi3.xml',
                        to=k,
                        from_=phone
                    )

After doing that it was easy. I can't say it was done in 15 minutes like John Oliver's tech dude, but I still suspect his wasn't fully working after 15 minutes either.

Using shell conditionals in AWS Codebuild

I was working on AWS Codebuild and having trouble getting a conditional to work. In my case, I wanted a "dry run" type flag. I'll use that as the example here.

Conditionals support in Codebuild's buildspec

My first problem was figuring out what shell AWS uses. I didn't find anything on the "Shells and Commands in Build Environments" documentation page, so I decided to keep it really vanilla- avoid using bash specifics and stay close to a POSIX shell. Here was my first try:

  post_build:
    commands:
    - [ "$DRY_RUN" -gt "0" ] && echo "run my command"

Looks great, right? Well, except I forgot the first rule of yaml: always run a linter. That would have shown me that I was using square brackets in a scalar, which is never a good idea.

Avoiding COMMAND_EXECUTION_ERROR

So, I quoted the whole thing:

  post_build:
    commands:
    - '[ "$DRY_RUN" -gt "0" ] && echo "run my command"'
    - |-
        [ "$DRY_RUN" -gt "0" ] && echo "alternate quoting syntax"

And that's what led me to write up this blog entry. That was returning COMMAND_EXECUTION_ERROR in the build:

[Container] 2019/02/28 21:14:25 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: [ "$DRY_RUN" -gt "0" ] && echo "run my command". Reason: exit status 1

When I google for this, I only found a couple lines of linkspam, nothing relevant. I had to iterate several times to solve it, and I ended up just using an if-fi block instead.

  post_build:
    commands:
    - |-
          if [ "$DRY_RUN" -gt "0" ]; then
            echo "run my command"
          fi

And that works! So, TLDR: use the if-fi syntax in a quoted yaml section.

First post!

This is the obligatory "first post" of my new blog. My last blog was from a few years ago, pre-Life Changes, and I've needed something to place random geekery. I love static sites and have opinions, so this is Nikola with Markdown posts and markdown meta-data posts. I tend to write in Markdown even when I never render the content, so .. may as well.

things to write about:

  • how I sync to s3
  • my config changes

todo:

  • sync to s3
  • figure out how to add images to posts
  • smugmug?