best HQ pic currently avail of the board will take better ones where can read all SOCs soonish, I might also be able to donate one of these boards + bed to you for testing (creality keeps this PCB under lock and key but I have obtained a few spares for testing)
the only viable python from the jumble is the prtouch_v1 and hx711s
the v2 uses close source C code that runs on the MCU itself
I dont blame you I didnt know if any of their code would have clues/helps/hints on approaches for reading things
I also got my mesh reliable and tuned with their python code. It just took some configuring of the values and changing some values to be configurable rather than hard coded so I could fiddle with them a bit
The little silver square in the lower left between the MCU and the HX711 chips could be a clock/crystal. If they get clocked together then its reasonable to assume they will all stay in sync. If their startup times are a fixed number of clock cycles (and these things should be simple enough that is true) then they should all report data ready at the same time. Then the gang-read code in C makes perfect sense.
From that standpoint this would just be a specific ‘driver’ in my design that reads 4 values. The thing we would have to do is support reporting multiple values back to the Pi and the Endstop. Maybe that wouldn’t be so difficult of a modification.
The pin they call ‘clk’ is the serial clock, PD_SCK in the docs. Its used to cause the DOUT pin to output 1 bit per cycle.
The clock I’m talking about is the “Crystal I/O and External Clock Input”, pin XI in the docs. That should be fed 11.0592Mhz. The exact number of cycles (at 11.0592Mhz) that it takes to prepare a sample is fixed. So if all the chips get the same clock signal they should be done preparing a sample “at the same time”. They may boot as slightly different times which may spread them out slightly, but whatever that spread is, it should remain constant.
It just dawned on me that the timing code has another hidden source of error that I haven’t accounted for yet. I’ve read the klipper docs and source a few times now but I must be thick…
I wrote about how the ADC prepares a sample and then it starts going stale. The chips expect you to use an interrupt to record the time and then read the value later.
Klipper doesn’t use interrupts, it does timer based polling. So there is some time between when data_ready is set and when it gets polled. I turned up the frequency of the timer function to 40Khz to reduce this to a small value for testing. But this is inefficient and it still suffers from another timing error:
I’m capturing the pin state in the TASK, not the timer function.
TASKs get invoked by klipper’s main loop, not at a specific time. So there is some delay between when the timer sets the flag to allow the task to execute and when the TASK gets scheduled for its next run. Critically I don’t know which of those 2 times is more accurate. Its as if you say “go check if the light is on”, then someone walks to another room and shouts back “its on”. But you don’t know if the light was on before they left to go check or if it came on right as they entered the room. The current code assumes the delay is negligible and the light is on when the “go check” command is issued. And most of the time it is probably is negligible. But adding load to the MCU from other devices can change that.
To get closer to interrupt timing accuracy I need to check the data_ready pin in the timer function and record the time when the pin changes state. I can then set the flag to run the TASK which will read in the load cell value some time later. Basically trying to simulate the timing accuracy of an ISR without the ISR.
I can reduce the total overhead by changing the rest_ticks to be a larger value when the data_ready pin is set. A timing diagram should look like this:
Here ‘T Delta’ is the unaccounted for source of error in the current implementation. The smaller ‘TE’ should be the timing error after this change, the time between runs of the timer pin check function. This should hold even when the MCU gets busy running other TASKs that delay the eventual read.
The implementations I have seen don’t get this detail right. I think Prusa-firmware-buddy is using a true ISR as the chip makers intended, but the klipper versions are polling at the sample rate which has both kinds of error.
Take a look at the “bulk sensor” reading code as is found in the adxl345 support. There is a simple way to get excellent timing precision.
Roughly:
If you have a sensor producing readings every ~10ms, you schedule a timer to wake a task that checks for the reading every ~7ms or so. That task looks at the “data_ready” pin and if it is set it reads the value and puts it into its host bound data queue. If the data_ready pin is not set (happens roughly every 3rd or 4th reading), then you just go back to sleep until the next timer wakeup.
Implement a “query_hx711_status” command (similar to “query_adxl345_status”). The command would report the number of samples the hx711 sensor has taken (the number of locally queued samples, plus one if the data_ready pin is set at the time of the query). That count is reported back to the host along with a high precision timestamp (timer_read_time()).
Rebuild accurate timestamps for all sensor readings in the host using the Python ClockSyncRegression class (that class is generic, but happens to be located in klippy/extras/adxl345.py).
The “magic” in this system is based on the knowledge that the hx711 is generating samples at a very precise and fixed rate. The only thing we don’t know is the exact sample rate nor the start time of the first sample. However, by measuring the “total number of samples produced” over extended periods of time (eg, 5+ seconds) we can deduce those two pieces of information with great accuracy (even if each individual measurement is relatively course). Said another way, if we were to “plot a graph” of “time vs number of samples produced” over an extended period of time, then one could use a linear regression to “draw a straight line through that graph” - the “slope of that line” is the sensor reading frequency, and the “x axis intercept of that line” is the time of the first sensor reading.
The above system is at the core of how Klipper manages clock skew between MCUs, and is how Klipper gets very precise times for accelerometers.
The system is best when the chip has a queue (so that we don’t have to worry about an mcu scheduling delay that is long enough to risk losing a sample entirely). However, that possibility should be manageable.
Thanks @koconnor. I have read that ClockSyncRegression class a couple times and never understood the code well enough to attempt to use it. But clearly that’s what I need to do. I’ll go back and read it again and go down that route.
I coded up my above idea but its downside is needing increasingly high polling rates as the frequency of the sensor goes up. It would be ‘ok’ at 400Hz, and maybe that’s all we will ever need. But the sensors go up to 32Khz. I wont be able to poll at 100x that rate. ClockSyncRegression sounds like to would let me poll at only a slightly higher frequency, but still within the range of whats possible on the hardware.
I got me new test rig commissioned for hacking this weekend:
That’s a Voron 2.4 with Prusa’s Nextruder in it. Talking to the onboard HX717 using the existing c code.
Initial testing is very encouraging. PROBE_ACCURACY at 5mm/s is 0.015mm of range. It will QUAD_GANTRY_LEVEL at 0.0075 without an issue at that speed.
If I drop the speed to 0.5mm/s I get 0.0012mm of range. It sees 2 or 3 microsteps values, like the same numbers to the 6th decimal place, over and over again. At that speed its equally as good as the Euclid probe that was in this machine. This is encouraging for the linear regression integration. Hopefully I can bring this level of accuracy up to 5mm/s.
I’m going to resist the urge to print something with it and get on with the ClockSyncRegression investigation.
Ok, after 2 long days of hacking I got this verifiably working. I added the state of ClockSyncRegression to the status so I can see the true measured SPS and other stats from Moonraker:
hx717 nextruder_adc
last_message_time
4485.671633160165
measured_sps
332.48968
is_capturing
true
last_sequence_number
116348
current_print_time
4485.775916694445
clock_sync_regression
sample_count_base
1395124.5148795838
time_base
4482.512181349338
sample_interval
0.0030076121493038954
delay_filter
max_delay_ticks
216
recurring_filter_count
0
max_recurring_filter_count
4
filtered_percent
34.9
filtered_count
14761
Its about 10SPS higher than the quoted value in the spec sheet (320).
The results of this look very good. It has about the same consistency as the probe in the tool changer did when hammering the sensor at 40Khz, but MCU load is only about 2%. Plus this gets rid of all of the existing timing related error codes.
Lets talk code…
ClockSyncRegression had me very confused the first few times I read it because it says chip_clock. I assumed this meant that the ADXL345 had a clock in it. But I couldn’t find evidence of that in the c code, so I always left confused but thinking it didn’t apply to my work. Now I understand this is probably a naming convention carried over from the MCU timing sync code where the MCU does have a clock. chip_clock is a unit-less value, in the ADXL345 code its the sample count number (its the y value on an x/y plot).
The code that handles the message sequence numbers is tricky. I wrapped that up into a class so its a bit more clear how the sequence numbers, messages and individual samples relate to the clock sync regression.
The stuff around duration (checking for delayed reads on the MCU) was also difficult to reason about without seeing data go through it. There is filtering, but the filter is adaptive, so it can accept large delays if it gets starved. It also has to be forced to accept the first 2 samples to prime the regression. I wrapped that up into a class as well. From the status you can see it has rejected about 1/3 of all clock sync attempts and as many as 4 in a row. I don’t know what the expected behavior is or if I should tune that.
I still have a lot of cleanup to do before I can publish the branch again…
Yes, chip_clock is poorly named. In this context it is “number of samples”. It’s based on the hypothetical idea that the sensor has a clock rate of one tick per sample. I agree it is poorly named.
I don’t think the duration check is relevant for hx711 - I’d have just removed that check on hx711. It’s desired on adxl345 because the spi query itself could be delayed a notable amount of time due to background irq activity on the mcu. That is, although the query was issued at a particular time clock, it may have been received at a time notably after that. To avoid that rare case skewing the timing, the code checks that the total query time isn’t excessively large (query_ticks). To avoid the case of all queries being thrown out, the host implements additional checking to loosen that query time restriction if needed (thus the duration complexity).
Since the hx711 doesn’t have an SPI nor I2C interface that could be delayed, it shouldn’t need this additional checking. I’d think one could get a very accurate timing with something like:
In case anyone is curious, I can provide a little more detail on the rationale of the duration check.
When I built the host to mcu clock synchronization code, I found that using a linear regression generally provided good results, even with notable query jitter during each clock measurement. The high-level idea is that minor timing differences between measurements will effectively be “random” and one can use a linear regression to eliminate that “noise”.
However, one challenge I found was that measurement latency does not appear as “random”. That is, if a measurement typically takes 5us, it may sometimes take 4us or even 3us, but will never ever take -2us. That is, one can never have a negative latency. In contrast, on occasion, one will find large latency outliers (eg, 50us).
During the development of the host to mcu clock synchronization code I found that this uneven distribution of latency resulted in a systemic bias to the clock estimates. It wouldn’t have been a problem if there were occasionally large negative latencies to balance the occasionally large positive latencies. Just having the outliers on one side did skew things though. I found that I could effectively counter that bias by adding a filter to discard the obvious outliers (minor outliers didn’t notably impact the linear regression). Alas, there were some rare cases where that filter could become too aggressive resulting in complete breakdown of the clock synchronization. To avoid these very rare situations, the filter is forced to be more accepting each time it finds an outlier. (The resulting code is in klippy/clocksync.py:_handle_clock().)
When I built the adxl345.py ClockSyncRegression class, I carried over those high-level ideas. The duration check is only intended to catch obvious latency outliers to prevent a notable systemic bias to the linear regression.
As above, though, unless I’m missing something, I don’t think it is needed for hx711.
You are correct. I did a shortcut by calling the existing read method which resulted in occasional high delay due to bit-bang reads on the sensor. But I can get rid of that as you have pointed out. All of the sensors of this kind that I have look at have some kind of data-ready pin (HX71x, ADS16* and the MPI one, its a common feature of continuous capture Delta Sigma ADCs). So as an architecture I think it will work to require the sensor implementer provide a pin to check.
I’ll refactor it to that before I push the code.
I also have to remove the testing & logging code in added to homing.py to gather all of the graphs I have been looking at. I’ll be working on moving that into a customized Probe class.
This graph shows the Range of a set of 9x50 probes taken at 9 points across the bed. The blue and orange lines that say ‘loadcell_330sps’ are the HX717 on the Voron and the red/green lines are the ADS1263 at 400 SPS on the E3D Tool Changer.
I cant see much difference between the two. The HX717 has a lower sample rate, this strain gauge has worse utilization of the bit range and the noise floor is much worse. Basically I expected it to be noticeably worse in this test but I’m not seeing that. I think this validates the pullback move + regression approach as able to handle noisy data well.
There is some ‘interesting’ things going on with the Voron. The tare value is different (+20g!) between the left and right side of the bed, consistently. I don’t know if that’s mechanical (cable chain?) or electrical (5V, long cables etc.). It’s messing with reliability for QGL and bed meshing. I probably need to implement “continuous tare” where the tare value is computed from some n trailing samples and sent the load cell when the endstop gets enabled.
I think you misunderstand or I don’t understand you. The ‘Voron’ is just the test machine I’m using. There are no links to anything that’s not already in this thread.
It no longer reports errors from the device so timing errors that testers have encountered should no longer be possible
Its ‘good enough’ to run a printer if you probe slowly (2mm/s)
I made a HX717 branch of my voron config to keep the test system config in.
Here is the class @koconnor : MessageSequence. I have to admit that I don’t fully grock the bit manipulation in there. I copied it and it works. One day I’ll sit down and truly understand it. I left the filter delay filter in the git history you are if interested.
The above takes an incrementing 16bit value (that could overflow) and turns it into an incrementing value that wont overflow. The first line mixes the high bits of the last sequence (which does not overflow) with the new sequence. In the second part, we check if that new sequence is less than the last sequence - since we know the value only ever increments, that must indicate an overflow, so we add (1<<16) to account for that overflow.
This code has a similar purpose, but works a slightly different way. It upconverts a 16bit value to a value that we don’t have to worry about integer overflows. It’s different in that we don’t know that the value is always incrementing - the new value may be slightly smaller or slightly larger than the last fully converted value. The first line calculates a 16bit difference between the reference sequence and the new sequence; the second line converts that value to a signed 16-bit value; the third line uses that on the reference value (which does not overflow) to find the new sequence (without overflows).
Separately, the “magic” 640 value is 3200*0.100*2. 3200 is the max query rate of the adxl345, 0.100 is how often the code takes a new clock measurement (query_adxl345_status), and 2 is because we want to smooth the application of new clock estimates over 2 full clock updates.
Just as background, the reason these integer conversions are “open coded” is that Python is very slow calling functions that are implemented in Python. (It’s not slow directly calling functions implemented in C, just functions that are implemented in Python.)
The adxl345’s _extract_samples() code is called many times a second with large lists (1000+) on each invocation. Optimizing the inner loop of that code by avoiding calls to other Python functions can have a notable impact on the overall cpu usage.
Those were the bits Thank you for the explanation! I got the first one but the second one had me stumped because its trying to do the same thing as the first method but different. I didn’t grock why it needed to be different.
Prefect, I’ll do some similar math for the sample rate of the sensors.
I had similar issues, I was running this on a Pi 3 and it was initially running quite hot. I tried running it at 32,000 Hz, 10x faster than the ADXL, and it tried to light the Pi on fire . I did a lot of perf analysis with Scalene. Here is what I found and what I changed in my implementation:
Using thread locks is expensive. This seemed to be the worst offender. I avoided them with Pythons collections.deque which has thread safe methods for append and popleft. Its a thread safe ring buffer.
Bitshifing in python is expensive if you do a lot of it. It can be replaced that with a call to c by using Struct.unpack_from. e.g. struct.Struct('<i').unpack_from will unpack a 32 bit unsigned int. I was doing an operation to reverse the endianness which required 4 bit shifts and this eliminated them.
Sadly, round() is expensive. Its implemented in python and is very accurate but slow. I reasoned that rounding isn’t needed as this could be done at some higher layer, either a gcode command or in the UI. So I just don’t round the grams in my code. (Numpy’s round function is implemented in c, so that is an alternative I might dig into further, particularly bulk rounding a column)